Yannick Estève / ONTRAC-Kaldi

Blame view

egs/aishell2/README.md 2.99 KB
  # AISHELL-2
  
  AISHELL-2 is by far the largest free speech corpus available for Mandarin ASR research.
  ## 1. DATA
  ### Training data
  * 1000 hours of speech data (around 1 million utterances)
  * 1991 speakers (845 male and 1146 female)
  * clean recording environment (studio or quiet living room)
  * read speech
  * reading prompts from various domain: entertainment, finance, technology, sports, control command, place of interest etc.
  * near field recording via 3 parallel channels (iOS, Android, Microphone).
  * iOS data is free for non-commercial research and education use (e.g. universities and non-commercial institutes)
  
  ### Evaluation data:
  Currently we release AISHELL2-2018A-EVAL, containing:
  * dev: 2500 utterances from 5 speakers
  * test: 5000 utterances from 10 speakers
  
  Both sets are available across the three channel conditions.
  
  One of interest can download the sets from [here](http://www.aishelltech.com/aishell_eval). Note that we may update and release other evaluation sets on the website later, targeting on different applications and senarios.
  
  ## 2. RECIPE
  Based on Kaldi standard system, AISHELL-2 provides a self-contained Mandarin ASR recipe, with:
  * a word segmentation module, which is a must-have component for Chinese ASR systems
  * an open-sourced Mandarin lexicon (DaCiDian, open-sourced at [here](https://github.com/aishell-foundation/DaCiDian))
  * Simplified GMM training & alignment generating recipe (we stopped at speaker independent stage)
  * LFMMI TDNN training and decoding recipe
  
  # REFERENCE
  We released a [paper on Arxiv](https://arxiv.org/abs/1808.10583) on a more detailed description about the corpus with some preliminary resulting numbers. If one would like to use AISHELL-2 in experiments, please cite the paper as below:
  ```
  @ARTICLE{aishell2,
     author = {{Du}, J. and {Na}, X. and {Liu}, X. and {Bu}, H.},
     title = "{AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale}",
     journal = {ArXiv},
     eprint = {1808.10583},
     primaryClass = "cs.CL",
     year = 2018,
     month = Aug,
  }
  ```
  
  # APPLY FOR DATA/CONTACT
  AISHELL foundation is a non-profit online organization, with members from speech industry and research institutes.
  
  We hope AISHELL-2 corpus and recipe could be beneficial to the entire speech community.
  
  Depends on your location and internet speed, we distribute the corpus in two ways:
  * hard-disk delivery
  * cloud-disk downloading
  
  To apply for AISHELL-2 corpus for free, you need to fill in a very simple application form, confirming that:
  * university department / educational institute information has been fully provided
  * only for non-commercial research / education use
  
  AISHELL-foundation covers all data distribution fees (including the corpus, hard-disk cost etc)
  
  Data re-distribution inside your university department is OK for convenience. However, users are not supposed to re-distribute the data to other universities or educational institutes.
  
  To get the application form, or you come across any problem with the recipe, contact us via:
  
  aishell.foundation@gmail.com