# AISHELL-2

AISHELL-2 is by far the largest free speech corpus available for Mandarin ASR research.
## 1. DATA
### Training data
* 1000 hours of speech data (around 1 million utterances)
* 1991 speakers (845 male and 1146 female)
* clean recording environment (studio or quiet living room)
* read speech
* reading prompts from various domain: entertainment, finance, technology, sports, control command, place of interest etc.
* near field recording via 3 parallel channels (iOS, Android, Microphone).
* iOS data is free for non-commercial research and education use (e.g. universities and non-commercial institutes)

### Evaluation data:
Currently we release AISHELL2-2018A-EVAL, containing:
* dev: 2500 utterances from 5 speakers
* test: 5000 utterances from 10 speakers

Both sets are available across the three channel conditions.

One of interest can download the sets from [here](http://www.aishelltech.com/aishell_eval). Note that we may update and release other evaluation sets on the website later, targeting on different applications and senarios.

## 2. RECIPE
Based on Kaldi standard system, AISHELL-2 provides a self-contained Mandarin ASR recipe, with:
* a word segmentation module, which is a must-have component for Chinese ASR systems
* an open-sourced Mandarin lexicon (DaCiDian, open-sourced at [here](https://github.com/aishell-foundation/DaCiDian))
* Simplified GMM training & alignment generating recipe (we stopped at speaker independent stage)
* LFMMI TDNN training and decoding recipe

# REFERENCE
We released a [paper on Arxiv](https://arxiv.org/abs/1808.10583) on a more detailed description about the corpus with some preliminary resulting numbers. If one would like to use AISHELL-2 in experiments, please cite the paper as below:
```
@ARTICLE{aishell2,
   author = {{Du}, J. and {Na}, X. and {Liu}, X. and {Bu}, H.},
   title = "{AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale}",
   journal = {ArXiv},
   eprint = {1808.10583},
   primaryClass = "cs.CL",
   year = 2018,
   month = Aug,
}
```

# APPLY FOR DATA/CONTACT
AISHELL foundation is a non-profit online organization, with members from speech industry and research institutes.

We hope AISHELL-2 corpus and recipe could be beneficial to the entire speech community.

Depends on your location and internet speed, we distribute the corpus in two ways:
* hard-disk delivery
* cloud-disk downloading

To apply for AISHELL-2 corpus for free, you need to fill in a very simple application form, confirming that:
* university department / educational institute information has been fully provided
* only for non-commercial research / education use

AISHELL-foundation covers all data distribution fees (including the corpus, hard-disk cost etc)

Data re-distribution inside your university department is OK for convenience. However, users are not supposed to re-distribute the data to other universities or educational institutes.

To get the application form, or you come across any problem with the recipe, contact us via:

aishell.foundation@gmail.com