Blame view

egs/vystadial_en/README.txt 7.03 KB
8dcb6dfcb   Yannick Estève   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
  Summary
  -------
  The data comprise over 41 hours of speech in English.
  
  The English recordings were collected from humans interacting via telephone 
  calls with statistical dialogue systems, designed to provide the user 
  with information on a suitable dining venue in the town.
  
  The data collection process is described in detail
  in article "Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license"
  published for LREC 2014 (To Appear).
  
  WE USE COMMON KALDI DECODERS IN THE SCRIPTS (gmm-latgen-faster through steps/decode.sh)
  However, the main purpose of providing the data and scripts
  is training acoustic models for real-time speech recognition unit
  for dialog system ALEX, which uses modified real-time Kaldi OnlineLatgenRecogniser.
  The modified Kaldi decoders are NOT required for running the scripts!
  
  The modified OnlineLatgenRecogniser is actively developed at 
  https://github.com/UFAL-DSG/pykaldi/tree/master/src/onl-rec
  and has Python wrapper:
  https://github.com/UFAL-DSG/pykaldi/tree/master/src/pykaldi
  Note that I am currently moving the online recogniser to:
  http://sourceforge.net/p/kaldi/code/HEAD/tree/sandbox/oplatek2/
  
  Credits and license
  ------------------------
  The scripts are partially based on Voxforge KALDI recipe.
  The original scripts as well as theses scripts are licensed under APACHE 2.0 license.
  The data are distributed under Attribution-{ShareAlike} 3.0 Unported ({CC} {BY}-{SA} 3.0) license.
  Czech data: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4670-6
  English data: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4671-4
  
  The data collecting process and development of these training scripts 
  was partly funded by the Ministry of Education, Youth and Sports 
  of the Czech Republic under the grant agreement LK11221 
  and core research funding of Charles University in Prague.
  For citing, please use following BibTex citation:
  
  @inproceedings{korvas_2014,
    title={{Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license}},
    author={Korvas, Mat\v{e}j and Pl\'{a}tek, Ond\v{r}ej and Du\v{s}ek, Ond\v{r}ej and \v{Z}ilka, Luk\'{a}\v{s} and Jur\v{c}\'{i}\v{c}ek, Filip},
    booktitle={Proceedings of the Eigth International Conference on Language Resources and Evaluation (LREC 2014)},
    pages={To Appear},
    year={2014},
  }
  
  
  Expected results
  ----------------
  The expected results were obtained simply by running
  bash train_voip_cs.sh OR bash train_voip_en.sh.
  Note that you need SRILM installed in path or at kaldi/tools/ directory!
  See s5/RESULTS. Following notation is used:
  
      build2 - bigram LM from train data, estimated by the scripts using SRILM
      build0 - zerogram LM from test data, estimated by scripts using Python code
      LMW - Language model weight, we picked the best from (min_lmw, max_lmw) based on decoding results on DEV set
  
  
  Details
  -------
  * Requires Kaldi installation and Linux environment. (Tested on Ubuntu 10.04, 12.04 and 12.10.)
  * The config file s5/env_voip_en.sh sets the data directory,
    mfcc directory and experiments directory.
  * Our scripts prepare the data to the expected format in s5/data.
  * Experiment files are stored to $exp directory e.g. s5/exp.
  * The local directory contains scripts for data preparation to prepare 
    lang directory.
  * path.sh, cmd.sh and  common/* contain configurations for the 
    recipe.
  * Language model (LM) is either built from the training data using 
    [SRILM](http://www.speech.sri.com/projects/srilm/)  or we supply one in 
    the ARPA format.
  
  
  Running experiments
  -------------------
  Before running the experiments, check that:
  
  * you have the Kaldi toolkit compiled: 
    http://sourceforge.net/projects/kaldi/.
  * you have SRILM compiled. (This is needed for building a language model 
    unless you supply your own LM in the ARPA format.) 
    See http://www.speech.sri.com/projects/srilm/.
  * The number of jobs njobs are set correctly in path.sh.
  * In cmd.sh, you switched to run the training on a SGE[*] grid if 
    required (disabled by default).
  
  Start the recipe from the s5 directory by running 
  bash run.sh.
  It will create s5/mfcc, s5/data and s5/exp directories.
  If any of them exists, it will ask you if you want them to be overwritten.
  
  .. [*] Sun Grid Engine
  
  Extracting the results and trained models
  -----------------------------------------
  The main scripts, s5/run.sh, 
  perform not only training of the acoustic models, but also decoding.
  The acoustic models are evaluated after running the training and  
  reports are printed to the standard output.
  
  The s5/local/results.py exp command extracts the results from the $exp directory.
  and stores the results to exp/results.log.
  
  If you want to use the trained acoustic model with your language model
  outside the prepared script, you need to build the HCLG decoding graph yourself.  
  See http://kaldi-asr.org/doc/graph.html for general introduction to the FST 
  framework in Kaldi.
  
  The simplest way to start decoding is to use the same LM which
  was used by the s5/run.sh script.
  Let's say you want to decode with 
  the acoustic model stored in exp/tri2b_bmmi,
  then you need files listed below:
  
  ================================= =====================================================================================
  mfcc.conf                          Speech parametrisation (MFCC) settings. Training and decoding setup must match.
  exp/tri2b_bmmi/graph/HCLG.fst      Decoding Graph. Graph part of AM plus lexicon, phone->3phone & LM representation.
  exp/tri2b_bmmi/graph/words.txt     Word symbol table, a mapping between words and integers which are decoded.
  exp/tri2b_bmmi/graph/silence.csl   List of phone integer ids, which represent silent phones. 
  exp/tri2b_bmmi/final.mdl           Trained acoustic model (AM).
  exp/tri2b_bmmi/final.mat           Trained matrix of feature/space transformations (E.g. LDA and bMMI).
  ================================= =====================================================================================
  
  
  We recommend to study steps/decode.sh Kaldi standard script
  for standalone decoding with gmm-latgen-faster Kaldi decoder.
  
  In order to build your own decoding graph HCLG 
  you need LM in ARPA format and files in table below. 
  
  * Note 1: Building HCLG decoding graph is out of scope this README.
  * Note 2: Each acoustic model needs corresponding HCLG graph.
  * Note 3: The phonetic dictionary applied on the vocabulary 
    should always generate only a subset of phones seen in training data!
  
  ===============================  =========================================================================
  LM.arpa                           Language model in ARPA format [You should supply it]
  vocabulary.txt                    List of words you want to decode [You should supply it]
  OOV_SYMBOL                        String representing out of vocabulary word. [You should supply it]
  dictionary.txt                    Phonetic dictionary. [You should supply it]
  exp/tri2b_bmmi/final.mdl          Trained acoustic model (AM).
  exp/tri2b_bmmi/final.tree         Phonetic decision tree.
  ===============================  =========================================================================