README.txt 7.13 KB
edit raw blame history



1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149


Summary
-------
The data comprise over 15 hours of speech in Czech.

The Czech recordings were collected in three ways:

1. using a free Call Friend phone service
2. using the Repeat After Me speech data collecting process
3. from telephone interactions with the PublicTransportInfo Spoken Dialog System (SDS)
   Alex: http://ufal.ms.mff.cuni.cz/alex-dialogue-systems-framework/.

The data collection process is described in detail
in article "Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license"
published for LREC 2014 (To Appear).

WE USE COMMON KALDI DECODERS IN THE SCRIPTS (gmm-latgen-faster through steps/decode.sh)
However, the main purpose of providing the data and scripts
is training acoustic models for real-time speech recognition unit
for dialog system ALEX, which uses modified real-time Kaldi OnlineLatgenRecogniser.
The modified Kaldi decoders are NOT required for running the scripts!

The modified OnlineLatgenRecogniser is actively developed at 
https://github.com/UFAL-DSG/pykaldi/tree/master/src/onl-rec
and has Python wrapper:
https://github.com/UFAL-DSG/pykaldi/tree/master/src/pykaldi
Note that I am currently moving the online recogniser to:
http://sourceforge.net/p/kaldi/code/HEAD/tree/sandbox/oplatek2/

Credits and license
------------------------
The scripts are partially based on Voxforge KALDI recipe.
The original scripts as well as theses scripts are licensed under APACHE 2.0 license.
The data are distributed under Attribution-{ShareAlike} 3.0 Unported ({CC} {BY}-{SA} 3.0) license.
Czech data: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4670-6
English data: https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4671-4

The data collecting process and development of these training scripts 
was partly funded by the Ministry of Education, Youth and Sports 
of the Czech Republic under the grant agreement LK11221 
and core research funding of Charles University in Prague.
For citing, please use following BibTex citation:

@inproceedings{korvas_2014,
  title={{Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license}},
  author={Korvas, Mat\v{e}j and Pl\'{a}tek, Ond\v{r}ej and Du\v{s}ek, Ond\v{r}ej and \v{Z}ilka, Luk\'{a}\v{s} and Jur\v{c}\'{i}\v{c}ek, Filip},
  booktitle={Proceedings of the Eigth International Conference on Language Resources and Evaluation (LREC 2014)},
  pages={To Appear},
  year={2014},
}


Expected results
----------------
The expected results were obtained simply by running
bash train_voip_cs.sh OR bash train_voip_en.sh.
Note that you need SRILM installed in path or at kaldi/tools/ directory!
See s5/RESULTS. Following notation is used:

    build2 - bigram LM from train data, estimated by the scripts using SRILM
    build0 - zerogram LM from test data, estimated by scripts using Python code
    LMW - Language model weight, we picked the best from (min_lmw, max_lmw) based on decoding results on DEV set


Details
-------
* Requires Kaldi installation and Linux environment. (Tested on Ubuntu 10.04, 12.04 and 12.10.)
* The config files s5/env_voip_cs.sh sets the data directory,
  mfcc directory and experiments directory.
* Our scripts prepare the data to the expected format in s5/data.
* Experiment files are stored to $exp directory e.g. s5/exp.
* The local directory contains scripts for data preparation to prepare 
  lang directory.
* path.sh, cmd.sh and  common/* contain configurations for the 
  recipe.
* Language model (LM) is either built from the training data using 
  [SRILM](http://www.speech.sri.com/projects/srilm/)  or we supply one in 
  the ARPA format.


Running experiments
-------------------
Before running the experiments, check that:

* you have the Kaldi toolkit compiled: 
  http://sourceforge.net/projects/kaldi/.
* you have SRILM compiled. (This is needed for building a language model 
  unless you supply your own LM in the ARPA format.) 
  See http://www.speech.sri.com/projects/srilm/.
* The number of jobs njobs are set correctly in path.sh.
* In cmd.sh, you switched to run the training on a SGE[*] grid if 
  required (disabled by default).

Start the recipe from the s5 directory by running 
bash run.sh.
It will create s5/mfcc, s5/data and s5/exp directories.
If any of them exists, it will ask you if you want them to be overwritten.

.. [*] Sun Grid Engine

Extracting the results and trained models
-----------------------------------------
The main scripts, s5/run.sh, 
perform not only training of the acoustic models, but also decoding.
The acoustic models are evaluated after running the training and  
reports are printed to the standard output.

The s5/local/results.py exp command extracts the results from the $exp directory.
and stores the results to exp/results.log.

If you want to use the trained acoustic model with your language model
outside the prepared script, you need to build the HCLG decoding graph yourself.  
See http://kaldi-asr.org/doc/graph.html for general introduction to the FST 
framework in Kaldi.

The simplest way to start decoding is to use the same LM which
was used by the s5/run.sh script.
Let's say you want to decode with 
the acoustic model stored in exp/tri2b_bmmi,
then you need files listed below:

================================= =====================================================================================
mfcc.conf                          Speech parametrisation (MFCC) settings. Training and decoding setup must match.
exp/tri2b_bmmi/graph/HCLG.fst      Decoding Graph. Graph part of AM plus lexicon, phone->3phone & LM representation.
exp/tri2b_bmmi/graph/words.txt     Word symbol table, a mapping between words and integers which are decoded.
exp/tri2b_bmmi/graph/silence.csl   List of phone integer ids, which represent silent phones. 
exp/tri2b_bmmi/final.mdl           Trained acoustic model (AM).
exp/tri2b_bmmi/final.mat           Trained matrix of feature/space transformations (E.g. LDA and bMMI).
================================= =====================================================================================


We recommend to study steps/decode.sh Kaldi standard script
for standalone decoding with gmm-latgen-faster Kaldi decoder.

In order to build your own decoding graph HCLG 
you need LM in ARPA format and files in table below. 

* Note 1: Building HCLG decoding graph is out of scope this README.
* Note 2: Each acoustic model needs corresponding HCLG graph.
* Note 3: The phonetic dictionary applied on the vocabulary 
  should always generate only a subset of phones seen in training data!

===============================  =========================================================================
LM.arpa                           Language model in ARPA format [You should supply it]
vocabulary.txt                    List of words you want to decode [You should supply it]
OOV_SYMBOL                        String representing out of vocabulary word. [You should supply it]
dictionary.txt                    Phonetic dictionary. [You should supply it]
exp/tri2b_bmmi/final.mdl          Trained acoustic model (AM).
exp/tri2b_bmmi/final.tree         Phonetic decision tree.
===============================  =========================================================================