Blame view

src/doc/data_prep.dox 43.8 KB
8dcb6dfcb   Yannick Estève   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
  // doc/data_prep.dox
  
  // Copyright 2012  Johns Hopkins University (author: Daniel Povey)
  
  // See ../../COPYING for clarification regarding multiple authors
  //
  // Licensed under the Apache License, Version 2.0 (the "License");
  // you may not use this file except in compliance with the License.
  // You may obtain a copy of the License at
  
  //  http://www.apache.org/licenses/LICENSE-2.0
  
  // THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  // KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
  // WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
  // MERCHANTABLITY OR NON-INFRINGEMENT.
  // See the Apache 2 License for the specific language governing permissions and
  // limitations under the License.
  
  /**
   \page data_prep  Data preparation
  
    \section data_prep_intro Introduction
  
    After running the example scripts (see \ref tutorial), you may want to set up
    Kaldi to run with your own data.  This section explains how to prepare the data.
    This page will assume that you are using the latest version of the example scripts
    (typically named "s5" in the example directories, e.g. egs/rm/s5/).
    In addition to this page, you can refer to the data preparation scripts in those
   directories.  The top-level run.sh scripts (e.g. egs/rm/s5/run.sh) have a few commands at
   the top of them that relate to various phases of data preparation.  The parts in
   the sub-directory named local/ are always specific to the database.  For example,
   in the Resource Management (RM) setup it is local/rm_data_prep.sh.  In the case of
   RM these commands are:
  \verbatim
  local/rm_data_prep.sh /export/corpora5/LDC/LDC93S3A/rm_comp || exit 1;
  
  utils/prepare_lang.sh data/local/dict '!SIL' data/local/lang data/lang || exit 1;
  
  local/rm_prepare_grammar.sh || exit 1;
  \endverbatim
  
  In the WSJ case the commands are:
  \verbatim
  
  wsj0=/export/corpora5/LDC/LDC93S6B
  wsj1=/export/corpora5/LDC/LDC94S13B
  
  local/wsj_data_prep.sh $wsj0/??-{?,??}.? $wsj1/??-{?,??}.?  || exit 1;
  
  local/wsj_prepare_dict.sh || exit 1;
  
  utils/prepare_lang.sh data/local/dict "<SPOKEN_NOISE>" data/local/lang_tmp data/lang || exit 1;
  
  local/wsj_format_data.sh || exit 1;
  \endverbatim
  There are more commands after these in the WSJ script that relate to training language
  models locally (rather than using the ones supplied by LDC), but the ones above are the most
  important ones.
  
  
  The output of the data preparation stage consists of two sets of things.  One relates
  to "the data" (directories like data/train/) and one relates to "the language"
  (directories like data/lang/).  The "data" part relates to the specific recordings you
  have, and the "lang" part contains things that relate more to the language itself,
  such as the lexicon, the phone set, and various extra information about the phone set
  that Kaldi needs.  If you want to prepare data which you will decode with an
  already existing system and an already existing language model, the "data" part is
  all you need to touch.
  
  \section data_prep_data Data preparation-- the "data" part.
  
  As an example of the "data" part of the data preparation, look at the directory
  "data/train" in one of the example directories (assuming you have already run
  the scripts there).  Note: there is nothing special about the directory name
  "data/train".  There are other directories such as "data/eval2000" (for a test set)
  that have essentially the same format ("essentially" because we may have an "stm" and
  "glm" file in the test directory, to enable sclite scoring).
  The specific example we'll look at the Switchboard recipe
  in egs/swbd/s5.
  \verbatim
  s5# ls data/train
  cmvn.scp  feats.scp  reco2file_and_channel  segments  spk2utt  text  utt2spk  wav.scp
  \endverbatim
  Not all of the files are equally important.  For a simple setup where there is no
  "segmentation" information (i.e. each utterance corresponds to a single file), the only
  files you have to create yourself are "utt2spk", "text" and "wav.scp" and possibly
  "segments" and "reco2file_and_channel", and the rest will be created by standard scripts.
  
  We will describe the files in this directory, starting with the files you need to create
  yourself.
  
  \subsection data_prep_data_yourself Files you need to create yourself
  
   The file "text" contains the transcriptions of each utterance.
  \verbatim
  s5# head -3 data/train/text
  sw02001-A_000098-001156 HI UM YEAH I'D LIKE TO TALK ABOUT HOW YOU DRESS FOR WORK AND
  sw02001-A_001980-002131 UM-HUM
  sw02001-A_002736-002893 AND IS
  \endverbatim
  The first element on each line is the utterance-id, which is an arbitrary text string,
  but if you have speaker information in your setup, you should make the speaker-id a
  prefix of the utterance id; this is important for reasons relating to the sorting of
  these files.  The rest of the line is the transcription of each sentence.  You don't
  have to make sure that all words in this file are in your vocabulary; out of vocabulary words will
  get mapped to a word specified in the file data/lang/oov.txt.
  
  It needs to be the case that when you sort both the utt2spk and spk2utt files,
  the orders "agree", e.g. the list of speaker-ids extracted from the utt2spk file
  is the same as the string sorted order.  The easiest way to make this happen is
  to make the speaker-ids a prefix of the utter Although, in this particular
  example we have used an underscore to separate the "speaker" and "utterance"
  parts of the utterance-id, in general it is probably safer to use a dash ("-").
  This is because it has a lower ASCII value; if the speaker-ids vary in length,
  in certain cases the speaker-ids and their corresponding utterance ids can end
  up being sorted in different orders when using the standard "C"-style ordering
  on strings, which will lead to a crash.
  \endverbatim
  Another important file is <DFN>wav.scp</DFN>.  In the Switchboard example,
  \verbatim
  s5# head -3 data/train/wav.scp
  sw02001-A /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 /export/corpora3/LDC/LDC97S62/swb1/sw02001.sph |
  sw02001-B /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 2 /export/corpora3/LDC/LDC97S62/swb1/sw02001.sph |
  \endverbatim
  The format of this file is
  \verbatim
  <recording-id> <extended-filename>
  \endverbatim
  where the "extended-filename" may be
  an actual filename, or as in this case, a command that extracts a wav-format file.  The pipe symbol
  on the end of the extended-filename specifies that it is to be interpreted as a pipe.  We will
  explain what "recording-id" is below, but we would first like to point out that if the "segments" file
  does not exist, the first token on each line of "wav.scp" file is just the utterance id.
  The files in wav.scp must be single-channel (mono); if the underlying wav files have multiple
  channels, then a sox command must be used in the wav.scp to extract a particular channel.
  
  In the Switchboard setup we have the "segments" file, so we'll discuss this next.
  \verbatim
  s5# head -3 data/train/segments
  sw02001-A_000098-001156 sw02001-A 0.98 11.56
  sw02001-A_001980-002131 sw02001-A 19.8 21.31
  sw02001-A_002736-002893 sw02001-A 27.36 28.93
  \endverbatim
  The format of the "segments" file is:
  \verbatim
  <utterance-id> <recording-id> <segment-begin> <segment-end>
  \endverbatim
  where the segment-begin and segment-end are measured in seconds.
  These specify time offsets into a recording.  The "recording-id"
  is the same identifier as is used in the "wav.scp" file-- again, this is
  an arbitrary identifier that you can choose.
  The file "reco2file_and_channel" is only used when scoring (measuring
  error rates) with NIST's "sclite" tool:
  \verbatim
  s5# head -3 data/train/reco2file_and_channel
  sw02001-A sw02001 A
  sw02001-B sw02001 B
  sw02005-A sw02005 A
  \endverbatim
  The format is:
  \verbatim
  <recording-id> <filename> <recording-side (A or B)>
  \endverbatim
  The filename is typically the name of the .sph file, without the suffix, but in
  general it's whatever identifier you have in your "stm" file.
  The recording side is a concept that relates to telephone conversations where there are
  two channels, and if not, it's probably safe to use "A". If you don't have
  an "stm" file or you have no idea what this is all about, then you don't need
  the "reco2file_and_channel" file.
  
  The last file you need to create yourself is the "utt2spk" file.  This says, for each
  utterance, which speaker spoke it.
  \verbatim
  s5# head -3 data/train/utt2spk
  sw02001-A_000098-001156 2001-A
  sw02001-A_001980-002131 2001-A
  sw02001-A_002736-002893 2001-A
  \endverbatim
  The format is
  \verbatim
  <utterance-id> <speaker-id>
  \endverbatim
  Note that the speaker-ids don't need to correspond in any very accurate sense
  to the names of actual speakers-- they simply need to represent a reasonable guess.
  In this case we assume each conversation side (each side of the telephone conversation)
  corresponds to a single speaker.  This is not entirely true -- sometimes one person
  may hand the phone to another person, or the same person may be speaking in multiple
  calls -- but it's good enough for our purposes.  <b> If you have no information at all about
  the speaker identities, you can just make the speaker-ids the same as the utterance-ids </b>,
  so the format of the file would be just <DFN>\<utterance-id\> \<utterance-id\></DFN>.
  We have made the previous sentence bold because we have encountered people creating
  a "global" speaker-id.  This is a bad idea because it makes cepstral mean normalization
  ineffective in training (since it's applied globally), and because it will create problems
  when you use utils/split_data_dir.sh to split your data into pieces.
  
  There is another file that exists in some setups; it is used only occasionally and
  not in the Kaldi system build.  We show what it looks like in the Resource Management
  (RM) setup:
  \verbatim
  s5# head -3 ../../rm/s5/data/train/spk2gender
  adg0 f
  ahh0 m
  ajp0 m
  \endverbatim
  This file maps from speaker-id to either "m" or "f" depending on the speaker gender.
  
  All of these files should be sorted.  If they are not sorted, you will get errors
  when you run the scripts.  In \ref io_sec_tables we explain why this is needed.
  It has to do with the I/O framework; the ultimate reason for the sorting is to
  enable something equivalent to random-access lookup on a stream that doesn't support
  fseek(), such as a piped command.  Many Kaldi programs are reading multiple pipes
  from other Kaldi commands, reading different types of object, and are doing something
  roughly comparable to merge-sort
  on the different inputs; merge-sort, of course, requires that the inputs be sorted.
  Be careful when you sort that you have the shell variable LC_ALL defined as "C",
  for example (in bash),
  \verbatim
  export LC_ALL=C
  \endverbatim
  If you don't do this, the files will be sorted in an order that's different from how
  C++ sorts strings, and Kaldi will crash.  You have been warned!
  
  If your data consists of a test set from NIST that has an "stm" and a "glm" file
  provided so that you can measure WER, then you can put these files in the data
  directory with the names "stm" and "glm".  Note that we put the scoring
  script (which measures WER) in <DFN>local/score.sh</DFN>, which means it is
  specific to the setup; not all of the scoring scripts in all of the setups will
  recognize the stm and glm file.  An example of a scoring script that uses those files is
  the one the Switchboard setup, i.e. <DFN>egs/swbd/s5/local/score_sclite.sh</DFN>,
  which is invoked by the top-level scoring script
  <DFN>egs/swbd/s5/local/score.sh</DFN> if it notices that your test set has the
  stm and glm files.
  
  \subsection data_prep_data_noneed Files you don't need to create yourself
  
  The other files in this directory can be generated from the files you provide.
  You can create the "spk2utt" file by a command like the following
  (this one is extracted from egs/rm/s5/local/rm_data_prep.sh)
  \verbatim
  utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
  \endverbatim
  This is possible because the utt2spk and spk2utt files contain exactly
  the same information; the format of the spk2utt file is
  <DFN>\<speaker-id\> \<utterance-id1\> \<utterance-id2\> ...</DFN>.
  
  Next we come to the <DFN>feats.scp</DFN> file.
  \verbatim
  s5# head -3 data/train/feats.scp
  sw02001-A_000098-001156 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:24
  sw02001-A_001980-002131 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:54975
  sw02001-A_002736-002893 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:62762
  \endverbatim
  This points to the extracted features-- MFCC features in this case, because
  that is what we use in this particular script.  The format is:
  \verbatim
  <utterance-id> <extended-filename-of-features>
  \endverbatim
  Each of the feature files contains a matrix, in Kaldi format.
  In this case the dimension of the matrix would be (the length of the file in 10ms intervals) by 13.
  The "extended filename" <DFN>/home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:24</DFN>
  means, open the "archive" file /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark, fseek()
  to position 24, and read the data that's there.
  
  This feats.scp file is created by the command
  \verbatim
  steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
  \endverbatim
  which is invoked by the top-level "run.sh" script.  For the definitions of the
  shell variables, see that script.  <DFN>\$mfccdir</DFN> is a user-specified directory where the
  .ark files will be written.
  
  The last file in the directory data/train is "cmvn.scp".  This contains statistics
  for cepstral mean and variance normalization, indexed by speaker.  Each set of
  statistics is a matrix, of dimension 2 by 14 in this case.  In our example, we have:
  \verbatim
  s5# head -3 data/train/cmvn.scp
  2001-A /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:7
  2001-B /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:253
  2005-A /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:499
  \endverbatim
  Unlike feats.scp, this scp file is indexed by speaker-id, not utterance-id.
  This file is created by a command such as this:
  \verbatim
  steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir
  \endverbatim
  (this example is from <DFN>egs/swbd/s5/run.sh</DFN>).
  
  Because errors in data preparation can cause problems later on, we have a script to
  check that the data directory is correctly formatted.  Run e.g.
  \verbatim
  utils/validate_data_dir.sh data/train
  \endverbatim
  You may also find the following command useful:
  \verbatim
  utils/fix_data_dir.sh data/train
  \endverbatim
  (of course the command will work for any data directory, not just data/train).  This
  script will fix sorting errors and will remove any utterances for which some required
  data, such as feature data or transcripts, is missing.
  
  \section data_prep_lang Data preparation-- the "lang" directory.
  
  Now we turn our attention to the "lang" directory.
  \verbatim
  s5# ls data/lang
  L.fst  L_disambig.fst  oov.int	oov.txt  phones  phones.txt  topo  words.txt
  \endverbatim
  There may other directories with a very similar format: in the case we have
  a directory "data/lang_test" that contains the same information but also a file
  G.fst that is a Finite State Transducer form of the language model:
  \verbatim
  s5# ls data/lang_test
  G.fst  L.fst  L_disambig.fst  oov.int  oov.txt	phones	phones.txt  topo  words.txt
  \endverbatim
  Note that lang_test/ was created by copying lang/ and adding G.fst.
  Each of these directories seems to contain only a few files.
  It's not quite as simple as this though, because "phones" is a directory:
  \verbatim
  s5# ls data/lang/phones
  context_indep.csl  disambig.txt         nonsilence.txt        roots.txt    silence.txt
  context_indep.int  extra_questions.int  optional_silence.csl  sets.int     word_boundary.int
  context_indep.txt  extra_questions.txt  optional_silence.int  sets.txt     word_boundary.txt
  disambig.csl       nonsilence.csl       optional_silence.txt  silence.csl
  \endverbatim
  The phones directory contains various bits of information about the phone set; there
  are three versions of some of the files, with extensions .csl, .int and .txt, that contain
  the same information in three formats.  Fortunately you, as a Kaldi user, don't have
  to create all of these files because we have a script "utils/prepare_lang.sh" that
  creates it all for you based on simpler inputs.  Before we describe that script
  and the simpler inputs it takes, we feel obligated to explain what is in the "lang" directory.
  After that we will explain the easy way to create it.  The user who is simply
  aiming to quickly build a system without needing to understand how Kaldi works
  may skip to \ref data_prep_lang_creating below.
  
  \section data_prep_lang_contents Contents of the "lang" directory
  
  First there are the files <DFN>phones.txt</DFN> and <DFN>words.txt</DFN>.  These
  are both symbol-table files, in the OpenFst format, where each line is
  the text form and then the integer form:
  \verbatim
  s5# head -3 data/lang/phones.txt
  <eps> 0
  SIL 1
  SIL_B 2
  s5# head -3 data/lang/words.txt
  <eps> 0
  !SIL 1
  -'S 2
  \endverbatim
  These files are used by Kaldi to map back and forth between the integer and
  text forms of these symbols.  They are mostly only accessed by the scripts
  utils/int2sym.pl and utils/sym2int.pl, and by the OpenFst programs fstcompile and
  fstprint.
  
  The file <DFN>L.fst</DFN> is the Finite State Transducer form of the lexicon (L,
  see  <a href=http://www.cs.nyu.edu/~mohri/pub/hbka.pdf> "Speech Recognition
  with Weighted Finite-State Transducers" </a> by Mohri, Pereira and
  Riley, in Springer Handbook on SpeechProcessing and Speech Communication, 2008).
  with phone symbols on the input and word symbols on the output.  The file
  <DFN>L_disambig.fst</DFN> is the lexicon, as above but including the disambiguation
  symbols \#1, \#2, and so on, as well as the self-loop with \#0 on it to "pass through"
  the disambiguation symbol from the grammar.  See \ref graph_disambig for more
  explanation.  Anyway, you won't have to deal with this directly.
  
  The file <DFN>data/lang/oov.txt</DFN> contains just a single line:
  \verbatim
  s5# cat data/lang/oov.txt
  <UNK>
  \endverbatim
  This is the word that we will map all out-of-vocabulary words to during
  training.  There is nothing special about "<UNK>" here, and it does not have
  to be this particular word; what is important is that this word should have a pronunciation
  containing just a phone that we designate as a "garbage phone"; this phone will
  align with various kinds of spoken noise.  In our particular setup, this phone
  is called <DFN>\<SPN\></DFN> (short for "spoken noise"):
  \verbatim
  s5# grep -w UNK data/local/dict/lexicon.txt
  <UNK> SPN
  \endverbatim
  The file <DFN>oov.int</DFN> contains the integer form of this (extracted from <DFN>words.txt</DFN>),
  which happens to be 221 in this setup.  You might notice that in the Resource Management
  setup, oov.txt contains the silence word, which in that setup happens to be called "!SIL".
  In that case we simply chose an arbitrary word from the vocabulary-- there are no out of vocabulary
  words in the training set, so the word we choose has no effect.
  
  The file data/lang/topo contains the following data:
  \verbatim
  s5# cat data/lang/topo
  <Topology>
  <TopologyEntry>
  <ForPhones>
  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
  </ForPhones>
  <State> 0 <PdfClass> 0 <Transition> 0 0.75 <Transition> 1 0.25 </State>
  <State> 1 <PdfClass> 1 <Transition> 1 0.75 <Transition> 2 0.25 </State>
  <State> 2 <PdfClass> 2 <Transition> 2 0.75 <Transition> 3 0.25 </State>
  <State> 3 </State>
  </TopologyEntry>
  <TopologyEntry>
  <ForPhones>
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
  </ForPhones>
  <State> 0 <PdfClass> 0 <Transition> 0 0.25 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 </State>
  <State> 1 <PdfClass> 1 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
  <State> 2 <PdfClass> 2 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
  <State> 3 <PdfClass> 3 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
  <State> 4 <PdfClass> 4 <Transition> 4 0.75 <Transition> 5 0.25 </State>
  <State> 5 </State>
  </TopologyEntry>
  </Topology>
  \endverbatim
  This specifies the topology of the HMMs we use.  In this case, the "real" phones contain
  three emitting states
  with the standard 3-state left-to-right topology-- the "Bakis model".
  (Emitting states are states that "emit" feature vectors, as distinct from the "fake"
  non-emitting states that are just used to glue other states together).
  Phones 1 to 20 are various kinds of silence and noise; we have a lot because of word-position-dependency,
  and in fact most of these will never be used; the real number excluding word position
  dependency is more like five.  The "silence phones" have a more complex topology with an
  initial emitting state and an end emitting state, but then three emitting states in the middle.
  You don't have to create this file by hand.
  
  There are a number of files in <DFN>data/lang/phones/</DFN> that specify various things about
  the phone set.  Most of these files exist in three separate versions: a ".txt" form, e.g.:
  \verbatim
  s5# head -3 data/lang/phones/context_indep.txt
  SIL
  SIL_B
  SIL_E
  \endverbatim
  a ".int" form, e.g:
  \verbatim
  s5# head -3 data/lang/phones/context_indep.int
  1
  2
  3
  \endverbatim
  and a ".csl" form, which in a slight abuse of notation, denotes a colon-separated list,
  not a comma-separated list:
  \verbatim
  s5# cat data/lang/phones/context_indep.csl
  1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20
  \endverbatim
  These files always contain the same information, so let's focus on the ".txt" form which
  is more human-readable.  The file "context_indep.txt" contains a list of those phones
  for which we build context-independent models: that is, for those phones, we do not build a decision tree
  that gets to ask questions about the left and right phonetic context.  In fact, we do build
  smaller trees where we get to ask questions about the central phone and the HMM-state;
  this depends on the "roots.txt" file which we'll describe below.  See \ref tree_externals
  for more in-depth discussion of tree issues.
  
  The file <DFN>context_indep.txt</DFN> contains all the phones which are not "real phones":
  i.e. silence (SIL), spoken noise (SPN), non-spoken noise (NSN), and laughter (LAU):
  \verbatim
  # cat data/lang/phones/context_indep.txt
  SIL
  SIL_B
  SIL_E
  SIL_I
  SIL_S
  SPN
  SPN_B
  SPN_E
  SPN_I
  SPN_S
  NSN
  NSN_B
  NSN_E
  NSN_I
  NSN_S
  LAU
  LAU_B
  LAU_E
  LAU_I
  LAU_S
  \endverbatim
  There are a lot of variants of these phones because of word-position dependency; not all of these variants
  will ever be used.  Here, <DFN>SIL</DFN> would be the silence that gets optionally inserted by the
  lexicon (not part of a word), <DFN>SIL_B</DFN> would be a silence phone at the beginning of a word
  (which should never exist), <DFN>SIL_I</DFN> word-internal silence (unlikely to exist), <DFN>SIL_E</DFN>
  word-ending silence (should never exist), and <DFN>SIL_S</DFN> would be silence as a "singleton
  word", i.e. a phone with only one word-- this might be used if you had a "silence word" in your
  lexicon and explicit silences appear in your transcriptions.
  
  The files <DFN>silence.txt</DFN> and <DFN>nonsilence.txt</DFN> contains lists of the silence
  phones and nonsilence phones respectively.  These should be mutually exclusive and together,
  should contain all the phones.  In this particular setup, <DFN>silence.txt</DFN> is identical
  to <DFN>context_indep.txt</DFN>.
  What we mean by "nonsilence" phones is, phones that we
  intend to estimate various kinds of linear transforms on: that is, global transforms
  such as LDA and MLLT, and speaker adaptation transforms such as fMLLR.  Our belief based
  on prior experience is that it does not pay to include silence in the estimation of
  such transforms.  Our practice is
  to designate all silence, noise and vocalized-noise phones as "silence" phones, and all
  phones representing traditional phonemes as "nonsilence" phones.  We haven't experimented
  in Kaldi with the best way to do this.
  \verbatim
  s5# head -3 data/lang/phones/silence.txt
  SIL
  SIL_B
  SIL_E
  s5# head -3 data/lang/phones/nonsilence.txt
  IY_B
  IY_E
  IY_I
  \endverbatim
  
  The file <DFN>disambig.txt</DFN> contains a list of the "disambiguation symbols"
  (see \ref graph_disambig):
  \verbatim
  s5# head -3 data/lang/phones/disambig.txt
  #0
  #1
  #2
  \endverbatim
  These symbols appear in the file <DFN>phones.txt</DFN> as if they were phones.
  
  The file <DFN>optional_silence.txt</DFN> contains a single phone which can optionally
  appear between words:
  \verbatim
  s5# cat data/lang/phones/optional_silence.txt
  SIL
  \endverbatim
  The mechanism by which it appears optionally between words is that it appears
  optionally in the lexicon FST at the end of every word (and also the beginning of the
  utterance).  The reason it has to be specified in <DFN>phones/</DFN> instead of just appearing
  in <DFN>L.fst</DFN> is obscure and we won't go into it here.
  
  The file <DFN>sets.txt</DFN> contains sets of phones that we group together (consider as
  the same phone) while clustering the phones in order to create the context-dependency questions
  (in Kaldi we use automatically generated questions when building decision trees,
  rather than linguistically meaningful ones).
  In this particular setup, <DFN>sets.txt</DFN> groups together all the word-position-dependent
  versions of each phone:
  \verbatim
  s5# head -3 data/lang/phones/sets.txt
  SIL SIL_B SIL_E SIL_I SIL_S
  SPN SPN_B SPN_E SPN_I SPN_S
  NSN NSN_B NSN_E NSN_I NSN_S
  \endverbatim
  
  The file <DFN>extra_questions.txt</DFN> contains some extra questions which we'll include
  in addition to the automatically generated questions:
  \verbatim
  s5# cat data/lang/phones/extra_questions.txt
  IY_B B_B D_B F_B G_B K_B SH_B L_B M_B N_B OW_B AA_B TH_B P_B OY_B R_B UH_B AE_B S_B T_B AH_B V_B W_B Y_B Z_B CH_B AO_B DH_B UW_B ZH_B EH_B AW_B AX_B EL_B AY_B EN_B HH_B ER_B IH_B JH_B EY_B NG_B
  IY_E B_E D_E F_E G_E K_E SH_E L_E M_E N_E OW_E AA_E TH_E P_E OY_E R_E UH_E AE_E S_E T_E AH_E V_E W_E Y_E Z_E CH_E AO_E DH_E UW_E ZH_E EH_E AW_E AX_E EL_E AY_E EN_E HH_E ER_E IH_E JH_E EY_E NG_E
  IY_I B_I D_I F_I G_I K_I SH_I L_I M_I N_I OW_I AA_I TH_I P_I OY_I R_I UH_I AE_I S_I T_I AH_I V_I W_I Y_I Z_I CH_I AO_I DH_I UW_I ZH_I EH_I AW_I AX_I EL_I AY_I EN_I HH_I ER_I IH_I JH_I EY_I NG_I
  IY_S B_S D_S F_S G_S K_S SH_S L_S M_S N_S OW_S AA_S TH_S P_S OY_S R_S UH_S AE_S S_S T_S AH_S V_S W_S Y_S Z_S CH_S AO_S DH_S UW_S ZH_S EH_S AW_S AX_S EL_S AY_S EN_S HH_S ER_S IH_S JH_S EY_S NG_S
  SIL SPN NSN LAU
  SIL_B SPN_B NSN_B LAU_B
  SIL_E SPN_E NSN_E LAU_E
  SIL_I SPN_I NSN_I LAU_I
  SIL_S SPN_S NSN_S LAU_S
  \endverbatim
  You will observe that a question is simply a set of phones.
  The first four questions are asking about the word-position, for regular phones; and the last five do the same for
  the "silence phones".  The "silence" phones also come in a variety without a suffix like <DFN>_B</DFN>,
  for example <DFN>SIL</DFN>.  These may appear as optional silence in the lexicon, i.e. not inside an
  actual word.  In setups with things like tone dependency or stress markings, <DFN>extra_questions.txt</DFN>
  may contain questions that relate to those features.
  
  The file <DFN>word_boundary.txt</DFN> explains how the phones relate to word positions:
  \verbatim
  s5# head  data/lang/phones/word_boundary.txt
  SIL nonword
  SIL_B begin
  SIL_E end
  SIL_I internal
  SIL_S singleton
  SPN nonword
  SPN_B begin
  \endverbatim
  This is the same information as is in the suffixes of the phones (<DFN>_B</DFN> and so on), but
  we don't like to hardcode this in the text form of the phones-- for one thing, Kaldi executables
  never see the text form of the phones, but only an integerized form.  So it is specified
  by this file <DFN>word_boundary.txt</DFN>.  The main reason we need this information is
  in order to recover the word boundaries within lattices (for example, the program
  lattice-align-words reads the integer versin of this file, <DFN>word_boundaray.int</DFN>).
  Finding the word boundaries is useful for reasons including NIST sclite scoring, which requires
  the time markings for words, and for other downstream processing.
  
  The file <DFN>roots.txt</DFN> contains information that relates to how we build the phonetic-context
  decision tree:
  \verbatim
  head data/lang/phones/roots.txt
  shared split SIL SIL_B SIL_E SIL_I SIL_S
  shared split SPN SPN_B SPN_E SPN_I SPN_S
  shared split NSN NSN_B NSN_E NSN_I NSN_S
  shared split LAU LAU_B LAU_E LAU_I LAU_S
  ...
  shared split B_B B_E B_I B_S
  \endverbatim
  For now you can ignore the words "shared" and "split"-- these relate to certain options
  in how we build the decision tree (see \ref tree_externals for more information).
  The significance of having a number of phones on a single line, for
  example <DFN>SIL SIL_B SIL_E SIL_I SIL_S</DFN>, is that all of these phones
  have a single "shared root" in the decision tree, so states may be shared
  between those phones.  For stress and tone-dependent systems, typically
  all the stress or tone-dependent versions of a particular phone will appear on
  the same line.  In addition, all three states of a HMM (or all five states, for
  silences) share the root, and the decision-tree building process gets to
  ask about the state.  This sharing of the decision-tree root
  between the HMM-states is what we mean by "shared" in the roots file.
  
  \section data_prep_lang_creating Creating the "lang" directory
  
  The <DFN>data/lang/</DFN> directory contains a lot of different files, so we have
  provided a script that creates it for you starting from a relatively simple
  input:
  \verbatim
  utils/prepare_lang.sh data/local/dict "<UNK>" data/local/lang data/lang
  \endverbatim
  Here, the inputs are the directory <DFN>data/local/dict/</DFN>, and the label <DFN>\<UNK\></DFN>
  which is the dictionary word we will map OOV words to when appear in transcripts
  (this becomes data/lang/oov.txt).  The location <DFN>data/local/lang/</DFN> is simply a
  temporary directory which the script will use; <DFN>data/lang/</DFN> is where
  it actually puts its output.
  
  The thing which you, as the data-preparer, need to create, is the directory
  <DFN>data/local/dict/</DFN>.  The directory contains the following contents:
  \verbatim
  s5# ls data/local/dict
  extra_questions.txt  lexicon.txt nonsilence_phones.txt  optional_silence.txt  silence_phones.txt
  \endverbatim
  (in fact there are a few more files there which we haven't listed, but they are just temporary files that
  were put there while creating that directory, and we can ignore them).  The commands below give
  you an idea what is in these files:
  \verbatim
  s5# head -3 data/local/dict/nonsilence_phones.txt
  IY
  B
  D
  s5# cat data/local/dict/silence_phones.txt
  SIL
  SPN
  NSN
  LAU
  s5# cat data/local/dict/extra_questions.txt
  s5# head -5 data/local/dict/lexicon.txt
  !SIL SIL
  -'S S
  -'S Z
  -'T K UH D EN T
  -1K W AH N K EY
  \endverbatim
  As you can see, the contents of this directory are very simple in this
  setup (the Switchboard setup).  We just have lists of the "real" phones and of the
  "silence" phones respectively, an empty file called <DFN>extra_questions.txt</DFN>, and
  a file called <DFN>lexicon.txt</DFN> which has the format
  \verbatim
  <word> <phone1> <phone2> ...
  \endverbatim
  Note: <DFN>lexicon.txt</DFN> will contain repeated entries for the same word,
  on separate lines,
  if we have multiple pronunciations for it.  If you want to use pronunciation
  probabilities, instead of creating the file <DFN>lexicon.txt</DFN>, create a file
  called <DFN>lexiconp.txt</DFN> that has the probability as the second field.
  Note that it is a common practice to normalize the pronunciations probabilities so that
  instead of summing to one, the most probable pronunciation for each word is one.  This
  tends to give better results.  For a top-level script that runs with
  pronunciation probabilities, search for <DFN>pp</DFN> in <DFN>egs/wsj/s5/run.sh</DFN>.
  
  Notice that in this input there is no notion of word-position dependency,
  i.e. no suffixes like <DFN>_B</DFN> and <DFN>_E</DFN>.  This is because it is the
  scripts <DFN>prepare_lang.sh</DFN> that adds those suffixes.
  
  You can see from the empty <DFN>extra_questions.txt</DFN> file that there
  is some kind of potential here that is not being fully exploited.  This relates
  to things like stress markings or tone markings.  You may want to have different
  versions of a particular phone that have different stress or tone.  In order
  to demonstrate what this looks like, we'll view the same files as above,
  but in the <DFN>egs/wsj/s5/</DFN> setup.  The result is below:
  \verbatim
  s5# cat data/local/dict/silence_phones.txt
  SIL
  SPN
  NSN
  s5# head data/local/dict/nonsilence_phones.txt
  S
  UW UW0 UW1 UW2
  T
  N
  K
  Y
  Z
  AO AO0 AO1 AO2
  AY AY0 AY1 AY2
  SH
  s5# head -6 data/local/dict/lexicon.txt
  !SIL SIL
  <SPOKEN_NOISE> SPN
  <UNK> SPN
  <NOISE> NSN
  !EXCLAMATION-POINT  EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
  "CLOSE-QUOTE  K L OW1 Z K W OW1 T
  s5# cat data/local/dict/extra_questions.txt
  SIL SPN NSN
  S UW T N K Y Z AO AY SH W NG EY B CH OY JH D ZH G UH F V ER AA IH M DH L AH P OW AW HH AE R TH IY EH
  UW1 AO1 AY1 EY1 OY1 UH1 ER1 AA1 IH1 AH1 OW1 AW1 AE1 IY1 EH1
  UW0 AO0 AY0 EY0 OY0 UH0 ER0 AA0 IH0 AH0 OW0 AW0 AE0 IY0 EH0
  UW2 AO2 AY2 EY2 OY2 UH2 ER2 AA2 IH2 AH2 OW2 AW2 AE2 IY2 EH2
  s5#
  \endverbatim
  You may notice that some of the lines in <DFN>nonsilence_phones.txt</DFN> contain
  multiple phones on a single line.  These are the different stress-dependent
  versions of the vowels.  Note that four different versions of each phone
  appear in the CMU dictionary: for example, <DFN>UW UW0 UW1 UW2</DFN>;
  for some reason, one of these versions does not have a numeric suffix.
  The order of the phones on the line does not matter, but the grouping into
  different lines does matter; in general, we advise users to group all forms of
  each "real phone" on a separate line.
  We use the stress markings present in the CMU
  dictionary.  The file extra_questions.txt contains a single question
  containing all of the "silence" phones (in fact this is unnecessary as
  it appears that the script <DFN>prepare_lang.sh</DFN> adds such a question anyway),
  and also a question corresponding to each of the different stress markers.
  These questions are necessary in order to get any benefit from the
  stress markers, because the fact that the different stress-dependent versions
  of each phone are together on the lines of <DFN>nonsilence_phones.txt</DFN>,
  ensures that they stay together in <DFN>data/lang/phones/roots.txt</DFN> and
  <DFN>data/lang/phones/sets.txt</DFN>, which in turn ensures that they
  share the same tree root and can never be distinguished by a question.  Thus,
  we have to provide a special question that affords the decision-tree building
  process a way to distinguish between the phones.  Note: the reason we put the
  phones together in the <DFN>sets.txt</DFN> and <DFN>roots.txt</DFN> is that some
  of the stress-dependent versions of phones may have too little data to
  robustly estimate either a separate decision tree or the phone clustering
  information that's used in producing the questions.  By grouping them together
  like this, we ensure that in the absence of enough data to estimate them
  separately, these different versions of the phone all "stay together" throughout
  the decision-tree building process.
  
  We should mention at this point that the script <DFN>utils/prepare_lang.sh</DFN>
  supports a number of options.  To give you an idea of what they are, here is
  the usage messages of that script:
  \verbatim
  usage: utils/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>
  e.g.: utils/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang
  options:
       --num-sil-states <number of states>             # default: 5, #states in silence models.
       --num-nonsil-states <number of states>          # default: 3, #states in non-silence models.
       --position-dependent-phones (true|false)        # default: true; if true, use _B, _E, _S & _I
                                                       # markers on phones to indicate word-internal positions.
       --share-silence-phones (true|false)             # default: false; if true, share pdfs of
                                                       # all non-silence phones.
       --sil-prob <probability of silence>             # default: 0.5 [must have 0 < silprob < 1]
  \endverbatim
  A potentially important option is the <DFN>--share-silence-phones</DFN> option.
  The default is false.  If this option is true, all the pdf's (the Gaussian
  mixture models) of all the silence phones such as silence, vocalized-noise,
  noise and laughter, will be shared and only the transition probabilities will
  differ between those models.  It's not clear why this should help, but we found
  that it was extremely helpful for the Cantonese data of IARPA's BABEL project.
  That data is very messy and has long untranscribed portions that we try to
  align to a special phone which we designate for that purpose.  We suspect
  that the training data was somehow failing to align correctly, and for some reason
  setting this option to true changed that.
  
  Another potentially important option is the "--sil-prob" option.   In general, we have
  not experimented much with any of these options so we cannot give very detailed advice.
  
  \section data_prep_grammar Creating the language model or grammar
  
  Our tutorial above on how to create the <DFN>lang/</DFN> directory did not address how
  to create the file <DFN>G.fst</DFN>, which is the finite state transducer form of
  the language model or grammar that we'll decode with.  In fact, in some setups
  we may have many "lang" directories for testing purposes, with different
  language models and dictionaries.  The Wall Street Journal (WSJ) setup is an example:
  \verbatim
  s5# echo data/lang*
  data/lang data/lang_test_bd_fg data/lang_test_bd_tg data/lang_test_bd_tgpr data/lang_test_bg \
   data/lang_test_bg_5k data/lang_test_tg data/lang_test_tg_5k data/lang_test_tgpr data/lang_test_tgpr_5k
  \endverbatim
  
  The process for creating <DFN>G.fst</DFN> is different depending on whether we're using
  a statistical language model or some kind of grammar.  In the RM setup there is
  a bigram grammar, which only allows certain pairs of words.  We make this sum to
  one within each grammar state by assigning a probability of 1 over the number of
  outgoing arcs.  There is a statement in <DFN>local/rm_data_prep.sh</DFN> that does:
  \verbatim
  local/make_rm_lm.pl $RMROOT/rm1_audio1/rm1/doc/wp_gram.txt  > $tmpdir/G.txt || exit 1;
  \endverbatim
  This script <DFN>local/make_rm_lm.pl</DFN> creates a grammar in FST format (text format,
  not binary format).  It contains lines like the following:
  \verbatim
  s5# head data/local/tmp/G.txt
  0    1    ADD    ADD    5.19849703126583
  0    2    AJAX+S    AJAX+S    5.19849703126583
  0    3    APALACHICOLA+S    APALACHICOLA+S    5.19849703126583
  \endverbatim
  See <a href=www.openfst.org> www.openfst.org </a> for more information on OpenFst (they
  have a useful tutorial).  The script <DFN>local/rm_prepare_grammar.sh</DFN> will turn this into
  the binary-format file <DFN>G.fst</DFN> using the following statement:
  \verbatim
  fstcompile --isymbols=data/lang/words.txt --osymbols=data/lang/words.txt --keep_isymbols=false \
      --keep_osymbols=false $tmpdir/G.txt > data/lang/G.fst
  \endverbatim
  If you want to create your own grammar, you will probably want to do something similar.
  Note: this type of procedure only applies to grammars of a certain class: it won't
  allow you to compile a complete Context Free Grammar, because it can't be represented
  in OpenFst format.  There are ways to do this in the WFST framework
  (e.g. see recent work by Mike Riley with push down transducers), but we have not yet
  worked with those ideas in Kaldi.
  
  Please, before asking any questions on the list about language models or about making
  grammar FSTs, read "A Bit of Progress in Language Modeling" by Joshua Goodman; and go to
  www.openfst.org and do the FST tutorial so that you understand the basics of finite
  state transducers.  (Note that language models would be represented as finite state
  acceptors, or FSAs, which can be considered as a special case of finite state transducers).
  
  The script <DFN>utils/format_lm.sh</DFN> deals with converting the ARPA-format language
  models into an OpenFst format. Here is the usage messages of that script:
  \verbatim
  Usage: utils/format_lm.sh <lang_dir> <arpa-LM> <lexicon> <out_dir>
  E.g.: utils/format_lm.sh data/lang data/local/lm/foo.kn.gz data/local/dict/lexicon.txt data/lang_test
  Convert ARPA-format language models to FSTs.
  \endverbatim
  Some of the key commands from that script are:
  \verbatim
  gunzip -c $lm \
    | arpa2fst --disambig-symbol=#0 \
               --read-symbol-table=$out_dir/words.txt - $out_dir/G.fst
  \endverbatim
  This Kaldi program, <DFN>arpa2fst</DFN>, turns the ARPA-format language model
  into a Weight Finite State Transducer (actually, an acceptor).
  
  A popular toolkit for building language models is SRILM.  Various language
  modeling toolkits are used in the Kaldi example scripts.  SRILM is the best
  documented and most fully featured, and we generally recommend it (its only
  drawback is that it don't have the most free licence). Here is the usage
  messages of <DFN>utils/format_lm_sri.sh</DFN>
  
  \verbatim
  Usage: utils/format_lm_sri.sh [options] <lang-dir> <arpa-LM> <out-dir>
  E.g.: utils/format_lm_sri.sh data/lang data/local/lm/foo.kn.gz data/lang_test
  Converts ARPA-format language models to FSTs. Change the LM vocabulary using SRILM.
  \endverbatim
  
  
  \section data_prep_unknown Note on unknown words
  
  This is an explanation of how Kaldi deals with unknown words (words not in the
  vocabulary); we are putting it on the "data preparation" page for lack of a more obvious
  location.
  
  In many setups, <DFN>\<unk\></DFN> or something similar will be present in the
  LM as long as the data that you used to train the LM had words that were not
  in the vocabulary you used to train the LM,
  because language modeling toolkits tend to map those all to a
  single special world, usually called <DFN>\<unk\></DFN> or
  <DFN>\<UNK\></DFN>.  You can look at the arpa file to figure out what it's called; it
  will usually be one of those two.
  
  
  During training, if there are words in the <DFN>text</DFN> file in your data
  directory that are not in the <DFN>words.txt</DFN> in the lang directory that
  you are using, Kaldi will map them to a special word that's specified in the
  lang directory in the file <DFN>data/lang/oov.txt</DFN>; it will usually be
  either <DFN>\<unk\></DFN>, <DFN>\<UNK\></DFN> or maybe
  <DFN>\<SPOKEN_NOISE\></DFN>.  This word will have been chosen by the user
  (i.e., you), and supplied to <DFN>prepare_lang.sh</DFN> as a command-line argument.
  If this word has nonzero probability in the language model (which you can test
  by looking at the arpa file), then it will be possible for Kaldi to recognize
  this word in test time.  This will often be the case if you call this word
  <DFN>\<unk\></DFN>, because as we mentioned above, language modeling toolkits
  will often use this spelling for ``unknown word'' (which is a special word that
  all out-of-vocabulary words get mapped to).  Decoding output will always be limited to the
  intersection of the words in the language model with the words in the lexicon.txt (or whatever file format you supplied the
  lexicon in, e.g. lexicop.txt); these words will all be present in the <DFN>words.txt</DFN>
  in your <DFN>lang</DFN> directory.
  So if Kaldi's "unknown word" doesn't match the LM's "unknown word", you will
  simply never decode this word.  In any
  case, even when allowed to be decoded, this word typically won't be output very
  often and in practice it doesn't tend to have much impact on WERs.
  
  Of course a single phone isn't a very good, or accurate, model of OOV words.  In
  some Kaldi setups we have example scripts with names
  <DFN>local/run_unk_model.sh</DFN>: e.g., see the file
  <DFN>tedlium/s5_r2/local/run_unk_model.sh</DFN>.  These scripts replace the unk
  phone with a phone-level LM on phones.  They make it possible to get access to
  the sequence of phones in a hypothesized unknown word.  Note: unknown words
  should be considered an "advanced topic" in speech recognition and we discourage
  beginners from looking into this topic too closely.
  
  
  
  */