Blame view

src/doc/online_decoding.dox 30.2 KB
8dcb6dfcb   Yannick Estève   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
  // doc/online_decoding.dox
  
  // Copyright 2014  Johns Hopkins University (author: Daniel Povey)
  
  // See ../../COPYING for clarification regarding multiple authors
  //
  // Licensed under the Apache License, Version 2.0 (the "License");
  // you may not use this file except in compliance with the License.
  // You may obtain a copy of the License at
  
  //  http://www.apache.org/licenses/LICENSE-2.0
  
  // THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  // KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
  // WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
  // MERCHANTABLITY OR NON-INFRINGEMENT.
  // See the Apache 2 License for the specific language governing permissions and
  // limitations under the License.
  
  namespace kaldi {
  
  /**
     \page online_decoding  Online decoding in Kaldi
  
   This page documents the capabilities for "online decoding" in Kaldi.  By
   "online decoding" we mean decoding where the features are coming in in real
   time, and you don't want to wait until all the audio is captured before
   starting the online decoding.  (We're not using the phrase "real-time decoding"
   because "real-time decoding" can also be used to mean decoding whose speed is
   not slower than real time, even if it is applied in batch mode).
  
   Note: see \subpage online_programs for some now-deprecated online recognizers.
  
   The approach that we took with Kaldi was to focus for the first few years
   on off-line recognition, in order to reach state of the art performance
   as quickly as possible.  Now we are making more of an effort to support
   online decoding.
  
   There are two online-decoding setups: the "old" online-decoding setup, in the
   subdirectories online/ and onlinebin/, and the "new" decoding setup,
   in online2/ and online2bin/.  The "old" online-decoding setup is now
   deprecated, and may eventually be removed from the trunk (but remain in
   ^/branches/complete).
  
   There is some documentation for the older setup \ref online_programs "here",
   but we recommend to read this page first.
  
    \section online_decoding_scope Scope of online decoding in Kaldi
  
    In Kaldi we aim to provide facilities for online decoding as a library.
    That is, we aim to provide the functionality for online decoding but
    not necessarily command-line tools for it.  The reason is, different
    people's requirements will be very different depending on how the data is
    captured and transmitted.  In the "old" online-decoding setup we provided facilities
    for transferring data over UDP and the like, but in the "new" online-decoding
    setup our only aim is to demonstrate the internal code, and for now
    we don't provide any example programs that you could hook up to actual
    real-time audio capture; you would have to do that yourself.
  
    We have decoding programs for GMM-based models (see next section) and for
    neural net models (see section \ref online_decoding_nnet2).
  
    \section GMM-based online decoding
  
    The program online2-wav-gmm-latgen-faster.cc is currently the primary example program
    for the GMM-based online-decoding setup.  It reads in whole wave files but internally it processes them chunk by chunk with
    no dependency on the future.  In the example script egs/rm/s5/local/online/run_gmm.sh
    you can see an example script for how you build models suitable for this program
    to use, and evaluate it.  The main purpose of program is to apply the GMM-based online-decoding
    procedure within a typical batch-processing framework, so that you can easily
    evaluate word error rates.  We plan to add similar programs for SGMMs and DNNs.
    In order to actually do online decoding, you would have to modify this program.
    We should note (and this is obvious to speech recognition people but not to outsiders)
    that the audio sample rate needs to exactly match what you used in training (and
    oversampling won't work but subsampling will).
  
    \section online_decoding_decoders Decoders versus decoding programs
  
    In Kaldi, when we use the term "decoder" we don't generally mean the entire decoding
    program.  We mean the inner decoder object, generally of the type LatticeFasterDecoder.
    This object takes the decoding graph (as an FST), and the decodable object
    (see \ref decodable_interface).  All the decoders naturally support online decoding; it
    is the code in the decoding program (but outside of the decoder) that needs to
    change.  We should note, though, a difference in how you need to invoke the decoder
    for online decoding.
       - In the old online-decoding setup (in online/), if "decoder" is some decoder
         (e.g. of type LatticeFasterDecoder) and "decodable" is a decodable object of
         a suitable type, you would call decoder.Decode(&decodable),
         and this call would block until the input was finished (because the decoder
         calls decodable.IsLastFrame(), which blocks).
       - In the new online-decoding setup (in online2/), you would instead call
         decoder.InitDecoding(), and then each time you get more feature data, you
         would call decoder.AdvanceDecoding().  For offline use, you can still call
         Decode().
  
    We should mention here that in the old online setup, there is a decoder called
    OnlineFasterDecoder.  Do not assume from the name of this that it is the only
    decoder to support online decoding.  The special thing about the OnlineFasterDecoder
    is that it has the ability to work out which words are going to be "inevitably"
    decoded regardless of what audio data comes in in future, so you can output those
    words.  This is useful in an online-transcription context, and if there seems to
    be a demand for this, we may move that decoder from online/ into the decoder/
    directory and make it compatible with the new online setup.
  
  
    \section online_decoding_feature Feature extraction in online decoding
  
    Most of the complexity in online decoding relates to feature extraction
    and adaptation.
  
    In online-feature.h we provide classes that provide various components
    of feature extraction, all inheriting from class OnlineFeatureInterface.
    OnlineFeatureInterface is a base class for online feature extraction.
    The interface specifies how the object provides the features to the caller
    (OnlineFeatureInterface::GetFrame()) and how it says how many frames
    are ready (OnlineFeatureInterface::NumFramesReady()), but does not
    say how it obtains those features.  That is up to the child class.
  
    In online-feature.h we define classes OnlineMfcc and OnlinePlp which
    are the lowest-level features.  They have a member function
    OnlineMfccOrPlp::AcceptWaveform(), which the user should call when
    data is captured.  All the other online feature types in online-feature.h
    are "derived" features, so they take an object of OnlineFeatureInterface
    in their constructor and get their input features through a stored pointer
    to that object.
  
    The only part of the online feature extraction code in online-feature.h
    that is non-trivial is the cepstral mean and variance normalization (CMVN)
    (and note that the fMLLR, or linear transform, estimation is not trivial but
    the complexity lies elsewhere).  We describe the CMVN below.
  
    \section online_decoding_cmvn Cepstral mean and variance normalization in online decoding
  
    Cepstral mean normalization is a normalization method in which
    the mean of the data (typically of the raw MFCC features) is subtracted.
   "Cepstral" simply refers to the normal feature type; the first C in MFCC means
    "Cepstral".. the cepstrum is the inverse fourier transform of the log spectrum,
    although it's actually the cosine transform that is used.
    Anyway, in cepstral variance normalization, each feature dimension is scaled
    so that its variance is one.  In all the current scripts, we turn cepstral
    variance normalization off and only use cepstral mean normalization, but the
    same code handles both.  In the discussion below, for brevity we will refer only to cepstral
    mean normalization.
  
    In the Kaldi scripts, cepstral mean and variance normalization (CMVN) is
    generally done on a per-speaker basis.  Obviously in an online-decoding
    context, this is impossible to do because it is "non-causal" (the current
    feature depends on future features).
  
    The basic solution we use is to do "moving-window" cepstral mean
    normalization.  We accumulate the mean over a moving window of, by default, 6
    seconds (see the "--cmn-window" option to programs in online2bin/, which
    defaults to 600).  The options class for this computation, OnlineCmvnOptions,
    also has extra configuration variables, speaker-frames (default: 600), and
    global-frames (default: 200).  These specify how we make use of prior
    information from the same speaker, or a global average of the cepstra, to
    improve the estimate for the first few seconds of each utterance.
    The program \ref apply-cmvn-online.cc "apply-cmvn-online" can apply this normalization
    as part of a training pipeline so that we can can train on matched features.
  
    \subsection online_decoding_cmvn_freeze Freezing the state of CMN
  
    The OnlineCmvn class has functions \ref OnlineCmvn::GetState "GetState" and
    \ref OnlineCmvn::SetState "SetState" that make it possible to keep track of
    the state of the CMVN computation between speakers.  It also has a function
    \ref OnlineCmvn::Freeze() "Freeze()".  This function causes it to freeze the
    state of the cepstral mean normalization at a particular value, so that after
    calling \ref OnlineCmvn::Freeze() "Freeze()", any calls to \ref
    OnlineCmvn::GetFrame() "GetFrame()", even for earlier times, will apply the
    mean offset that we were using when the user called \ref OnlineCmvn::Freeze()
    "Freeze()".  This frozen state will also be propagated to future utterances of
    the same speaker via the \ref OnlineCmvn::GetState "GetState" and \ref
    OnlineCmvn::SetState "SetState" function calls.  The reason we do this is that
    we don't believe it makes sense to do speaker adaptation with fMLLR on top
    of a constantly varying CMN offset.  So when we start estimating fMLLR
    (see below), we freeze the CMN state and leave it fixed in future.  The
    value of CMN at the time we freeze it is not especially critical because fMLLR subsumes
    CMN.   The reason we freeze the CMN state to a particular value rather than just
    skip over the CMN when we start estimating fMLLR, is that we are actually
    using a method called basis-fMLLR (again, see below) where we incrementally
    estimate the parameters, and it is not completely invariant to offsets.
  
  
    \section online_decoding_adaptation  Adaptation in online decoding
  
   The most standard adaptation method used for speech recognition is
   feature-space Maximum Likelihood Linear Regression (fMLLR), also known in the
   literature as Constrained MLLR (CMLLR), but we use the term fMLLR in the Kaldi
   code and documentation.  fMLLR consists of an affine (linear + offset) transform
   of the features; the number of parameters is d * (d+1), where d is the
   final feature dimension (typically 40).  In the online decoding program
   a basis method to incrementally estimate an increasing number of
   transform parameters as we decode more data.  The top-level logic for this at the
   decoder level is mostly implemented in class SingleUtteranceGmmDecoder.
  
   The fMLLR estimation is done not continuously but periodically, since it involvesa
   computing lattice posteriors and this can't very easily be done in a continuous
   manner.  Configuration variables in class OnlineGmmDecodingAdaptationPolicyConfig
   determine when we re-estimate fMLLR.  The default currently is, during the first
   utterance, to estimate it after 2 seconds, and thereafter at times in a geometrically
   increasing ratio with constant 1.5 (so at 2 seconds, 3 seconds, 4.5 seconds...).
   For later utterances we estimate it after 5 seconds, 10 seconds, 20 seconds and so on.
   For all utterances we estimate it at the end of the utterance.
  
   Note that the CMN adaptation state is frozen, as mentioned above, the first time
   we estimate fMLLR for a speaker, which by default will be two seconds into the
   first utterance.
  
   \section online_decoding_models  Use of multiple models in GMM-based online decoding
  
     In the online decoding decode for GMMs in online-gmm-decoding.h, up to three
    models can be supplied.  These are held in class OnlineGmmDecodingModels, which
    takes care of the logic necessary to decide which model to use for different purposes
    if fewer models are supplied.  The three models are:
       - A speaker-independent model, trained with online-mode CMVN from
         \ref apply-cmvn-onlin.cc apply-cmvn-online
       - A speaker adapted model, trained with fMLLR
       - A discriminatively trained version of the speaker adapted model
    It is our practice to use a Maximum Likelihood estimated model to estimate
    adaptation parameters, as this is more consistent with the Maximum framework
    than using a discriminatively trained model, although this probably makes little
    difference and you would lose little (and save some memory) by using the discriminatively
    trained model for this purpose.
  
  
   \section online_decoding_nnet2  Neural net based online decoding with iVectors
  
    Our best online-decoding setup, which we recommend should be used, is the neural
   net based setup.  The adaptation philosphy is to give the neural net un-adapted
   and non-mean-normalized features (MFCCs, in our example recipes), and also to give
   it an iVector.  An iVector is a vector of dimension several hundred (one or two hundred,
   in this particular context) which represents the speaker properties.  For more information
   on this the reader can look at the speaker identification literature.  Our idea is that
   the iVector gives the neural net as much as it needs to know about the speaker properties.
   This has proved quite useful.  The iVector is estimated in a left-to-right way, meaning
   that at a certain time t, it sees input from time zero to t.   It also sees information
   from previous utterances of the current speaker, if available.  The iVector estimation is
   Maximum Likelihood, involving Gaussian Mixture Models.
  
   If pitch is used (e.g. for tonal languages), we don't include it in the features used for
   iVector estimation, in order to simplify things; we just include it in the features given
   to the neural network.  We don't yet have example scripts for the online-neural-net decoding
   for tonal languages; it is still being debugged.
  
   The neural nets in our example scripts for online decoding are p-norm neural networks, typically
   trained in parallel on several GPUs.  We have these example scripts for several different
   example setups, e.g. in egs/rm/s5, egs/wsj/s5, egs/swbd/s5b,  and egs/fisher_english/s5.
   The top-level example script is always called local/online/run_nnet2.sh.  In the case of the
   Resource Management recipe there is also a script local/online/run_nnet2_wsj.sh.  This demonstrates
   how to take a larger neural net trained on out-of-domain speech with the same sampling rate (in
   this example, WSJ), and retrain it on in-domain data.  In this way we obtained our best-ever
   results on RM.
  
   We are currently working on example scripts for discriminative training for this setup.
  
    \subsection online_decoding_nnet2_example Example for using already-built online-nnet2 models
  
    In this section we will explain how to download already-build online-nnet2 models from www.kaldi-asr.org
    and evaluate them on your own data.
  
   The reader can download the models and other relating files from <b>
   http://kaldi-asr.org/downloads/build/2/sandbox/online/egs/fisher_english/s5 </b>,
   which are built using the fisher_english recipe. To use the online-nnet2 models, the reader
   only needs to download two directories: exp/tri5a/graph and exp/nnet2_online/nnet_a_gpu_online. Use the
   following commands to download the archives and extract them:
  
   \verbatim
  wget http://kaldi-asr.org/downloads/build/5/trunk/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu_online/archive.tar.gz -O nnet_a_gpu_online.tar.gz
  wget http://kaldi-asr.org/downloads/build/2/sandbox/online/egs/fisher_english/s5/exp/tri5a/graph/archive.tar.gz -O graph.tar.gz
  mkdir -p nnet_a_gpu_online graph
  tar zxvf nnet_a_gpu_online.tar.gz -C nnet_a_gpu_online
  tar zxvf graph.tar.gz -C graph
   \endverbatim
   Here the archives are extracted to the local directory.  We need to modify pathnames in the
   config files, which we can do as follows:
  \verbatim
  for x in nnet_a_gpu_online/conf/*conf; do
    cp $x $x.orig
    sed s:/export/a09/dpovey/kaldi-clean/egs/fisher_english/s5/exp/nnet2_online/:$(pwd)/: < $x.orig > $x
  done
  \endverbatim
   Next, choose a single wav file to decode. The reader can download a sample file by typing
   \verbatim
   wget http://www.signalogic.com/melp/EngSamples/Orig/ENG_M.wav
   \endverbatim
   This is a 8kHz-sampled wav file that we found online (unfortunately it is UK
   English, so the accuracy is not very good).  It can be decoded with the following command:
   \verbatim
  ~/kaldi-online/src/online2bin/online2-wav-nnet2-latgen-faster --do-endpointing=false \
      --online=false \
      --config=nnet_a_gpu_online/conf/online_nnet2_decoding.conf \
      --max-active=7000 --beam=15.0 --lattice-beam=6.0 \
      --acoustic-scale=0.1 --word-symbol-table=graph/words.txt \
     nnet_a_gpu_online/final.mdl graph/HCLG.fst "ark:echo utterance-id1 utterance-id1|" "scp:echo utterance-id1 ENG_M.wav|" \
     ark:/dev/null
   \endverbatim
   We added the <code>--online=false</code> option because it tends to slightly improve results.
   You can see the result in the logging output (although there are other ways to retrieve this).
   For us, the logging output was as follows:
  \verbatim
  /home/dpovey/kaldi-online/src/online2bin/online2-wav-nnet2-latgen-faster --do-endpointing=false --online=false --config=nnet_a_gpu_online/conf/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=graph/words.txt nnet_a_gpu_online/smbr_epoch2.mdl graph/HCLG.fst 'ark:echo utterance-id1 utterance-id1|' 'scp:echo utterance-id1 ENG_M.wav|' ark:/dev/null
  LOG (online2-wav-nnet2-latgen-faster:ComputeDerivedVars():ivector-extractor.cc:180) Computing derived variables for iVector extractor
  LOG (online2-wav-nnet2-latgen-faster:ComputeDerivedVars():ivector-extractor.cc:201) Done.
  utterance-id1 tons of who was on the way for races two miles and then in nineteen ninety to buy sodas sale the rate them all these to commemorate columbus is drawn into the new world five hundred years ago on the one to the moon is to promote the use of so the sales in space exploration
  LOG (online2-wav-nnet2-latgen-faster:main():online2-wav-nnet2-latgen-faster.cc:253) Decoded utterance utterance-id1
  LOG (online2-wav-nnet2-latgen-faster:Print():online-timing.cc:51) Timing stats: real-time factor for offline decoding was 1.62102 = 26.7482 seconds  / 16.5009 seconds.
  LOG (online2-wav-nnet2-latgen-faster:main():online2-wav-nnet2-latgen-faster.cc:259) Decoded 1 utterances, 0 with errors.
  LOG (online2-wav-nnet2-latgen-faster:main():online2-wav-nnet2-latgen-faster.cc:261) Overall likelihood per frame was 0.230575 per frame over 1648 frames.
  \endverbatim
  
  Note that for mismatched data, sometimes the iVector estimation can get confused and lead to bad results.
  Something that we have found useful is to weight down the silence in the iVector estimation.
  To do this you can set e.g. <code>--ivector-silence-weighting.silence-weight=0.001</code>; you need to set the silence
  phones as appropriate, e.g. <code>--ivector-silence-weighting.silence-phones=1:2:3:4</code>
  (this should be a list of silence or noise phones in your phones.txt; you can experiment with
  which ones to include).
  
  \subsection online_decoding_nnet2_lm Example for using your own language model with existing online-nnet2 models
  Oftentimes users will have to use their own language model to improve the
  recognition accuracy. In this section we will explain how to build a language
  model with SRILM, and how to incorporate this language model to the existing
  online-nnet2 models.
  
  We first have to build an ARPA format language model with SRILM. Note that SRILM
  comes with a lot of training options, and we assume it's the user's
  responsibility to figure out what is the best setting for their own application.
  Suppose "train.txt" is our language model training corpus (e.g., training
  data transcriptions), and "wordlist" is our vocabulary. Here we assume the
  language model vocabulary is the same as the recognizer's vocabulary, i.e., it
  only contains the words from data/lang/words.txt, except the epsilon symbol
  "<eps>" and disambiguation symbol "#0". We will explain how we can use a
  different vocabulary in the next section. We can build a 3gram Kneser-Ney
  language model using the following SRILM command
  \verbatim
  ngram-count -text train.txt -order 3 -limit-vocab -vocab wordlist -unk \
    -map-unk "<unk>" -kndiscount -interpolate -lm srilm.o3g.kn.gz
  \endverbatim
  
  Now that we have the ARPA format language model trained, we have to compile it
  into WFST format. Let's first define the following variables
  \verbatim
  dict_dir=data/local/dict                # The dict directory provided by the online-nnet2 models
  lm=srilm.o3g.kn.gz                      # ARPA format LM you just built.
  lang=data/lang                          # Old lang directory provided by the online-nnet2 models
  lang_own=data/lang_own                  # New lang directory we are going to create, which contains the new language model
  \endverbatim
  
  Given the above variables, we can compile an ARPA format language model into
  WFST format using the following commands
  \verbatim
  utils/format_lm.sh $lang $lm $dict_dir/lexicon.txt $lang_own
  \endverbatim
  
  Now, we can compile the decoding graph with the new language model, using the
  following command
  \verbatim
  graph_own_dir=$model_dir/graph_own
  utils/mkgraph.sh $lang_own $model_dir $graph_own_dir || exit 1;
  \endverbatim
  where $model_dir is the model directory which contains the model "final.mdl"
  and the tree "tree". At this point, we can use $graph_own_dir/HCLG.fst to
  replace the old HCLG.fst, which uses the language model we just built.
  
  \subsection online_decoding_nnet2_vocab Example for using a different vocabulary with existing online-nnet2 models
  For most applications users will also have to change the recognizer's existing
  vocabulary, for example, adding out-of-vocabulary words such as person names
  to the existing vocabulary. In this section we will explain how this can be
  done.
  
  We first have to create a new pronunciation lexicon, typically by adding more
  words to the recognizer's existing pronunciation lexicon. The recognizer's
  lexicon that we are going to modify is usually located at the $dict_dir/lexicon.txt,
  where $dict_dir is the recognizer's dictionary directory, and is usually
  data/local/dict. The new lexicon can be created manually by adding new lexical
  entries to $dict_dir/lexicon.txt. If we do not have pronunciations for the new
  words, we can use grapheme-to-phoneme (G2P) conversion to generate pronunciations
  automatically. The commonly used G2P tools are Sequitur and Phonetisaurus, the
  later is usually much faster.
  
  The second step is to create a dictionary directory for our new lexicon, which
  contains the required files, for example, lexicon.txt, lexiconp.txt, etc.
  Most likely if we don't change the lexicon's phone set, the old files such as
  extra_questions.txt, nonsilence_phones.txt, optional_silence.txt,
  silence_phones.txt can be re-used. For details of how to create those files, we
  suggest the users follow the existing Kaldi scripts, for example this one:
  egs/wsj/s5/local/wsj_prepare_dict.sh. The format of the dictionary directory is
  described \ref data_prep_lang_creating "here".
  
  Now we can create a new lang directory with the updated lexicon. Suppose
  $lang is the recognizer's old lang directory, $lang_own is the new lang
  directory that we are going to create, $dict_own is the dictionary directory we
  just created, and "<SPOKEN_NOISE>" is the word symbol that represents
  out-of-vocabulary words in the lexicon, we can generate the new lang directory
  with the updated lexicon using the following command
  \verbatim
  lang_own_tmp=data/local/lang_own_tmp/   # Temporary directory.
  utils/prepare_lang.sh \
    --phone-symbol-table $lang/phones.txt \
    $dict_own "<SPOKEN_NOISE>" $lang_own_tmp $lang_own
  \endverbatim
  Make usre you use the option "--phone-symbol-table", which makes sure that
  phones in your new lexicon will be compatible with the recognizer.
  
  The last step is of course to update the decoding graph, using the following
  command
  \verbatim
  graph_own_dir=$model_dir/graph_own
  utils/mkgraph.sh $lang_own $model_dir $graph_own_dir || exit 1;
  \endverbatim
  where $model_dir is the model directory which contains the model "final.mdl"
  and the tree "tree". We now can use $graph_own_dir/HCLG.fst to replace the old
  HCLG.fst.
  
  
  \section online_decoding_nnet3 Online decoding with nnet3 models
  
  Online decoding with nnet3 models is basically the same as with nnet2
  models as described in \ref online_decoding_nnet2.  However, there are
  some limitations as to the model type you can use.  In Kaldi 5.0 and
  earlier, online nnet3 decoding does not support recurrent models.
  In Kaldi 5.1 and later, online nnet3 decoding supports "forward"
  recurrent models such as LSTMs, but not bidirectional ones like BLSTMs.
  In addition, online nnet3 decoding with recurrent
  models may not give optimal results unless
  you use "Kaldi-5.1-style" configuration, including the "decay-time"
  option and specifying --extra-left-context-initial 0; see
  \ref dnn3_scripts_context for more discussions of these issues.
  
  
  Many of the issues in online nnet3 decoding are the same as in nnet2
  decoding and the command lines are quite similar.  For online nnet3
  decoding with Kaldi 5.1 and later, the best example script for online
  decoding including model training is, at the
  time of writing, egs/tedlium/s5_r2/local/chain/tuning/run_tdnn_lstm_1e.sh
  (at the time of writing this is only available in the 'shortcut' branch,
  as Kaldi 5.1 has not yet been merged to master);
  and downloadable models that can be used with online nnet3 decoding, please
  see http://kaldi-asr.org/models.html (the first model there, the ASPIRE model,
  includes instructions in a README file).
  
  \subsection online_decoding_nnet3_tcp TCP server for nnet3 online decoding
  
  The program to run the TCP sever is online2-tcp-nnet3-decode-faster located in the
  ~/src/online2bin folder. The usage is as follows:
  
  \verbatim
  online2-tcp-nnet3-decode-faster <nnet3-in> <fst-in> <word-symbol-table>
  \endverbatim
  
  For example:
  
  \verbatim
  online2-tcp-nnet3-decode-faster model/final.mdl graph/HCLG.fst graph/words.txt
  \endverbatim
  
  The word symbol table is mandatory (unlike other nnet3 online decoding programs) because
  the server outputs word strings. Endpointing is mandatory to make the operation of the
  program reasonable. Other, non-standard options include:
      - port-num - the port the server listens on (by default 5050)
      - samp-freq - sampling frequency of audio (usually 8000 for telephony and 16000 for other uses)
      - chunk-length - length of signal being processed by decoder at each step
      - output-period - how often we check for changes in the decoding (ie. output refresh rate, default 1s)
      - num-threads-startup - number of threads used when initializing iVector extractor
      - read-timeout - it the program doesn't receive data during this timeout, the server terminates the connection.
  		Use -1 to disable this feature.
  
  The TCP protocol simply takes RAW signal on input (16-bit signed integer
  encoding at chosen sampling frequency) and outputs simple text using the following
  logic:
      - each refresh period (output-freq argument) the current state of decoding is output
      - each line is terminated by '\r'
      - once an utterance boundary is detected due to endpointing a '
  ' char is output
  
  Each output string (delimited by '\r') should be treated as uncertain and can change
  entirely until the utterance delimiter ('
  ') is sent. The delimiter chars are chosen
  specifically in order to make the output look neat in the terminal. It is possible to
  use it with other interfaces and a web demo (HTML/JS AudioAPI+WebSockets) exists.
  
  To run the program from the terminal you can use one of the following commands. First,
  make sure the server is running and accepting connections. Using the Aspire models, the
  command should look like this:
  \verbatim
  online2-tcp-nnet3-decode-faster --samp-freq=8000 --frames-per-chunk=20 --extra-left-context-initial=0
      --frame-subsampling-factor=3 --config=model/conf/online.conf --min-active=200 --max-active=7000
      --beam=15.0 --lattice-beam=6.0 --acoustic-scale=1.0 --port-num=5050 model/final.mdl graph/HCLG.fst graph/words.txt
  \endverbatim
  
  Note in order to make the communication as simple as possible, the server has to accept
  any data on input and cannot figure out when the stream is over. It will therefore not
  be able to terminate the connection and it is the client's resposibility to disconnect
  when it is ready to do so. As a fallback for certain situations, the read-timeout option
  was added, which will automatically disconnect if a chosen amount of seconds has passed.
  Keep in mind, that this is not an ideal solution and it's a better idea to design your
  client to properly disconnect the connection when neccessary.
  
  For testing purposes, we will use the netcat program. We will also use sox to reeoncode the
  files properly from any source. Netcat has an issue that, similarly to what was stated above 
  about the server, it cannot always interpret the data and usually it won't automatically
  disconnect the TCP connection. To get around this, we will use the '-N' switch, which kills
  the connection once streaming of the file is complete, but this can have a small sideffect of
  not reading the whole output from the Kaldi server if the discconect comes too fast. Just
  keep this in mind if you intend to implement any of these programs into a production environment.
  
  To send a WAV file into the server, it first needs to be decoded into raw audio, then it can be
  sent to the socket:
  \verbatim
  sox audio.wav -t raw -c 1 -b 16 -r 8k -e signed-integer - | nc -N localhost 5050
  \endverbatim
  
  It is possible to play audio (almost) simultaneously as decoding. It may require installing the
  'pv' program (used to throttle the signal into Kaldi at the same speed as the playback):
  
  \verbatim
  sox audio.wav -t raw -c 1 -b 16 -r 8k -e signed-integer - | \
      tee >(play -t raw -r 8k -e signed-integer -b 16 -c 1 -q -) | \
      pv -L 16000 -q | nc -N localhost 5050
  \endverbatim
  
  Finally, it is possible to send audio from the microphone directly into the server:
  
  \verbatim
  rec -r 8k -e signed-integer -c 1 -b 16 -t raw -q - | nc -N localhost 5050
  \endverbatim
  
  
  */
  
  
  }