Blame view

src/doc/dnn3_scripts_context.dox 13.8 KB
8dcb6dfcb   Yannick Estève   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
  // doc/dnn3_scripts_context.dox
  
  
  // Copyright 2015   Johns Hopkins University (author: Daniel Povey)
  
  // See ../../COPYING for clarification regarding multiple authors
  //
  // Licensed under the Apache License, Version 2.0 (the "License");
  // you may not use this file except in compliance with the License.
  // You may obtain a copy of the License at
  
  //  http://www.apache.org/licenses/LICENSE-2.0
  
  // THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  // KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
  // WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
  // MERCHANTABLITY OR NON-INFRINGEMENT.
  // See the Apache 2 License for the specific language governing permissions and
  // limitations under the License.
  
  namespace kaldi {
  namespace nnet3 {
  
  /**
    \page dnn3_scripts_context  Context and chunk-size in the "nnet3" setup
  
    \section dnn3_scripts_context_intro Introduction
  
    This page discusses certain issues of terminology in the nnet3 setup
    about chunk sizes for decoding and training, and left and right context.
    This will be helpful in understanding some of the scripts.  At the current
    time don't have any 'overview' documentation of nnet3 from a scripting perspective,
    so this will have to stand as an isolated piece of documentation.
  
   \section dnn3_scripts_context_basics The basics
  
   If you have read the previous documentation available for \ref dnn3, you will
   realize that the "nnet3" setup supports setups other than simple feedforward
   DNNs.  It can be used for time delay neural networks (TDNNs) where temporal
   splicing (frame splicing) is done at internal layers of the network; and also
   for recurrent topologies (RNNs, LSTMs, BLSTMs, etc.).  So nnet3
   "knows about" the time axis.  Below we estabilish some terminology.
  
     \subsection dnn3_scripts_context_basics_context Left and right context
  
     Suppose we want a network to compute an output for a specific time index;
     to be concrete, say time t = 154.  If the network does frame splicing
     internally (or anything else nontrivial with the 't' indexes), it may not be able to
     compute this output without seeing a range of input frames.  For example,
     it may be impossible to compute the output without seeing the range of
     't' values from t = 150 through t = 157.  In this case (glossing over details),
     we'd say that the network has a \b left-context of 4 and a \b right-context of 3.
     The actual computation of the context is a bit more complex as it has to
     take into account special cases like where, say, the behavior for odd and
     even 't' values is different (c.f. Round() descriptors in
     \ref dnn3_dt_nnet_descriptor_config).
  
     There are cases with recurrent topologies where, in addition to the
     "required" left and right context, we want to give the training or the
     decoding "extra" context.  For such topologies, the network can make use
     of context beyond the required context.
     In the scripts you'll generally see variables called
     \b extra-left-context and \b extra-right-context, which mean
     "the amount of context that we're going to provide in addition to what is required".
  
     In some circumstances the names \b left-context and
     \b right-context simply mean the total left and right context that we're
     adding to the chunks, i.e. the sums of the model left/right context and the
     extra left/right context.  So in some circumstances you may have to work out
     from the context whether a variable refers to the <em>model</em> left/right context
     of the left/right context of the chunks of data.
  
     In Kaldi version 5.0 and earlier the left and right context in the chunks
     of data is not affected by whether the chunks were at the
     beginning or end of the utterance; at the ends we pad the input with copies of the
     first or last frame.  This means that for recurrent topologies, we might end up
     padding the start or end of the utterance with a lot of frames (up to 40 or so).
     This is wasteful and rather strange.
     In versions 5.1 and later, you can specify configuration values \b extra-left-context-initial and
     \b extra-right-context-final that allow the start/end of the utterance to have a different
     amount of context.  If you specify these values, you would normally specify them both to be 0
     (i.e. no extra context).  However, for back compatibility to older setups, they
     generally default to -1 (meaning, just copy the default extra-left-context and extra-right-context).
  
  
     \subsection dnn3_scripts_context_basics_chunk Chunk size
  
     The \b chunk-size is the number of (output) frames for each chunk of data
     that we evaluate in training or decoding.  In the get_egs.sh script
     and train_dnn.py it is also referred to as \b frames-per-eg (in some contexts,
     this is not the same as the chunk size; see below).  In decoding we call this
     the \b frames-per-chunk.
  
     \subsubsection dnn3_scripts_context_basics_chunk_dnn Non-recurrent, non-chain case
  
     For the very simplest types of networks, such as feedforward networks or TDNNs
     trained with the cross-entropy objective function, we randomize the entire
     dataset at the frame level and we just train on one frame at a time.   In order
     for the training jobs to mostly do sequential I/O, we aim pre-randomize the
     data at the frame level.  However, when you consider that we might easily
     require 10 frames each of left and right context, and we have to write this out,
     we could easily be increasing the amount of data by a factor of 20 or so when we
     generate the training examples.  To solve this problem we include labels for
     a range of time values, controlled by \b frames-per-eg (normally 8), and include
     enough left/right context that we can train on any of those 8 frames.  Then
     when we train the model, any given training job will pick one of those 8 frames to
     train on.
  
     \subsubsection dnn3_scripts_context_basics_chunk_rnn  Recurrent or chain case
  
    In models that are RNNs or LSTMs or are \ref chain, we always train on fairly large
    chunks (generally in the range 40 to 150 frames).  This is referred to as the
    \b chunk-size.  When we decode, we also generally evaluate the neural net on fairly
    large chunks of data (like, 30, 50 or 100 frames).  This is usually referred to
    as the \b frames-per-chunk.  For recurrent networks we tend to
    make sure that the \b chunk-size/\b frames-per-chunk
    and the \b extra-left-context and \b extra-right-context are about the same in
    training and decoding, because this generally gives the best results (although
    sometimes it's best to make the extra-context values slightly larger in decoding).
    One might expect that in decoding time longer context would always be better, but
    this does not always seem to be the case (however, see \ref dnn3_scripts_context_looped
    below, where we mention a way around this).
  
  
     \subsubsection dnn3_scripts_context_basics_chunk_subsampling Interaction of chunk size with frame-subsampling-factor
  
     In cases where there is frame-subsampling at the output (like the chain model),
     the chunk-size is still measured in multiples of 't', and we make sure (via
     rounding up in the code) that it's  a multiple of the frame-subsampling factor.
     Bear in mind that if the \b chunk-size is 90 and the \b frame-subsampling-factor
     is 3, then we're only evaluating 30 distinct output indexes for each chunk of
     90 frames (e.g. t=0, t=3 ... t=87).
  
    \subsection dnn3_scripts_context_basics_variable Variable chunk size
  
    Variable chunk size is something used in training that is only available in Kaldi version
    5.1 or later.  This is a mechanism to allow fairly large chunks while avoiding
    the loss of data due to files that are not exact multiples of the chunk size.
    Instead of specifying the chunk size as (say) 150, we might specify the chunk
    size as a comma-separated list like 150,120,90,75, and the commands that generate the
    training examples are allowed to create chunks of any of those sizes.  The
    first chunk size specified is referred to as the primary chunk size, and is
    "special" in that for any given utterance, we are allowed pick at most two of the
    non-primary chunk size; the remaining chunks must be of the primary chunk size.
    This restriction makes it easier to work out the optimal split of a file of
    a given length into chunks, and allows us to bias the chunk-generation to
    chunks of a certain length.
  
  
    \subsection dnn3_scripts_context_basics_minibatch Minibatch size
  
    The program nnet3-merge-egs merges individual training examples into
    minibatches containing many different examples (each original example
    gets a different 'n' index).  The \b minibatch-size is the desired
    size of minibatch, by which we mean the number of examples (frames or
    sequences) that we combine into one(for example, minibatch-size=128).
    When the chunk sizes
    are variable (and taking into account that the context may be different
    at the start/end of utterances if we set the \b extra-left-context-initial
    and \b extra-right-context-final), it's important to ensure that only
    ``similar'' examples are merged into minibatches; this prevents expensive
    recompilation from happening on every single minibatch.
  
    In Kaldi version
    5.1 and later, nnet3-merge-egs only merges together chunks of the same
    structure (i.e. the same chunk-size and left and right context).
    It keeps reading chunks from the input until it finds that
    for some structure of input, there are \b minibatch-size examples ready
    to merge into one.  In Kaldi versions prior to 5.1 we generally discarded
    the "odd-numbered" examples that couldn't be fit into a normal-sized
    minibatch, but this becomes problematic now that there are many different
    chunk-sizes (we'd discard too much data).
  
    \subsubsection dnn3_scripts_context_basics_minibatch_variable Variable minibatch size
  
    From Kaldi 5.1 and later,
    the --minibatch-size is a more general string that allows the user more
    control than just having a fixed minibatch size.  For example, you can specify --minibatch-size=64,128 and
    for each type of example it will try to accumulate batches of the
    largest specified size (128) and output
    them, until it reaches the end of the input; then it will output
    a minibatch of size 64 if there are >= 64 egs left.  Ranges are also
    supported, e.g. --minibatch-size=1:64 means to output minibatches of size 64
    until the end of the input, then output all remaining examples as a single
    minibatch.  You may also specify different rules for examples of different
    sizes (run nnet3-merge-egs without arguments for details of this); this can be useful
    to stay within GPU memory limits.
  
    \section dnn3_scripts_context_looped  Looped decoding
  
    Looped decoding in nnet3 is another feature that is new in Kaldi version 5.1.
    It is applicable to forward-recurrent neural networks such as RNNs and LSTMs
    (but not to BLSTMs).  It allows us to re-use hidden-state activations from
    previously-computed chunks.  This allows us to have effectively unlimited left
    context.  The reason why it's called ``looped decoding'' relates to the way
    it's implemented: we create a computation whose last statement is a 'goto'
    that jumps to somewhere in the middle, so effectively it has a loop like
    'while(1)'.  (Note: the computations have statements that request user input or
    provide output, so the loop doesn't cause the computation to run indefinitely when called;
    it will stop when an I/O operation is reached).  Looped computation is intended to solve two problems: wasteful
    computation, and latency.  Suppose we trained our LSTMs with 40 frames of left
    context and a chunk-size of 100.  Without looped computation, we'd probably
    want to decode with chunks of size about 100 and we'd left-pad the input with around 40
    frames.  But this takes about 40\% extra computation; and the chunk size of 1
    second would be a problem for latency/responsiveness in a real-time
    application.  With looped computation, we can choose any chunk size that's
    convenient, because the effective left context is infinite; and the chunk size
    doesn't affect the computed output any more.
  
    However, there is a slight problem with what we sketched out above.  In
    practice, we've found for LSTMs that decoding works best with about the same
    chunk sizes and context as we trained with.  That is, adding more context than
    we trained on is not helpful.  Our theory about why this happens is that
    as the context gets longer we reach parts of activation space that were unreachable
    before.  The maximum value of the cells \f$c_t\f$ in LSTMs rises linearly with
    the number of frames we've seen.  Following this theory, we made a modification
    to LSTMs that seems to fix the problem.  We scale the \f$c_t\f$ in the LSTM equations
    by a value slightly less than one in the recurrence (for example, like 0.9).
    This puts a bound on the maximum hidden activation activations and makes them
    increase less dramatically with increasing recurrence time.  It's specified
    as a configuration value in the LSTM components in the "xconfig" configuration files
    with the "decay-time" value, e.g. "decay-time=20".  This doesn't seem to
    degrade the Word Error Rates, and it removes the discrepancy between regular
    and looped decoding (i.e. it makes the networks tolerant to longer context than
    was seen in training).
  
    The script steps/nnet3/decode_looped.sh (only available from Kaldi version 5.1)
    takes only two chunk- or context-related configuration values:
    \b frames-per-chunk (which only affects the speed/latency tradeoff and not
    results), and \b extra-left-context-initial, which should be set to
    match the training condition (generally this will be zero, in up-to-date
    scripts).
  
  
    At the time of writing, we have not yet created a program similar to
    online2-wav-nnet3-latgen-faster that uses the looped decoder; that is
    on our TODO list (it's not inherently difficult).
  
  
   - Up: \ref dnn3
   - Previous: \ref dnn3_code_optimization
  
  */
  
  }
  }