Yannick Estève / ONTRAC-Kaldi

Blame view

src/doc/dnn3_scripts_context.dox 13.8 KB

// doc/dnn3_scripts_context.dox

// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at

// http://www.apache.org/licenses/LICENSE-2.0

// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.

namespace kaldi {
namespace nnet3 {

/**
\page dnn3_scripts_context Context and chunk-size in the "nnet3" setup

\section dnn3_scripts_context_intro Introduction

This page discusses certain issues of terminology in the nnet3 setup
about chunk sizes for decoding and training, and left and right context.
This will be helpful in understanding some of the scripts. At the current
time don't have any 'overview' documentation of nnet3 from a scripting perspective,
so this will have to stand as an isolated piece of documentation.

\section dnn3_scripts_context_basics The basics

If you have read the previous documentation available for \ref dnn3, you will
realize that the "nnet3" setup supports setups other than simple feedforward
DNNs. It can be used for time delay neural networks (TDNNs) where temporal
splicing (frame splicing) is done at internal layers of the network; and also
for recurrent topologies (RNNs, LSTMs, BLSTMs, etc.). So nnet3
"knows about" the time axis. Below we estabilish some terminology.

\subsection dnn3_scripts_context_basics_context Left and right context

Suppose we want a network to compute an output for a specific time index;
to be concrete, say time t = 154. If the network does frame splicing
internally (or anything else nontrivial with the 't' indexes), it may not be able to
compute this output without seeing a range of input frames. For example,
it may be impossible to compute the output without seeing the range of
't' values from t = 150 through t = 157. In this case (glossing over details),
we'd say that the network has a \b left-context of 4 and a \b right-context of 3.
The actual computation of the context is a bit more complex as it has to
take into account special cases like where, say, the behavior for odd and
even 't' values is different (c.f. Round() descriptors in
\ref dnn3_dt_nnet_descriptor_config).

There are cases with recurrent topologies where, in addition to the
"required" left and right context, we want to give the training or the
decoding "extra" context. For such topologies, the network can make use
of context beyond the required context.
In the scripts you'll generally see variables called
\b extra-left-context and \b extra-right-context, which mean
"the amount of context that we're going to provide in addition to what is required".

In some circumstances the names \b left-context and
\b right-context simply mean the total left and right context that we're
adding to the chunks, i.e. the sums of the model left/right context and the
extra left/right context. So in some circumstances you may have to work out
from the context whether a variable refers to the <em>model</em> left/right context
of the left/right context of the chunks of data.

In Kaldi version 5.0 and earlier the left and right context in the chunks
of data is not affected by whether the chunks were at the
beginning or end of the utterance; at the ends we pad the input with copies of the
first or last frame. This means that for recurrent topologies, we might end up
padding the start or end of the utterance with a lot of frames (up to 40 or so).
This is wasteful and rather strange.
In versions 5.1 and later, you can specify configuration values \b extra-left-context-initial and
\b extra-right-context-final that allow the start/end of the utterance to have a different
amount of context. If you specify these values, you would normally specify them both to be 0
(i.e. no extra context). However, for back compatibility to older setups, they
generally default to -1 (meaning, just copy the default extra-left-context and extra-right-context).

\subsection dnn3_scripts_context_basics_chunk Chunk size

The \b chunk-size is the number of (output) frames for each chunk of data
that we evaluate in training or decoding. In the get_egs.sh script
and train_dnn.py it is also referred to as \b frames-per-eg (in some contexts,
this is not the same as the chunk size; see below). In decoding we call this
the \b frames-per-chunk.

\subsubsection dnn3_scripts_context_basics_chunk_dnn Non-recurrent, non-chain case

For the very simplest types of networks, such as feedforward networks or TDNNs
trained with the cross-entropy objective function, we randomize the entire
dataset at the frame level and we just train on one frame at a time. In order
for the training jobs to mostly do sequential I/O, we aim pre-randomize the
data at the frame level. However, when you consider that we might easily
require 10 frames each of left and right context, and we have to write this out,
we could easily be increasing the amount of data by a factor of 20 or so when we
generate the training examples. To solve this problem we include labels for
a range of time values, controlled by \b frames-per-eg (normally 8), and include
enough left/right context that we can train on any of those 8 frames. Then
when we train the model, any given training job will pick one of those 8 frames to
train on.

\subsubsection dnn3_scripts_context_basics_chunk_rnn Recurrent or chain case

In models that are RNNs or LSTMs or are \ref chain, we always train on fairly large
chunks (generally in the range 40 to 150 frames). This is referred to as the
\b chunk-size. When we decode, we also generally evaluate the neural net on fairly
large chunks of data (like, 30, 50 or 100 frames). This is usually referred to
as the \b frames-per-chunk. For recurrent networks we tend to
make sure that the \b chunk-size/\b frames-per-chunk
and the \b extra-left-context and \b extra-right-context are about the same in
training and decoding, because this generally gives the best results (although
sometimes it's best to make the extra-context values slightly larger in decoding).
One might expect that in decoding time longer context would always be better, but
this does not always seem to be the case (however, see \ref dnn3_scripts_context_looped
below, where we mention a way around this).

\subsubsection dnn3_scripts_context_basics_chunk_subsampling Interaction of chunk size with frame-subsampling-factor

In cases where there is frame-subsampling at the output (like the chain model),
the chunk-size is still measured in multiples of 't', and we make sure (via
rounding up in the code) that it's a multiple of the frame-subsampling factor.
Bear in mind that if the \b chunk-size is 90 and the \b frame-subsampling-factor
is 3, then we're only evaluating 30 distinct output indexes for each chunk of
90 frames (e.g. t=0, t=3 ... t=87).

\subsection dnn3_scripts_context_basics_variable Variable chunk size

Variable chunk size is something used in training that is only available in Kaldi version
5.1 or later. This is a mechanism to allow fairly large chunks while avoiding
the loss of data due to files that are not exact multiples of the chunk size.
Instead of specifying the chunk size as (say) 150, we might specify the chunk
size as a comma-separated list like 150,120,90,75, and the commands that generate the
training examples are allowed to create chunks of any of those sizes. The
first chunk size specified is referred to as the primary chunk size, and is
"special" in that for any given utterance, we are allowed pick at most two of the
non-primary chunk size; the remaining chunks must be of the primary chunk size.
This restriction makes it easier to work out the optimal split of a file of
a given length into chunks, and allows us to bias the chunk-generation to
chunks of a certain length.

\subsection dnn3_scripts_context_basics_minibatch Minibatch size

The program nnet3-merge-egs merges individual training examples into
minibatches containing many different examples (each original example
gets a different 'n' index). The \b minibatch-size is the desired
size of minibatch, by which we mean the number of examples (frames or
sequences) that we combine into one(for example, minibatch-size=128).
When the chunk sizes
are variable (and taking into account that the context may be different
at the start/end of utterances if we set the \b extra-left-context-initial
and \b extra-right-context-final), it's important to ensure that only
``similar'' examples are merged into minibatches; this prevents expensive
recompilation from happening on every single minibatch.

In Kaldi version
5.1 and later, nnet3-merge-egs only merges together chunks of the same
structure (i.e. the same chunk-size and left and right context).
It keeps reading chunks from the input until it finds that
for some structure of input, there are \b minibatch-size examples ready
to merge into one. In Kaldi versions prior to 5.1 we generally discarded
the "odd-numbered" examples that couldn't be fit into a normal-sized
minibatch, but this becomes problematic now that there are many different
chunk-sizes (we'd discard too much data).

\subsubsection dnn3_scripts_context_basics_minibatch_variable Variable minibatch size

From Kaldi 5.1 and later,
the --minibatch-size is a more general string that allows the user more
control than just having a fixed minibatch size. For example, you can specify --minibatch-size=64,128 and
for each type of example it will try to accumulate batches of the
largest specified size (128) and output
them, until it reaches the end of the input; then it will output
a minibatch of size 64 if there are >= 64 egs left. Ranges are also
supported, e.g. --minibatch-size=1:64 means to output minibatches of size 64
until the end of the input, then output all remaining examples as a single
minibatch. You may also specify different rules for examples of different
sizes (run nnet3-merge-egs without arguments for details of this); this can be useful
to stay within GPU memory limits.

\section dnn3_scripts_context_looped Looped decoding

Looped decoding in nnet3 is another feature that is new in Kaldi version 5.1.
It is applicable to forward-recurrent neural networks such as RNNs and LSTMs
(but not to BLSTMs). It allows us to re-use hidden-state activations from
previously-computed chunks. This allows us to have effectively unlimited left
context. The reason why it's called ``looped decoding'' relates to the way
it's implemented: we create a computation whose last statement is a 'goto'
that jumps to somewhere in the middle, so effectively it has a loop like
'while(1)'. (Note: the computations have statements that request user input or
provide output, so the loop doesn't cause the computation to run indefinitely when called;
it will stop when an I/O operation is reached). Looped computation is intended to solve two problems: wasteful
computation, and latency. Suppose we trained our LSTMs with 40 frames of left
context and a chunk-size of 100. Without looped computation, we'd probably
want to decode with chunks of size about 100 and we'd left-pad the input with around 40
frames. But this takes about 40\% extra computation; and the chunk size of 1
second would be a problem for latency/responsiveness in a real-time
application. With looped computation, we can choose any chunk size that's
convenient, because the effective left context is infinite; and the chunk size
doesn't affect the computed output any more.

However, there is a slight problem with what we sketched out above. In
practice, we've found for LSTMs that decoding works best with about the same
chunk sizes and context as we trained with. That is, adding more context than
we trained on is not helpful. Our theory about why this happens is that
as the context gets longer we reach parts of activation space that were unreachable
before. The maximum value of the cells \f$c_t\f$ in LSTMs rises linearly with
the number of frames we've seen. Following this theory, we made a modification
to LSTMs that seems to fix the problem. We scale the \f$c_t\f$ in the LSTM equations
by a value slightly less than one in the recurrence (for example, like 0.9).
This puts a bound on the maximum hidden activation activations and makes them
increase less dramatically with increasing recurrence time. It's specified
as a configuration value in the LSTM components in the "xconfig" configuration files
with the "decay-time" value, e.g. "decay-time=20". This doesn't seem to
degrade the Word Error Rates, and it removes the discrepancy between regular
and looped decoding (i.e. it makes the networks tolerant to longer context than
was seen in training).

The script steps/nnet3/decode_looped.sh (only available from Kaldi version 5.1)
takes only two chunk- or context-related configuration values:
\b frames-per-chunk (which only affects the speed/latency tradeoff and not
results), and \b extra-left-context-initial, which should be set to
match the training condition (generally this will be zero, in up-to-date
scripts).

At the time of writing, we have not yet created a program similar to
online2-wav-nnet3-latgen-faster that uses the looped decoder; that is
on our TODO list (it's not inherently difficult).

- Up: \ref dnn3
- Previous: \ref dnn3_code_optimization

}
}