Blame view

src/doc/graph_recipe_train.dox 6.82 KB
8dcb6dfcb   Yannick Estève   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
  // doc/graph_recipe_train.dox
  
  
  // Copyright 2009-2011 Microsoft Corporation
  
  // See ../../COPYING for clarification regarding multiple authors
  //
  // Licensed under the Apache License, Version 2.0 (the "License");
  // you may not use this file except in compliance with the License.
  // You may obtain a copy of the License at
  
  //  http://www.apache.org/licenses/LICENSE-2.0
  
  // THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  // KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
  // WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
  // MERCHANTABLITY OR NON-INFRINGEMENT.
  // See the Apache 2 License for the specific language governing permissions and
  // limitations under the License.
  
  namespace kaldi {
  
  /**
  
   \page graph_recipe_train Decoding-graph creation recipe (training time)
  
   Our graph creation recipe during training time is simpler than during test time,
   mainly because we do not need the disambiguation symbols.  We will assume
   that you have already read the \ref graph_recipe_train "test-time
   version" of this recipe.
  
   In training time we use the same HCLG formalism as in test time, except that
   G consists of just a linear acceptor corresponding to the training transcript
   (of course, this setup is easy to extend to cases where there is uncertainty
   in the transcripts).
  
   \section graph_recipe_command Command-line programs involved in decoding-graph creation
  
  Here we explain what happens at the script level in a typical training script;
  below we will dig into what happens inside the programs.  Suppose we have built
  a tree and model.  The following command creates an archive (c.f. \ref
  io_sec_tables) that contains the graph HCLG corresponding to each of the
  training transcripts.
  \verbatim
  compile-train-graphs $dir/tree $dir/1.mdl data/L.fst ark:data/train.tra \
     ark:$dir/graphs.fsts
  \endverbatim
  The input file train.tra is a file containing integer versions of the training
  transcripts, e.g. a typical line might look like:
  \verbatim
  011c0201 110906 96419 79214 110906 52026 55810 82385 79214 51250 106907 111943 99519 79220
  \endverbatim
  (the first token is the utterance id).  The output of this program is the archive
  graphs.fsts; it contains an FST (in binary form) for each utterance id in train.tra.
  The FSTs in this archive correspond to HCLG, except
  that there are no transition probabilities (by default, compile-train-graphs
  has --self-loop-scale=0 and --transition-scale=0).  This is because these graphs
  will be used for multiple stages of training and the transition probabilities will
  change, so we add them in later.  But the FSTs on the archive will have probabilities
  that arise from silence probabilities (these probabilities are encoded into L.fst),
  and if we were using pronunciation probabilities these would be present too.
  
  A command that reads these archives and
  decodes the training data, producing state-level alignments
  (c.f. \ref hmm_alignment) is as follows; we will briefly review this command, although
  our main focus on this page is the graph creation itself.
  \verbatim
  gmm-align-compiled \
    --transition-scale=1.0 --acoustic-scale=0.1 --self-loop-scale=0.1 \
    --beam=8 --retry-beam=40 \
     $dir/$x.mdl ark:$dir/graphs.fsts \
     "ark:add-deltas --print-args=false scp:data/train.scp ark:- |" \
     ark:$dir/cur.ali
  \endverbatim
  The first three argument are probability scales (c.f. \ref hmm_scale).  The
  transition-scale and self-loop-scale options are there because this program adds
  the transition probabilities to the graph before decoding (c.f. 
  \ref hmm_graph_add_transition_probs).  The next options are the beams; we use an
  initial beam, and then if alignment fails to reach the end state we use another
  beam.  Since we apply a scale of 0.1 the acoustics, we would have to multiply
  these beams by 10 to get figures comparable to acoustic likelihoods.  The program
  requires the model; \$x.mdl would expand to 1.mdl or 2.mdl, according to the
  iteration.  It reads the graphs from the archive (graphs.fsts).  The argument in
  quotes is evaluated as a pipe (minus the "ark:" and "|"), and is interpreted as
  an archive indexed by utterance id, containing features.  The output goes to
  cur.ali, and if written in text form it would look like the .tra file described
  above, although the integers now correspond not to word-ids but transition-ids
  (c.f. \ref transition_model_identifiers).
  
  We note that these two stages (graph creation and decoding) can be done
  by a single command-line program, gmm-align, which compiles the graphs as they
  are needed.  Because graph creation takes a fairly significant amount of time, 
  any time the graphs will be needed more than once we write them to disk.
  
  
  \section graph_recipe_internal Internals of graph creation
  
   The graph creation done in training time, whether from the program
   compile-train-graphs or from gmm-align, is done by the same code:
   the class TrainingGraphCompiler.  Its behaviour when compiling graphs
   one by one is as follows:
  
   In initialization:
     - Add the subsequential loop (c.f. \ref graph_c) to the lexicon L.fst, make sure it is 
      sorted on its output label, and store it.
  
   When compiling a transcript, it does the following:
     - Make a linear acceptor corresponding to the word sequence (G)
     - Compose it (using TableCompose) with the lexicon to get an FST (LG) with
       phones on the input and words on the output; this FST contains
       pronunciation alternatives and optional silence.
     - Create a new context FST "C" that will be expanded on demand.
     - Compose C with LG to get CLG; the composition uses a special Matcher
       that expands C on demand.
     - Call the function GetHTransducer to get the transducer H (this covers
       just the context-dependent phones that were seen).
     - Compose H with CLG to get HCLG (this will have transition-ids on the
       input and words on the output, but no self-loops)
     - Determinize HCLG with the function DeterminizeStarInLog; this
       casts to the log semiring (to preserve stochasticity) before determinizing
       with epsilon removal.
     - Minimize HCLG with MinimizeEncoded (this does transducer minimization without
       weight-pushing, to preserve stochasticity).
     - Add self-loops.
  
   The TrainingGraphCompiler class has a function CompileGraphs() that will combine
   a number of graphs in a batch.  This is used in the tool compile-train-graphs
   to speed up the graph compilation.  The main reason it helps to do it in a batch
   is that when we create the H transducer, we only have to process each seen
   context-window once, even if it was used in many of the instances of CLG.
  
   The reason that we can determinize without first adding disambiguation symbols
   is that HCLG in this case is functional (since any accepted input-label sequence is 
   transuced to the same string) and acyclic; any acyclic FST has the twins property,
   and any functional FST with the twins property is determinizable.
  
  
  */
  
  
  }