hmm.dox
26.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
// doc/hmm.dox
// Copyright 2009-2011 Microsoft Corporation
// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.
namespace kaldi {
/**
\page hmm HMM topology and transition modeling
\section hmm_intro Introduction
In this page we describe how HMM topologies are represented by Kaldi and
how we model and train HMM transitions. We briefly mention how this interacts
with decision trees; decision trees are covered more fully in \ref tree_externals and
\ref tree_internals. For a list of classes and functions in this group, see
\ref hmm_group
\section hmm_topology HMM topologies
The class HmmTopology is the way the user specifies to the toolkit the topology
of the HMMs the phones. In the normal recipe, the scripts create
in a file the text form of the HmmTopology object, which is then given to the
command-line programs. To give some idea of what this object contains, below
is the text format for the HmmTopology object in the "normal" case (the
3-state Bakis model):
\verbatim
<Topology>
<TopologyEntry>
<ForPhones> 1 2 3 4 5 6 7 8 </ForPhones>
<State> 0 <PdfClass> 0
<Transition> 0 0.5
<Transition> 1 0.5
</State>
<State> 1 <PdfClass> 1
<Transition> 1 0.5
<Transition> 2 0.5
</State>
<State> 2 <PdfClass> 2
<Transition> 2 0.5
<Transition> 3 0.5
</State>
<State> 3
</State>
</TopologyEntry>
</Topology>
\endverbatim
There is one TopologyEntry in this particular HmmTopology object, and it covers
phones 1 through 8 (so in this example there are just eight phones and they all
share the same topology). There are three emitting states (i.e. states that
have pdfs associated with them and 'emit' feature vectors); each has a self-loop
and a transition to the next state. There is also a fourth, non-emitting state,
state 3 (there is no \<PdfClass\> entry for it) which has no transitions out of it
(implicitly, it connects to the next phone in the sequence).
This is
a standard feature of these topology entries; Kaldi treats the first state (state
zero) as the start state, and the last state, which should always be nonemitting and
have no transitions out of it, has final-probability one. You
can treat the transition-probability to the last state as equivalent to the
"final-probability" in a HMM.
All of emitting the states in this particular example can have different pdf's in
them (since the PdfClass numbers are all distinct). We can enforce tying by
making the \<PdfClass\> numbers the same.
The probabilities given in the HmmTopology object are those that are used to
initialize training; the trained probabilities are specific to the
context-dependent HMMs and are stored in the TransitionModel object. The
TransitionModel stores the HmmTopology object as a class member, but be aware
that the transition probabilities in the HmmTopology object are generally not
used after initializing the TransitionModel object. There is an exception to
this, however; for nonemitting states that are non-final (i.e. those that have
transitions out of them but no \<PdfClass\> entry), Kaldi does not train the
transition probabilities and instead it uses the probabilities given in the
HmmTopology object. The decision not to support trainable transition probabilities
for non-emitting states simplifies our training mechanisms, and since it is not
normal to have non-emitting states with transitions, we felt that this was no
great loss.
\section pdf_class Pdf-classes
The pdf-class is a concept that relates to the HmmTopology object. The
HmmTopology object specifies a prototype HMM for each phone. Each
numbered state of a
"prototype HMM" has two variables "forward_pdf_class" and "self_loop_pdf_class".
The "self_loop_pdf_class" is a kind of pdf-class that is associated
with self-loop transition. It is by default identical to "forward_pdf_class",
but it can be used to define less-conventional HMM topologies
where the pdfs on the self-loop and forward transitions are different.
The decision to allow the pdf-class on just the self-loop to be different,
while not embracing a fully "arc-based" representation where the pdfs on
all transitions in the HMM are potentially independent, was made as a compromise,
to allow for compatibility with previous versions of Kaldi while supporting the topology
used in our "chain models" AKA lattice-free MMI.
If two states have the same
pdf_class variable, then they will always share the same probability
distribution function (p.d.f.) if they are in the same phonetic context. This
is because the decision-tree code does not get to "see" the HMM-state directly,
it only gets to see the pdf-class. In the normal case the pdf-class is the same
as the HMM state index (e.g. 0, 1 or 2), but pdf classes provide a way for the
user to enforce sharing. This would mainly be useful if you wanted richer
transition modeling but wanted to leave the acoustic model otherwise the same.
Another function of the pdf-class is to specify nonemitting states. If the
pdf-class for some HMM state is set to the constant \ref kNoPdf = -1, then the
HMM state is nonemitting (it has no associated pdf). This can be achieved
simply by omitting the \<PdfClass\> tag and associated number, in the text form of the object.
The set of pdf-classes for a particular prototype HMM is expected to start
from zero and be contiguous (e.g. 0, 1, 2). This is for the convenience of
the graph-building code, and does not lead to any loss of generality.
\section transition_model Transition models (the TransitionModel object)
The TransitionModel object stores the transition probabilities and information
about HMM topologies (it contains a HmmTopology object).
The graph-building code depends on the TransitionModel object to get
the topology and transition probabilities (it also requires a ContextDependencyInterface
object to get the pdf-ids associated with particular phonetic contexts).
\subsection transition_model_how How we model transition probabilities in Kaldi
The decision that underlies a lot of the transition-modeling code is as follows:
we have decided to make the transition probability of a
context dependent HMM state depend on the following five things (you could view
them as a 5-tuple):
- The phone (whose HMM we are in)
- The source HMM-state (as interpreted by the HmmTopology object, i.e. normally 0, 1 or 2)
- The \ref pdf_id "forward-pdf-id"
(i.e. the index of the forward transition pdfs associated with the state)
- The \ref pdf_id "self-loop-pdf-id"
(i.e. the index of the self-loop pdfs associated with the state)
- The index of the transition in the HmmTopology object.
The last of these four items could be viewed as encoding the destination
HMM-state in the HmmTopology object.
The reason for making the transition probabilities depend on these things,
is that this is the most fine-grained way we could model transitions without
increasing the size of the compiled decoding graphs; it is also quite convenient
for training the transition probabilities. In practice, with conventional setups
it probably does not make any difference to model the transitions as precisely
as this, and the HTK approach of sharing the transitions at the monophone level
would probably be sufficient.
\subsection transition_model_mappings The reason for transition-ids etc.
The TransitionModel object sets up a number of integer mappings when it is
initialized, and is used by other parts of the code to perform these mappings.
Apart from the quantities mentioned above, there are quantities called
transition identifiers (transition-ids), transition indexes (which are not
the same thing as transition-ids), and transition states. The reason we
have introduced these identifiers and the associated mappings is so that we can
use a completely FST-based training recipe. The most "natural" FST-based setups
would have what we call pdf-ids on the input labels. However, bearing in mind
that given our tree-building algorithms it will not always be possible to map
uniquely from a pdf-id to a phone, this would make it hard to map from an input-label
sequence to a phone
sequence, and this is inconvenient for a number of reasons;
it would also make it hard in general to train the transition probabilities using
the information in the FST alone. For this reason we put
identifiers called transition-ids on the input labels of the FST, and these can be mapped
to the pdf-id but also to the phone and to a particular transition in a
prototype HMM (as given in the HmmTopology object).
\subsection transition_model_identifiers Integer identifiers used by TransitionModel
The following types of identifiers are used in the TransitionModel interface. All of
them are represented as type int32. Note that some of these quantities are one-based
indices and some are zero-based. We have tried to avoid one-based indices as
much as possible in the toolkit
because they are not very compatible with C++ array indexing, but because OpenFst
treats zero as a special case (meaning the special symbol epsilon), we have decided
to allow one-based indexing for quantities that frequently appear as input symbols
on FSTs. Most importantly, transition-ids are one based.
Since we do not imagine \ref pdf_id "pdf-ids" appearing very frequently as
FST labels, and since we often use them as C++ array indexes, we have decided
to make them zero-based but if they appear as FST input symbols (which should be
rarely) we add one to them.
When reading the TransitionModel code, be aware that when indexing arrays with
one-based quantities there are cases where we subtract one and some cases where we do not;
this is documented where the member variables are declared. In any case, such code
is not in the public interface so it should not lead to too much confusion.
A complete list of the various integer quantities used in TransitionModel are as follows:
- phone (one-based): this type of identifier is used throughout the toolkit; it
can be converted to a phone name via an OpenFst symbol table. Not necessarily
contiguous (the toolkit allows "skips" in the phone indices).
- hmm-state (zero-based): this is an index into something of type HmmTopology::TopologyEntry.
In the normal case, it is one of {0, 1, 2}.
- pdf, or pdf-id (zero-based): this is the index of the p.d.f., as
originally allocated by the decision-tree clustering; (see \ref pdf_id).
There would normally be several thousand pdf-ids in a system.
- transition-state, or trans_state (one-based): this is an index that is defined
by the TransitionModel itself. Each possible triple of (phone, hmm-state, pdf)
maps to a unique transition-state. Think of it is the finest granularity of
HMM-state for which transitions are separately estimated.
- transition-index, or trans_index (zero-based): this is an index into the
"transitions" array of type HmmTopology::HmmState. It numbers the
transitions out of a particular transition-state.
- transition-id, or trans_id (one-based): each of these corresponds to a
unique transition probability in the transition model. There is a mapping
from (transition-state, transition-index) to transition-id, and vice versa.
There are also in the transition-modeling code reference to the following concepts:
- A tuple means a 4-tuple (phone, hmm-state, forward pdf, self-loop pdf) which is mappable to and from a transition-state.
- A pair means a pair (transition-state, transition-index) which is mappable to and from a transition-id.
\section hmm_transition_training Training the transition model
The training procedure for the transition model is very simple. The FSTs that
we create (for both training and test) have transition-ids as their input
labels. In training we do a Viterbi decoding that gives us the input-label
sequence, which is a sequence of transition-ids (one for each feature vector).
The statistics we accumulate for training transitions are essentially counts of
how many times each transition-id was seen in training (the code itself uses
floating-point occupation counts but these are just integers in our normal
training setup). The function Transition::Update() does the ML update for each
transition-state. This works in the "obvious" way. There are also some minor
issues related to probability flooring and what to do if a particular
transition-state is unseen; for these details, see the code.
\section hmm_alignment Alignments in Kaldi
At this point we introduce the concept of an alignment. By "alignment", we
generally mean something of type vector<int32>, which contains a sequence
of transition-ids (c.f. \ref transition_model_identifiers) whose length is
the same as the utterance the alignment corresponds to. This sequence of transition-ids
would generally be obtained from the decoder as the input-label sequence.
Alignments are used in training time for Viterbi training, and in test time
for adaptation. Because transition-ids encode
the phone information, it is possible to work out the phonetic sequence from
an alignment (c.f. SplitToPhones() and ali-to-phones.cc).
We often need to deal with collections of alignments indexed by utterance.
To do this conveniently, we read and write alignments with tables;
see \ref table_examples_ali for more information.
The function \ref ConvertAlignment() (c.f. the command-line program \ref
convert-ali.cc "convert-ali") converts alignments from one transition-model to
another. The typical case is where you have alignments created using one
transition-model (created from a particular decision tree) and want to convert
them to be valid for another transition model with a different tree. It
optionally takes a mapping from the original phones to a new phone set; this
feature is not normally needed but we have used it when dealing with simplified models
based on a reduced (clustered) phone set.
Programs that read in alignments generally have the suffix "-ali".
\section hmm_post State-level posteriors
State-level posteriors are an extension of the "alignment" concept (previous section),
except that instead of having a single transition-id per frame we have an arbitrary
number of transition-ids per frame, and each one has a weight. It is stored in
the following type of structure:
\verbatim
typedef std::vector<std::vector<std::pair<int32, BaseFloat> > > Posterior;
\endverbatim
where if we have an object "post" of type Posterior, post.size() will be equal
to the length of the utterance in frames, and post[i] is a list of pairs (stored
as a vector), where each pair consists of a (transition-id, posterior).
In the current programs, there are only two ways to create posteriors: either
- By converting alignments to posteriors using the program ali-to-post, which
gives us a rather trivial Posterior object where each frame has a single transition-id
with unit posterior
- By modifying posteriors using the program weight-silence-post, which is usually
used to weight down the posteriors corresponding to silence phones.
In future, when lattice generation is added, we will add utilities to create posteriors
from lattices.
Programs that read in posteriors don't have a suffix comparable to "-ali", which is the
suffix for programs that read in alignments. This is for brevity; reading in
state-level posteriors is considered the "default" behavior of programs that
need this type of alignment information.
\section hmm_gpost Gaussian-level posteriors
A set of Gaussian-level posteriors for an utterance may be stored using the
following typedef:
\verbatim
typedef std::vector<std::vector<std::pair<int32, Vector<BaseFloat> > > > GauPost;
\endverbatim
This is like the Posterior structure, except the floating-point value (which represents
the posterior of the state) is now a vector of floating-point values, one for
each Gaussian in the state. The size of the vector would be the same as the
number of Gaussians in the pdf corresponding to the transition-id which is the
first element of the pair.
The program post-to-gpost converts Posterior structures into GauPost structures; it
uses the model and the features to compute the Gaussian-level posteriors. This
is mainly useful in situations where we may need to compute Gaussian-level posteriors
with a different model or features than the ones we need to accumulate statistics
with. Programs that read in Gaussian-level posteriors have the suffix "-gpost".
\section hmm_graph Functions for converting HMMs to FSTs
A complete list of functions and classes involved in converting HMMs to FSTs may
be found \ref hmm_group_graph "here".
\subsection hmm_graph_get_h_transducer GetHTransducer()
The most important one is the function
GetHTranducer(), declared as follows:
\code
fst::VectorFst<fst::StdArc>*
GetHTransducer (const std::vector<std::vector<int32> > &ilabel_info,
const ContextDependencyInterface &ctx_dep,
const TransitionModel &trans_model,
const HTransducerConfig &config,
std::vector<int32> *disambig_syms_left);
\endcode
There are aspects of this function which will be hard to understand
without having first understood the \ref tree_ilabel "ilabel_info" object,
the \ref tree_ctxdep "ContextDependencyInterface" interface, and at
least the basics of how FSTs are used in speech recognition. This function
returns an FST whose input labels are \ref transition_model_identifiers "transition-ids"
and whose output labels represent context-dependent phones (they are
indices into the \ref tree_ilabel "ilabel_info" object). The FST it
returns has a state that is both initial and final, and all the transitions out
of it have output-labels (for efficient composition with CLG). Each transition
out of it will typically enter a structure representing a 3-state HMM, and
then loop back to the initial state. The FST returned
GetHTransducer() will only be valid for the phonetic contexts represented
in ilabel_info, which the caller can specify.
This is useful because for wide-context systems there can
be a large number of contexts, most of which are never used. The ilabel_info
object can be obtained from the ContextFst object (which represents C) after
composing it with something, and it contains just the contexts that have
been used. We would then provide this same ilabel_info object to
GetHTransducer() to get an H transducer that covers everything we need.
Note that GetHTransducer() function does not include the self-loops. These
must be added later by the function AddSelfLoops(); it is normally convenient
to only add the self-loops after all stages of decoding-graph optimization.
\subsection hmm_graph_config The HTransducerConfig configuration class
The HTransducerConfig configuration class controls the behavior of
GetHTransducer.
- The variable \ref HTransducerConfig::trans_prob_scale
"trans_prob_scale" is the transition probability scale. When transition
probabilities are included in the graph, they are included with this scale.
As a command-line option this is called --transition-scale.
See \ref hmm_scale for a discussion of the appropriate scale to use.
\subsection hmm_graph_get_hmm_as_fst The function GetHmmAsFst()
The function GetHmmAsFst() takes a phonetic context window and returns
the corresponding finite state acceptor with transition-ids as the symbols.
This is used in GetHTransducer(). A function GetHmmAsFstSimple() that
takes fewer options is also provided as a form of documentation,
in order to show in principle how the process works.
\subsection hmm_graph_add_self_loops AddSelfLoops()
The AddSelfLoops() function adds self-loops to a graph that has been
created without self-loops. A typical setup is to create the H transducer
without self-loops, compose with CLG, do determinization and minimization,
and then add the self-loops. This enables more efficient determinization and
minimization. The AddSelfLoops() function has the option to reorder the
transitions; see below \ref hmm_reorder for more details on this. It also
takes a transition-probability scale, "self_loop_scale", which does not
have to be the same as the normal transition-probability scale; for more
on this, see below \ref hmm_scale.
\subsection hmm_graph_add_transition_probs Adding transition probabilities to FSTs
The AddTransitionProbs() function adds transition probabilities to an FST.
The reason this is useful is so that graphs can be created without transition
probabilities on them (i.e. without the component of the weights that arises
from the HMM transitions), and these can be added in later; this makes it
possible to use the same graph on different iterations of training the
model, and keep the transition-probabilities in the graph up to date.
Creating the graph without transition-probabilities is accomplished by
using a zero value for trans_prob_scale (command-line option: --transition-scale).
In training time, our scripts tend to store the
graphs on disk without the transition probabilities, and then each time we
realign we add in the currently valid transition probabilities.
\section hmm_reorder Reordering transitions
The AddSelfLoops() function takes a boolean option "reorder" which
tells it to reorder transion-probabilities so the self-loop comes after
the transition out of the state. Where applicable this becomes a
boolean command-line option, e.g. you can do --reorder=true to enable
reordering during graph creation. This option makes the "simple" and
"faster" decoders more efficient (see \ref decoders), although it is
not compatible with the "kaldi" decoder.
The idea of reordering is that we switch the order of the self-loop arcs
with all the other arcs that come out of a state, so the self-loop is
located at the destination state of each of the other arcs. For this
to work, we have to ensure that the FST has
certain properties, namely that all the arcs into a particular state must
induce the same self-loop (also, a state with a self-loop cannot have
input arcs with epsilon inputs, or be the start state). The AddSelfLoops()
function modifies the graphs to ensure that they have this property. A similar
property is required even if the "reorder" option is set to false.
The graphs created with the "reorder" option are exactly equivalent to the
non-reordered graphs in terms
of the acoustic and transition-model probabilities you get when decoding
an utterance. The transition-ids on the resulting alignment are in a different
order, but this does not matter given the ways that we make use of these
alignments.
\section hmm_scale Scaling of transition and acoustic probabilities
There are three types of scaling that can be applied in Kaldi:
<table border="1">
<tr>
<td> Name in code</td> <td> Name in command-line arguments</td> <td> Example value (train) </td> <td> Example value (test) </td>
</tr>
<tr>
<td> acoustic_scale </td> <td> --acoustic-scale=? </td> <td> 0.1 </td> <td> 0.08333 </td>
</tr>
<tr>
<td> self_loop_scale </td> <td> --self-loop-scale=? </td> <td> 0.1 </td> <td> 0.1 </td>
</tr>
<tr>
<td> transition_scale </td> <td> --transition-scale=? </td> <td> 1.0 </td> <td> 1.0 </td>
</tr>
</table>
You may notice that there is no language model scale on this list; everything is
scaled relative to the language model. Also we don't support a word insertion
penalty, in general (although the "kaldi" decoder does support this).
The idea is that the language model represents "real"
probabilities so it makes sense to scale everything else relative to them.
The scales during training time are the scales we use in decoding to get Viterbi
alignments. In general, we use a figure of 0.1 whenever a parameter is not to
critical and is expected to be small. The acoustic scale used during
test is quite critical and is typically tuned to the task.
We now explain what these three scales do:
- The acoustic scale is the scale applied to the acoustics (i.e. to the log-likelihood
of a frame given an acoustic state).
- The transition scale is the scale on the transition probabilities, but this only
applies to HMM states that have multiple transitions out of them; it applies to the
relative weight between such transitions. It does not have any effect for typical
topologoes.
- The self-loop scale is the scale that we apply to the self-loops. More specifically,
when we add the self-loop, let the probability mass given to the self-loop be p
and the mass given to the rest be (1-p). We add a self-loop with log-probability
self_loop_scale * log(p), and add (self_loop_scale * log(1-p)) to all the other
log transition probabilities out of that state. (Note: in the initial stage of
graph creation we create a graph without self-loops, and with the non-self-loop
transition probabilities renormalized to sum to one). In typical topologies, the
self-loop scale is the only scale that matters.
The reason we feel it might make sense to apply a different probability scale to
the self-loops versus the normal transition scale is we think they could be
dictating the probabilities of events at different timescales. A slightly more
subtle argument is the following. All the transition probabilities can be
regarded as "real" probabilities (comparable to LM probabilities), because the
problem of correlation between acoustic probabilities does not occur for
transitions. However, a problem arises because we use the Viterbi algorithm in
testing (and in our case, in training too). The transition probabilities would
only represent real probabilities when summed over, as in the forward-backward
algorithm. We expect this to be more of an issue for the self-loops than for
probabilities that dictate the weight to give entirely different paths through
the HMM, as in the latter case the acoustic distributions will often be quite
disjoint, and the difference between forward-backward and Viterbi will be
small.
*/
/**
\defgroup hmm_group Classes and functions related to HMM topology and transition modeling
*/
}