transform.dox
42.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
// doc/transform.dox
// Copyright 2009-2011 Microsoft Corporation
// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.
namespace kaldi {
/**
\page transform Feature and model-space transforms in Kaldi
\section transform_intro Introduction
Kaldi code currently supports a number of feature and model-space transformations
and projections. Feature-space transforms and projections are treated in a consistent
way by the tools (they are essientially just matrices), and the following sections
relate to the commonalities:
- \ref transform_apply
- \ref transform_perspk
- \ref transform_utt2spk
- \ref transform_compose
- \ref transform_weight
Transforms, projections and other feature operations that are typically not speaker specific include:
- \ref transform_lda
- \ref transform_splice and \ref transform_delta
- \ref transform_hlda
- \ref transform_mllt
Global transforms that are typically applied in a speaker adaptive way are:
- \ref transform_cmllr_global
- \ref transform_lvtln
- \ref transform_et
- \ref transform_cmvn
We next discuss regression class trees and transforms that use them:
- \ref transform_regtree
\section transform_apply Applying global linear or affine feature transforms
In the case of feature-space transforms and projections that are global,
and not associated with classes (e.g. speech/silence or regression classes), we
represent them as matrices. A linear transform or
projection is represented as a matrix by which we will left-multiply a feature vector,
so the transformed feature is \f$ A x \f$. An affine transform or projection
is represented the same way, but we imagine a 1 has been appended to the
feature vector, so the transformed feature is
\f$ W \left[ \begin{array}{c} x \\ 1 \end{array} \right] \f$ where
\f$ W = \left[ A ; b \right] \f$, with A and b being the linear transform
and the constant offset.
Note that this convention differs from some of the literature, where the 1 may appear as
the first dimension rather than the last.
Global transforms and projections are generally written
as a type Matrix<BaseFloat> to a single file, and speaker or utterance-specific
transforms or projections are stored in a table of such matrices (see \ref io_sec_tables)
indexed by speaker-id or utterance-id.
Transforms may be applied to features
using the program transform-feats. Its syntax is
\verbatim
transform-feats <transform> <input-feats> <output-feats>
\endverbatim
where <input-feats> is an rspecifier, <output-feats> is an wspecifier, and <transform>
may be an rxfilename or an rspecifier (see \ref io_sec_specifiers and \ref io_sec_xfilename).
The program will work out whether the transform
is linear or affine based on whether or not the matrix's number of columns equals the
feature dimension, or the feature dimension plus one.
This program is typically used as part of a pipe.
A typical example is:
\verbatim
feats="ark:splice-feats scp:data/train.scp ark:- |
transform-feats $dir/0.mat ark:- ark:-|"
some-program some-args "$feats" some-other-args ...
\endverbatim
Here, the file 0.mat contains a single matrix. An example of applying
speaker-specific transforms is:
\verbatim
feats="ark:add-deltas scp:data/train.scp ark:- |
transform-feats --utt2spk=ark:data/train.utt2spk ark:$dir/0.trans ark:- ark:-|"
some-program some-args "$feats" some-other-args ...
\endverbatim
A per-utterance example would be as above but removing the --utt2spk option.
In this example, the archive file 0.trans would contain transforms (e.g. CMLLR transforms)
indexed by speaker-id, and the file data/train.utt2spk would have
lines of the form "utt-id spk-id" (see next section for more explanation).
The program transform-feats does not care how the transformation matrix was
estimated, it just applies it to the
features. After it has been through all the features it prints out the average
per-frame log determinant. This can be useful when comparing objective functions
(this log determinant would have to be added to the per-frame likelihood printed
out by programs like gmm-align, gmm-acc-stats, or gmm-decode-kaldi). If the
linear part A of the transformation (i.e. ignoring the offset term) is not square,
then the program will instead print out the per-frame average of
\f$ \frac{1}{2} \mathbf{logdet} (A A^T) \f$. It refers to this as the pseudo-log-determinant.
This is useful in checking convergence of MLLT estimation where the transformation matrix
being applied is the MLLT matrix times an LDA matrix.
\section transform_perspk Speaker-independent versus per-speaker versus per-utterance adaptation
Programs that estimate transforms are generally set up to do a particular kind of
adaptation, i.e. speaker-independent versus (speaker- or utterance-specific). For example, LDA
and MLLT/STC transforms are speaker-independent but fMLLR transforms are speaker- or
utterance-specific. Programs that estimate speaker- or utterance-specific transforms
will work in per-utterance mode by default, but in per-speaker mode if the --spk2utt
option is supplied (see below).
One program that can accept either speaker-independent or speaker- or utterance-specific
transforms is transform-feats. This program detects whether the first argument (the transform)
is an rxfilename (see \ref io_sec_xfilename)
or an rspecifier (see \ref io_sec_specifiers). If the former, it treats it as a speaker-independent
transform (e.g. a file containing a single matrix).
If the latter, there are two choices. If no --utt2spk option is provided,
it treats the transform as a table of matrices indexed by utterance id. If an --utt2spk option is provided
(utt2spk is a table of strings indexed by utterance that contains the string-valued speaker id),
then the transforms are assumed to be indexed by speaker id, and the table
provided to the --utt2spk option is used to map each utterance to a speaker id.
\section transform_utt2spk Utterance-to-speaker and speaker-to-utterance maps
At this point we give a general overview of the --utt2spk and --spk2utt options.
These options are accepted by programs that deal with transformations; they are used when
you are doing per-speaker (as opposed to per-utterance) adaptation.
Typically programs that process already-created transforms will need the --utt2spk
option and programs that create the transforms will need the --spk2utt option.
A typical case is that there will be a file called some-directory/utt2spk
that looks like:
\verbatim
spk1utt1 spk1
spk1utt2 spk1
spk2utt1 spk2
spk2utt2 spk2
...
\endverbatim
where these strings are just examples, they stand for generic speaker and
utterance identifiers; and there will be a file called some-directory/spk2utt that looks like:
\verbatim
spk1 spk1utt1 spk1utt2
spk2 spk2utt1 spk2utt2
...
\endverbatim
and you will supply options that look like --utt2spk=ark:some-directory/utt2spk
or --spk2utt=ark:some-directory/spk2utt. The 'ark:' prefix is necessary because
these files are given as rspecifiers by the Table code, and are interpreted as archives
that contain strings (or vectors of strings, in the spk2utt case). Note that
the utt2spk archive is generally accessed in a random-access manner, so if you
are processing subsets of data it is safe to provide the whole file, but the
spk2utt archive is accessed in a sequential manner so if you are using subsets
of data you would have to split up the spk2utt archive.
Programs that accept the spk2utt option will normally iterate over the
speaker-ids in the spk2utt file, and for each speaker-id they will iterate over
the utterances for each speaker, accumulating statistics for each utterance. Access to the feature
files will then be in random-access mode, rather than the normal sequential
access. This requires some care to set up, because feature files are quite large
and fully-processed features are normally read from an archive, which does not
allow the most memory-efficient random access unless carefully set up. To avoid memory
bloat when accessing the feature files in this case, it may be advisable to
ensure that all archives are sorted on utterance-id, that the utterances in the
file given to the --spk2utt option appear in sorted order, and that the
appropriate options are given on the rspecifiers that specify the feature input
to such programs (e.g. "ark,s,cs:-" if it is the standard input). See \ref io_sec_bloat
for more discussion of this issue.
\section transform_compose Composing transforms
Another program that accepts generic transforms is the program compose-transforms.
The general syntax is "compose-transforms a b c", and it performs the multiplication
c = a b (although this involves a little more than matrix multiplication if a is affine).
An example modified from a script is as follows:
\verbatim
feats="ark:splice-feats scp:data/train.scp ark:- |
transform-feats
\"ark:compose-transforms ark:1.trans 0.mat ark:- |\"
ark:- ark:- |"
some-program some-args "$feats" ...
\endverbatim
This example also illustrates using two levels of commands invoked from a program.
Here, 0.mat is a global matrix (e.g. LDA) and 1.trans is a set of fMLLR/CMLLR matrices
indexed by utterance id. The program compose-transforms composes the transforms
together. The same features could be computed more simply, but less efficiently, as follows:
\verbatim
feats="ark:splice-feats scp:data/train.scp ark:- |
transform-feats 0.mat ark:- ark:- |
transform-feats ark:1.trans ark:- ark:- |"
...
\endverbatim
In general, the transforms a and b that are the inputs to compose-transforms
may be either speaker-independent transforms or speaker- or utterance-specific
transforms. If a is utterance-specific and b is speaker-specific then you have to supply
the --utt2spk option. However, the combination of a being speaker-specific and b being utterance-specific
(which does not make much sense) is not supported. The output of compose-transforms
will be a table if either a or b are tables. The three arguments a, b and c may all
represent either tables or normal files (i.e. either {r,w}specifiers or {r,w}xfilenames),
subject to consistency requirements.
If a is an affine transform, in order to perform the composition correctly, compose-transforms
needs to know whether b is affine or linear (it does not know this because it does not have access
to the dimension of the features
that are transformed by b). This is controlled by the option --b-is-affine (bool, default false).
If b is affine but you forget to set this option and a is affine, compose-transforms
will treat b as a linear transform from dimension (the real input feature dimension) plus one,
and will output a transform whose input dimension is (the real input feature dimension) plus two. There
is no way for "transform-feats" to interpret this when it is to be applied to features,
so the error should become obvious as a dimension mismatch at this point.
\section transform_weight Silence weighting when estimating transforms
Eliminating silence frames can be helpful when estimating speaker adaptive
transforms such as CMLLR. This even appears to be true when using
a multi-class approach with a regression tree (for which, see \ref transform_regtree).
The way we implement this is by weighting down the posteriors associated with
silence phones. This takes place as a modification to the \ref hmm_post
"state-level posteriors". An extract of a bash shell script that does this
is below (this script is discussed in more detail in \ref transform_cmllr_global):
\verbatim
ali-to-post ark:$srcdir/test.ali ark:- | \
weight-silence-post 0.0 $silphones $model ark:- ark:- | \
gmm-est-fmllr --fmllr-min-count=$mincount \
--spk2utt=ark:data/test.spk2utt $model "$sifeats" \
ark,o:- ark:$dir/test.fmllr 2>$dir/fmllr.log
\endverbatim
Here, the shell variable "silphones" would be set to a colon-separated
list of the integer id's of the silence phones.
\section transform_lda Linear Discriminant Analysis (LDA) transforms
Kaldi supports LDA estimation via class LdaEstimate. This class does not interact
directly with any particular type of model; it needs to be initialized with the
number of classes, and the accumulation function is declared as:
\verbatim
class LdaEstimate {
...
void Accumulate(const VectorBase<BaseFloat> &data, int32 class_id,
BaseFloat weight=1.0);
};
\endverbatim
The program acc-lda accumulates LDA statistics using the acoustic states (i.e. pdf-ids) as the
classes. It requires the transition model in order to map the alignments (expressed in terms
of transition-ids) to pdf-ids. However, it is not limited to a particular type of acoustic model.
The program est-lda does the LDA estimation (it reads in the statistics from acc-lda). The features you get from the transform will
have unit variance, but not necessarily zero mean. The program est-lda outputs the LDA transformation matrix,
and using the option --write-full-matrix you can write out the full matrix without dimensionality
reduction (its first rows will be equivalent to the LDA projection matrix). This can be useful
when using LDA as an initialization for HLDA.
\section transform_splice Frame splicing
Frame splicing (e.g. splicing nine consecutive frames together) is typically done
to the raw MFCC features prior to LDA. The program splice-feats does this. A typical
line from a script that uses this is the following:
\verbatim
feats="ark:splice-feats scp:data/train.scp ark:- |
transform-feats $dir/0.mat ark:- ark:-|"
\endverbatim
and the "feats" variable would later be used as an rspecifier (c.f. \ref io_sec_specifiers)
by some program that needs to read features. In this example we don't specify the number of frames to splice
together because we are using the defaults (--left-context=4, --right-context=4, or
9 frames in total).
\section transform_delta Delta feature computation
Computation of delta features is done by the program add-deltas, which uses the
function ComputeDeltas. The delta feature computation has the same default setup
as HTK's, i.e. to compute the first delta feature we multiply by the features
by a sliding window of values [ -2, -1, 0, 1, 2 ], and then normalize by
dividing by (2^2 + 1^2 + 0^2 + 1^2 + 2^2 = 10). The second delta feature
is computed by applying the same approach to the first delta feature. The
number of frames of context on each side is controlled by --delta-window (default: 2)
and the number of delta features to add is controlled by --delta-order (default: 2).
A typical script line that uses this is:
\verbatim
feats="ark:add-deltas --print-args=false scp:data/train.scp ark:- |"
\endverbatim
\section transform_hlda Heteroscedastic Linear Discriminant Analysis (HLDA)
HLDA is a dimension-reducing linear feature projection, estimated using
Maximum Likelihood, where "rejected" dimensions are modeled using a global
mean and variance, and "accepted" dimensions are modeled with a particular
model whose means and variances are estimated via Maximum Likelihood.
The form of HLDA
that is currently integrated with the tools is as implemented in
HldaAccsDiagGmm. It estimates HLDA for GMMs, using a relatively compact
form of the statistics. The classes correspond to the Gaussians in the model.
Since it does not use a standard estimation method, we will explain the idea
here. Firstly, because of memory limitations we do not want to store
the largest form of HLDA statistics which is mean and full-covariance statistics
for each class. We observe that if during the HLDA update phase we leave
the variances fixed, then the problem of HLDA estimation reduces to MLLT
(or global STC) estimation. See "Semi-tied Covariance Matrices for Hidden
Markov Models", by Mark Gales, IEEE Transactions on Speech and Audio Processing,
vol. 7, 1999, pages 272-281, e.g. Equations (22) and (23). The statistics
that are \f$ \mathbf{G}^{(ri)} \f$ there, are also used here, but in the HLDA
case they need to be defined
slightly differently for the accepted and rejected dimensions.
Suppose the original feature dimension is D and the
reduced feature dimension is K.
Let us forget the iteration superscript r, and use subscript j for state and
m for Gaussian mixture.
For accepted dimensions (\f$0 \leq i < K\f$), the statistics are:
\f[
\mathbf{G}^{(i)} = \sum_{t,j,m} \gamma_{jm}(t) \frac{1}{ \sigma^2_{jm}(i) } (\mu_{jm} - \mathbf{x}(t)) (\mu_{jm} - \mathbf{x}(t))^T
\f]
where \f$\mu_{jm} \in \Re^{K}\f$ is the Gaussian mean in the original D-dimensional space,
and \f$\mathbf{x}(t)\f$ is the feature in the original K-dimensional space, but
\f$\sigma^2_{jm}(i)\f$ is the i'th dimension of the variance within the the K-dimensional model.
For rejected dimensions (\f$ K \leq d < D\f$), we use a unit variance Gaussian, and
the statistics are as follows:
\f[
\mathbf{G}^{(i)} = \sum_{t,j,m} \gamma_{jm}(t) (\mu - \mathbf{x}(t)) (\mu - \mathbf{x}(t))^T ,
\f]
where \f$\mu\f$ is the global feature mean in the K-dimensional space. Once we have
these statistics, HLDA estimation is the same as MLLT/STC estimation in dimension D.
Note here that all the \f$\mathbf{G}\f$ statistics for rejected dimensions are the
same, so in the code we only store statistics for K+1 rather than D dimensions.
Also, it is convenient for the program that accumulates the statistics to only have
access to the K-dimensional model, so during HLDA accumulation we accumulate
statistics sufficient to estimate the K-dimensional means \f$\mu_{jm}\f$, and insead of
G we accumulate the following statistics: for accepted dimensions (\f$0 \leq i < K\f$),
\f[
\mathbf{S}^{(i)} = \sum_{t,j,m} \gamma_{jm}(t) \frac{1}{ \sigma^2_{jm}(i) } \mathbf{x}(t) \mathbf{x}(t)^T
\f]
and for rejected dimensions \f$K \leq i < D\f$
\f[
\mathbf{S}^{(i)} = \sum_{t,j,m} \gamma_{jm}(t) \mathbf{x}(t) \mathbf{x}(t)^T ,
\f]
and of course we only need to store one of these (e.g. for i = K) because they are all the same.
Then in the update time we can compute the G statistics for \f$0 \leq i < K\f$ as:
\f[
\mathbf{G}^{(i)} = \mathbf{S}^{(i)} - \sum_{j,m} \gamma_{jm} \mu_{jm} \mu_{jm}^T ,
\f]
and for \f$K \leq i < D\f$,
\f[
\mathbf{G}^{(i)} = \mathbf{S}^{(i)} - \beta \mu \mu^T,
\f]
where \f$ \beta = \sum_{j,m} \gamma_{jm} \f$ is the total count and \f$\mu = \frac{1}{\beta} \sum_{j,m} \mu_{j,m}\f$
is the global feature mean. After computing the transform from the G statistics using the same computation as MLLT,
we output the transform, and we also use the first K rows of the transform to project the means
into dimension K and write out the transformed model.
The computation described here is fairly slow; it is \f$ O(K^3) \f$ on each frame,
and K is fairly large (e.g. 117). This is the price we pay for compact statistics;
if we stored full mean and variance statistics, the per-frame computation would be \f$O(K^2)\f$.
To speed it up, we have an optional parameter ("speedup" in the code) which
selects a random subset of frames to actually compute the HLDA statistics on.
For instance, if speedup=0.1 we would only accumulate HLDA statistics on 1/10 of
the frames. If this option is activated, we need to store two separate
versions of the sufficient statistics for the means. One version of the mean
statistics, accumulated on the subset, is only used in the HLDA computation, and
corresponds to the quantities \f$\gamma_{jm}\f$ and \f$\mu_{jm}\f$ in the formulas above.
The other version of the mean statistics is accumulated on all the training data
and is used to write out the transformed model.
The overall HLDA estimation process is as follows (see rm_recipe_2/scripts/train_tri2j.sh):
- First initialize it with LDA (we store both the reduced dimension matrix
and the full matrix).
- Start model-building and training process. On certain (non-consecutive)
iterations where we have decided to do the HLDA update, do the following:
- Accumulate HLDA statistics (S, plus statistics for the full-dimensional means).
The program that accumulates these (gmm-acc-hlda) needs the model, the un-transformed features,
and the current transform (which it needs to transform the features in order
to compute Gaussian posteriors)
- Update the HLDA transform. The program that does this (gmm-est-hlda)
needs the model; the statistics; and the previous full (square)
transformation matrix which it needs to start the optimization and to correctly
report auxiliary function changes. It outputs the new transform (both full and
reduced dimension), and the model with newly estimated and transformed means.
\section transform_mllt Global Semi-tied Covariance (STC) / Maximum Likelihood Linear Transform (MLLT) estimation
Global STC/MLLT is a square feature-transformation matrix. For more details,
see "Semi-tied Covariance Matrices for Hidden Markov Models", by Mark Gales,
IEEE Transactions on Speech and Audio Processing, vol. 7, 1999, pages 272-281.
Viewing it as a feature-space transform, the objective function is the average
per-frame log-likelihood of the transformed features given the model, plus the
log determinant of the transform. The means of the model are also rotated by
transform in the update phase. The sufficient statistics are the following,
for \f$ 0 \leq i < D \f$ where D is the feature dimension:
\f[
\mathbf{G}^{(i)} = \sum_{t,j,m} \gamma_{jm}(t) \frac{1}{ \sigma^2_{jm}(i) } (\mu_{jm} - \mathbf{x}(t)) (\mu_{jm} - \mathbf{x}(t))^T
\f]
See the reference, Equations (22) and (23) for the update equations. These are
basically a simplified form of the diagonal row-by-row Constrained MLLR/fMLLR update
equations, where the first-order term of the quadratic equation disappears. Note that
our implementation differs from that reference by using a column of the inverse of the matrix
rather than the cofactor, since multiplying by the determinant does not make a difference to the
result and could potentially cause problems with floating-point underflow or overflow.
We describe the overall process as if we are doing MLLT on top of LDA features,
but it is also applicable on top of traditional delta features. See the script
rm_recipe_2/steps/train_tri2f for an example. The process is as follows:
- Estimate the LDA transformation matrix (we only need the first rows of this, not the full matrix).
Call this matrix \f$\mathbf{M}\f$.
- Start a normal model building process, always using features transformed with \f$\mathbf{M}\f$.
At certain selected iterations (where we will update the MLLT matrix), we do the following:
- Accumulate MLLT statistics in the current fully-transformed space
(i.e., on top of features transformed with \f$\mathbf{M}\f$). For efficiency we do this using
a subset of the training data.
- Do the MLLT update; let this produce a square matrix \f$\mathbf{T}\f$.
- Transform the model means by setting \f$ \mu_{jm} \leftarrow \mathbf{T} \mu_{jm} \f$.
- Update the current transform by setting \f$ \mathbf{M} \leftarrow \mathbf{T} \mathbf{M} \f$
The programs involved in MLLT estimation are gmm-acc-mllt and est-mllt. We also need the
programs gmm-transform-means (to transform the Gaussian means using \f$\mathbf{T}\f$), and
compose-transforms (to do the multiplication \f$\mathbf{M} \leftarrow \mathbf{T} \mathbf{M} \f$).
\section transform_cmllr_global Global CMLLR/fMLLR transforms
Constrained Maximum Likelihood Linear Regression (CMLLR), also known as feature-space MLLR (fMLLR),
is an affine feature transform of the form \f$ \mathbf{x} \rightarrow \mathbf{A} \mathbf{x} + \mathbf{b} \f$,
which we write in the form \f$ \mathbf{x} \rightarrow \mathbf{W} \mathbf{x}^+ \f$, where
\f$\mathbf{x}^+ = \left[\begin{array}{c} \mathbf{x} \\ 1 \end{array} \right]\f$ is the feature with
a 1 appended. Note that this differs from some of the literature where the 1 comes first.
For a review paper that explains CMLLR and the estimation techniques we use, see
"Maximum likelihood linear transformations for HMM-based speech recognition" by Mark Gales,
Computer Speech and Language Vol. 12, pages 75-98.
The sufficient statistics we store are:
\f[ \mathbf{K} = \sum_{t,j,m} \gamma_{j,m}(t) \Sigma_{jm}^{-1} \mu_{jm} \mathbf{x}(t)^+ \f]
where \f$\Sigma_{jm}^{-1}\f$ is the inverse covariance matrix,
and for \f$0 \leq i < D \f$ where D is the feature dimension,
\f[ \mathbf{G}^{(i)} = \sum_{t,j,m} \gamma_{j,m}(t) \frac{1}{\sigma^2_{j,m}(i)} \mathbf{x}(t)^+ \left.\mathbf{x}(t)^+\right.^T \f]
Our estimation scheme is the standard one, see Appendix B of the reference (in particular section B.1,
"Direct method over rows"). We differ by using a column of the inverse in place of the cofactor row,
i.e. ignoring the factor of the determinant, as it does not affect the result and causes danger of
numerical underflow or overflow.
Estimation of global Constrained MLLR (CMLLR) transforms is done by the
class FmllrDiagGmmAccs,
and by the program gmm-est-fmllr (also see gmm-est-fmllr-gpost). The syntax
of gmm-est-fmllr is:
\verbatim
gmm-est-fmllr [options] <model-in> <feature-rspecifier> \
<post-rspecifier> <transform-wspecifier>
\endverbatim
The "<post-rspecifier>" item corresponds to posteriors at the transition-id level
(see \ref hmm_post). The program writes out a table of CMLLR transforms
indexed by utterance by default, or if the --spk2utt option is given, indexed by speaker.
Below is a simplified extract of a script
(rm_recipe_2/steps/decode_tri_fmllr.sh) that estimates and uses CMLLR transforms based
on alignments from a previous, unadapted decoding. The previous decoding is assumed
to be with the same model (otherwise we would have to convert the alignments with
the program "convert-ali").
\verbatim
...
silphones=48 # colon-separated list with one phone-id in it.
mincount=500 # min-count to estimate an fMLLR transform
sifeats="ark:add-deltas --print-args=false scp:data/test.scp ark:- |"
# The next comand computes the fMLLR transforms.
ali-to-post ark:$srcdir/test.ali ark:- | \
weight-silence-post 0.0 $silphones $model ark:- ark:- | \
gmm-est-fmllr --fmllr-min-count=$mincount \
--spk2utt=ark:data/test.spk2utt $model "$sifeats" \
ark,o:- ark:$dir/test.fmllr 2>$dir/fmllr.log
feats="ark:add-deltas --print-args=false scp:data/test.scp ark:- |
transform-feats --utt2spk=ark:data/test.utt2spk ark:$dir/test.fmllr
ark:- ark:- |"
# The next command decodes the data.
gmm-decode-faster --beam=30.0 --acoustic-scale=0.08333 \
--word-symbol-table=data/words.txt $model $graphdir/HCLG.fst \
"$feats" ark,t:$dir/test.tra ark,t:$dir/test.ali 2>$dir/decode.log
\endverbatim
\section transform_lvtln Linear VTLN (LVTLN)
In recent years, there have been a number of papers that describe
implementations of Vocal Tract Length Normalization (VTLN) that
work out a linear feature transform corresponding to each VTLN
warp factor. See, for example, ``Using VTLN for broadcast news transcription'',
by D. Y. Kim, S. Umesh, M. J. F. Gales, T. Hain and P. C. Woodland, ICSLP 2004.
We implement a method in this general category using the class LinearVtln, and programs
such as gmm-init-lvtln, gmm-train-lvtln-special, and gmm-est-lvtln-trans.
The LinearVtln object essentially stores a set of linear feature transforms,
one for each warp factor. Let these linear feature transform matrices
be
\f[\mathbf{A}^{(i)}, 0\leq i < N, \f]
where for instance we might have \f$N\f$=31, corresponding to 31 different warp
factors. We will describe below how we obtain these matrices below.
The way the speaker-specific transform is estimated is as follows.
First, we require some kind of model and a corresponding alignment. In the
example scripts we do this either with a small monophone model, or with
a full triphone model. From this model and alignment, and using the original,
unwarped features, we compute the conventional statistics for estimating
CMLLR. When computing the LVTLN transform, what we do is take each matrix
\f$\mathbf{A}^{(i)}\f$, and compute the offset vector \f$\mathbf{b}\f$ that
maximizes the CMLLR auxiliary function for the transform
\f$\mathbf{W} = \left[ \mathbf{A}^{(i)} \, ; \, \mathbf{b} \right]\f$.
This value of \f$\mathbf{W}\f$ that gave the best auxiliary function value
(i.e. maximizing over i) becomes the transform for that speaker. Since we
are estimating a mean offset here,
we are essentially combining a kind of model-based cepstral mean normalization
(or alternatively an offset-only form of CMLLR) with VTLN warping implemented
as a linear transform. This avoids us having to implement mean normalization
as a separate step.
We next describe how we estimate the matrices \f$\mathbf{A}^{(i)}\f$. We
don't do this in the same way as described in the referenced paper; our method
is simpler (and easier to justify). Here we describe our computation for a
particular warp factor; in the current scripts we have 31 distinct warp
factors ranging from 0.85, 0.86, ..., 1.15.
We take a subset of feature data (e.g. several tens of utterances),
and for this subset we compute both the original and transformed features,
where the transformed features are computed using a conventional VLTN computation
(see \ref feat_vtln).
Call the original and transformed features \f$\mathbf{x}(t)\f$ and \f$\mathbf{y}(t)\f$ respectively,
where \f$t\f$ will range over the frames of the selected utterances.
We compute the affine transform that maps \f$\mathbf{x}\f$ to \f$\mathbf{y}\f$ in a least-squares
sense, i.e. if \f$\mathbf{y}' = \mathbf{A} \mathbf{x} + \mathbf{b}\f$,
we compute \f$\mathbf{A}\f$ and \f$\mathbf{b}\f$ that minimizes the sum-of-squares
difference \f$\sum_t (\mathbf{y}'(t) - \mathbf{y}(t) )^T (\mathbf{y}'(t) - \mathbf{y}(t) )\f$.
Then we normalize the diagonal variance as follows: we compute the
variance of the original features as \f$\mathbf{\Sigma}^{(x)}\f$ and of the linearly transformed
features as \f$\mathbf{\Sigma}^{(y')}\f$, and for each dimension index d we multiply the
d'th row of \f$\mathbf{A}\f$ by
\f$\sqrt{ \frac{\mathbf{\Sigma}^{(x)}_{d,d}}{\mathbf{\Sigma}^{(y')}_{d,d}}}\f$.
The resulting matrix will become \f$\mathbf{A}^{(i)}\f$ for some value of i.
The command-line tools support the option to ignore the log determinant term
when evaluating which of the transform matrices to use (e.g., you can set
--logdet-scale=0.0). Under certain circumstances this appears to improve
results; ignoring the log determinant it always makes the distribution of warp
factors more bimodal because the log determinant is never positive and is zero
for a warp factor of 1.0, so the log determinant essentially acts as a penalty
on warp factors that are far away from 1. However, for certain types of
features (in particular, features derived from LDA), ignoring the log
determinant makes results a lot worse and leads to very odd distributions of
warp factors, so our example scripts always use the log-determinant. This is
anyway the "right" thing to do.
The internal C++ code supports accumluating statistics for Maximum Likelihood
re-estimation of the transform matrices \f$\mathbf{A}^{(i)}\f$. Our expectation
was that this would improve results. However, it led to a degradation in
performance so we do not include example scripts for doing this.
\section transform_et Exponential Transform (ET)
The Exponential Transform (ET) is another approach to computing a VTLN-like
transform, but unlike Linear VTLN we completely sever the connection
to frequency warping, and learn it in a data-driven way. For normal
training data, we find that it does learn something very similar to
conventional VTLN.
ET is a transform of the form:
\f[
\mathbf{W}_s = \mathbf{D}_s \exp ( t_s \mathbf{A} ) \mathbf{B} ,
\f]
where exp is the matrix exponential function, defined via a Taylor
series in \f$\mathbf{A}\f$ that is the same as the Taylor series for
the scalar exponential function. Quantities with a subscript "s"
are speaker-specific; other quantities (i.e. \f$\mathbf{A}\f$ and
\f$\mathbf{B}\f$) are global and shared across all speakers.
The most important factor in this equation is the middle one,
with the exponential function in it.
The factor \f$\mathbf{D}_s\f$ gives us the ability to combine
model-based mean and optionally variance normalization (i.e. offset-only
or diagonal-only CMLLR)
with this technique, and the factor \f$\mathbf{B}\f$ allows the transform to include
MLLT (a.k.a. global STC), and is also a byproduct of the process
of renormalizing the \f$t_s\f$ quantities on each iteration of
re-estimation. The dimensions of these quantities are as follows,
where D is the feature dimension:
\f[
\mathbf{D}_s \in \Re^{D \times (D+1)}, \ t_s \in \Re, \ \mathbf{A} \in \Re^{(D+1)\times(D+1)}, \ \mathbf{B} \in \Re^{(D+1)\times (D+1)} .
\f]
Note that if \f$\mathbf{D}_s\f$ were a completely unconstrained CMLLR matrix,
there would be no point to this technique as the other quantities in the
equation would add no degrees of freedom. The tools support three kinds of
constraints on \f$\mathbf{D}_s\f$: it may be of the form
\f$[ {\mathbf I} \, \;\, {\mathbf 0} ]\f$ (no adaptation), or
\f$[ {\mathbf I} \, \;\, {\mathbf m} ]\f$ (offset only), or
\f$[ {\mathrm{diag}}( {\mathbf d} ) \, \;\, {\mathbf m} ]\f$ (diagonal CMLLR);
this is controlled by the --normalize-type options to the command-line tools.
The last rows of \f$\mathbf{A}\f$ and \f$\mathbf{B}\f$ are
fixed at particular values (these rows are involved in propagating the
last vector element with value 1.0, which is appended to the feature in order
to express an affine transform as a matrix). The last row
of \f$\mathbf{A}\f$ is fixed at zero and the last row
of \f$\mathbf{B}\f$ is fixed at \f$[ 0\ 0\ 0 \ \ldots\ 0 \ 1]\f$.
The speaker-specific quantity \f$t_s\f$ may be interpreted
very loosely as the log of the speaker-specific warp factor.
The basic intuition behind the use of the exponential function is that
if we were to warp by a factor f and then a factor g,
this should be the same as warping by the combined factor
fg. Let l = log(f) and m = log(g). Then we achieve this
property via the identity
\f[ \exp( l \mathbf{A} ) \exp( m \mathbf{A}) = \exp( (l+m) \mathbf{A} ) . \f]
The ET computation for a particular speaker is as follows; this assumes we
are given \f$\mathbf{A}\f$ and \f$\mathbf{B}\f$. We accumulate conventional
CMLLR sufficient statistics for the speaker. In the update phase we iteratively optimize
\f$t_s\f$ and \f$\mathbf{D}_s\f$ to maximize the auxiliary function.
The update for \f$t_s\f$ is an iterative procedure based on Newton's method;
the update for \f$\mathbf{D}_s\f$ is based on the conventional CMLLR
update, specialized for the diagonal or offset-only case, depending on
the exact constraints we are putting \f$\mathbf{D}_s\f$.
The overall training-time computation is as follows:
- First, initialize \f$\mathbf{B}\f$ to the identity and \f$\mathbf{A}\f$ to
a random matrix with zero final row.
Then, starting with some known model, start an iterative E-M process.
On each iteration, we first estimate the speaker-specific parameters
\f$t_s\f$ and \f$\mathbf{D}_s\f$, and compute the transforms \f$\mathbf{W}_s\f$
that result from them. Then we choose to update either \f$\mathbf{A}\f$, or
\f$\mathbf{B}\f$, or the model.
- If updating \f$\mathbf{A}\f$, we do this given fixed values of
\f$t_s\f$ and \f$\mathbf{D}_s\f$. The update is not guaranteed to
converge, but converges rapidly in practice; it's based on a
quadratic "weak-sense auxiliary function"
where the quadratic term is obtained using a first-order truncation
of the Taylor series expansion of the matrix exponential function.
After updating \f$\mathbf{A}\f$, we modify \f$\mathbf{B}\f$ in order
to renormalize the \f$t_s\f$ to zero; this involves premultiplying
\f$\mathbf{B}\f$ by \f$\exp(t \mathbf{A})\f$, where t is the average
value of \f$t_s\f$.
- If updating \f$\mathbf{B}\f$, this is also done using fixed values of
\f$t_s\f$ and \f$\mathbf{D}_s\f$, and the update is similar to MLLT
(a.k.a. global STC).
For purposes of the accumulation and update, we imagine we are estimating
an MLLT matrix just to the left of \f$\mathbf{A}\f$, i.e. some matrix
\f$\mathbf{C} \in \Re^{D\times D}\f$; let us define
\f$\mathbf{C}^+ = \left[ \begin{array}{cc} \mathbf{C} & 0 \\ 0 & 1 \end{array} \right]\f$.
The transform will be
\f$\mathbf{W}_s = \mathbf{D}_s \mathbf{C}^+ \exp ( t_s \mathbf{A} ) \mathbf{B}\f$.
Conceptually, while estimating \f$\mathbf{C}\f$ we view \f$\mathbf{D}_s\f$ as
a model-space transform creating speaker-specific models, which this is only possible
due to the diagonal structure of \f$\mathbf{D}_s\f$; and we view
\f$\exp ( t_s \mathbf{A} ) \mathbf{B}\f$ as a feature-space transform (i.e.
as part of the features). After estimating \f$\mathbf{C}\f$, we will use the identity
\f[
\mathbf{C}^+ \exp ( t_s \mathbf{A} ) = \exp ( t_s \mathbf{C}^+ \mathbf{A} \left.\mathbf{C}^+\right.^{-1} ) \mathbf{C}^+
\f]
so the update becomes:
\f[
\mathbf{A} \leftarrow \mathbf{C}^+ \mathbf{A} \left.\mathbf{C}^+\right.^{-1} , \ \ \mathbf{B} \leftarrow \mathbf{C}^+ \mathbf{B} .
\f]
At this point we need to transform the model means with the matrix
\f$\mathbf{C}\f$. The reader might question how this interacts with the
fact that for estimating \f$\mathbf{C}\f$, we viewed the quantity
\f$\mathbf{D}_s\f$ as a model-space transform. If \f$\mathbf{D}_s\f$ only
contains a mean offset, we can still prove that the auxiliary function
would increase, except we would have to change the offsets appropriately
(this is not necessary to do explicitly, as we will re-estimate them on
the next iteration anyway). However, if \f$\mathbf{D}_s\f$ has non-unit
diagonal (i.e. is diagonal not offset CMLLR), this re-estimation process
is not guaranteed to improve the likelihood; the tools will print a warning
in this case. In order to avoid encountering this case, our scripts
train in a mode where \f$\mathbf{D}_s\f$ is an offset-only transform; but
in test time we allow \f$\mathbf{D}_s\f$ to be a diagonal CMLLR transform, which seems
to give slightly better results than the offset-only case.
- Updating the model is straightforward; it just involves training on the adapted
features.
Important programs related to the use of exponential transforms are as follows:
- gmm-init-et initializes the exponential transform object (that contains A and B) and writes it to disk; the initialization of A is random.
- gmm-est-et estimates the exponential transforms for a set of speakers; it reads the exponential transform object, the model, the features and \ref hmm_gpost "Gaussian-level posteriors", and it writes out the transforms \f$\mathbf{W}_s\f$ and optionally the "warp factors" \f$t_s\f$.
- gmm-et-acc-a accumulates statistics for updating \f$\mathbf{A}\f$, and and gmm-et-est-a does the corresponding update.
- gmm-et-acc-b accumulates statistics for updating \f$\mathbf{B}\f$, and and gmm-et-est-b does the corresponding update.
\section transform_cmvn Cepstral mean and variance normalization
Cepstral mean and variance normalization consists of normalizing the mean
and variance of the raw cepstra, usually to give zero-mean, unit-variance
cepstra, either on a per-utterance or per-speaker basis. We provide code
to support this, and some example scripts, but we do not particularly recommend its use.
In general we prefer model-based approaches to mean and variance normalization;
e.g., our code for \ref transform_lvtln also learns a mean offset and the code
for \ref transform_et does a diagonal CMLLR transform that has the same power as
cepstral mean and variance normalization (except usually applied to the fully
expanded features). For very fast operation, it is possible to apply these
approaches using a very tiny model with a phone-based language model, and some of
our example scripts demonstrate this. There is also the capability in the
feature extraction code to subtract the mean on a per-utterance basis (the
--subtract-mean option to compute-mfcc-feats and compute-plp-feats).
In order to support per-utterance and per-speaker mean and variance normalization
we provide the programs compute-cmvn-stats and apply-cmvn. The program
compute-cmvn-stats will, by default, compute the sufficient statistics for mean
and variance normalization, as a matrix (the format is not very important; see
the code for details), and will write out a table of these statistics indexed by
utterance-id. If it is given the --spk2utt option, it will write out the
statistics on a per-speaker basis instead (warning: before using this option,
read \ref io_sec_bloat, as this option causes the input features to be read in
random-access mode). The program "apply-cmvn" reads in features and cepstral
mean and variance statistics; the statistics are expected to be indexed per
utterance by default, or per speaker if the --utt2spk option is applied. It
writes out the features after mean and variance normalization. These programs,
despite the names, do not care whether the features in question consist of
cepstra or anything else; it simply regards them as matrices. Of course, the
features supplied to compute-cmvn-stats and apply-cmvn must have the same
dimension.
We note that it would probably be more consistent with the overall design of the
feature transformation code, to supply a version of compute-cmvn-stats that would
write write out the mean and variance normalizing transforms as generic affine
transforms (in the same format as CMLLR transforms), so that they could be
applied by the program transform-feats, and composed as needed with other
transforms using compose-transforms. If needed we may supply such a program, but
because we don't regard mean and variance normalization as an important part of
any recipes, we have not done so yet.
\section transform_regtree Building regression trees for adaptation
Kaldi supports regression-tree MLLR and CMLLR (also known as fMLLR). For
an overview of regression trees, see "The generation and use of regression class trees for MLLR
adaptation" by M. J. F. Gales, CUED technical report, 1996.
*/
}