transform.dox 42.5 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747
// doc/transform.dox


// Copyright 2009-2011 Microsoft Corporation

// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at

//  http://www.apache.org/licenses/LICENSE-2.0

// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.

namespace kaldi {

/**
  \page transform Feature and model-space transforms in Kaldi

  \section transform_intro Introduction

  Kaldi code currently supports a number of feature and model-space transformations
  and projections.  Feature-space transforms and projections are treated in a consistent
  way by the tools (they are essientially just matrices), and the following sections
  relate to the commonalities:
   - \ref transform_apply
   - \ref transform_perspk
   - \ref transform_utt2spk
   - \ref transform_compose
   - \ref transform_weight

  Transforms, projections and other feature operations that are typically not speaker specific include:
   - \ref transform_lda
   - \ref transform_splice and \ref transform_delta
   - \ref transform_hlda
   - \ref transform_mllt

  Global transforms that are typically applied in a speaker adaptive way are:
    - \ref transform_cmllr_global
    - \ref transform_lvtln
    - \ref transform_et
    - \ref transform_cmvn

  We next discuss regression class trees and transforms that use them:
    - \ref transform_regtree


  \section transform_apply Applying global linear or affine feature transforms

  In the case of feature-space transforms and projections that are global,
  and not associated with classes (e.g. speech/silence or regression classes), we
  represent them as matrices.  A linear transform or
  projection is represented as a matrix by which we will left-multiply a feature vector,
  so the transformed feature is \f$ A x \f$.  An affine transform or projection
  is represented the same way, but we imagine a 1 has been appended to the
  feature vector, so the transformed feature is
  \f$ W \left[ \begin{array}{c} x \\ 1 \end{array} \right] \f$ where
   \f$ W = \left[ A ; b \right] \f$, with A and b being the linear transform
  and the constant offset.
  Note that this convention differs from some of the literature, where the 1 may appear as
  the first dimension rather than the last.
  Global transforms and projections are generally written
  as a type Matrix<BaseFloat> to a single file, and speaker or utterance-specific
  transforms or projections are stored in a table of such matrices (see \ref io_sec_tables)
  indexed by speaker-id or utterance-id.

  Transforms may be applied to features
  using the program transform-feats.  Its syntax is
\verbatim
 transform-feats <transform> <input-feats> <output-feats>
\endverbatim
  where <input-feats> is an rspecifier, <output-feats> is an wspecifier, and <transform>
  may be an rxfilename or an rspecifier (see \ref io_sec_specifiers and \ref io_sec_xfilename).
  The program will work out whether the transform
  is linear or affine based on whether or not the matrix's number of columns equals the
  feature dimension, or the feature dimension plus one.
  This program is typically used as part of a pipe.
  A typical example is:
\verbatim
 feats="ark:splice-feats scp:data/train.scp ark:- |
          transform-feats $dir/0.mat ark:- ark:-|"
 some-program some-args "$feats" some-other-args ...
\endverbatim
 Here, the file 0.mat contains a single matrix.  An example of applying
 speaker-specific transforms is:
\verbatim
 feats="ark:add-deltas scp:data/train.scp ark:- |
   transform-feats --utt2spk=ark:data/train.utt2spk ark:$dir/0.trans ark:- ark:-|"
 some-program some-args "$feats" some-other-args ...
\endverbatim
A per-utterance example would be as above but removing the --utt2spk option.
In this example, the archive file 0.trans would contain transforms (e.g. CMLLR transforms)
indexed by speaker-id, and the file data/train.utt2spk would have
lines of the form "utt-id spk-id" (see next section for more explanation).
The program transform-feats does not care how the transformation matrix was
estimated, it just applies it to the
features.  After it has been through all the features it prints out the average
per-frame log determinant.  This can be useful when comparing objective functions
(this log determinant would have to be added to the per-frame likelihood printed
out by programs like gmm-align, gmm-acc-stats, or gmm-decode-kaldi).  If the
linear part A of the transformation (i.e. ignoring the offset term) is not square,
then the program will instead print out the per-frame average of
\f$ \frac{1}{2} \mathbf{logdet} (A A^T) \f$.  It refers to this as the pseudo-log-determinant.
This is useful in checking convergence of MLLT estimation where the transformation matrix
being applied is the MLLT matrix times an LDA matrix.

\section transform_perspk Speaker-independent versus per-speaker versus per-utterance adaptation

Programs that estimate transforms are generally set up to do a particular kind of
adaptation, i.e. speaker-independent versus (speaker- or utterance-specific).  For example, LDA
and MLLT/STC transforms are speaker-independent but fMLLR transforms are speaker- or
utterance-specific.  Programs that estimate speaker- or utterance-specific transforms
will work in per-utterance mode by default, but in per-speaker mode if the --spk2utt
option is supplied (see below).

One program that can accept either speaker-independent or speaker- or utterance-specific
transforms is transform-feats.  This program detects whether the first argument (the transform)
is an rxfilename (see \ref io_sec_xfilename)
or an rspecifier (see \ref io_sec_specifiers).  If the former, it treats it as a speaker-independent
transform (e.g. a file containing a single matrix).
If the latter, there are two choices.  If no --utt2spk option is provided,
it treats the transform as a table of matrices indexed by utterance id.  If an --utt2spk option is provided
(utt2spk is a table of strings indexed by utterance that contains the string-valued speaker id),
then the transforms are assumed to be indexed by speaker id, and the table
provided to the --utt2spk option is used to map each utterance to a speaker id.

\section transform_utt2spk Utterance-to-speaker and speaker-to-utterance maps

 At this point we give a general overview of the --utt2spk and --spk2utt options.
 These options are accepted by programs that deal with transformations; they are used when
 you are doing per-speaker (as opposed to per-utterance) adaptation.
 Typically programs that process already-created transforms will need the --utt2spk
 option and programs that create the transforms will need the --spk2utt option.
 A typical case is that there will be a file called some-directory/utt2spk
 that looks like:
\verbatim
spk1utt1  spk1
spk1utt2  spk1
spk2utt1  spk2
spk2utt2  spk2
...
\endverbatim
where these strings are just examples, they stand for generic speaker and
utterance identifiers; and there will be a file called some-directory/spk2utt that looks like:
\verbatim
spk1 spk1utt1 spk1utt2
spk2 spk2utt1 spk2utt2
...
\endverbatim
 and you will supply options that look like --utt2spk=ark:some-directory/utt2spk
 or --spk2utt=ark:some-directory/spk2utt.  The 'ark:' prefix is necessary because
 these files are given as rspecifiers by the Table code, and are interpreted as archives
 that contain strings (or vectors of strings, in the spk2utt case).  Note that
 the utt2spk archive is generally accessed in a random-access manner, so if you
 are processing subsets of data it is safe to provide the whole file, but the
 spk2utt archive is accessed in a sequential manner so if you are using subsets
 of data you would have to split up the spk2utt archive.

 Programs that accept the spk2utt option will normally iterate over the
 speaker-ids in the spk2utt file, and for each speaker-id they will iterate over
 the utterances for each speaker, accumulating statistics for each utterance.  Access to the feature
 files will then be in random-access mode, rather than the normal sequential
 access.  This requires some care to set up, because feature files are quite large
 and fully-processed features are normally read from an archive, which does not
 allow the most memory-efficient random access unless carefully set up.  To avoid memory
 bloat when accessing the feature files in this case, it may be advisable to
 ensure that all archives are sorted on utterance-id, that the utterances in the
 file given to the --spk2utt option appear in sorted order, and that the
 appropriate options are given on the rspecifiers that specify the feature input
 to such programs (e.g. "ark,s,cs:-" if it is the standard input).  See \ref io_sec_bloat
 for more discussion of this issue.

 \section transform_compose Composing transforms

 Another program that accepts generic transforms is the program compose-transforms.
 The general syntax is "compose-transforms a b c", and it performs the multiplication
 c = a b (although this involves a little more than matrix multiplication if a is affine).
 An example modified from a script is as follows:
\verbatim
 feats="ark:splice-feats scp:data/train.scp ark:- |
         transform-feats
           \"ark:compose-transforms ark:1.trans 0.mat ark:- |\"
           ark:- ark:- |"
 some-program some-args "$feats" ...
\endverbatim
 This example also illustrates using two levels of commands invoked from a program.
 Here, 0.mat is a global matrix (e.g. LDA) and 1.trans is a set of fMLLR/CMLLR matrices
 indexed by utterance id.   The program compose-transforms composes the transforms
 together.  The same features could be computed more simply,  but less efficiently, as follows:
\verbatim
 feats="ark:splice-feats scp:data/train.scp ark:- |
         transform-feats 0.mat ark:- ark:- |
         transform-feats ark:1.trans ark:- ark:- |"
 ...
\endverbatim
 In general, the transforms a and b that are the inputs to compose-transforms
 may be either speaker-independent transforms or speaker- or utterance-specific
 transforms.  If a is utterance-specific and b is speaker-specific then you have to supply
 the --utt2spk option.  However, the combination of a being speaker-specific and b being utterance-specific
 (which does not make much sense) is not supported.  The output of compose-transforms
 will be a table if either a or b are tables.  The three arguments a, b and c may all
 represent either tables or normal files (i.e. either {r,w}specifiers or {r,w}xfilenames),
 subject to consistency requirements.

 If a is an affine transform, in order to perform the composition correctly, compose-transforms
 needs to know whether b is affine or linear (it does not know this because it does not have access
 to the dimension of the features
 that are transformed by b).  This is controlled by the option --b-is-affine (bool, default false).
 If b is affine but you forget to set this option and a is affine, compose-transforms
 will treat b as a linear transform from dimension (the real input feature dimension) plus one,
 and will output a transform whose input dimension is (the real input feature dimension) plus two.  There
 is no way for "transform-feats" to interpret this when it is to be applied to features,
 so the error should become obvious as a dimension mismatch at this point.


\section transform_weight Silence weighting when estimating transforms

Eliminating silence frames can be helpful when estimating speaker adaptive
transforms such as CMLLR.  This even appears to be true when using
a multi-class approach with a regression tree (for which, see \ref transform_regtree).
The way we implement this is by weighting down the posteriors associated with
silence phones.  This takes place as a modification to the \ref hmm_post
"state-level posteriors".  An extract of a bash shell script that does this
is below (this script is discussed in more detail in \ref transform_cmllr_global):
\verbatim
ali-to-post ark:$srcdir/test.ali ark:- | \
  weight-silence-post 0.0 $silphones $model ark:- ark:- | \
  gmm-est-fmllr --fmllr-min-count=$mincount \
    --spk2utt=ark:data/test.spk2utt $model "$sifeats" \
   ark,o:- ark:$dir/test.fmllr 2>$dir/fmllr.log
\endverbatim
Here, the shell variable "silphones" would be set to a colon-separated
list of the integer id's of the silence phones.

\section transform_lda Linear Discriminant Analysis (LDA) transforms

Kaldi supports LDA estimation via class LdaEstimate.  This class does not interact
directly with any particular type of model; it needs to be initialized with the
number of classes, and the accumulation function is declared as:
\verbatim
class LdaEstimate {
  ...
  void Accumulate(const VectorBase<BaseFloat> &data, int32 class_id,
                  BaseFloat weight=1.0);
};
\endverbatim
The program acc-lda accumulates LDA statistics using the acoustic states (i.e. pdf-ids) as the
classes.  It requires the transition model in order to map the alignments (expressed in terms
of transition-ids) to pdf-ids.  However, it is not limited to a particular type of acoustic model.

The program est-lda does the LDA estimation (it reads in the statistics from acc-lda).  The features you get from the transform will
have unit variance, but not necessarily zero mean.  The program est-lda outputs the LDA transformation matrix,
and using the option --write-full-matrix you can write out the full matrix without dimensionality
reduction (its first rows will be equivalent to the LDA projection matrix).  This can be useful
when using LDA as an initialization for HLDA.

\section transform_splice Frame splicing

Frame splicing (e.g. splicing nine consecutive frames together) is typically done
to the raw MFCC features prior to LDA.  The program splice-feats does this.  A typical
line from a script that uses this is the following:
\verbatim
feats="ark:splice-feats scp:data/train.scp ark:- |
        transform-feats $dir/0.mat ark:- ark:-|"
\endverbatim
and the "feats" variable would later be used as an rspecifier (c.f. \ref io_sec_specifiers)
by some program that needs to read features.  In this example we don't specify the number of frames to splice
together because we are using the defaults (--left-context=4, --right-context=4, or
9 frames in total).

\section transform_delta Delta feature computation

Computation of delta features is done by the program add-deltas, which uses the
function ComputeDeltas.  The delta feature computation has the same default setup
as HTK's, i.e. to compute the first delta feature we multiply by the features
by a sliding window of values [ -2, -1, 0, 1, 2 ], and then normalize by
dividing by (2^2 + 1^2 + 0^2 + 1^2 + 2^2 = 10).  The second delta feature
is computed by applying the same approach to the first delta feature.  The
number of frames of context on each side is controlled by --delta-window (default: 2)
and the number of delta features to add is controlled by --delta-order (default: 2).
A typical script line that uses this is:
\verbatim
feats="ark:add-deltas --print-args=false scp:data/train.scp ark:- |"
\endverbatim

\section transform_hlda Heteroscedastic Linear Discriminant Analysis (HLDA)

 HLDA is a dimension-reducing linear feature projection, estimated using
 Maximum Likelihood, where "rejected" dimensions are modeled using a global
 mean and variance, and "accepted" dimensions are modeled with a particular
 model whose means and variances are estimated via Maximum Likelihood.
 The form of HLDA
 that is currently integrated with the tools is as implemented in
 HldaAccsDiagGmm.  It estimates HLDA for GMMs, using a relatively compact
 form of the statistics.  The classes correspond to the Gaussians in the model.
 Since it does not use a standard estimation method, we will explain the idea
 here.  Firstly, because of memory limitations we do not want to store
 the largest form of HLDA statistics which is mean and full-covariance statistics
 for each class.  We observe that if during the HLDA update phase we leave
 the variances fixed, then the problem of HLDA estimation reduces to MLLT
 (or global STC) estimation.  See "Semi-tied Covariance Matrices for Hidden
 Markov Models", by Mark Gales, IEEE Transactions on Speech and Audio Processing,
 vol. 7, 1999, pages 272-281, e.g. Equations (22) and (23).  The statistics
 that are \f$ \mathbf{G}^{(ri)} \f$ there, are also used here, but in the HLDA
 case they need to be defined
 slightly differently for the accepted and rejected dimensions.
 Suppose the original feature dimension is D and the
 reduced feature dimension is K.
 Let us forget the iteration superscript r, and use subscript j for state and
 m for Gaussian mixture.
 For accepted dimensions (\f$0 \leq i < K\f$), the statistics are:
 \f[
   \mathbf{G}^{(i)} = \sum_{t,j,m} \gamma_{jm}(t) \frac{1}{ \sigma^2_{jm}(i) } (\mu_{jm} - \mathbf{x}(t)) (\mu_{jm} - \mathbf{x}(t))^T
 \f]
 where \f$\mu_{jm} \in \Re^{K}\f$ is the Gaussian mean in the original D-dimensional space,
 and \f$\mathbf{x}(t)\f$ is the feature in the original K-dimensional space, but
 \f$\sigma^2_{jm}(i)\f$ is the i'th dimension of the variance within the the K-dimensional model.

 For rejected dimensions (\f$ K \leq d < D\f$), we use a unit variance Gaussian, and
 the statistics are as follows:
 \f[
   \mathbf{G}^{(i)} = \sum_{t,j,m} \gamma_{jm}(t)  (\mu - \mathbf{x}(t)) (\mu - \mathbf{x}(t))^T ,
 \f]
 where \f$\mu\f$ is the global feature mean in the K-dimensional space.  Once we have
 these statistics, HLDA estimation is the same as MLLT/STC estimation in dimension D.
 Note here that all the \f$\mathbf{G}\f$ statistics for rejected dimensions are the
 same, so in the code we only store statistics for K+1 rather than D dimensions.

 Also, it is convenient for the program that accumulates the statistics to only have
 access to the K-dimensional model, so during HLDA accumulation we accumulate
 statistics sufficient to estimate the K-dimensional means \f$\mu_{jm}\f$, and insead of
 G we accumulate the following statistics: for accepted dimensions (\f$0 \leq i < K\f$),
 \f[
   \mathbf{S}^{(i)} = \sum_{t,j,m} \gamma_{jm}(t) \frac{1}{ \sigma^2_{jm}(i) }  \mathbf{x}(t) \mathbf{x}(t)^T
 \f]
 and for rejected dimensions \f$K \leq i < D\f$
 \f[
   \mathbf{S}^{(i)} = \sum_{t,j,m} \gamma_{jm}(t)  \mathbf{x}(t) \mathbf{x}(t)^T ,
 \f]
 and of course we only need to store one of these (e.g. for i = K) because they are all the same.
 Then in the update time we can compute the G statistics for \f$0 \leq i < K\f$ as:
 \f[
  \mathbf{G}^{(i)} = \mathbf{S}^{(i)}  - \sum_{j,m} \gamma_{jm}  \mu_{jm} \mu_{jm}^T ,
 \f]
 and for \f$K \leq i < D\f$,
 \f[
  \mathbf{G}^{(i)} = \mathbf{S}^{(i)} - \beta \mu \mu^T,
 \f]
 where \f$ \beta = \sum_{j,m} \gamma_{jm} \f$ is the total count and \f$\mu = \frac{1}{\beta} \sum_{j,m} \mu_{j,m}\f$
 is the global feature mean.   After computing the transform from the G statistics using the same computation as MLLT,
 we output the transform, and we also use the first K rows of the transform to project the means
 into dimension K and write out the transformed model.

 The computation described here is fairly slow; it is \f$ O(K^3) \f$ on each frame,
 and K is fairly large (e.g. 117).  This is the price we pay for compact statistics;
 if we stored full mean and variance statistics, the per-frame computation would be \f$O(K^2)\f$.
 To speed it up, we have an optional parameter ("speedup" in the code) which
 selects a random subset of frames to actually compute the HLDA statistics on.
 For instance, if speedup=0.1 we would only accumulate HLDA statistics on 1/10 of
 the frames.  If this option is activated, we need to store two separate
 versions of the sufficient statistics for the means.  One version of the mean
 statistics, accumulated on the subset, is only used in the HLDA computation, and
 corresponds to the quantities \f$\gamma_{jm}\f$ and \f$\mu_{jm}\f$ in the formulas above.
 The other version of the mean statistics is accumulated on all the training data
 and is used to write out the transformed model.

 The overall HLDA estimation process is as follows (see rm_recipe_2/scripts/train_tri2j.sh):
    - First initialize it with LDA (we store both the reduced dimension matrix
      and the full matrix).
    - Start model-building and training process.  On certain (non-consecutive)
      iterations where we have decided to do the HLDA update, do the following:
      - Accumulate HLDA statistics (S, plus statistics for the full-dimensional means).
        The program that accumulates these (gmm-acc-hlda) needs the model, the un-transformed features,
        and the current transform (which it needs to transform the features in order
        to compute Gaussian posteriors)
      - Update the HLDA transform.  The program that does this (gmm-est-hlda)
        needs the model; the statistics; and the previous full (square)
        transformation matrix which it needs to start the optimization and to correctly
        report auxiliary function changes.  It outputs the new transform (both full and
        reduced dimension), and the model with newly estimated and transformed means.

 \section transform_mllt Global Semi-tied Covariance (STC) / Maximum Likelihood Linear Transform (MLLT) estimation

  Global STC/MLLT is a square feature-transformation matrix.  For more details,
  see "Semi-tied Covariance Matrices for Hidden Markov Models", by Mark Gales,
  IEEE Transactions on Speech and Audio Processing, vol. 7, 1999, pages 272-281.
  Viewing it as a feature-space transform, the objective function is the average
  per-frame log-likelihood of the transformed features given the model, plus the
  log determinant of the transform.  The means of the model are also rotated by
  transform in the update phase.  The sufficient statistics are the following,
  for \f$ 0 \leq i < D \f$ where D is the feature dimension:
 \f[
   \mathbf{G}^{(i)} = \sum_{t,j,m} \gamma_{jm}(t) \frac{1}{ \sigma^2_{jm}(i) } (\mu_{jm} - \mathbf{x}(t)) (\mu_{jm} - \mathbf{x}(t))^T
 \f]
  See the reference, Equations (22) and (23) for the update equations.  These are
  basically a simplified form of the diagonal row-by-row Constrained MLLR/fMLLR update
  equations, where the first-order term of the quadratic equation disappears.  Note that
  our implementation differs from that reference by using a column of the inverse of the matrix
  rather than the cofactor, since multiplying by the determinant does not make a difference to the
  result and could potentially cause problems with floating-point underflow or overflow.

  We describe the overall process as if we are doing MLLT on top of LDA features,
  but it is also applicable on top of traditional delta features.  See the script
  rm_recipe_2/steps/train_tri2f for an example.  The process is as follows:

  - Estimate the LDA transformation matrix (we only need the first rows of this, not the full matrix).
    Call this matrix \f$\mathbf{M}\f$.
  - Start a normal model building process, always using features transformed with \f$\mathbf{M}\f$.
    At certain selected iterations (where we will update the MLLT matrix), we do the following:
      - Accumulate MLLT statistics in the current fully-transformed space
        (i.e., on top of features transformed with \f$\mathbf{M}\f$).  For efficiency we do this using
        a subset of the training data.
      - Do the MLLT update; let this produce a square matrix \f$\mathbf{T}\f$.
      - Transform the model means by setting \f$ \mu_{jm} \leftarrow \mathbf{T} \mu_{jm} \f$.
      - Update the current transform by setting \f$ \mathbf{M} \leftarrow \mathbf{T} \mathbf{M} \f$

  The programs involved in MLLT estimation are gmm-acc-mllt and est-mllt.  We also need the
  programs gmm-transform-means (to transform the Gaussian means using \f$\mathbf{T}\f$), and
  compose-transforms (to do the multiplication \f$\mathbf{M} \leftarrow \mathbf{T} \mathbf{M} \f$).


 \section transform_cmllr_global Global CMLLR/fMLLR transforms

  Constrained Maximum Likelihood Linear Regression (CMLLR), also known as feature-space MLLR (fMLLR),
  is an affine feature transform of the form \f$ \mathbf{x} \rightarrow \mathbf{A} \mathbf{x}  + \mathbf{b} \f$,
  which we write in the form  \f$ \mathbf{x} \rightarrow \mathbf{W} \mathbf{x}^+ \f$, where
  \f$\mathbf{x}^+ = \left[\begin{array}{c} \mathbf{x} \\ 1 \end{array} \right]\f$ is the feature with
  a 1 appended.  Note that this differs from some of the literature where the 1 comes first.

  For a review paper that explains CMLLR and the estimation techniques we use, see
 "Maximum likelihood linear transformations for HMM-based speech recognition" by Mark Gales,
  Computer Speech and Language Vol. 12, pages 75-98.

  The sufficient statistics we store are:
  \f[ \mathbf{K} = \sum_{t,j,m} \gamma_{j,m}(t) \Sigma_{jm}^{-1} \mu_{jm} \mathbf{x}(t)^+ \f]
  where \f$\Sigma_{jm}^{-1}\f$ is the inverse covariance matrix,
  and for \f$0 \leq i < D \f$ where D is the feature dimension,
  \f[ \mathbf{G}^{(i)} = \sum_{t,j,m} \gamma_{j,m}(t) \frac{1}{\sigma^2_{j,m}(i)} \mathbf{x}(t)^+  \left.\mathbf{x}(t)^+\right.^T \f]

  Our estimation scheme is the standard one, see Appendix B of the reference (in particular section B.1,
  "Direct method over rows").  We differ by using a column of the inverse in place of the cofactor row,
  i.e. ignoring the factor of the determinant, as it does not affect the result and causes danger of
  numerical underflow or overflow.

  Estimation of global Constrained MLLR (CMLLR) transforms is done by the
  class FmllrDiagGmmAccs,
  and by the program gmm-est-fmllr (also see gmm-est-fmllr-gpost).  The syntax
  of gmm-est-fmllr is:
\verbatim
gmm-est-fmllr [options] <model-in> <feature-rspecifier> \
   <post-rspecifier> <transform-wspecifier>
\endverbatim
 The "<post-rspecifier>" item corresponds to posteriors at the transition-id level
 (see \ref hmm_post).  The program writes out a table of CMLLR transforms
  indexed by utterance by default, or if the --spk2utt option is given, indexed by speaker.

  Below is a simplified extract of a script
  (rm_recipe_2/steps/decode_tri_fmllr.sh) that estimates and uses CMLLR transforms based
  on alignments from a previous, unadapted decoding.  The previous decoding is assumed
  to be with the same model (otherwise we would have to convert the alignments with
 the program "convert-ali").
\verbatim
...
silphones=48 # colon-separated list with one phone-id in it.
mincount=500 # min-count to estimate an fMLLR transform
sifeats="ark:add-deltas --print-args=false scp:data/test.scp ark:- |"

# The next comand computes the fMLLR transforms.
ali-to-post ark:$srcdir/test.ali ark:- | \
  weight-silence-post 0.0 $silphones $model ark:- ark:- | \
  gmm-est-fmllr --fmllr-min-count=$mincount \
    --spk2utt=ark:data/test.spk2utt $model "$sifeats" \
   ark,o:- ark:$dir/test.fmllr 2>$dir/fmllr.log

feats="ark:add-deltas --print-args=false scp:data/test.scp ark:- |
  transform-feats --utt2spk=ark:data/test.utt2spk ark:$dir/test.fmllr
       ark:- ark:- |"

# The next command decodes the data.
gmm-decode-faster --beam=30.0 --acoustic-scale=0.08333 \
  --word-symbol-table=data/words.txt $model $graphdir/HCLG.fst \
 "$feats" ark,t:$dir/test.tra ark,t:$dir/test.ali 2>$dir/decode.log
\endverbatim

 \section transform_lvtln Linear VTLN (LVTLN)

 In recent years, there have been a number of papers that describe
 implementations of Vocal Tract Length Normalization (VTLN) that
 work out a linear feature transform corresponding to each VTLN
 warp factor.  See, for example, ``Using VTLN for broadcast news transcription'',
 by D. Y. Kim, S. Umesh, M. J. F. Gales, T. Hain and P. C. Woodland, ICSLP 2004.

 We implement a method in this general category using the class LinearVtln, and programs
 such as gmm-init-lvtln, gmm-train-lvtln-special, and gmm-est-lvtln-trans.
 The LinearVtln object essentially stores a set of linear feature transforms,
 one for each warp factor.  Let these linear feature transform matrices
 be
   \f[\mathbf{A}^{(i)},  0\leq i < N,  \f]
 where for instance we might have \f$N\f$=31, corresponding to 31 different warp
 factors. We will describe below how we obtain these matrices below.
 The way the speaker-specific transform is estimated is as follows.
 First, we require some kind of model and a corresponding alignment.  In the
 example scripts we do this either with a small monophone model, or with
 a full triphone model.  From this model and alignment, and using the original,
 unwarped features, we compute the conventional statistics for estimating
 CMLLR.  When computing the LVTLN transform, what we do is take each matrix
 \f$\mathbf{A}^{(i)}\f$, and compute the  offset vector \f$\mathbf{b}\f$ that
 maximizes the CMLLR auxiliary function for the transform
  \f$\mathbf{W} = \left[  \mathbf{A}^{(i)} \, ; \, \mathbf{b} \right]\f$.
 This value of \f$\mathbf{W}\f$ that gave the best auxiliary function value
 (i.e. maximizing over i) becomes the transform for that speaker.  Since we
 are estimating a mean offset here,
 we are essentially combining a kind of model-based cepstral mean normalization
 (or alternatively an offset-only form of CMLLR) with VTLN warping implemented
 as a linear transform.  This avoids us having to implement mean normalization
 as a separate step.

 We next describe how we estimate the matrices \f$\mathbf{A}^{(i)}\f$.  We
 don't do this in the same way as described in the referenced paper; our method
 is simpler (and easier to justify).  Here we describe our computation for a
 particular warp factor; in the current scripts we have 31 distinct warp
 factors ranging from 0.85, 0.86, ..., 1.15.
 We take a subset of feature data (e.g. several tens of utterances),
 and for this subset we compute both the original and transformed features,
 where the transformed features are computed using a conventional VLTN computation
 (see \ref feat_vtln).
 Call the original and transformed features \f$\mathbf{x}(t)\f$ and \f$\mathbf{y}(t)\f$ respectively,
 where \f$t\f$ will range over the frames of the selected utterances.
 We compute the affine transform that maps \f$\mathbf{x}\f$ to \f$\mathbf{y}\f$ in a least-squares
 sense, i.e. if \f$\mathbf{y}' = \mathbf{A} \mathbf{x} + \mathbf{b}\f$,
 we compute \f$\mathbf{A}\f$ and \f$\mathbf{b}\f$ that minimizes the sum-of-squares
 difference \f$\sum_t (\mathbf{y}'(t) - \mathbf{y}(t) )^T (\mathbf{y}'(t) - \mathbf{y}(t) )\f$.
 Then we normalize the diagonal variance as follows: we compute the
 variance of the original features as \f$\mathbf{\Sigma}^{(x)}\f$ and of the linearly transformed
 features as \f$\mathbf{\Sigma}^{(y')}\f$, and for each dimension index d we multiply the
 d'th row of \f$\mathbf{A}\f$ by
  \f$\sqrt{ \frac{\mathbf{\Sigma}^{(x)}_{d,d}}{\mathbf{\Sigma}^{(y')}_{d,d}}}\f$.
 The resulting matrix will become \f$\mathbf{A}^{(i)}\f$ for some value of i.

 The command-line tools support the option to ignore the log determinant term
 when evaluating which of the transform matrices to use (e.g., you can set
 --logdet-scale=0.0).  Under certain circumstances this appears to improve
 results; ignoring the log determinant it always makes the distribution of warp
 factors more bimodal because the log determinant is never positive and is zero
 for a warp factor of 1.0, so the log determinant essentially acts as a penalty
 on warp factors that are far away from 1.  However, for certain types of
 features (in particular, features derived from LDA), ignoring the log
 determinant makes results a lot worse and leads to very odd distributions of
 warp factors, so our example scripts always use the log-determinant.  This is
 anyway the "right" thing to do.

 The internal C++ code supports accumluating statistics for Maximum Likelihood
 re-estimation of the transform matrices \f$\mathbf{A}^{(i)}\f$.  Our expectation
 was that this would improve results.  However, it led to a degradation in
 performance so we do not include example scripts for doing this.


 \section transform_et Exponential Transform (ET)

 The Exponential Transform (ET) is another approach to computing a VTLN-like
 transform, but unlike Linear VTLN we completely sever the connection
 to frequency warping, and learn it in a data-driven way.  For normal
 training data, we find that it does learn something very similar to
 conventional VTLN.

 ET is a transform of the form:
\f[
  \mathbf{W}_s = \mathbf{D}_s \exp ( t_s \mathbf{A} ) \mathbf{B} ,
\f]
 where exp is the matrix exponential function, defined via a Taylor
 series in \f$\mathbf{A}\f$ that is the same as the Taylor series for
 the scalar exponential function.  Quantities with a subscript "s"
 are speaker-specific; other quantities (i.e. \f$\mathbf{A}\f$ and
 \f$\mathbf{B}\f$) are global and shared across all speakers.

 The most important factor in this equation is the middle one,
 with the exponential function in it.
 The factor \f$\mathbf{D}_s\f$ gives us the ability to combine
 model-based mean and optionally variance normalization (i.e. offset-only
 or diagonal-only CMLLR)
 with this technique, and the factor \f$\mathbf{B}\f$ allows the transform to include
 MLLT (a.k.a. global STC), and is also a byproduct of the process
 of renormalizing the \f$t_s\f$ quantities on each iteration of
 re-estimation.  The dimensions of these quantities are as follows,
  where D is the feature dimension:
\f[
   \mathbf{D}_s \in \Re^{D \times (D+1)}, \ t_s \in \Re, \  \mathbf{A} \in \Re^{(D+1)\times(D+1)}, \ \mathbf{B} \in \Re^{(D+1)\times (D+1)}  .
\f]
 Note that if \f$\mathbf{D}_s\f$ were a completely unconstrained CMLLR matrix,
 there would be no point to this technique as the other quantities in the
 equation would add no degrees of freedom.  The tools support three kinds of
 constraints on \f$\mathbf{D}_s\f$: it may be of the form
 \f$[ {\mathbf I} \, \;\, {\mathbf 0} ]\f$ (no adaptation), or
 \f$[ {\mathbf I} \, \;\, {\mathbf m} ]\f$ (offset only), or
 \f$[ {\mathrm{diag}}( {\mathbf d} ) \, \;\, {\mathbf m} ]\f$ (diagonal CMLLR);
 this is controlled by the --normalize-type options to the command-line tools.
 The last rows of \f$\mathbf{A}\f$ and \f$\mathbf{B}\f$ are
 fixed at particular values (these rows are involved in propagating the
 last vector element with value 1.0, which is appended to the feature in order
 to express an affine transform as a matrix).  The last row
 of \f$\mathbf{A}\f$ is fixed at zero and the last row
 of \f$\mathbf{B}\f$ is fixed at \f$[ 0\ 0\ 0 \ \ldots\ 0 \ 1]\f$.

 The speaker-specific quantity \f$t_s\f$ may be interpreted
 very loosely as the log of the speaker-specific warp factor.
 The basic intuition behind the use of the exponential function is that
 if we were to warp by a factor f and then a factor g,
 this should be the same as warping by the combined factor
 fg.  Let l = log(f) and m = log(g).  Then we achieve this
 property via the identity
  \f[ \exp( l \mathbf{A} ) \exp( m \mathbf{A}) = \exp( (l+m) \mathbf{A} ) . \f]

 The ET computation for a particular speaker is as follows; this assumes we
 are given \f$\mathbf{A}\f$ and \f$\mathbf{B}\f$.  We accumulate conventional
 CMLLR sufficient statistics for the speaker.  In the update phase we iteratively optimize
 \f$t_s\f$ and \f$\mathbf{D}_s\f$ to maximize the auxiliary function.
  The update for \f$t_s\f$ is an iterative procedure based on Newton's method;
 the update for \f$\mathbf{D}_s\f$ is based on the conventional CMLLR
 update,  specialized for the diagonal or offset-only case, depending on
 the exact constraints we are putting \f$\mathbf{D}_s\f$.

 The overall training-time computation is as follows:
  - First, initialize \f$\mathbf{B}\f$ to the identity and \f$\mathbf{A}\f$ to
    a random matrix with zero final row.

 Then, starting with some known model, start an iterative E-M process.
 On each iteration, we first estimate the speaker-specific parameters
 \f$t_s\f$ and \f$\mathbf{D}_s\f$, and compute the transforms \f$\mathbf{W}_s\f$
 that result from them.  Then we choose to update either \f$\mathbf{A}\f$, or
 \f$\mathbf{B}\f$, or the model.
   - If updating \f$\mathbf{A}\f$, we do this given fixed values of
     \f$t_s\f$ and \f$\mathbf{D}_s\f$.  The update is not guaranteed to
     converge, but converges rapidly in practice; it's based on a
     quadratic "weak-sense auxiliary function"
     where the quadratic term is obtained using a first-order truncation
     of the Taylor series expansion of the matrix exponential function.
     After updating \f$\mathbf{A}\f$, we modify \f$\mathbf{B}\f$ in order
     to renormalize the \f$t_s\f$ to zero; this involves premultiplying
     \f$\mathbf{B}\f$  by \f$\exp(t \mathbf{A})\f$, where t is the average
     value of \f$t_s\f$.

   - If updating \f$\mathbf{B}\f$, this is also done using fixed values of
     \f$t_s\f$ and \f$\mathbf{D}_s\f$, and the update is similar to MLLT
     (a.k.a. global STC).
     For purposes of the accumulation and update, we imagine we are estimating
     an MLLT matrix just to the left of \f$\mathbf{A}\f$, i.e. some matrix
     \f$\mathbf{C} \in \Re^{D\times D}\f$; let us define
     \f$\mathbf{C}^+ = \left[ \begin{array}{cc} \mathbf{C} & 0 \\ 0 & 1 \end{array} \right]\f$.
     The transform will be
     \f$\mathbf{W}_s = \mathbf{D}_s \mathbf{C}^+ \exp ( t_s \mathbf{A} ) \mathbf{B}\f$.
     Conceptually, while estimating \f$\mathbf{C}\f$ we view \f$\mathbf{D}_s\f$ as
     a model-space transform creating speaker-specific models, which this is only possible
     due to the diagonal structure of \f$\mathbf{D}_s\f$; and we view
     \f$\exp ( t_s \mathbf{A} ) \mathbf{B}\f$ as a feature-space transform (i.e.
     as part of the features).  After estimating \f$\mathbf{C}\f$, we will use the identity
\f[
   \mathbf{C}^+ \exp ( t_s \mathbf{A} ) =  \exp ( t_s \mathbf{C}^+ \mathbf{A}  \left.\mathbf{C}^+\right.^{-1} ) \mathbf{C}^+
\f]
  so the update becomes:
\f[
        \mathbf{A} \leftarrow \mathbf{C}^+ \mathbf{A}  \left.\mathbf{C}^+\right.^{-1} , \ \ \mathbf{B} \leftarrow \mathbf{C}^+ \mathbf{B} .
\f]
     At this point we need to transform the model means with the matrix
     \f$\mathbf{C}\f$.  The reader might question how this interacts with the
     fact that for estimating \f$\mathbf{C}\f$, we viewed the quantity
     \f$\mathbf{D}_s\f$ as a model-space transform.  If \f$\mathbf{D}_s\f$ only
     contains a mean offset, we can still prove that the auxiliary function
     would increase, except we would have to change the offsets appropriately
     (this is not necessary to do explicitly, as we will re-estimate them on
     the next iteration anyway).  However, if \f$\mathbf{D}_s\f$ has non-unit
     diagonal (i.e. is diagonal not offset CMLLR),  this re-estimation process
     is not guaranteed to improve the likelihood; the tools will print a warning
     in this case.  In order to avoid encountering this case, our scripts
     train in a mode where \f$\mathbf{D}_s\f$ is an offset-only transform; but
     in test time we allow \f$\mathbf{D}_s\f$ to be a diagonal CMLLR transform, which seems
     to give slightly better results than the offset-only case.

   - Updating the model is straightforward; it just involves training on the adapted
     features.

  Important programs related to the use of exponential transforms are as follows:
   - gmm-init-et initializes the exponential transform object (that contains A and B) and writes it to disk; the initialization of A is random.
   - gmm-est-et estimates the exponential transforms for a set of speakers; it reads the exponential transform object, the model, the features and \ref hmm_gpost "Gaussian-level posteriors", and it writes out the transforms \f$\mathbf{W}_s\f$ and optionally the "warp factors" \f$t_s\f$.
   - gmm-et-acc-a accumulates statistics for updating \f$\mathbf{A}\f$, and and gmm-et-est-a does the corresponding update.
   - gmm-et-acc-b accumulates statistics for updating \f$\mathbf{B}\f$, and and gmm-et-est-b does the corresponding update.

\section transform_cmvn Cepstral mean and variance normalization

Cepstral mean and variance normalization consists of normalizing the mean
and variance of the raw cepstra, usually to give zero-mean, unit-variance
cepstra, either on a per-utterance or per-speaker basis.  We provide code
to support this, and some example scripts, but we do not particularly recommend its use.
In general we prefer model-based approaches to mean and variance normalization;
e.g., our code for \ref transform_lvtln also learns a mean offset and the code
for \ref transform_et does a diagonal CMLLR transform that has the same power as
cepstral mean and variance normalization (except usually applied to the fully
expanded features).  For very fast operation, it is possible to apply these
approaches using a very tiny model with a phone-based language model, and some of
our example scripts demonstrate this.  There is also the capability in the
feature extraction code to subtract the mean on a per-utterance basis (the
--subtract-mean option to compute-mfcc-feats and compute-plp-feats).

In order to support per-utterance and per-speaker mean and variance normalization
we provide the programs compute-cmvn-stats and apply-cmvn.  The program
compute-cmvn-stats will, by default, compute the sufficient statistics for mean
and variance normalization, as a matrix (the format is not very important; see
the code for details), and will write out a table of these statistics indexed by
utterance-id.  If it is given the --spk2utt option, it will write out the
statistics on a per-speaker basis instead (warning: before using this option,
read \ref io_sec_bloat, as this option causes the input features to be read in
random-access mode).  The program "apply-cmvn" reads in features and cepstral
mean and variance statistics; the statistics are expected to be indexed per
utterance by default, or per speaker if the --utt2spk option is applied.  It
writes out the features after mean and variance normalization.  These programs,
despite the names, do not care whether the features in question consist of
cepstra or anything else; it simply regards them as matrices.  Of course, the
features supplied to compute-cmvn-stats and apply-cmvn must have the same
dimension.

We note that it would probably be more consistent with the overall design of the
feature transformation code, to supply a version of compute-cmvn-stats that would
write write out the mean and variance normalizing transforms as generic affine
transforms (in the same format as CMLLR transforms), so that they could be
applied by the program transform-feats, and composed as needed with other
transforms using compose-transforms.  If needed we may supply such a program, but
because we don't regard mean and variance normalization as an important part of
any recipes, we have not done so yet.


\section transform_regtree Building regression trees for adaptation

  Kaldi supports regression-tree MLLR and CMLLR (also known as fMLLR).  For
  an overview of regression trees, see "The generation and use of regression class trees for MLLR
  adaptation" by M. J. F. Gales, CUED technical report, 1996.




*/

}