model.dox 6.98 KB
// doc/model.dox


// Copyright 2009-2011 Microsoft Corporation

// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at

//  http://www.apache.org/licenses/LICENSE-2.0

// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.

namespace kaldi {

/**
  \page model Acoustic modeling code

  \section model_intro Introduction

  We will start with a few words about the general philosophy of our modeling
  code, and why we chose this path.  Our aim is for Kaldi to support conventional
  models (i.e. diagonal GMMs) and Subspace Gaussian Mixture Models (SGMMs), but
  also to be easily extensible to new kinds of model.  In a previous iteration of
  designing this software, we used a virtual base class that both the GMM and
  SGMM classes inherited from, and wrote command-line tools that handled both types of
  model.  Our experience was that a base class is not as useful as one might
  think, because there are too many differences in the models (e.g. they support
  different types of adaptation), and we were forced to constantly expand the
  base-class so that our supposedly "generic" code could access functionality
  specific to one model or the other.  Eventually our command-line tools reached
  a state where they were almost impossible to modify.

  When redesigning the code, we decided on a more "modern" software engineering
  approach that focused less on using class hierarchies to capture commonalities,
  and more on creating simple, reusable components.  For example, our decoder
  code (see \ref decoders) is generic because its requirements are very limited;
  it only requires that we create an object inheriting from the simple base-class
  DecodableInterface, that behaves a lot like a matrix of acoustic likelihoods for
  an utterance.  Individual command-line tools generally have simple and
  limited functionalities (e.g. gmm-align produces state-level alignments of 
  utterances given a diagonal GMM).  The idea is that implementing a new
  technique will generally involve creating a new command-line program, rather
  than increasing the complexity of any of the existing command-line programs.

 \section model_diag Diagonal GMMs


  The class DiagGmm represents a single diagonal-covariance Gaussian Mixture Model.
  An acoustic model based on a collection of objects of type DiagGmm, indexed by
  zero-based "pdf-ids", is implemented as class AmDiagGmm.  You can think of
  AmDiagGmm as a vector of type DiagGmm, although it has a slightly richer interface
  than that.  Representing an acoustic model as a collection of individual models, one for
  each p.d.f., is not the way we imagine all models would be represented; for example,
  SGMMs cannot be represented that way, and if we implemented GMMs with tying of
  Gaussians among states we would not be able to represent the pdfs separately.

 \subsection model_diag_gmm Individual GMMs

  Class DiagGmm is, conceptually, a fairly simple and passive object that stores the
  parameters of a Gaussian Mixture Model and has member functions that compute
  likelihoods.  It does not "know anything" about how it will be used; it just
  provides access to its members.  It does not handle accumulation or update;
  for that, see below, or class MlEstimateDiagGmm.  The class DiagGmm stores
  its parameters as: inverse variances, and (means times inverse variances).  
  This means that likelihoods can be computed with simple dot products.  
  The "gconsts" (i.e. the precomputed constant terms in the likelihood) are
  different from, say, HTK's gconsts because they depend on the mean also. 

  Since it is quite complicated to modify the Gaussian parameters in this form,
  we also provide a class DiagGmmNormal, which contains the parameters in
  a more simple and obvious form, and we provide functions to convert back and
  forth between DiagGmm and DiagGmmNormal representations.  Most of the update
  code works with the DiagGmmNormal representation.

  \subsection model_diag_am GMM-based acoustic model

 Class AmDiagGmm represents a collection of DiagGmm objects, indexed by \ref pdf_id "pdf-id".
 This class does not represent a HMM-GMM, just a collection of GMMs.  Putting it 
 together with the HMM structure is the responsibility of other code, principally
 the topology and transition-modeling code and the code responsible for compiling
 decoding graphs (see \ref hmm).  We mention at this point that 
 we never write an object of AmDiagGmm to disk on its own; instead
 we write an object of type TransitionModel and then an object of type AmDiagGmm.
 This is simply a convenience to avoid having to write too many separate files to disk,
 since normally we update the Gaussians and the transitions at the same time.
 The idea is that with other model types we would create a file with
 a TransitionModel and then [an object some other model type].  This way, programs that need
 to read only the transition model (e.g. for graph creation) can read the
 file without needing to know the type of the model.

 Class AmDiagGmm is a fairly simple object and does not take the responsibility
 for such things as model estimation (e.g. see AccumAmDiagGmm), or transform
 estimation (there are various pieces of code that do this; see \ref transform.

 \subsection model_full_gmm Full-covariance GMMs

 We have a class \ref FullGmm for full-covariance GMMs, which has similar functionality
 to the \ref DiagGmm class but with full covariances.  This is mainly of use for training
 full-covariance Universal Background  Models (UBMs) in the SGMM recipe (see below).
 The only command-line tools available for full GMMs are used to train global mixture models
 (i.e. UBMs); we have not
 implemented a full covariance version of the AmDiagGmm class or the corresponding command
 line tools, although doing so would be fairly easy.

 \section model_sgmm Subspace Gaussian Mixture Models (SGMMs)

 Subspace Gaussian Mixture Models (SGMMs) are implemented by class
 AmSgmm.  This class essentially implements the approach described in
 ``The Subspace Gaussian Mixture Model -- a Structured Model for Speech 
 Recognition'', by D. Povey, Lukas Burget et. al, Computer Speech and Language, 
 2011.
 The class AmSgmm represents a whole collection of pdf's; there
 is no class that represents a single pdf of the SGMM (as there is for
 GMMs).  Estimation of SGMMs is handled (at the C++ level) by the classes
 \ref MleAmSgmmAccs and MleAmSgmmUpdater.  

 For example scripts that demonstrate how to
 build an SGMM based system, see egs/rm/s1/steps/train_ubma.sh,
 egs/rm/s1/steps/train_sgmma.sh, and egs/rm/s1/steps/decode_sgmma.sh.


*/
}