Blame view

src/doc/feat.dox 7.56 KB
8dcb6dfcb   Yannick Estève   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
  // doc/feat.dox
  
  
  // Copyright 2009-2011 Microsoft Corporation
  
  // See ../../COPYING for clarification regarding multiple authors
  //
  // Licensed under the Apache License, Version 2.0 (the "License");
  // you may not use this file except in compliance with the License.
  // You may obtain a copy of the License at
  
  //  http://www.apache.org/licenses/LICENSE-2.0
  
  // THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  // KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
  // WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
  // MERCHANTABLITY OR NON-INFRINGEMENT.
  // See the Apache 2 License for the specific language governing permissions and
  // limitations under the License.
  
  namespace kaldi {
  
  /**
     \page feat Feature extraction
  
    \section feat_intro Introduction
  
    Our feature extraction and waveform-reading code aims to create standard MFCC
    and PLP features, setting reasonable defaults but leaving available the options
    that people are most likely to want to tweak (for example, the number of mel
    bins, minimum and maximum frequency cutoffs, and so on).  This code only reads
    from .wav files containing pcm data.  These files commonly have the suffix .wav
    or .pcm (although sometimes the .pcm suffix is applied to sphere files; in this
    case the file would have to be converted).  If the source data is not a wave
    file then it is up to the user to find a command-line tool to convert it, but
    to cover a common case, we do provide installation instructions for sph2pipe.
  
    The command-line tools compute-mfcc-feats and compute-plp-feats compute the
    features; as with other Kaldi tools, running them without arguments will give a
    list of options.  The example scripts demonstrate the usage of these tools.
    
    \section feat_mfcc Computing MFCC features
  
    Here we describe how MFCC features are computed by the command-line tool
    compute-mfcc-feats.  This program requires two command-line
    arguments: an rspecifier to read the .wav data (indexed by utterance) and a
    wspecifier to write the features (indexed by utterance); see \ref
    io_sec_tables and \ref io_sec_specifiers for more explanation of these terms.  
    In typical usage, we will write the
    data to one big "archive" file but also write out an "scp" file for easy
    random access; see \ref io_sec_specifiers_both for explanation.  The program
    does not add delta features (for that, see add-deltas).  It accepts an option --channel
    to select the channel (e.g. --channel=0, --channel=1), which is useful when
    reading stereo data.  
  
    The computation of MFCC features is done by an object of type Mfcc, which has
    a function \ref Mfcc::Compute() "Compute()" to compute the features
    from the waveform.
  
    The overall MFCC computation is as follows:
      - Work out the number of frames in the file (typically 25 ms frames shifted
        by 10ms each time).
      - For each frame:
        - Extract the data, do optional dithering, preemphasis and dc offset removal,
          and multiply it by a windowing function (various options are supported here, e.g. Hamming)
        - Work out the energy at this point (if using log-energy not C0).
        - Do FFT and compute the power spectrum
        - Compute the energy in each mel bin; these are e.g. 23 triangular overlapping bins 
          whose centers are equally spaced in the mel-frequency domain.
        - Compute the log of the energies and take the cosine transform, keeping
          as many coefficients as specified (e.g. 13)
        - Optionally do cepstral liftering; this is just a scaling of the
          coefficients, which ensures they have a reasonable range.
       
    The lower and upper cutoff of the frequency range covered by the triangular mel bins
    are controlled by the options --low-freq and --high-freq, which are usually set close
    to zero and the Nyquist frequency respectively, e.g. --low-freq=20 and --high-freq=7800
    for 16kHz sampled speech.
  
    The features differ from HTK features in a number of ways, but almost all of
    these relate to having different defaults.  With the option --htk-compat=true,
    and setting parameters correctly, it is possible to get very close to HTK
    features.  One possibly important option that we do not support is energy
    max-normalization.  This is because we prefer normalization methods that can
    be applied in a stateless way, and would like to keep the feature computation
    such that it could in principle be done frame by frame and still give the same
    results.  The program compute-mfcc-feats does, however, have an option
    --subtract-mean to subtract the mean of the features.  This is done per
    utterance; there are different ways to do it per speaker (e.g. search for
    "cmvn", meaning cepstral mean and variance normalization, in the scripts).
  
    \section feat_plp Computing PLP features
  
    The algorithm to compute PLP features is similar to the MFCC one in the early
    stages.  We may add more to this section later, but for now see "Perceptual
    linear predictive (PLP) analysis of speech" by Hynek Hermansky, Journal of the
    Acoustical Society of America, vol. 87, no. 4, pages 1738--1752 (1990).
    
  
    \section feat_vtln Feature-level Vocal Tract Length Normalization (VTLN).
  
    The programs compute-mfcc-feats and compute-plp-feats accept a VTLN warp factor
    option.  In current scripts this is only used as a means of initializing linear
    transforms for linear versions of VTLN.  VTLN acts by moving the locations of
    the center frequencies of the triangular frequency bins.  The warping function
    that moves the frequency bins around is a piecewise linear function in
    frequency space.  To understand it, bear in mind the following quantities:
  
      0 <= low-freq <= vtln-low < vtln-high < high-freq  <= nyquist
  
    Here, low-freq and high-freq are the lowest and highest frequencies that
    are used in the standard MFCC or PLP computation (lower and higher frequencies
    are discarded).  vtln-low and vtln-high are frequency cutoffs used in VTLN,
    and their function is to ensure that all the mel bins get a reasonable width.
  
    The VTLN warping function we implement is a piecewise linear function with
    three segments that maps the interval [low-freq, high-freq] to [low-freq,
    high-freq].  Let the warping function be W(f), where f is the frequency.  The
    central segment maps f to f/scale, where "scale" is the VTLN warp factor
    (typically in the range 0.8 to 1.2).  The point on the x-axis at which the
    lower segment joins the middle segment is the point f defined so that min(f,
    W(f)) = vtln-low.  The point on the x-axis at which the middle segment joins
    the upper segment is the point f defined so that max(f, W(f)) = vtln-high.  The
    slope and offsets of the lower and upper segments are dictated by continuity
    and by the requirement that W(low-freq)=low-freq and W(high-freq)=high-freq.
    This warping function differs from HTK's; in the HTK version, the "vtln-low"
    and "vtln-high" quantities are interpreted as the points on the x-axis at which
    the discontinuity happens, and this means that the "vtln-high" variable has to
    be selected quite carefully based on knowledge of the possible range of warp
    factors (otherwise mel bins with empty size can occur).
  
    A reasonable setup is the following (for 16kHz sampled speech); note that this is
    a reflection of our understanding of the reasonable values, and is not
    the product of any very careful tuning experiments.
  
  <table border="1">
  <tr>
  <td>low-freq</td> <td>vtln-low</td> <td>vtln-high</td> <td>high-freq</td> <td>nyquist</td>
  </tr>
  <tr>
  <td> 40      </td> <td> 60    </td> <td> 7200    </td> <td>7800     </td> <td> 8000  </td>    
  </tr>
  </table>
  
  
  */
  
  
  }