feat.dox
7.56 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
// doc/feat.dox
// Copyright 2009-2011 Microsoft Corporation
// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.
namespace kaldi {
/**
\page feat Feature extraction
\section feat_intro Introduction
Our feature extraction and waveform-reading code aims to create standard MFCC
and PLP features, setting reasonable defaults but leaving available the options
that people are most likely to want to tweak (for example, the number of mel
bins, minimum and maximum frequency cutoffs, and so on). This code only reads
from .wav files containing pcm data. These files commonly have the suffix .wav
or .pcm (although sometimes the .pcm suffix is applied to sphere files; in this
case the file would have to be converted). If the source data is not a wave
file then it is up to the user to find a command-line tool to convert it, but
to cover a common case, we do provide installation instructions for sph2pipe.
The command-line tools compute-mfcc-feats and compute-plp-feats compute the
features; as with other Kaldi tools, running them without arguments will give a
list of options. The example scripts demonstrate the usage of these tools.
\section feat_mfcc Computing MFCC features
Here we describe how MFCC features are computed by the command-line tool
compute-mfcc-feats. This program requires two command-line
arguments: an rspecifier to read the .wav data (indexed by utterance) and a
wspecifier to write the features (indexed by utterance); see \ref
io_sec_tables and \ref io_sec_specifiers for more explanation of these terms.
In typical usage, we will write the
data to one big "archive" file but also write out an "scp" file for easy
random access; see \ref io_sec_specifiers_both for explanation. The program
does not add delta features (for that, see add-deltas). It accepts an option --channel
to select the channel (e.g. --channel=0, --channel=1), which is useful when
reading stereo data.
The computation of MFCC features is done by an object of type Mfcc, which has
a function \ref Mfcc::Compute() "Compute()" to compute the features
from the waveform.
The overall MFCC computation is as follows:
- Work out the number of frames in the file (typically 25 ms frames shifted
by 10ms each time).
- For each frame:
- Extract the data, do optional dithering, preemphasis and dc offset removal,
and multiply it by a windowing function (various options are supported here, e.g. Hamming)
- Work out the energy at this point (if using log-energy not C0).
- Do FFT and compute the power spectrum
- Compute the energy in each mel bin; these are e.g. 23 triangular overlapping bins
whose centers are equally spaced in the mel-frequency domain.
- Compute the log of the energies and take the cosine transform, keeping
as many coefficients as specified (e.g. 13)
- Optionally do cepstral liftering; this is just a scaling of the
coefficients, which ensures they have a reasonable range.
The lower and upper cutoff of the frequency range covered by the triangular mel bins
are controlled by the options --low-freq and --high-freq, which are usually set close
to zero and the Nyquist frequency respectively, e.g. --low-freq=20 and --high-freq=7800
for 16kHz sampled speech.
The features differ from HTK features in a number of ways, but almost all of
these relate to having different defaults. With the option --htk-compat=true,
and setting parameters correctly, it is possible to get very close to HTK
features. One possibly important option that we do not support is energy
max-normalization. This is because we prefer normalization methods that can
be applied in a stateless way, and would like to keep the feature computation
such that it could in principle be done frame by frame and still give the same
results. The program compute-mfcc-feats does, however, have an option
--subtract-mean to subtract the mean of the features. This is done per
utterance; there are different ways to do it per speaker (e.g. search for
"cmvn", meaning cepstral mean and variance normalization, in the scripts).
\section feat_plp Computing PLP features
The algorithm to compute PLP features is similar to the MFCC one in the early
stages. We may add more to this section later, but for now see "Perceptual
linear predictive (PLP) analysis of speech" by Hynek Hermansky, Journal of the
Acoustical Society of America, vol. 87, no. 4, pages 1738--1752 (1990).
\section feat_vtln Feature-level Vocal Tract Length Normalization (VTLN).
The programs compute-mfcc-feats and compute-plp-feats accept a VTLN warp factor
option. In current scripts this is only used as a means of initializing linear
transforms for linear versions of VTLN. VTLN acts by moving the locations of
the center frequencies of the triangular frequency bins. The warping function
that moves the frequency bins around is a piecewise linear function in
frequency space. To understand it, bear in mind the following quantities:
0 <= low-freq <= vtln-low < vtln-high < high-freq <= nyquist
Here, low-freq and high-freq are the lowest and highest frequencies that
are used in the standard MFCC or PLP computation (lower and higher frequencies
are discarded). vtln-low and vtln-high are frequency cutoffs used in VTLN,
and their function is to ensure that all the mel bins get a reasonable width.
The VTLN warping function we implement is a piecewise linear function with
three segments that maps the interval [low-freq, high-freq] to [low-freq,
high-freq]. Let the warping function be W(f), where f is the frequency. The
central segment maps f to f/scale, where "scale" is the VTLN warp factor
(typically in the range 0.8 to 1.2). The point on the x-axis at which the
lower segment joins the middle segment is the point f defined so that min(f,
W(f)) = vtln-low. The point on the x-axis at which the middle segment joins
the upper segment is the point f defined so that max(f, W(f)) = vtln-high. The
slope and offsets of the lower and upper segments are dictated by continuity
and by the requirement that W(low-freq)=low-freq and W(high-freq)=high-freq.
This warping function differs from HTK's; in the HTK version, the "vtln-low"
and "vtln-high" quantities are interpreted as the points on the x-axis at which
the discontinuity happens, and this means that the "vtln-high" variable has to
be selected quite carefully based on knowledge of the possible range of warp
factors (otherwise mel bins with empty size can occur).
A reasonable setup is the following (for 16kHz sampled speech); note that this is
a reflection of our understanding of the reasonable values, and is not
the product of any very careful tuning experiments.
<table border="1">
<tr>
<td>low-freq</td> <td>vtln-low</td> <td>vtln-high</td> <td>high-freq</td> <td>nyquist</td>
</tr>
<tr>
<td> 40 </td> <td> 60 </td> <td> 7200 </td> <td>7800 </td> <td> 8000 </td>
</tr>
</table>
*/
}