Blame view

src/doc/tree_internals.dox 14.2 KB
8dcb6dfcb   Yannick Estève   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
  // doc/tree_internals.dox
  
  
  // Copyright 2009-2011 Microsoft Corporation
  
  // See ../../COPYING for clarification regarding multiple authors
  //
  // Licensed under the Apache License, Version 2.0 (the "License");
  // you may not use this file except in compliance with the License.
  // You may obtain a copy of the License at
  
  //  http://www.apache.org/licenses/LICENSE-2.0
  
  // THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  // KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
  // WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
  // MERCHANTABLITY OR NON-INFRINGEMENT.
  // See the Apache 2 License for the specific language governing permissions and
  // limitations under the License.
  
  namespace kaldi {
  /**
   \page tree_internals Decision tree internals
  
    This page describes the internals of the phonetic decision-tree clustering
    code, which are implemented as quite generic structures and algorithms.  For a
    higher-level view that explains the overall algorithm being implemented and how it
    is used by the toolkit, see \ref tree_externals.
  
    \section treei_event_map Event maps
  
    The central concept in the decision-tree building code is the event map,
    represented by type EventMap.  Don't be misled by the term "event" to think
    that this is something happening at a particular time.  An event is just a set
    of (key,value) pairs, with no key repeated.  Conceptually, it could be represented by 
    the type std::map<int32,int32>.  In fact, for efficiency we represent it as a
    sorted vector of pairs, via the typedef:
  \code
    typedef std::vector<std::pair<EventKeyType,EventValueType> > EventType; 
  \endcode
    Here, EventKeyType and EventValueType are just typedefs to int32, but the code is
    easier to understand if we use distinct names for them.  Think of an Event
    (type EventType) as being like a collection of variables, except that the
    names and values of the variables are both integers.   There is also the type 
    \ref EventAnswerType, and this is also int32.  It is the type returned by the EventMap;
    in practice it is a pdf identifier (acoustic-state index).
    Functions that require EventType expect you to have sorted it first (e.g. by calling
    std::sort(e.begin(),e.end()); )
  
    The purpose of the EventMap class is best illustrated via an example.  Suppose our phone-in-context is a/b/c;
    suppose we have the standard 3-state topology; and suppose we want to ask what is the index of the pdf
    for the central state of phone "b" in this context.  The state in question would be
    state 1, since we use zero-based indexing.
    We are skimming over a few details here when we refer to "state"; see \ref pdf_class for more information.
    Suppose the integer indices of phones a, b and c are 10, 11 and 12 respectively.
    The "event" would correspond to the mapping:
   \verbatim
     (0  -> 10), (1  -> 11), (2  -> 12), (-1 -> 1)
   \endverbatim
    where 0, 1 and 2 are the positions in the 3-phone window "a/b/c", and -1 is a special
    index we use to encode the state-id (c.f. the constant \ref kPdfClass = -1).  The way this
    would be represented as a sorted vector of pairs is:
   \code
     // caution: not valid C++.
     EventType e = { {-1,1}, {0,10}, {1,11}, {2,12} }; 
   \endcode
    Suppose we that the acoustic-state index (pdf-id) corresponding to this
    acoustic state happens to be 1000.  Then if we have an
    EventMap "emap" representing the tree, then we would expect the following assert
    not to fail:
   \code
     EventAnswerType ans;
     bool ret = emap.Map(e, &ans); // emap is of type EventMap; e is EventType
     KALDI_ASSERT(ret == true && ans == 1000);
   \endcode
    So when we state that an EventMap is a map from EventType to EventAnswerType, you can think of
    this roughly as a map from a phone-in-context to an integer index.  The phone-in-context
    is represented as a set of (key-value) pairs, and in principle we could add new keys and
    put in more information.  Notice that the EventMap::Map() function returns bool.  This is
    because it's possible for certain events not to map to any answer at all (e.g. consider
    an invalid phone, or imagine what happens when the EventType doesn't contain all the keys
    that the EventMap is querying).
  
    The EventMap is a very passive object.  It does not have any
    capability to learn decision-trees, it is just a means to store decision-trees.
    Think of it as a structure that represents a function from EventType to EventAnswerType.
    EventMap is a polymorphic, pure-virtual class (i.e. it cannot be instantiated because it
    has virtual functions that are not implemented).  There are three concrete classes
    that implement the EventMap interface:
        - ConstantEventMap:  Think of this as a decision-tree leaf node. 
            This class stores an integer of type EventAnswerType and its Map function always returns that value.  
        - SplitEventMap: Think of this as a decision-tree non-leaf node that queries a certain
           key and goes to the "yes" or "no" child node depending on the answer.  Its Map
           function calls the Map function of the appropriate child node.  It stores a set of
           integers of type kAnswerType that correspond to the "yes" child (everything else goes to "no").
        - TableEventMap: This does a complete split on a particular key.  A typical example is:
           you might first split completely on the central phone and then have a separate 
           decision tree for each value of that phone.  Internally it stores a vector of
           EventMap* pointers.  It looks up the value corresponding to the key that it
           is splitting on, and calls the Map function of the EventMap at the corresponding
           position in the vector.
    
    The EventMap cannot really do very much apart from map EventType to EventAnswerType.  
    Its interface does not provide functions that allow you to traverse the tree, either 
    up or down.  There is just one function that allows you to modify the EventMap, and this
    is the EventMap::Copy() function, declared like this (as a class-member):
   \code
    virtual EventMap *Copy(const std::vector<EventMap*> &new_leaves) const;
   \endcode
   This has an effect similar to function composition.  
   If you call Copy() with an empty vector "new_leaves", it will just return a deep copy
   of the entire object, copying all the pointers as it goes down the tree.  However,
   if new_leaves is nonempty, each time the Copy() function reaches a leaf, if that leaf
   \c l is within the range (0, new_leaves.size()-1) and \c new_leaves[\c l] is non-NULL,
   the Copy() function will return the result of calling \c new_leaves[\c l]->\c Copy(), rather
   than the \c Copy() function of the leaf itself (which would have returned a new ConstantEventMap).
   A typical example is where you decide to split a particular leaf, say leaf 852.
   You could create an object of type vector<EventMap*> whose only non-NULL member is
   as position 852.  It would contain a pointer to an object whose real type is
   SplitEventMap, and the "yes" and "no" pointers would be to type ConstantEventMap with
   leaf values, say, 852 and 1234 (we reuse the old leaf-node id for the new leaf).  The real
   tree-building code is not this inefficient. 
   
    \section treei_stats Statistics for building the tree
   
   The statistics used to build the phonetic decision tree are of the following type:
   \code
    typedef std::vector<std::pair<EventType, Clusterable*> > BuildTreeStatsType;
   \endcode
   A reference to an object of this type is passed to all the decision-tree
   building routines.  These statistics are expected to contain no duplicates with
   the same EventType member, i.e. they conceptually represent a map from \ref EventType
   to Clusterable.  In our current code the Clusterable objects are really of type
   GaussClusterable, but the tree-building code doesn't know this.  A program that
   accumulates these statistics is acc-tree-stats.cc. 
       
    \section treei_func Classes and functions involved in tree-building
  
    \subsection treei_func_questions Questions (config class)
  
    The class Questions is a class that in its interaction with the
    tree-building code, behaves a little bit like a "configuration" class.
    It is really a map from the "key" value (of type \ref EventKeyType),
    to a configuration class of type QuestionsForKey.  
  
    Class QuestionsForKey
    has a set of "questions", each of which is just a set of integers of
    type \ref EventValueType; these mostly correspond to sets of
    phones, or if the key is -1, they would in the typical case correspond to
    sets of HMM-state indices, i.e. subsets of {0,1,2}.  QuestionsForKey also contains
    a configuration class
    of type RefineClustersOptions.  This controls tree-building behaviour in that
    the tree-building code will attempt to iteratively move values (e.g. phones) between
    the two sides of the split in order to maximize likelihood (as in K-means
    with K=2).  However, this can be turned off by setting the number
    of iterations of "refining clusters" to zero, which corresponds to just choosing
    from among a fixed list of questions.  This seems to work a bit better.
  
    \subsection treei_func_low Lowest-level functions
         
    See \ref tree_group_lower for a complete list; 
    we summarize some important ones here.
  
    These functions mostly involve manipulations of objects of type BuildTreeStatsType, which as
    we saw above is a vector of pairs of (EventType, Clusterable*).  The simplest ones
    are DeleteBuildTreeStats(), WriteBuildTreeStats() and ReadBuildTreeStats().
  
    The function PossibleValues() finds out which values a particular key takes within
    a collection of statistics (and informs the user whether or not the key was always
    defined);  SplitStatsByKey() will split up an object of type
    BuildTreeStatsType into a vector<BuildTreeStatsType> according to the value taken by
    a particular key (e.g. splitting on the central phone); SplitStatsByMap() does the
    same but the index is not the value for that key, but the answer returned by an EventMap
    that is provided to the function.  
  
    SumStats() sums up the statistics (i.e. the Clusterable objects) in an object
    of BuildTreeStatsType and returns corresponding Clusterable* object.
    SumStatsVec() takes an object of
    type vector<BuildTreeStatsType> and outputs something of type vector<Clusterable*>,
    i.e. it's like SumStats() but for a vector; it's useful in processing the output of
    SplitStatsByKey() and SplitStatsByMap().
   
    ObjfGivenMap() evaluates the objective function given some statistics and an EventMap:
    it sums up all the statistics within each cluster (corresponding to each distinct answer
    of the EventMap::Map() function), adds up the objective functions across the clusters,
    and returns the total.
  
    FindAllKeys() will find all the keys that are defined in a collection of statistics, and
    depending on the arguments it can find all keys that are defined for all the events, or
    all the keys that are defined for any event (i.e. take the intersection or union of sets of
    defined keys).
  
    \subsection treei_func_intermediate Intermediate-level functions
  
    The next group of functions involved in tree building are listed \ref
    tree_group_intermediate "here" and correspond to various stages of tree building.
    We will now only mention some representative ones.
  
    Firstly, we point out that a lot of these functions have a parameter int32
    *num_leaves.  The integer which this points to acts as a counter to allocate new
    leaves.  At the start of tree building the caller would set it to zero.
    Aach time a new leaf is required, the tree-building code takes the number it
    currently points to as the new leaf-id, and then increments it.
  
    One important function is GetStubMap().  This function returns a tree that has
    not yet been split, i.e. the pdfs do not depend on the context.  The input to this
    function controls whether all the phones are distinct or some of them share 
    decision-tree roots, and whether all the states within a particular phone share
    the same decision-tree root.
  
    The function SplitDecisionTree() takes as input an EventMap that would normally
    correspond to a non-split "stub" decision tree of the type created by GetStubMap(),
    and does the decision-tree splitting until either the maximum number of leaves
    has been reached, or the gain from splitting a leaf is less than a specified threshold.
  
    The function ClusterEventMap() takes an EventMap and a threshold, and merges the
    leaves of the EventMap as long as the cost of doing so is less than the threshold.
    This function would normally be called after SplitDecisionTree().
    There are other versions of this function that operate with restrictions (e.g. to
    avoid merging leaves from distinct phones). 
  
    \subsection treei_func_top Top-level tree-building functions
  
    The top-level tree-building functions are listed  \ref tree_group_top "here".
    These functions are called directly from the command-line programs.
    The most important one is BuildTree().  It is given a Questions configuration class,
    and some information about sets of phones whose roots to share and (for each 
    set of phones) whether to share the \ref pdf_class "pdf-classes" in a single decision
    tree root or to have a separate root for each pdf-class.  It is also passed information
    about the lengths of the phones, plus various thresholds.  It builds the tree and
    returns the EventMap object which is used to construct the ContextDependency object.
   
   Another important routine in this group is
   AutomaticallyObtainQuestions(),  which used to obtain questions via automatic clustering of
   phones.  It clusters the phones into a tree, and for each node in the tree, all leaves
   accessible from that node form one question (questions are equivalent to sets of phones).
   This (called from cluster-phones.cc) makes our recipes independent of human-generated
   phone sets.
  
   The function AccumulateTreeStats() accumulates statistics for training the tree, given
   features and an alignment that is a sequence of transition-ids (see \ref transition_model_identifiers).
   This function is defined in a different directory from the other tree-building-related functions 
   (hmm/ not tree/) as it depends on more code (e.g. it knows about the the TransitionModel class), 
   and we preferred to keep the dependencies of the core tree-building code to a minimum.
    
  */
  
  /**
    \defgroup tree_group Modules involved in tree building
    See \page tree_internals for documentation.
  */
  }