Blame view
src/doc/fst_algo.dox
10.1 KB
8dcb6dfcb first commit |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
// doc/fst_algo.dox // Copyright 2009-2011 Microsoft Corporation // See ../../COPYING for clarification regarding multiple authors // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // http://www.apache.org/licenses/LICENSE-2.0 // THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY // KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED // WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE, // MERCHANTABLITY OR NON-INFRINGEMENT. // See the Apache 2 License for the specific language governing permissions and // limitations under the License. namespace kaldi { /** \page fst_algo Finite State Transducer algorithms Here we describe the FST algorithms in the Kaldi toolkit that are new or different than the the ones in OpenFst (we use the OpenFst code itself for many algorithms). These algorithms are in the directory fstext/, and the corresponding command-line programs, where they exist, are in fstbin/. This code uses the OpenFst library. We will only describe here the algorithms that are actually used in our current recipes. \section fst_algo_det Determinization We use a different determinization algorithm from the one in OpenFst. We distinguish this with the name DeterminizeStar(); the corresponding command-line program is named fstdeterminizestar. Our determinization algorithm is actually closer to the standard FST determinization algorithm than the one in OpenFst, in that it does epsilon removal along with determinization (thus, like many other FST algorithms, we do not consider epsilon to be a "real symbol". Our determinization algorithm has a different way of handling what happens when a transition in the initial determinized output has more than one output symbol on it. The OpenFst determinization algorithm uses a function called FactorWeights that moves around the output symbols (encoded as weights) around while maintaining equivalence, in order to ensure that no arc has more than one (encoded) output symbol; it does not introduce new states with epsilon symbols on their input, which is the most "obvious" thing to do. However, the FactorWeights algorithm can fail for the output of DeterminizeStar, because there can be cycles with more outputs than states on the cycle (this is not possible for the output of the normal Determinize algorithm, because it does not do epsilon removal). Instead, whenever we encounter a link with more than one output symbols, we create a chain with a sufficient number of intermediate states to accommodate all the output symbols. The weight and input symbol go on the first link of this chain. The output of our DeterminizeStar algorithm is deterministic according to the definitions OpenFst uses, i.e. treating epsilons as a normal symbols. Its output does have epsilons on the input side, which is against the normal definition of determinism, but this is to be viewed as an encoding mechanism for allowing more than one output symbol on a link, and in any case it only happens in quite specific circumstances (an epsilon arc is always the only arc out of a state). One other difference is that our program fstdeterminizestar does not require the input FST to have its output symbols encoded as weights. \subsection fst_algo_det_debug Debugging determinization In general debugging determinization is quite hard, because when a process takes a long time it's hard to tell whether ever terminate. The program fstdeterminizestar has the feature that if you send it a signal with "kill -SIGUSR1 \<its processid\>", it will stop and print out some information that can be useful for debugging the determinization. \subsection fst_algo_det_log Determinization in the log semiring We supply a function DeterminizeInLog() which casts an Fst in the normal (tropical) semiring to the log semiring before determinizing, and then converts back. This is the form of determinization used in our algorithms, as it preserves stochasticity (see \ref fst_algo_stochastic). \section fst_algo_eps Removing epsilons We supply an epsilon-removal algorithm called RemoveEpsLocal() that is guaranteed to never blow up the FST, but on the other hand is not guaranteed to remove all epsilons. Essentially it removes whatever epsilons it can easily remove without making the graph larger. There is no optimality guaranteed here, as this is a hard problem. The RemoveEpsLocal() function has slightly different behaviour from OpenFst's RemoveEps() function, because it will combine two arcs if one has an input epsilon and one an output epsilon. The function RemoveEpsLocal() preserves FST equivalence. There is also a function RemoveEpsLocalSpecial() that preserves equivalence in the tropical semiring while preserving stochasticity in the log semiring (for more on stochasticity see next section). This appears to be a case where the usefulness of the semiring formalism breaks down a little bit, as we have to consider two semirings simultaneously. \section fst_algo_stochastic Preserving stochasticity and testing it We define a stochastic FST as an FST in which, in the FST's semiring the sum of the weights of the arcs out of any given state (plus the final-weight) equals one (in the semiring). This concept is most useful and natural in the log semiring; essentially, a stochastic FST is one where the sum of the weights out of a given arc is one (e.g. a properly normalized HMM would correspond to a stochastic FST). The function IsStochasticFst() tests stochasticity. Optionally it can output minimum and maximum weights, to inform the user of the extent to which the FST fails to be stochastic. The command-line version of this is fstisstochastic. We aim for most of the FST algorithms we use to preserve stochasticity, in the sense that they will produce stochastic outputs given stochastic inputs. For non-stochastic inputs, we aim that the minimum and maximum range of the weights will not get larger. By default the program isstochasticfst will test stochasticity after casting to the log semiring, as this is the most useful case (you can stop this by supplying the option --test-in-log=false). In order to preserve stochasticity, the FSTs that we compose with have to have certain properties. These should apply to L, C and H. Consider L, for example. We require that for any linear FST corresponding to a path through G, call this FST F, the product L o F must be stochastic. This basically means that L have properly normalized pronunciation probabilities. The actual property formally required may be a little stronger than this (this relates to ensuring that the probabilities appear at the "right time"). In practice we verify stochasticity by running the program isstochasticfst for each stage of graph creation. \section fst_algo_minimization Minimization We use the minimization algorithm supplied by OpenFst, but we apply a patch before compiling OpenFst so that minimization can be applied to non-deterministic FSTs. The reason for this is so that we can remove disambiguation symbols before minimizing, which is more optimal (it allows minimization to combine more states). The patch fixes data-structure related issues; fundamentally, OpenFst's minimization algorithm is applicable to non-deterministic FSTs. We supply a command-line program called fstminimizeencoded that minimizes FSTs after encoding weights and output symbols as part of the input symbols (so the FST becomes an acceptor). This is the same thing the fstminimize program does except that it does not do weight pushing. This is desirable for us because the way we ensure stochasticity entails avoiding any weight pushing. \section fst_algo_composition Composition For the most part we use OpenFst's own composition algorithms, but we do make use of a function TableCompose(), and a corresponding command-line program fsttablecompose, which is a more efficient composition algorithm for certain common cases. It uses the "Matcher" concept of OpenFst; a Matcher is a kind of helper class used during composition that performs lookup on the arcs out of a state to find any arcs with a particular input or output symbol. The normal matcher that OpenFst uses is SortedMatcher, which relies on arcs being sorted on the relevant label, and does a binary search. TableMatcher detects cases where it would be efficient to build a table indexed by label, and for those states it avoids the overhead of binary search. This leads to a speedup when composing with things like lexicons that have a very high out-degree. \section fst_algo_disambig Adding and removing disambiguation symbols Our FST recipes (like other transducer-based recipes) rely on disambiguation symbols. In the normal recipes, these are added to the input side of the lexicon FST (L) to make it determizable. We also add disambiguation symbols to G and C (see \ref graph_disambig). Whenever we do a composition and the FST on the right has disambiguation symbols on its input, we (in theory) add to each state in the left-hand FST a self-loop for each of the disambiguation symbols, which has that symbol on both its input and output. Note that the actual integer symbols id's for the disambiguation symbols on the left and right may not be the same. For instance, we have a special symbol \#0 in G (where epsilon would normally be). The symbol-id for this would generally be the highest-numbered word plus one. But when we want to pass this symbol through L, we need a symbol in L's input symbol table (which mainly contains phones), to represent \#0. We have a function AddSelfLoops() that takes a mutable FST and two vectors of labels (a label is an integer id for a symbol). The vectors are the same size, and represent corresponding input and output labels for the disambiguation symbols. This function adds self-loops to each final state and each state with non-epsilon output symbols on at least one arc out of it. We remove disambiguation symbols with the function DeleteISymbols(), accessible on the command line with the program fstrmsymbols. */ } |