Blame view
src/doc/online_programs.dox
20.1 KB
8dcb6dfcb first commit |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 |
// doc/online_programs.dox // Copyright 2013 Polish-Japanese Institute of Information Technology (author: Danijel Korzinek) // See ../../COPYING for clarification regarding multiple authors // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // http://www.apache.org/licenses/LICENSE-2.0 // THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY // KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED // WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE, // MERCHANTABLITY OR NON-INFRINGEMENT. // See the Apache 2 License for the specific language governing permissions and // limitations under the License. namespace kaldi { /** \page online_programs Online Recognizers Warning, this page is deprecated as it refers to the older online-decoding setup. The page for the new setup is \ref online_decoding. There are several programs in the Kaldi toolkit that can be used for online recognition. They are all located in the src/onlinebin folder and require the files from the src/online folder to be compiled as well (you can currently compile these with "make ext"). Many of these programs will also require the Portaudio library present in the tools folder, which can be downloaded using the appropriate script found there. The programs are as follows: - online-gmm-decode-faster - records audio from the microphone and outputs the result on stdout - online-wav-gmm-decode-faster - reads a list of WAV files and outputs the result into special output format - online-server-gmm-decode-faster - reads vectors of MFCC features from a UDP socket and prints the result to stdout - online-net-client - records audio from the microphone, converts it to vectors of features and sends them over UDP to online-server-gmm-decode-faster - online-audio-server-decode-faster - reads RAW audio data from a TCP socket and outputs aligned words to the same socket - online-audio-client - reads a list of WAV files, sends their audio over TCP to online-audio-server-decode-faster, reads the output and saves the result to the chosen file format There is also a Java equivalent of the online-audio-client which contains slightly more features and has a GUI. In addition, there is a GStreamer 1.0 compatible plugin that acts as a filter, taking raw audio as input and producing recognized word as output. The plugin is based on \ref OnlineFasterDecoder, as other online recognition programs. \section audio_server Online Audio Server The main difference between the online-server-gmm-decode-faster and online-audio-server-decode-faster programs is the input: the former accepts feature vectors, while the latter accepts RAW audio. The advantage of the latter is that it can be deployed directly as a back-end for any client: whether it is another computer on the Internet or a mobile device. Main thing here is that the client doesn't need to know anything about the feature set used to train the models and provided it can record standard audio at the predetermined sampling frequency and bit-depth, it will always be compatible with the server. An advantage of the server that accepts feature vectors, instead of audio, is a lower cost of data transfer between the client and the server, but this can be easily outperformed by simply using a state-of-the-art codec for audio (which is something that may be done in the future). The communication between the online-audio-client and online-audio-server-decode-faster consists of two phases: first the client sends packets of raw audio to the server, second the server replies with the output of the recognition. The two phases may happen asynchronously, meaning that the decoder can output results online, as fast as it is certain of their outcome and not wait for the end of the data to arrive. This opens up more possibilities for creating applications in the future. \subsection audio_data Audio Data The audio data format is currently hardcoded to be RAW 16 kHz, 16-bit, signed, little-endian (server native), linear PCM. The protocol works by splitting data into chunks and prepending each chunk with a 4-byte code containing its exact size. The size is also (server native) little-endian and can contain any value as long as it is positive and even (because of 16-bit sampling). The last packet of size 0 is treated as the end of stream and forces the decoder to dump the rest of its results and finish the recognition process. \subsection results Recognition Results There are two types of packets sent by the server. Time-aligned results and partial results. Partial results are sent as fast as the decoder recognizes each word and time-alignment is performed every time the decoder recognizes the end of an utterance (which may or may not correspond to silences and/or sentance boundaries). Each partial result packet is prepended by the characters "PARTIAL:" follwed by exactly one word. Each word is sent in a different partial packet. Time-aligned results are prepended by a header starting with the characters "RESULT:". Following that is a comma separated list of key=value parameters containing some useful information: - NUM - number of words to follow - FORMAT - the format of the result to follow; currently only WSEC (word-start-end-confidence), but it will allow modification in the future - RECO-DUR - time it took to recognize the sequence (as a floating point value in seconds) - INPUT-DUR - length of the input audio sequence (as a floating point value in seconds) You can divide INPUT-DUR by RECO-DUR to get real-time recognition speed of the decoder. The header "RESULT:DONE" is sent when there are no more results that can be returned by the server. In this case, the server simply waits for either more data by the client, or for a disconnect. The data underneath the header consists of exactly NUM lines of words formatted in the way determined by the FORMAT parameter. In the case of WSE format, this is simply a comma separated list containing 4 tokens: the word (as present in the dictionary), start time and end time (as floating point values in seconds). Beware that the words are going to be encoded exactly as they are in the dictionary provided to the server and therefore, the client must make sure to perform the appropriate character conversion if necessary. The online-audio-client, for example, doesn't perform any character conversion while generating WebVTT files (which require UTF8), so you need to convert the resulting files to UTF8 using iconv (or a similar program). An example of the results of the server is as follows: \verbatim RESULT:NUM=3,FORMAT=WSE,RECO_DUR=1.7,INPUT_DUR=3.22 one,0.4,1.2 two,1.4,1.9 three,2.2,3.4 RESULT:DONE \endverbatim \subsection usage Example Usage Command line to start the server: \verbatim online-audio-server-decode-faster --verbose=1 --rt-min=0.5 --rt-max=3.0 --max-active=6000 --beam=72.0 --acoustic-scale=0.0769 final.mdl graph/HCLG.fst graph/words.txt '1:2:3:4:5' graph/word_boundary.int 5010 final.mat \endverbatim Arguments are as follow: - final.mdl - the acoustic model - HCLG.fst - the complete FST - words.txt - word dictionary (mapping word ids to their textual representation) - '1:2:3:4:5' - list of silence phoneme ids - 5010 - port the server is listening on - word_boundary.int - a list of phoneme boundary information required for word alignemnt - final.mat - feature LDA matrix Command line to start the client: \verbatim online-audio-client --htk --vtt localhost 5010 scp:test.scp \endverbatim Arguments are as follow: - --htk - save results as an HTK label file - --vtt - save results as a WebVTT file - localhost - server to connect to - 5010 - port to connect to - scp:test.scp - list of WAV files to send Command line to start the Java client: \verbatim java -jar online-audio-client.jar \endverbatim Or simply double-click the JAR file in the graphical interface. \section gst_plugin GStreamer plugin Kaldi toolkit comes with a plugin for the <a href="http://gstreamer.freedesktop.org/">GStreamer</a> media streaming framework (version 1.0 or compatible). The plugin acts as a filter that accepts raw audio as input and produces recognized words as output. The main benefit of the plugin is the fact that it makes Kaldi's online speech recognition functionality available to all programming languages that support GStreamer 1.0 (that includes Python, Ruby, Java, Vala and many more). It also simplifies the integration of the Kaldi online decoder in applications since communicating with the decoder follows GStreamer standards. \subsection gst_plugin_installation Installation The source of the GStreamer plugin is located in the `src/gst-plugin` directory. To compile the plugin, rest of the Kaldi toolkit has to be compiled to use the shared libraries. To do this, invoke `configure` with the `--shared` flag. Also compile the online extensions (`make ext`). Make sure the package that provides GStreamer 1.0 development headers is installed on your system. On Debian Jessie (current 'testing' distribution), the needed package is called `libgstreamer1.0-dev`. On Debian Wheezy (current stable), GStreamer 1.0 is available from the backports repository (http://backports.debian.org). Install the 'good' GStreamer plugins (package gstreamer1.0-plugins-good) and GStreamer 1.0 tools (package `gstreamer1.0-tools`). A demo program also requires the PulseAudio Gstreamer plugins (package gstreamer1.0-pulseaudio). Finally, run `make depend` and `make` in the `src/gst-plugin` directory. This should result in a file `src/gst-plugin/libgstonlinegmmdecodefaster.so` which contains the GStreamer plugin. To make GStreamer able to find the Kaldi plugin, you have to add the `src/gst-plugin` directory to its plugin search path. To do this, add the directory to the GST_PLUGIN_PATH environment variable: \verbatim export GST_PLUGIN_PATH=$KALDI_ROOT/src/gst-plugin \endverbatim Of course, replace `$KALDI_ROOT` with the actual location of the Kaldi root folder on your file system. Now, running `gst-inspect-1.0 onlinegmmdecodefaster` should provide info about the plugin: \verbatim # gst-inspect-1.0 onlinegmmdecodefaster Factory Details: Rank: none (0) Long-name: OnlineGmmDecodeFaster Klass: Speech/Audio Description: Convert speech to text Author: Tanel Alumae <tanel.alumae@phon.ioc.ee> [..] Element Properties: name : The name of the object flags: readable, writable String. Default: "onlinegmmdecodefaster0" parent : The parent of the object flags: readable, writable Object of type "GstObject" silent : Determines whether incoming audio is sent to the decoder or not flags: readable, writable Boolean. Default: false model : Filename of the acoustic model flags: readable, writable String. Default: "final.mdl" fst : Filename of the HCLG FST flags: readable, writable String. Default: "HCLG.fst" [..] min-cmn-window : Minumum CMN window used at start of decoding (adds latency only at start) flags: readable, writable Integer. Range: -2147483648 - 2147483647 Default: 100 Element Signals: "hyp-word" : void user_function (GstElement* object, gchararray arg0, gpointer user_data); \endverbatim \subsection usage_cli Usage through the command-line The most simple way to use the GStreamer plugin is via the command line. You have to specify the model files used for decoding when lauching the plugin. To do this, set the `model`, `fst`, `word-syms`, `silence-phones` and optionally the `lda-mat` plugin properties (similarly to Kaldi's command-line online decoders). The decoder accepts only 16KHz 16-bit mono audio. Any audio stream can be automatically converted to the required format using GStreamer's `audioresample` and `audioconvert` plugins. For example, to decode the file `test1.wav` using a model files in `tri2b_mmi`, and have the recognized stream of words printed to stdout, execute: \verbatim gst-launch-1.0 -q filesrc location=test1.wav \ ! decodebin ! audioconvert ! audioresample \ ! onlinegmmdecodefaster model=tri2b_mmi/model fst=tri2b_mmi/HCLG.fst \ word-syms=tri2b_mmi/words.txt silence-phones="1:2:3:4:5" lda-mat=tri2b_mmi/matrix \ ! filesink location=/dev/stdout buffer-mode=2 \endverbatim Note that the audio stream is segmented on the fly, with "<#s>" denoting silence. You can easily try live decoding of microphone input by replacing `filesrc location=test1.wav` with `pulsesrc` (given that your OS uses the PulseAudio framework). An example stript that uses the plugin via the command-line to process a buch of audio files is located in `egs/voxforge/gst_demo/run-simulated.sh`. \subsection usage_gst Usage through GStreamer bindings An example of a Python GUI program that uses the plugin via the GStreamer bindings is located in `egs/voxforge/gst_demo/run-live.py`. The program constructs in the `init_gst(self)` method a similar pipeline of GStreamer elements as in the command-line example. The model files and some decoding parameters are communicated to the `onlinegmmdecodefaster` element through the standard `set_property()` method. More interesting is this part of the code: \verbatim self.asr.connect('hyp-word', self._on_word) \endverbatim This expression orders our decoding plugin to call the GUI's `_on_word` method whenever it produces a new recognized word. The `_on_word()` method looks like this: \verbatim def _on_word(self, asr, word): Gdk.threads_enter() if word == "<#s>": self.textbuf.insert_at_cursor(" ") else: self.textbuf.insert_at_cursor(word) self.textbuf.insert_at_cursor(" ") Gdk.threads_leave() \endverbatim What it does (apart from some GUI-related chemistry), is that it inserts the recognized word into the text buffer that is connected to the GUI's main text box. If a segmentation symbol is recognized, it inserts a line break instead. Recognition start and stop are controlled by setting the `silent` property of the decoder plugin to `False` or `True`. Setting the property to `False` orders the plugin not to process any incoming audio (although the audio that is already being processed might produce some new recognized words). \subsection code_structure Overview of the online decoders' code The main components used in the implemention of the online decoders are as follows: - \ref OnlineAudioSource - represents a source of raw audio samples. There are different implementations of this "interface", that can for example acquire samples from microphone (e.g. \ref OnlinePaSource, using PortAudio), or read them from existing \ref Vector (\ref OnlineVectorSource). Technically OnlineAudioSource is currently not a C++ interface, using virtual methods - instead the classes "implementing" it are passed as template parameters to their users. - \ref OnlineFeatInputItf - used to read feature vectors that are computed in some manner. The computation in this case can mean feature extraction(\ref OnlineFeInput), cepstral mean normalization (\ref OnlineCmnInput), calculation of higher order features (\ref OnlineDeltaInput) and so on. - \ref OnlineDecodableDiagGmmScaled - an implementation of \ref DecodableInterface, reading its input features from an implementation of OnlineFeatInputItf(through OnlineFeatureMatrix proxy). - \ref OnlineFasterDecoder - extends \ref FasterDecoder (described in the \ref decoders section), to make it work with real time input. The process of feature extraction and transformation is implemented by forming a pipeline of several objects, each of which pulling data from the previous object in the pipeline, performing a specific type of computation and passing the result to the next one. Usually, at the start of the pipeline there is an OnlineAudioSource object, which is used to acquire the raw audio in some way. Then comes a component of type \ref OnlineFeInput, that reads the raw samples and extracts features. OnlineFeInput can be used to produce either MFCC or PLP features, by passing reference to the object doing the actual job(\ref Mfcc or \ref Plp) as parameter to its constructor. After that the pipeline continues with other OnlineFeatInputItf objects. For example if we want the decoder to receive audio from a microphone using PortAudio and to use a vector of MFCC, delta and delta-delta features the pipeline looks like: \verbatim OnlinePaSource -> OnlineFeInput<OnlinePaSource, Mfcc> -> OnlineCmnInput -> OnlineDeltaInput \endverbatim <B>NOTE:</B> The pipeline shown above has nothing to do with GStreamer pipeline described in the previous sections - the both structures are completely orthogonal to each other. The GStreamer plug-in makes it possible to grab audio samples from GStreamer's pipeline, by implementing a new OnlineAudioSource (\ref GstBufferSource), and feed them into Kaldi online decoder's pipeline. \ref OnlineFasterDecoder itself is completely oblivious to the details of feature computation. As is the case with the other Kaldi decoders, the decoder only "knows" that it can get scores for a given feature frame using a Decodable object. The actual type of this object, for the currently implemented example programs, is OnlineDecodableDiagGmmScaled. OnlineDecodableDiagGmmScaled doesn't directly access the final OnlineFeatInputItf either. Instead, the responsibility for bookkeeping and requesting batches of features from the pipeline, as well as coping with potential timeouts is delegated to an object of type \ref OnlineFeatureMatrix. The chain of function calls, that results in pulling new features along the pipeline, looks like the following: -# in \ref OnlineFasterDecoder::ProcessEmitting(), the decoder invokes \ref OnlineDecodableDiagGmmScaled::LogLikelihood() -# this triggers a call to \ref OnlineFeatureMatrix::IsValidFrame(). If the requested feature frame is not already available inside the cache maintained by OnlineFeatureMatrix, a new batch of features is requested, by calling the \ref OnlineFeatInputItf::Compute() method of the last object in the pipeline. -# the Compute() method of the last object (e.g. OnlineDeltaInput in the example above) then calls the Compute() method of the previous object in the pipeline and so on to the first OnlineFeatInputItf object, which invokes OnlineAudioSource::Read(). OnlineFasterDecoder inherits most of its implementation from FasterDecoder, but its Decode() method processes the feature vectors in batches of configurable size and returns a code corresponding to the current decoder state. There are three different codes: - <I>kEndFeats</I> means there are no more features left in the pipeline - <I>kEndUtt</I> signals end of the current utterance, which can be used for example to separate the utterances by new line characters, as in our demos, or other application-dependent feadback. An utterance is considered to be finished when sufficiently long silence is detected in the traceback. The number of the silence frames that are interpreted as utterance delimiter is controlled by \ref OnlineDecoderOpts::inter_utt_sil. - <I>kEndBatch</I> - nothing extraordinary has happened - the decoder just finished processing the current batch of feature vectors. OnlineFasterDecoder makes it possible to get recognition results as soon as possible, by implementing partial traceback. The back-pointers of all active \ref FasterDecoder::Token objects are followed until a point is reached, back in time, where there is only one active token. That means that this (immortal) token is a common ancestor of all currently active tokens. Moreover by keeping track of the previous such token(for the previously processed batches), the decoder can produce a partial result, by taking the output symbols along the traceback from the current immortal token to the previous one. */ } |