io.dox
34.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
// doc/io.dox
// Copyright 2009-2011 Microsoft Corporation
// 2013 Johns Hopkins University (author: Daniel Povey)
// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.
namespace kaldi {
/** \page io Kaldi I/O mechanisms
This page gives an overview of input-output mechanisms in Kaldi.
This section of the documentation is oriented towards the code-level mechanisms
for I/O; for documentation more oriented towards the command-line, see \ref io_tut.
\section io_sec_style The input/output style of Kaldi classes
Classes defined in Kaldi have a uniform interface for
I/O. The standard interface is illustrated here:
\code
class SomeKaldiClass {
public:
void Read(std::istream &is, bool binary);
void Write(std::ostream &os, bool binary) const;
};
\endcode
Notice that these return void; errors are indicated via exceptions
(see \ref error). The boolean "binary" argument indicates whether the
object should be written (or read) as binary data or text data. The calling
code must know whether we want the object to be written or read
in binary or text form (see \ref io_sec_files for how it knows this in the
case of reading). Note that this "binary" variable is not necessarily the
same as the binary or text mode the file is opened with (on Windows);
see \ref io_sec_windows for more explanation.
The Read and Write functions may have additional optional arguments.
A common case is to have a Read function of the form:
\code
class SomeKaldiClass {
public:
void Read(std::istream &is, bool binary, bool add = false);
};
\endcode
If add==true, the Read function would add whatever is on disk (e.g. statistics)
to the current class's contents, if the class is not currently empty.
\section io_sec_basic Input/output mechanisms for fundamental types and STL types
See \ref io_funcs_basic for a list of functions involved in this. We have
provided thse functions to make it easier to read and write fundamental types;
they are mostly called from the Read and Write functions of Kaldi classes.
The Kaldi classes are under no obligation to use
these functions, as long as they ensure that their Read function can read the
data that their Write function produces.
The most important functions in this category are ReadBasicType() and WriteBasicType();
these are templates that cover bool, float, double, and integer types. An example of using these
in Read and Write functions is:
\code
// we suppose that class_member_ is of type int32.
void SomeKaldiClass::Read(std::istream &is, bool binary) {
ReadBasicType(is, binary, &class_member_);
}
void SomeKaldiClass::Write(std::ostream &os, bool binary) const {
WriteBasicType(os, binary, class_member_);
}
\endcode
We have assumed that \c class_member_ is of type int32, which is a type of known
size. Using types like int with these functions is not safe. In binary mode,
these functions actually write a character that encodes the
size and signedness of integer types, and the read will fail if it doesn't match. We
could have decided to attempt to convert them automatically, but we didn't;
currently, you have to use integer types of known size in I/O (int32 is recommended for
"normal" use). Floating-point types, on the other hand, are automatically
converted. This is for ease of debugging, so you can compile with
\c -DKALDI_DOUBLE_PRECISION and still read your binary files that were written without
that option. Our I/O routines have no byte swapping; if this is a problem for you,
use the text formats.
There are also the WriteIntegerVector() and ReadIntegerVector() templated functions.
These are in the same style as the WriteBasicType() and ReadBasicType() functions, but
work for \c std::vector<I>, where I is some integer type (again, its size should
be known at compile time, e.g. int32).
Some other important low-level I/O functions are;
\code
void ReadToken(std::istream &is, bool binary, std::string *token);
void WriteToken(std::ostream &os, bool binary, const std::string & token);
\endcode
A token must be a nonempty string with no spaces, typically in practice an XML-looking
string like "<SomeKaldiClass>" or "<SomeClassMemberName>" or "</SomeKaldiClass>".
These functions do what they look like they would do. For convenience, we also
provide ExpectToken(), which is like ReadToken() except you give it the string
you expect (and it will throw an exception if it doesn't get it). Typical lines
of code invoking these are:
\code
// in writing code:
WriteToken(os, binary, "<MyClassName>");
// in reading code:
ExpectToken(is, binary, "<MyClassName>");
// or, if a class has multiple forms:
std::string token;
ReadToken(is, binary, &token);
if(token == "<OptionA>") { ... }
else if(token == "<OptionB>") { ... }
...
\endcode
There are also the WritePretty() and ExpectPretty() functions.
These are less frequently used, and they behave like the corresponding Token
functions except that they only actually read and write in text mode, and they
accept arbitrary strings (i.e. they allow spaces); the ReadPretty function also
accepts input that has differs in whitespace versus what was expected.
The Read functions in Kaldi classes never check for end of file, but are expected
to read until the end of where the Write function wrote to (in text mode,
leaving some whitespace unread doesn't matter). This is so
that multiple Kaldi objects can be put in the same file, and also allows
the archive concept (see \ref io_sec_archive) to work.
\section io_sec_files How Kaldi objects are stored in files
As we have seen above, the Kaldi reading code needs to know whether it is
reading in text or binary mode, and we don't want the user to have to keep
track of whether a given file is text or binary. For this reason,
files that contain Kaldi objects need to announce whether they contain
binary or text data. A binary Kaldi file will start with the string
"\0B"; since text files can't contain "\0", they don't need a header.
If you opened a file using standard C++ mechanisms (and you won't normally
be doing this, see \ref io_sec_opening), you would have to take care of
this header before doing anything. You could do this with
the functions InitKaldiOutputStream()
(this also sets the stream precision), and InitKaldiInputStream().
\section io_sec_opening How to open files in Kaldi
Suppose you want to load or save a Kaldi object from/to disk,
and suppose it is something like speech model (but not something
that you need many of, like speech features; for that, see \ref io_sec_tables).
You will typically use the Input and Output classes. An example is:
\code
{ // input.
bool binary_in;
Input ki(some_rxfilename, &binary_in);
my_object.Read(ki.Stream(), binary_in);
// you can have more than one object in a file:
my_other_object.Read(ki.Stream(), binary_in);
}
// output. note, "binary" is probably a command-line option.
{
Output ko(some_wxfilename, binary);
my_object.Write(ko.Stream(), binary);
}
\endcode
The purpose of the braces is to make the Input and Output objects go out of scope
as soon as we're done, so the file gets closed immediately. This might seem
a bit pointless (why not use a normal C++ stream?). The reason is so we can
support various extended types of filename. It also makes handling errors
a bit easier (the Input and Output classes will print an informative
error message and throw an exception on error). Notice the filenames have "rxfilename"
and "wxfilename" in them. We use these types of names a lot, and they are supposed
to remind the coder that these are extended filenames. We describe these entities
in the next section.
The Input and Output classes have a slightly richer interface than used in the
example code above. You can open them with Open(), and you can call Close()
rather than just letting them go out of scope. These functions return boolean
status values rather than throwing exceptions on error the way the constructors
and destructors will. The Open() functions (and the constructors) can also be
called in such a way that they don't handle the Kaldi binary header, in case
you need to read or write non-Kaldi objects. You probably won't need any of
this extra functionality.
See \ref io_group for classes and functions related to Input and Output,
and to rxfilenames and wxfilenames (next section).
\section io_sec_xfilename Extended filenames: rxfilenames and wxfilenames
The words "rxfilename" and "wxfilename" are not classes; they are descriptors that usually
appear in variable names, and they indicate the following:
- an rxfilename is a string that is to be interpreted by the Input class
as an extended filename for reading
- a wxfilename is a string that is to be interpreted by the Output class
as an extended filename for writing
The types of rxfilename are as follows:
- "-" or "" means the standard input
- "some command |" means an input piped command, i.e. we strip off the "|" and give the
rest of the string to the shell via popen().
- "/some/filename:12345" means an offset into a file, i.e. we open the file and
seek to position 12345.
- "/some/filename" ... anything not matching the patterns above is treated as a normal filename
(however, some obviously wrong things will be recognized as errors before attempting
to open them).
You can find out what type an rxfilename is using ClassifyRxfilename(), but this typically
won't be necessary.
The types of wxfilename are as follows:
- "-" or "" means the standard input
- "| some command" means an output piped command, i.e. we strip off the "|" and give the
rest of the string to the shell via popen().
- "/some/filename" ... anything not matching the patterns above is treated as a normal
filename (again, barring obvious errors).
Again, ClassifyWxfilename() tells you the type of a filename.
\section io_sec_tables The Table concept
A Table is a concept rather than actual C++ class. It consists of a collection of
objects of some known type, indexed by strings. These strings must be
tokens (a token is defined as a non-empty string without whitespaces). Typical examples
of Tables include:
- A collection of feature files (represented as Matrix<float>) indexed by utterance id
- A collection of transcriptions (represented as std::vector<int32>), indexed
by utterance id
- A collection of Constrained MLLR transforms (represented as Matrix<float>), indexed
by speaker id.
We will deal with these types of tables in more detail on the page
\subpage table_examples; here we just explain the general principles and the
internal mechanisms.
A Table can exist on disk (or indeed, in a pipe) in two possible formats: a script
file, or an archive (see below, \ref io_sec_scp and \ref io_sec_archive).
For a list of classes and types that relate to Tables, see \ref table_group.
A Table can be accessed in three ways: using a TableWriter, a
SequentialTableReader, and a RandomAccessTableReader (there is also
RandomAccessTableReaderMapped, which is a special case we will introduce later).
These are all templates; they are templated not on the
object in the table, but on a Holder type (see below, \ref io_sec_holders) that
tells the Table code how to read and write that type of object. To open
a Table type, you must provide a string called a wspecifier or rspecifier (see below, \ref
io_sec_specifiers) that tells the Table code how the table is stored on
disk and gives it various other directives. We illustrate this with some example code.
This code reads features, linearly transforms them and writes them out.
\code
std::string feature_rspecifier = "scp:/tmp/my_orig_features.scp",
transform_rspecifier = "ark:/tmp/transforms.ark",
feature_wspecifier = "ark,t:/tmp/new_features.ark";
// there are actually more convenient typedefs for the types below,
// e.g. BaseFloatMatrixWriter, SequentialBaseFloatMatrixReader, etc.
TableWriter<BaseFloatMatrixHolder> feature_writer(feature_wspecifier);
SequentialTableReader<BaseFloatMatrixHolder> feature_reader(feature_rspecifier);
RandomAccessTableReader<BaseFloatMatrixHolder> transform_reader(transform_rspecifier);
for(; !feature_reader.Done(); feature_reader.Next()) {
std::string utt = feature_reader.Key();
if(transform_reader.HasKey(utt)) {
Matrix<BaseFloat> new_feats(feature_reader.Value());
ApplyFmllrTransform(new_feats, transform_reader.Value(utt));
feature_writer.Write(utt, new_feats);
}
}
\endcode
The nice thing about this setup is that the code that accesses the tables
can treat them as generic maps or lists. The format of the data and
other aspects of the reading process (e.g., its error tolerance) can be
controlled by options in the rspecifiers and wspecifiers and does not
have to be handled by the calling code; in the example above,
the option ",t" tells it to write the data in text form.
The Platonic ideal of a Table would probably be a map from a string to the object.
However, as long as we're not doing random access on a particular table, the
code will not complain if it contains duplicate entries for a particular string
(i.e. for writing and sequential access, it behaves more like a list of pairs).
For a list of typedefs corresponding to Table types to read and write
specific types, see \ref table_types.
\section io_sec_scp The Kaldi script-file format
A script file (perhaps slightly misnamed) is a text file where each line
will typically contain something like:
\verbatim
some_string_identifier /some/filename
\endverbatim
Another valid line in a script file would be:
\verbatim
utt_id_01002 gunzip -c /usr/data/file_010001.wav.gz |
\endverbatim
The general form of these lines is:
\verbatim
<key> <rxfilename>
\endverbatim
\subsection io_sec_scp_range Ranges in script-file lines (for taking sub-parts of matrices)
We also allow an optional 'range-specifier' to appear after the rxfilename;
this is useful for representing parts of matrices, such as row ranges.
Ranges are currently not supported for any data types other than matrices.
For example, we can express a row range of a matrix as follows:
\verbatim
utt_id_01002 foo.ark:89142[0:51]
\endverbatim
which means rows 0 through 51 (inclusive) of the matrix.
Both row and column ranges may be expressed, e.g.
\verbatim
utt_id_01002 foo.ark:89142[0:51,89:100]
\endverbatim
and if you just want to express a column range, you can leave the row-range blank, as follows:
\verbatim
utt_id_01002 foo.ark:89142[,89:100]
\endverbatim
\subsection io_sec_scp_details How Kaldi processes lines of scp files
When reading a line of script file, Kaldi will trim off leading and trailing whitespace,
and then split the line on the first region of whitespace. The first part
becomes the key into the table (e.g. the utterance id, in this case "utt_id_01001"),
and the second part (after stripping off the optional range-specifier)
becomes the xfilename (by which we mean an wxfilename or rxfilename, in
this case "gunzip -c /usr/data/file_010001.wav.gz |").
An empty line or an empty xfilename is not allowed. A script file may be
valid for reading or writing or both, depending whether the xfilenames are
valid rxfilenames, or wxfilenames, or both.
Note: once the optional ranges are stripped off,
the (r,x)filenames that appear on lines of script files may generally be given
to any Kaldi program in the same way you'd give a filename. This is even
true of rspecifiers that contain byte offsets, like foo.ark:8432. The byte offsets
will point to the beginning of the data of the object (not to the key-value that
precedes the data in the archive). For binary data, the byte offset points to
the "\0B" that precedes the object; this allows the reading code to ascertain
that the data is binary before it reads the object.
\section io_sec_archive The Kaldi archive format
The Kaldi archive format is quite simple. First recall that a token is defined
as a whitespace-free string. The archive format could be described as:
\verbatim
token1 [something]token2 [something]token3 [something] ....
\endverbatim
We can describe this as zero or more repetitions of: (a token; then a
space character; then the result of calling the Write function of the Holder).
Recall that the Holder is an object that tells the Table code how to read or
write something.
When writing Kaldi objects, the [something] written by the Holder will constist
of the binary-mode header (if binary), and then the result of calling the Write
function of the object. When writing non-Kaldi objects that are simple (like
int32 or float or vector<int32>), the Holder classes that we write generally
ensure that in the text format, the [something] is a newline-terminated string.
That way, the archive has a nice one-line-per-entry format that looks
superfically like a script file, for instance:
\verbatim
utt_id_1 5
utt_id_2 7
...
\endverbatim
is the text archive format we use for storing integers.
The archive format is such that you can concatenate archives together and they
will still be a valid archive (assuming they hold the same type of object). The
format has been designed to be pipe-friendly, i.e. you can put an archive in a pipe
and the program reading it won't have to wait till the end of the pipe before
it can process the data. For efficient random access into archives it's possible
to simultaneously write an archive to disk together with a script file that contains
offsets into the archive. For this, see the next section.
\section io_sec_specifiers Specifying Table formats: wspecifiers and rspecifiers
The Table classes require a string that is passed to the constructor or to the
Open method. This string is called a wspecifier if passed to the TableWriter
class, or a rspecifier if passed to the RandomAccessTableReader or SequentialTableReader
classes. Examples of valid rspecifiers and wspecifiers include:
\code
std::string rspecifier1 = "scp:data/train.scp"; // script file.
std::string rspecifier2 = "ark:-"; // archive read from stdin.
// write to a gzipped text archive.
std::string wspecifier1 = "ark,t:| gzip -c > /some/dir/foo.ark.gz";
std::string wspecifier2 = "ark,scp:data/my.ark,data/my.scp";
\endcode
Usually, an rspecifier or wspecifier consists of a comma-separated, unordered
list of one or two-letter options and one of the strings "ark" and "scp",
followed by a colon, followed by an rxfilename or wxfilename respectively.
The order of options before the colon doesn't matter.
\subsection io_sec_specifiers_both Writing an archive and a script file simultaneously
There is a special case available for wspecifiers: they can "ark,scp" before the
colon, and after the colon, a wxfilename for writing the archive, then a comma,
then a wxfilename (for the script file). For example,
\verbatim
"ark,scp:/some/dir/foo.ark,/some/dir/foo.scp"
\endverbatim
This will write an archive, and a
script file with lines like "utt_id /somedir/foo.ark:1234" that specify offsets into the
archive for more efficient random access. You can then do whatever you like with
the script file, including breaking it up into segments, and it will behave like
any other script file. Note that although the order of options before the colon
doesn't generally matter, in this particular case the "ark" must come before
the "scp"; this is in order to prevent confusion about the order of the
two wxfilenames after the colon (the archive always comes first). The wxfilename
that specifies the archive should be a normal filename or otherwise the script file that gets
written won't be directly readable by Kaldi, but the code doesn't enforce this.
\subsection io_sec_wspecifiers Valid options for wspecifiers
The allowable wspecifier options are:
- "b" (binary) means write in binary mode (currently unnecessary as it's always the default).
- "t" (text) means write in text mode.
- "f" (flush) means flush the stream after each write operation.
- "nf" (no-flush) means don't flush the stream after each write operation (would currently
be pointless, but calling code can change the default).
- "p" means permissive mode, which affects "scp:" wspecifiers where the scp
file is missing some entries: the "p" option will cause it to silently
not write anything for these files, and report no error.
Examples of wspecifiers using a lot of options are
\verbatim
"ark,t,f:data/my.ark"
"ark,scp,t,f:data/my.ark,|gzip -c > data/my.scp.gz"
\endverbatim
\subsection io_sec_rspecifiers Valid options for rspecifiers
When reading the options below, bear in mind the code that reads archives can
never seek in the archive, in case the archive is actually a pipe (and it very
often is). If a RandomAccessTableReader is reading an archive, the reading
code may have to store many objects in memory just in case they are requested
again later, or it may have to seek to the end of an archive while looking for
a key that was not actually present in the archive. Some of the options below
represent ways to prevent this.
The important rspecifier options are:
- "o" (once) is the user's way of asserting to the RandomAccessTableReader code
that each key will be queried only once. This stops it
from having to keep already-read objects in memory just in case they are needed again.
- "p" (permissive) instructs the code to ignore errors and just provide what
data it can; invalid data is treated as not existing. In scp files,
this means that a query to HasKey() forces the load of the corresponding file,
so the code can know to return false if the file is corrupt. In archives,
this option
stops exceptions from being raised if the archive is corrupted or truncated
(it will just stop reading at that point).
- "s" (sorted) instructs the code that the keys in an archive being read are in
sorted string order. For RandomAccessTableReader, this means that when HasKey() is
called for some key not in the archive, it can return false as soon as it
encounters a "higher" key; it won't have to read till the end.
- "cs" (called-sorted) instructs the code that the calls to HasKey() and Value()
will be in sorted string order. Thus, if one of these functions is called for
some string, the reading code can discard the objects for lower-numbered keys.
This saves memory. In effect, "cs" represents the user's assertion that some other
archive that the program may be iterating over, is itself sorted.
If the user provides any of these options wrongly, e.g. provides the "s" option for
an archive that is not actually sorted, the RandomAccessTableReader code will make
a best-effort attempt to detect this error and crash.
The following options are included for symmetry and convenience but are
not very useful at the moment.
- "no" (not-once) is the opposite of "o" (in current code,
this would never have any effect).
- "np" (not-permissive) is the opposite of "p" (in current code,
this would never have any effect).
- "ns" (not-sorted) is the opposite of "s" (in current code,
this would never have any effect).
- "ncs" (not-called-sorted) is the opposite of "cs" (in current code,
this would never have any effect).
- "b" (binary) does nothing but is allowed for scripting convenience.
- "t" (text) does nothing but is allowed for scripting convenience.
Typical examples of rspecifiers using a lot of options are:
\verbatim
"ark:o,s,cs:-"
"scp,p:data/my.scp"
\endverbatim
\section io_sec_holders Holders as helpers to Table classes
As mentioned before, the Table classes i.e. TableWriter, RandomAccessTableReader
and SequentialTableReader, are templated on a Holder class. Holder is not an actual
class or base class but describes a category of classes, and these have been given names ending in Holder,
e.g. TokenHolder or KaldiObjectHolder. (KaldiObjectHolder is a generic Holder that
may be templated on any class satisfying that Kaldi I/O style described
in \ref io_sec_style). We have written the template class GenericHolder, which is not intended
to be used, in order to document the properties that the Holder classes must satisfy.
The type of the class "held" by the Holder class is a typedef Holder::T (where Holder is
the name of the actual Holder class in question).
A list of the available holder types may be found in \ref holders.
\section io_sec_windows How the binary/text mode relates to the file open mode
This section is only relevant on the Windows platform. The general rule is
that when writing, the file mode will always match the "binary" argument to the
Write function; when reading binary data, the file mode will always be
binary, but when reading text data, the file mode may be binary or text (thus
the text-mode reading functions must always accept the extra '\\r' characters
that Windows inserts). This is because we don't always know until we open a
file, whether its contents are binary or text and so when unsure, we open
in binary mode.
\section io_sec_bloat Avoiding memory bloat when reading archives in random-access mode
When large archives are read in random access mode by the Table code, there is a
potential for memory bloat. This potentially occurs whenever an object of type
RandomAccessTableReader<SomeHolder> reads in an archive. The Table code is
written so as to first and foremost ensure correctness, so when reading an
archive in random access mode, unless you give the Table code some additional
information (which we will discuss below), it can never throw away any object it
has read in case you ask for it again. An obvious questions here is: why
doens't the Table code simply keep track of the position in the file at which
each object starts, and fseek() to that location when needed? We have not
implemented this, and the reason is as follows: the only situation that you can
fseek() is when the archive being read is an actual file (i.e. not a piped
command or the standard input). If the archive was an actual file on disk, you
could have written it out with an attached scp file containing offsets into the
file (using the "ark,scp:" prefix, see \ref io_sec_specifiers_both), and then
provided that scp file to the program that needs to read the archive. This
would be almost as time-efficient as reading the archive directly, since the
code that reads in scp files is smart enough to avoid reopening files when not
needed and calling fseek() unnecessarily. So treating file archives as a
special case and caching offsets into the file would not solve any problems.
There are two separate problems that can happen when you read an archive in random
access mode; these can both happen if you use just the "ark:" prefix with no
additional options.
- If you ask for a key that is not present in the archive, the reading code
is forced to read till the end of the archive to make sure it is not there.
- Every time the code reads an object, it is forced to keep it in memory in case
you ask for it later.
With regard to the first problem (having to read till the end of the file),
the way you can avoid this is to assert that the archive is sorted on key (using
the normal string sorted order that "C" uses, and that the program "sort" uses
if you do "export LC_ALL=C"). You can do this using the "s" option when reading
archives: for example, the rspecifier "ark,s:-" instructs the code to read the
standard input as an archive and expect it to be in sorted order. The Table code
checks that what you have asserted is actually true, and will crash if not.
Of course, you have to set up your scripts in such a way that the archives are
actually sorted on key (usually this will be done in the initial feature-extraction
stage).
With regard to the second problem (being forced to keep things in memory in
case needed later), there are two solutions.
- The first solution, which is
a rather brittle solution, is to provide the "once" option;
for example, the rspecifier "ark,o:-" reads in from the standard input and asserts
that you will only ask for each object once. To be able to assert this you would
have to know something about how the program in question works and you would probably
have to know that some other Table provided to the program does not contain any
repeated keys (yes, Tables can have repeated keys as long as they are only accessed
in sequential mode).
If you provide the "o" option the Table can deallocate objects after they have been
accessed. However, this only works well if your archives are perfectly synchronized with
no gaps or missing elements. For example, suppose you execute the command:
\verbatim
some-program ark:somedir/some.ark "ark,o:some command|"
\endverbatim
The program "some-program" will first iterate sequentially over the archive "somedir/some.ark"
and then for each key it encounters, access the second archive via random access.
Note that the order of command-line arguments is not arbitrary: we have tried to
adopt the convention that rspecifiers that will be accessed sequentially appear
before those that will be accessed via random access.
Suppose the two archives are mostly synchronized but may have gaps (i.e. missing keys,
e.g. due to failures in feature extraction, data alignment, and so on).
Any time there
is a gap in the first archive, the program will have to cache the associated object
from the second archive because it doesn't know that it won't be called for later
(it can only throw away an object once you have read it). Gaps in the second
archive are more serious, because if there is a gap of even one element, when
the program asks for that key it will have to read right till the end of the
second archive to look for it, and will have to cache all objects along the way.
- The second solution, which is more robust, is to use the "called-sorted" (cs) option.
This asserts that the objects will be requested in sorted order, and again this
requires knowledge of how the program works, plus that any sequentially accessed
archives are in sorted order. The "cs" option is normally most useful in conjunction
with the "s" option. Suppose we execute the following command:
\verbatim
some-program ark:somedir/some.ark "ark,s,cs:some command|"
\endverbatim
We assume that both archives are in sorted order, and the the program does
sequential access on the first archive and random access on the second.
This is now robust to gaps
in the archives. First imagine there is a gap in the first archive (e.g., its keys
are 001, 002, 003, 081, 082, ...). When the second archive is searched for key 081 right
after key 003, the code that reads the
second archive will encounter keys 004, 005, and so on, but it can discard the associated
objects because it knows that no key before 081 will be asked for again (thanks to the "cs" option).
If there is a gap in the second archive, it can use the fact that the second archive is sorted
to avoid searching till the end of the file (this is the job of the "s" option).
\subsection io_sec_mapped
In order to condense a particular code pattern that was recurring in many programs, we have introduced the template type
RandomAccessTableReaderMapped. Unlike RandomAccessTableReader, this takes two initializer arguments, for instance:
\verbatim
std::string rspecifier, utt2spk_map_rspecifier; // get these from somewhere.
RandomAccessTableReaderMapped<BaseFloatMatrixHolder> transform_reader(rspecifier,
utt2spk_map_rspecifier);
\endverbatim
If utt2spk_map_rspecifier is the empty string, this will behave just like a
regular RandomAccessTableReader. If it is nonempty, e.g. ark:data/train/utt2spk,
it will read an utterance-to-speaker map from that location and whenever a particular
string e.g. utt1 is queried, it will use that map to convert the utterance-id
to a speaker-id (e.g. spk1) and use that as the key to query the table being
read from rspecifier. The utterance-to-speaker map is also an archive
because it happens that the Table code is the easiest way to read in such maps.
*/
/**
\defgroup io_funcs_basic "Low-level I/O functions"
These functions are provided to write fundamental types, strings, and a few STL types
to and from C++ streams; see \ref io_sec_basic for how this fits into the bigger picture
of Kaldi-style I/O.
\defgroup holders "Holder types"
Holder types are types that are used as template arguments to the Table types
(see \ref table_group), and which help the Table types to read and write the object of type SomeHolder::T;
see \ref io_sec_holders for more information.
\defgroup table_group "Table types and related functions"
This group is for classes and functions relatied to Tables; see also
\ref table_impl_types and \ref table_types, and for a description
of the Table concept see \ref io_sec_tables.
\defgroup table_impl_types "Implementation classes for Table types"
This group is for classes that implement specific ways of reading and
writing Tables; see also \ref table_group, \ref table_types, \ref
table_types, and for a description of the Table concept see \ref io_sec_tables.
\defgroup table_types "Specific Table types"
This group is for typedefs that define specific instantiations of
Table types, for various kinds of access to collections of various
kinds of types, indexed by strings;
for a description of the Table concept see \ref io_sec_tables.
\defgroup io_group "Classes for opening streams"
This group contains the Input and Output classes, which are provided
to open streams for reading and writing in Kaldi code; for an explanation
of how this fits into the bigger picture of Kaldi I/O, see \ref io_sec_opening.
*/
}