data_prep.dox 43.8 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889
// doc/data_prep.dox

// Copyright 2012  Johns Hopkins University (author: Daniel Povey)

// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at

//  http://www.apache.org/licenses/LICENSE-2.0

// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.

/**
 \page data_prep  Data preparation

  \section data_prep_intro Introduction

  After running the example scripts (see \ref tutorial), you may want to set up
  Kaldi to run with your own data.  This section explains how to prepare the data.
  This page will assume that you are using the latest version of the example scripts
  (typically named "s5" in the example directories, e.g. egs/rm/s5/).
  In addition to this page, you can refer to the data preparation scripts in those
 directories.  The top-level run.sh scripts (e.g. egs/rm/s5/run.sh) have a few commands at
 the top of them that relate to various phases of data preparation.  The parts in
 the sub-directory named local/ are always specific to the database.  For example,
 in the Resource Management (RM) setup it is local/rm_data_prep.sh.  In the case of
 RM these commands are:
\verbatim
local/rm_data_prep.sh /export/corpora5/LDC/LDC93S3A/rm_comp || exit 1;

utils/prepare_lang.sh data/local/dict '!SIL' data/local/lang data/lang || exit 1;

local/rm_prepare_grammar.sh || exit 1;
\endverbatim

In the WSJ case the commands are:
\verbatim

wsj0=/export/corpora5/LDC/LDC93S6B
wsj1=/export/corpora5/LDC/LDC94S13B

local/wsj_data_prep.sh $wsj0/??-{?,??}.? $wsj1/??-{?,??}.?  || exit 1;

local/wsj_prepare_dict.sh || exit 1;

utils/prepare_lang.sh data/local/dict "<SPOKEN_NOISE>" data/local/lang_tmp data/lang || exit 1;

local/wsj_format_data.sh || exit 1;
\endverbatim
There are more commands after these in the WSJ script that relate to training language
models locally (rather than using the ones supplied by LDC), but the ones above are the most
important ones.


The output of the data preparation stage consists of two sets of things.  One relates
to "the data" (directories like data/train/) and one relates to "the language"
(directories like data/lang/).  The "data" part relates to the specific recordings you
have, and the "lang" part contains things that relate more to the language itself,
such as the lexicon, the phone set, and various extra information about the phone set
that Kaldi needs.  If you want to prepare data which you will decode with an
already existing system and an already existing language model, the "data" part is
all you need to touch.

\section data_prep_data Data preparation-- the "data" part.

As an example of the "data" part of the data preparation, look at the directory
"data/train" in one of the example directories (assuming you have already run
the scripts there).  Note: there is nothing special about the directory name
"data/train".  There are other directories such as "data/eval2000" (for a test set)
that have essentially the same format ("essentially" because we may have an "stm" and
"glm" file in the test directory, to enable sclite scoring).
The specific example we'll look at the Switchboard recipe
in egs/swbd/s5.
\verbatim
s5# ls data/train
cmvn.scp  feats.scp  reco2file_and_channel  segments  spk2utt  text  utt2spk  wav.scp
\endverbatim
Not all of the files are equally important.  For a simple setup where there is no
"segmentation" information (i.e. each utterance corresponds to a single file), the only
files you have to create yourself are "utt2spk", "text" and "wav.scp" and possibly
"segments" and "reco2file_and_channel", and the rest will be created by standard scripts.

We will describe the files in this directory, starting with the files you need to create
yourself.

\subsection data_prep_data_yourself Files you need to create yourself

 The file "text" contains the transcriptions of each utterance.
\verbatim
s5# head -3 data/train/text
sw02001-A_000098-001156 HI UM YEAH I'D LIKE TO TALK ABOUT HOW YOU DRESS FOR WORK AND
sw02001-A_001980-002131 UM-HUM
sw02001-A_002736-002893 AND IS
\endverbatim
The first element on each line is the utterance-id, which is an arbitrary text string,
but if you have speaker information in your setup, you should make the speaker-id a
prefix of the utterance id; this is important for reasons relating to the sorting of
these files.  The rest of the line is the transcription of each sentence.  You don't
have to make sure that all words in this file are in your vocabulary; out of vocabulary words will
get mapped to a word specified in the file data/lang/oov.txt.

It needs to be the case that when you sort both the utt2spk and spk2utt files,
the orders "agree", e.g. the list of speaker-ids extracted from the utt2spk file
is the same as the string sorted order.  The easiest way to make this happen is
to make the speaker-ids a prefix of the utter Although, in this particular
example we have used an underscore to separate the "speaker" and "utterance"
parts of the utterance-id, in general it is probably safer to use a dash ("-").
This is because it has a lower ASCII value; if the speaker-ids vary in length,
in certain cases the speaker-ids and their corresponding utterance ids can end
up being sorted in different orders when using the standard "C"-style ordering
on strings, which will lead to a crash.
\endverbatim
Another important file is <DFN>wav.scp</DFN>.  In the Switchboard example,
\verbatim
s5# head -3 data/train/wav.scp
sw02001-A /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 /export/corpora3/LDC/LDC97S62/swb1/sw02001.sph |
sw02001-B /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 2 /export/corpora3/LDC/LDC97S62/swb1/sw02001.sph |
\endverbatim
The format of this file is
\verbatim
<recording-id> <extended-filename>
\endverbatim
where the "extended-filename" may be
an actual filename, or as in this case, a command that extracts a wav-format file.  The pipe symbol
on the end of the extended-filename specifies that it is to be interpreted as a pipe.  We will
explain what "recording-id" is below, but we would first like to point out that if the "segments" file
does not exist, the first token on each line of "wav.scp" file is just the utterance id.
The files in wav.scp must be single-channel (mono); if the underlying wav files have multiple
channels, then a sox command must be used in the wav.scp to extract a particular channel.

In the Switchboard setup we have the "segments" file, so we'll discuss this next.
\verbatim
s5# head -3 data/train/segments
sw02001-A_000098-001156 sw02001-A 0.98 11.56
sw02001-A_001980-002131 sw02001-A 19.8 21.31
sw02001-A_002736-002893 sw02001-A 27.36 28.93
\endverbatim
The format of the "segments" file is:
\verbatim
<utterance-id> <recording-id> <segment-begin> <segment-end>
\endverbatim
where the segment-begin and segment-end are measured in seconds.
These specify time offsets into a recording.  The "recording-id"
is the same identifier as is used in the "wav.scp" file-- again, this is
an arbitrary identifier that you can choose.
The file "reco2file_and_channel" is only used when scoring (measuring
error rates) with NIST's "sclite" tool:
\verbatim
s5# head -3 data/train/reco2file_and_channel
sw02001-A sw02001 A
sw02001-B sw02001 B
sw02005-A sw02005 A
\endverbatim
The format is:
\verbatim
<recording-id> <filename> <recording-side (A or B)>
\endverbatim
The filename is typically the name of the .sph file, without the suffix, but in
general it's whatever identifier you have in your "stm" file.
The recording side is a concept that relates to telephone conversations where there are
two channels, and if not, it's probably safe to use "A". If you don't have
an "stm" file or you have no idea what this is all about, then you don't need
the "reco2file_and_channel" file.

The last file you need to create yourself is the "utt2spk" file.  This says, for each
utterance, which speaker spoke it.
\verbatim
s5# head -3 data/train/utt2spk
sw02001-A_000098-001156 2001-A
sw02001-A_001980-002131 2001-A
sw02001-A_002736-002893 2001-A
\endverbatim
The format is
\verbatim
<utterance-id> <speaker-id>
\endverbatim
Note that the speaker-ids don't need to correspond in any very accurate sense
to the names of actual speakers-- they simply need to represent a reasonable guess.
In this case we assume each conversation side (each side of the telephone conversation)
corresponds to a single speaker.  This is not entirely true -- sometimes one person
may hand the phone to another person, or the same person may be speaking in multiple
calls -- but it's good enough for our purposes.  <b> If you have no information at all about
the speaker identities, you can just make the speaker-ids the same as the utterance-ids </b>,
so the format of the file would be just <DFN>\<utterance-id\> \<utterance-id\></DFN>.
We have made the previous sentence bold because we have encountered people creating
a "global" speaker-id.  This is a bad idea because it makes cepstral mean normalization
ineffective in training (since it's applied globally), and because it will create problems
when you use utils/split_data_dir.sh to split your data into pieces.

There is another file that exists in some setups; it is used only occasionally and
not in the Kaldi system build.  We show what it looks like in the Resource Management
(RM) setup:
\verbatim
s5# head -3 ../../rm/s5/data/train/spk2gender
adg0 f
ahh0 m
ajp0 m
\endverbatim
This file maps from speaker-id to either "m" or "f" depending on the speaker gender.

All of these files should be sorted.  If they are not sorted, you will get errors
when you run the scripts.  In \ref io_sec_tables we explain why this is needed.
It has to do with the I/O framework; the ultimate reason for the sorting is to
enable something equivalent to random-access lookup on a stream that doesn't support
fseek(), such as a piped command.  Many Kaldi programs are reading multiple pipes
from other Kaldi commands, reading different types of object, and are doing something
roughly comparable to merge-sort
on the different inputs; merge-sort, of course, requires that the inputs be sorted.
Be careful when you sort that you have the shell variable LC_ALL defined as "C",
for example (in bash),
\verbatim
export LC_ALL=C
\endverbatim
If you don't do this, the files will be sorted in an order that's different from how
C++ sorts strings, and Kaldi will crash.  You have been warned!

If your data consists of a test set from NIST that has an "stm" and a "glm" file
provided so that you can measure WER, then you can put these files in the data
directory with the names "stm" and "glm".  Note that we put the scoring
script (which measures WER) in <DFN>local/score.sh</DFN>, which means it is
specific to the setup; not all of the scoring scripts in all of the setups will
recognize the stm and glm file.  An example of a scoring script that uses those files is
the one the Switchboard setup, i.e. <DFN>egs/swbd/s5/local/score_sclite.sh</DFN>,
which is invoked by the top-level scoring script
<DFN>egs/swbd/s5/local/score.sh</DFN> if it notices that your test set has the
stm and glm files.

\subsection data_prep_data_noneed Files you don't need to create yourself

The other files in this directory can be generated from the files you provide.
You can create the "spk2utt" file by a command like the following
(this one is extracted from egs/rm/s5/local/rm_data_prep.sh)
\verbatim
utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
\endverbatim
This is possible because the utt2spk and spk2utt files contain exactly
the same information; the format of the spk2utt file is
<DFN>\<speaker-id\> \<utterance-id1\> \<utterance-id2\> ...</DFN>.

Next we come to the <DFN>feats.scp</DFN> file.
\verbatim
s5# head -3 data/train/feats.scp
sw02001-A_000098-001156 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:24
sw02001-A_001980-002131 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:54975
sw02001-A_002736-002893 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:62762
\endverbatim
This points to the extracted features-- MFCC features in this case, because
that is what we use in this particular script.  The format is:
\verbatim
<utterance-id> <extended-filename-of-features>
\endverbatim
Each of the feature files contains a matrix, in Kaldi format.
In this case the dimension of the matrix would be (the length of the file in 10ms intervals) by 13.
The "extended filename" <DFN>/home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:24</DFN>
means, open the "archive" file /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark, fseek()
to position 24, and read the data that's there.

This feats.scp file is created by the command
\verbatim
steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
\endverbatim
which is invoked by the top-level "run.sh" script.  For the definitions of the
shell variables, see that script.  <DFN>\$mfccdir</DFN> is a user-specified directory where the
.ark files will be written.

The last file in the directory data/train is "cmvn.scp".  This contains statistics
for cepstral mean and variance normalization, indexed by speaker.  Each set of
statistics is a matrix, of dimension 2 by 14 in this case.  In our example, we have:
\verbatim
s5# head -3 data/train/cmvn.scp
2001-A /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:7
2001-B /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:253
2005-A /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:499
\endverbatim
Unlike feats.scp, this scp file is indexed by speaker-id, not utterance-id.
This file is created by a command such as this:
\verbatim
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir
\endverbatim
(this example is from <DFN>egs/swbd/s5/run.sh</DFN>).

Because errors in data preparation can cause problems later on, we have a script to
check that the data directory is correctly formatted.  Run e.g.
\verbatim
utils/validate_data_dir.sh data/train
\endverbatim
You may also find the following command useful:
\verbatim
utils/fix_data_dir.sh data/train
\endverbatim
(of course the command will work for any data directory, not just data/train).  This
script will fix sorting errors and will remove any utterances for which some required
data, such as feature data or transcripts, is missing.

\section data_prep_lang Data preparation-- the "lang" directory.

Now we turn our attention to the "lang" directory.
\verbatim
s5# ls data/lang
L.fst  L_disambig.fst  oov.int	oov.txt  phones  phones.txt  topo  words.txt
\endverbatim
There may other directories with a very similar format: in the case we have
a directory "data/lang_test" that contains the same information but also a file
G.fst that is a Finite State Transducer form of the language model:
\verbatim
s5# ls data/lang_test
G.fst  L.fst  L_disambig.fst  oov.int  oov.txt	phones	phones.txt  topo  words.txt
\endverbatim
Note that lang_test/ was created by copying lang/ and adding G.fst.
Each of these directories seems to contain only a few files.
It's not quite as simple as this though, because "phones" is a directory:
\verbatim
s5# ls data/lang/phones
context_indep.csl  disambig.txt         nonsilence.txt        roots.txt    silence.txt
context_indep.int  extra_questions.int  optional_silence.csl  sets.int     word_boundary.int
context_indep.txt  extra_questions.txt  optional_silence.int  sets.txt     word_boundary.txt
disambig.csl       nonsilence.csl       optional_silence.txt  silence.csl
\endverbatim
The phones directory contains various bits of information about the phone set; there
are three versions of some of the files, with extensions .csl, .int and .txt, that contain
the same information in three formats.  Fortunately you, as a Kaldi user, don't have
to create all of these files because we have a script "utils/prepare_lang.sh" that
creates it all for you based on simpler inputs.  Before we describe that script
and the simpler inputs it takes, we feel obligated to explain what is in the "lang" directory.
After that we will explain the easy way to create it.  The user who is simply
aiming to quickly build a system without needing to understand how Kaldi works
may skip to \ref data_prep_lang_creating below.

\section data_prep_lang_contents Contents of the "lang" directory

First there are the files <DFN>phones.txt</DFN> and <DFN>words.txt</DFN>.  These
are both symbol-table files, in the OpenFst format, where each line is
the text form and then the integer form:
\verbatim
s5# head -3 data/lang/phones.txt
<eps> 0
SIL 1
SIL_B 2
s5# head -3 data/lang/words.txt
<eps> 0
!SIL 1
-'S 2
\endverbatim
These files are used by Kaldi to map back and forth between the integer and
text forms of these symbols.  They are mostly only accessed by the scripts
utils/int2sym.pl and utils/sym2int.pl, and by the OpenFst programs fstcompile and
fstprint.

The file <DFN>L.fst</DFN> is the Finite State Transducer form of the lexicon (L,
see  <a href=http://www.cs.nyu.edu/~mohri/pub/hbka.pdf> "Speech Recognition
with Weighted Finite-State Transducers" </a> by Mohri, Pereira and
Riley, in Springer Handbook on SpeechProcessing and Speech Communication, 2008).
with phone symbols on the input and word symbols on the output.  The file
<DFN>L_disambig.fst</DFN> is the lexicon, as above but including the disambiguation
symbols \#1, \#2, and so on, as well as the self-loop with \#0 on it to "pass through"
the disambiguation symbol from the grammar.  See \ref graph_disambig for more
explanation.  Anyway, you won't have to deal with this directly.

The file <DFN>data/lang/oov.txt</DFN> contains just a single line:
\verbatim
s5# cat data/lang/oov.txt
<UNK>
\endverbatim
This is the word that we will map all out-of-vocabulary words to during
training.  There is nothing special about "<UNK>" here, and it does not have
to be this particular word; what is important is that this word should have a pronunciation
containing just a phone that we designate as a "garbage phone"; this phone will
align with various kinds of spoken noise.  In our particular setup, this phone
is called <DFN>\<SPN\></DFN> (short for "spoken noise"):
\verbatim
s5# grep -w UNK data/local/dict/lexicon.txt
<UNK> SPN
\endverbatim
The file <DFN>oov.int</DFN> contains the integer form of this (extracted from <DFN>words.txt</DFN>),
which happens to be 221 in this setup.  You might notice that in the Resource Management
setup, oov.txt contains the silence word, which in that setup happens to be called "!SIL".
In that case we simply chose an arbitrary word from the vocabulary-- there are no out of vocabulary
words in the training set, so the word we choose has no effect.

The file data/lang/topo contains the following data:
\verbatim
s5# cat data/lang/topo
<Topology>
<TopologyEntry>
<ForPhones>
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.75 <Transition> 1 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.75 <Transition> 2 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 2 0.75 <Transition> 3 0.25 </State>
<State> 3 </State>
</TopologyEntry>
<TopologyEntry>
<ForPhones>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.25 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 3 <PdfClass> 3 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 4 <PdfClass> 4 <Transition> 4 0.75 <Transition> 5 0.25 </State>
<State> 5 </State>
</TopologyEntry>
</Topology>
\endverbatim
This specifies the topology of the HMMs we use.  In this case, the "real" phones contain
three emitting states
with the standard 3-state left-to-right topology-- the "Bakis model".
(Emitting states are states that "emit" feature vectors, as distinct from the "fake"
non-emitting states that are just used to glue other states together).
Phones 1 to 20 are various kinds of silence and noise; we have a lot because of word-position-dependency,
and in fact most of these will never be used; the real number excluding word position
dependency is more like five.  The "silence phones" have a more complex topology with an
initial emitting state and an end emitting state, but then three emitting states in the middle.
You don't have to create this file by hand.

There are a number of files in <DFN>data/lang/phones/</DFN> that specify various things about
the phone set.  Most of these files exist in three separate versions: a ".txt" form, e.g.:
\verbatim
s5# head -3 data/lang/phones/context_indep.txt
SIL
SIL_B
SIL_E
\endverbatim
a ".int" form, e.g:
\verbatim
s5# head -3 data/lang/phones/context_indep.int
1
2
3
\endverbatim
and a ".csl" form, which in a slight abuse of notation, denotes a colon-separated list,
not a comma-separated list:
\verbatim
s5# cat data/lang/phones/context_indep.csl
1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20
\endverbatim
These files always contain the same information, so let's focus on the ".txt" form which
is more human-readable.  The file "context_indep.txt" contains a list of those phones
for which we build context-independent models: that is, for those phones, we do not build a decision tree
that gets to ask questions about the left and right phonetic context.  In fact, we do build
smaller trees where we get to ask questions about the central phone and the HMM-state;
this depends on the "roots.txt" file which we'll describe below.  See \ref tree_externals
for more in-depth discussion of tree issues.

The file <DFN>context_indep.txt</DFN> contains all the phones which are not "real phones":
i.e. silence (SIL), spoken noise (SPN), non-spoken noise (NSN), and laughter (LAU):
\verbatim
# cat data/lang/phones/context_indep.txt
SIL
SIL_B
SIL_E
SIL_I
SIL_S
SPN
SPN_B
SPN_E
SPN_I
SPN_S
NSN
NSN_B
NSN_E
NSN_I
NSN_S
LAU
LAU_B
LAU_E
LAU_I
LAU_S
\endverbatim
There are a lot of variants of these phones because of word-position dependency; not all of these variants
will ever be used.  Here, <DFN>SIL</DFN> would be the silence that gets optionally inserted by the
lexicon (not part of a word), <DFN>SIL_B</DFN> would be a silence phone at the beginning of a word
(which should never exist), <DFN>SIL_I</DFN> word-internal silence (unlikely to exist), <DFN>SIL_E</DFN>
word-ending silence (should never exist), and <DFN>SIL_S</DFN> would be silence as a "singleton
word", i.e. a phone with only one word-- this might be used if you had a "silence word" in your
lexicon and explicit silences appear in your transcriptions.

The files <DFN>silence.txt</DFN> and <DFN>nonsilence.txt</DFN> contains lists of the silence
phones and nonsilence phones respectively.  These should be mutually exclusive and together,
should contain all the phones.  In this particular setup, <DFN>silence.txt</DFN> is identical
to <DFN>context_indep.txt</DFN>.
What we mean by "nonsilence" phones is, phones that we
intend to estimate various kinds of linear transforms on: that is, global transforms
such as LDA and MLLT, and speaker adaptation transforms such as fMLLR.  Our belief based
on prior experience is that it does not pay to include silence in the estimation of
such transforms.  Our practice is
to designate all silence, noise and vocalized-noise phones as "silence" phones, and all
phones representing traditional phonemes as "nonsilence" phones.  We haven't experimented
in Kaldi with the best way to do this.
\verbatim
s5# head -3 data/lang/phones/silence.txt
SIL
SIL_B
SIL_E
s5# head -3 data/lang/phones/nonsilence.txt
IY_B
IY_E
IY_I
\endverbatim

The file <DFN>disambig.txt</DFN> contains a list of the "disambiguation symbols"
(see \ref graph_disambig):
\verbatim
s5# head -3 data/lang/phones/disambig.txt
#0
#1
#2
\endverbatim
These symbols appear in the file <DFN>phones.txt</DFN> as if they were phones.

The file <DFN>optional_silence.txt</DFN> contains a single phone which can optionally
appear between words:
\verbatim
s5# cat data/lang/phones/optional_silence.txt
SIL
\endverbatim
The mechanism by which it appears optionally between words is that it appears
optionally in the lexicon FST at the end of every word (and also the beginning of the
utterance).  The reason it has to be specified in <DFN>phones/</DFN> instead of just appearing
in <DFN>L.fst</DFN> is obscure and we won't go into it here.

The file <DFN>sets.txt</DFN> contains sets of phones that we group together (consider as
the same phone) while clustering the phones in order to create the context-dependency questions
(in Kaldi we use automatically generated questions when building decision trees,
rather than linguistically meaningful ones).
In this particular setup, <DFN>sets.txt</DFN> groups together all the word-position-dependent
versions of each phone:
\verbatim
s5# head -3 data/lang/phones/sets.txt
SIL SIL_B SIL_E SIL_I SIL_S
SPN SPN_B SPN_E SPN_I SPN_S
NSN NSN_B NSN_E NSN_I NSN_S
\endverbatim

The file <DFN>extra_questions.txt</DFN> contains some extra questions which we'll include
in addition to the automatically generated questions:
\verbatim
s5# cat data/lang/phones/extra_questions.txt
IY_B B_B D_B F_B G_B K_B SH_B L_B M_B N_B OW_B AA_B TH_B P_B OY_B R_B UH_B AE_B S_B T_B AH_B V_B W_B Y_B Z_B CH_B AO_B DH_B UW_B ZH_B EH_B AW_B AX_B EL_B AY_B EN_B HH_B ER_B IH_B JH_B EY_B NG_B
IY_E B_E D_E F_E G_E K_E SH_E L_E M_E N_E OW_E AA_E TH_E P_E OY_E R_E UH_E AE_E S_E T_E AH_E V_E W_E Y_E Z_E CH_E AO_E DH_E UW_E ZH_E EH_E AW_E AX_E EL_E AY_E EN_E HH_E ER_E IH_E JH_E EY_E NG_E
IY_I B_I D_I F_I G_I K_I SH_I L_I M_I N_I OW_I AA_I TH_I P_I OY_I R_I UH_I AE_I S_I T_I AH_I V_I W_I Y_I Z_I CH_I AO_I DH_I UW_I ZH_I EH_I AW_I AX_I EL_I AY_I EN_I HH_I ER_I IH_I JH_I EY_I NG_I
IY_S B_S D_S F_S G_S K_S SH_S L_S M_S N_S OW_S AA_S TH_S P_S OY_S R_S UH_S AE_S S_S T_S AH_S V_S W_S Y_S Z_S CH_S AO_S DH_S UW_S ZH_S EH_S AW_S AX_S EL_S AY_S EN_S HH_S ER_S IH_S JH_S EY_S NG_S
SIL SPN NSN LAU
SIL_B SPN_B NSN_B LAU_B
SIL_E SPN_E NSN_E LAU_E
SIL_I SPN_I NSN_I LAU_I
SIL_S SPN_S NSN_S LAU_S
\endverbatim
You will observe that a question is simply a set of phones.
The first four questions are asking about the word-position, for regular phones; and the last five do the same for
the "silence phones".  The "silence" phones also come in a variety without a suffix like <DFN>_B</DFN>,
for example <DFN>SIL</DFN>.  These may appear as optional silence in the lexicon, i.e. not inside an
actual word.  In setups with things like tone dependency or stress markings, <DFN>extra_questions.txt</DFN>
may contain questions that relate to those features.

The file <DFN>word_boundary.txt</DFN> explains how the phones relate to word positions:
\verbatim
s5# head  data/lang/phones/word_boundary.txt
SIL nonword
SIL_B begin
SIL_E end
SIL_I internal
SIL_S singleton
SPN nonword
SPN_B begin
\endverbatim
This is the same information as is in the suffixes of the phones (<DFN>_B</DFN> and so on), but
we don't like to hardcode this in the text form of the phones-- for one thing, Kaldi executables
never see the text form of the phones, but only an integerized form.  So it is specified
by this file <DFN>word_boundary.txt</DFN>.  The main reason we need this information is
in order to recover the word boundaries within lattices (for example, the program
lattice-align-words reads the integer versin of this file, <DFN>word_boundaray.int</DFN>).
Finding the word boundaries is useful for reasons including NIST sclite scoring, which requires
the time markings for words, and for other downstream processing.

The file <DFN>roots.txt</DFN> contains information that relates to how we build the phonetic-context
decision tree:
\verbatim
head data/lang/phones/roots.txt
shared split SIL SIL_B SIL_E SIL_I SIL_S
shared split SPN SPN_B SPN_E SPN_I SPN_S
shared split NSN NSN_B NSN_E NSN_I NSN_S
shared split LAU LAU_B LAU_E LAU_I LAU_S
...
shared split B_B B_E B_I B_S
\endverbatim
For now you can ignore the words "shared" and "split"-- these relate to certain options
in how we build the decision tree (see \ref tree_externals for more information).
The significance of having a number of phones on a single line, for
example <DFN>SIL SIL_B SIL_E SIL_I SIL_S</DFN>, is that all of these phones
have a single "shared root" in the decision tree, so states may be shared
between those phones.  For stress and tone-dependent systems, typically
all the stress or tone-dependent versions of a particular phone will appear on
the same line.  In addition, all three states of a HMM (or all five states, for
silences) share the root, and the decision-tree building process gets to
ask about the state.  This sharing of the decision-tree root
between the HMM-states is what we mean by "shared" in the roots file.

\section data_prep_lang_creating Creating the "lang" directory

The <DFN>data/lang/</DFN> directory contains a lot of different files, so we have
provided a script that creates it for you starting from a relatively simple
input:
\verbatim
utils/prepare_lang.sh data/local/dict "<UNK>" data/local/lang data/lang
\endverbatim
Here, the inputs are the directory <DFN>data/local/dict/</DFN>, and the label <DFN>\<UNK\></DFN>
which is the dictionary word we will map OOV words to when appear in transcripts
(this becomes data/lang/oov.txt).  The location <DFN>data/local/lang/</DFN> is simply a
temporary directory which the script will use; <DFN>data/lang/</DFN> is where
it actually puts its output.

The thing which you, as the data-preparer, need to create, is the directory
<DFN>data/local/dict/</DFN>.  The directory contains the following contents:
\verbatim
s5# ls data/local/dict
extra_questions.txt  lexicon.txt nonsilence_phones.txt  optional_silence.txt  silence_phones.txt
\endverbatim
(in fact there are a few more files there which we haven't listed, but they are just temporary files that
were put there while creating that directory, and we can ignore them).  The commands below give
you an idea what is in these files:
\verbatim
s5# head -3 data/local/dict/nonsilence_phones.txt
IY
B
D
s5# cat data/local/dict/silence_phones.txt
SIL
SPN
NSN
LAU
s5# cat data/local/dict/extra_questions.txt
s5# head -5 data/local/dict/lexicon.txt
!SIL SIL
-'S S
-'S Z
-'T K UH D EN T
-1K W AH N K EY
\endverbatim
As you can see, the contents of this directory are very simple in this
setup (the Switchboard setup).  We just have lists of the "real" phones and of the
"silence" phones respectively, an empty file called <DFN>extra_questions.txt</DFN>, and
a file called <DFN>lexicon.txt</DFN> which has the format
\verbatim
<word> <phone1> <phone2> ...
\endverbatim
Note: <DFN>lexicon.txt</DFN> will contain repeated entries for the same word,
on separate lines,
if we have multiple pronunciations for it.  If you want to use pronunciation
probabilities, instead of creating the file <DFN>lexicon.txt</DFN>, create a file
called <DFN>lexiconp.txt</DFN> that has the probability as the second field.
Note that it is a common practice to normalize the pronunciations probabilities so that
instead of summing to one, the most probable pronunciation for each word is one.  This
tends to give better results.  For a top-level script that runs with
pronunciation probabilities, search for <DFN>pp</DFN> in <DFN>egs/wsj/s5/run.sh</DFN>.

Notice that in this input there is no notion of word-position dependency,
i.e. no suffixes like <DFN>_B</DFN> and <DFN>_E</DFN>.  This is because it is the
scripts <DFN>prepare_lang.sh</DFN> that adds those suffixes.

You can see from the empty <DFN>extra_questions.txt</DFN> file that there
is some kind of potential here that is not being fully exploited.  This relates
to things like stress markings or tone markings.  You may want to have different
versions of a particular phone that have different stress or tone.  In order
to demonstrate what this looks like, we'll view the same files as above,
but in the <DFN>egs/wsj/s5/</DFN> setup.  The result is below:
\verbatim
s5# cat data/local/dict/silence_phones.txt
SIL
SPN
NSN
s5# head data/local/dict/nonsilence_phones.txt
S
UW UW0 UW1 UW2
T
N
K
Y
Z
AO AO0 AO1 AO2
AY AY0 AY1 AY2
SH
s5# head -6 data/local/dict/lexicon.txt
!SIL SIL
<SPOKEN_NOISE> SPN
<UNK> SPN
<NOISE> NSN
!EXCLAMATION-POINT  EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
"CLOSE-QUOTE  K L OW1 Z K W OW1 T
s5# cat data/local/dict/extra_questions.txt
SIL SPN NSN
S UW T N K Y Z AO AY SH W NG EY B CH OY JH D ZH G UH F V ER AA IH M DH L AH P OW AW HH AE R TH IY EH
UW1 AO1 AY1 EY1 OY1 UH1 ER1 AA1 IH1 AH1 OW1 AW1 AE1 IY1 EH1
UW0 AO0 AY0 EY0 OY0 UH0 ER0 AA0 IH0 AH0 OW0 AW0 AE0 IY0 EH0
UW2 AO2 AY2 EY2 OY2 UH2 ER2 AA2 IH2 AH2 OW2 AW2 AE2 IY2 EH2
s5#
\endverbatim
You may notice that some of the lines in <DFN>nonsilence_phones.txt</DFN> contain
multiple phones on a single line.  These are the different stress-dependent
versions of the vowels.  Note that four different versions of each phone
appear in the CMU dictionary: for example, <DFN>UW UW0 UW1 UW2</DFN>;
for some reason, one of these versions does not have a numeric suffix.
The order of the phones on the line does not matter, but the grouping into
different lines does matter; in general, we advise users to group all forms of
each "real phone" on a separate line.
We use the stress markings present in the CMU
dictionary.  The file extra_questions.txt contains a single question
containing all of the "silence" phones (in fact this is unnecessary as
it appears that the script <DFN>prepare_lang.sh</DFN> adds such a question anyway),
and also a question corresponding to each of the different stress markers.
These questions are necessary in order to get any benefit from the
stress markers, because the fact that the different stress-dependent versions
of each phone are together on the lines of <DFN>nonsilence_phones.txt</DFN>,
ensures that they stay together in <DFN>data/lang/phones/roots.txt</DFN> and
<DFN>data/lang/phones/sets.txt</DFN>, which in turn ensures that they
share the same tree root and can never be distinguished by a question.  Thus,
we have to provide a special question that affords the decision-tree building
process a way to distinguish between the phones.  Note: the reason we put the
phones together in the <DFN>sets.txt</DFN> and <DFN>roots.txt</DFN> is that some
of the stress-dependent versions of phones may have too little data to
robustly estimate either a separate decision tree or the phone clustering
information that's used in producing the questions.  By grouping them together
like this, we ensure that in the absence of enough data to estimate them
separately, these different versions of the phone all "stay together" throughout
the decision-tree building process.

We should mention at this point that the script <DFN>utils/prepare_lang.sh</DFN>
supports a number of options.  To give you an idea of what they are, here is
the usage messages of that script:
\verbatim
usage: utils/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>
e.g.: utils/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang
options:
     --num-sil-states <number of states>             # default: 5, #states in silence models.
     --num-nonsil-states <number of states>          # default: 3, #states in non-silence models.
     --position-dependent-phones (true|false)        # default: true; if true, use _B, _E, _S & _I
                                                     # markers on phones to indicate word-internal positions.
     --share-silence-phones (true|false)             # default: false; if true, share pdfs of
                                                     # all non-silence phones.
     --sil-prob <probability of silence>             # default: 0.5 [must have 0 < silprob < 1]
\endverbatim
A potentially important option is the <DFN>--share-silence-phones</DFN> option.
The default is false.  If this option is true, all the pdf's (the Gaussian
mixture models) of all the silence phones such as silence, vocalized-noise,
noise and laughter, will be shared and only the transition probabilities will
differ between those models.  It's not clear why this should help, but we found
that it was extremely helpful for the Cantonese data of IARPA's BABEL project.
That data is very messy and has long untranscribed portions that we try to
align to a special phone which we designate for that purpose.  We suspect
that the training data was somehow failing to align correctly, and for some reason
setting this option to true changed that.

Another potentially important option is the "--sil-prob" option.   In general, we have
not experimented much with any of these options so we cannot give very detailed advice.

\section data_prep_grammar Creating the language model or grammar

Our tutorial above on how to create the <DFN>lang/</DFN> directory did not address how
to create the file <DFN>G.fst</DFN>, which is the finite state transducer form of
the language model or grammar that we'll decode with.  In fact, in some setups
we may have many "lang" directories for testing purposes, with different
language models and dictionaries.  The Wall Street Journal (WSJ) setup is an example:
\verbatim
s5# echo data/lang*
data/lang data/lang_test_bd_fg data/lang_test_bd_tg data/lang_test_bd_tgpr data/lang_test_bg \
 data/lang_test_bg_5k data/lang_test_tg data/lang_test_tg_5k data/lang_test_tgpr data/lang_test_tgpr_5k
\endverbatim

The process for creating <DFN>G.fst</DFN> is different depending on whether we're using
a statistical language model or some kind of grammar.  In the RM setup there is
a bigram grammar, which only allows certain pairs of words.  We make this sum to
one within each grammar state by assigning a probability of 1 over the number of
outgoing arcs.  There is a statement in <DFN>local/rm_data_prep.sh</DFN> that does:
\verbatim
local/make_rm_lm.pl $RMROOT/rm1_audio1/rm1/doc/wp_gram.txt  > $tmpdir/G.txt || exit 1;
\endverbatim
This script <DFN>local/make_rm_lm.pl</DFN> creates a grammar in FST format (text format,
not binary format).  It contains lines like the following:
\verbatim
s5# head data/local/tmp/G.txt
0    1    ADD    ADD    5.19849703126583
0    2    AJAX+S    AJAX+S    5.19849703126583
0    3    APALACHICOLA+S    APALACHICOLA+S    5.19849703126583
\endverbatim
See <a href=www.openfst.org> www.openfst.org </a> for more information on OpenFst (they
have a useful tutorial).  The script <DFN>local/rm_prepare_grammar.sh</DFN> will turn this into
the binary-format file <DFN>G.fst</DFN> using the following statement:
\verbatim
fstcompile --isymbols=data/lang/words.txt --osymbols=data/lang/words.txt --keep_isymbols=false \
    --keep_osymbols=false $tmpdir/G.txt > data/lang/G.fst
\endverbatim
If you want to create your own grammar, you will probably want to do something similar.
Note: this type of procedure only applies to grammars of a certain class: it won't
allow you to compile a complete Context Free Grammar, because it can't be represented
in OpenFst format.  There are ways to do this in the WFST framework
(e.g. see recent work by Mike Riley with push down transducers), but we have not yet
worked with those ideas in Kaldi.

Please, before asking any questions on the list about language models or about making
grammar FSTs, read "A Bit of Progress in Language Modeling" by Joshua Goodman; and go to
www.openfst.org and do the FST tutorial so that you understand the basics of finite
state transducers.  (Note that language models would be represented as finite state
acceptors, or FSAs, which can be considered as a special case of finite state transducers).

The script <DFN>utils/format_lm.sh</DFN> deals with converting the ARPA-format language
models into an OpenFst format. Here is the usage messages of that script:
\verbatim
Usage: utils/format_lm.sh <lang_dir> <arpa-LM> <lexicon> <out_dir>
E.g.: utils/format_lm.sh data/lang data/local/lm/foo.kn.gz data/local/dict/lexicon.txt data/lang_test
Convert ARPA-format language models to FSTs.
\endverbatim
Some of the key commands from that script are:
\verbatim
gunzip -c $lm \
  | arpa2fst --disambig-symbol=#0 \
             --read-symbol-table=$out_dir/words.txt - $out_dir/G.fst
\endverbatim
This Kaldi program, <DFN>arpa2fst</DFN>, turns the ARPA-format language model
into a Weight Finite State Transducer (actually, an acceptor).

A popular toolkit for building language models is SRILM.  Various language
modeling toolkits are used in the Kaldi example scripts.  SRILM is the best
documented and most fully featured, and we generally recommend it (its only
drawback is that it don't have the most free licence). Here is the usage
messages of <DFN>utils/format_lm_sri.sh</DFN>

\verbatim
Usage: utils/format_lm_sri.sh [options] <lang-dir> <arpa-LM> <out-dir>
E.g.: utils/format_lm_sri.sh data/lang data/local/lm/foo.kn.gz data/lang_test
Converts ARPA-format language models to FSTs. Change the LM vocabulary using SRILM.
\endverbatim


\section data_prep_unknown Note on unknown words

This is an explanation of how Kaldi deals with unknown words (words not in the
vocabulary); we are putting it on the "data preparation" page for lack of a more obvious
location.

In many setups, <DFN>\<unk\></DFN> or something similar will be present in the
LM as long as the data that you used to train the LM had words that were not
in the vocabulary you used to train the LM,
because language modeling toolkits tend to map those all to a
single special world, usually called <DFN>\<unk\></DFN> or
<DFN>\<UNK\></DFN>.  You can look at the arpa file to figure out what it's called; it
will usually be one of those two.


During training, if there are words in the <DFN>text</DFN> file in your data
directory that are not in the <DFN>words.txt</DFN> in the lang directory that
you are using, Kaldi will map them to a special word that's specified in the
lang directory in the file <DFN>data/lang/oov.txt</DFN>; it will usually be
either <DFN>\<unk\></DFN>, <DFN>\<UNK\></DFN> or maybe
<DFN>\<SPOKEN_NOISE\></DFN>.  This word will have been chosen by the user
(i.e., you), and supplied to <DFN>prepare_lang.sh</DFN> as a command-line argument.
If this word has nonzero probability in the language model (which you can test
by looking at the arpa file), then it will be possible for Kaldi to recognize
this word in test time.  This will often be the case if you call this word
<DFN>\<unk\></DFN>, because as we mentioned above, language modeling toolkits
will often use this spelling for ``unknown word'' (which is a special word that
all out-of-vocabulary words get mapped to).  Decoding output will always be limited to the
intersection of the words in the language model with the words in the lexicon.txt (or whatever file format you supplied the
lexicon in, e.g. lexicop.txt); these words will all be present in the <DFN>words.txt</DFN>
in your <DFN>lang</DFN> directory.
So if Kaldi's "unknown word" doesn't match the LM's "unknown word", you will
simply never decode this word.  In any
case, even when allowed to be decoded, this word typically won't be output very
often and in practice it doesn't tend to have much impact on WERs.

Of course a single phone isn't a very good, or accurate, model of OOV words.  In
some Kaldi setups we have example scripts with names
<DFN>local/run_unk_model.sh</DFN>: e.g., see the file
<DFN>tedlium/s5_r2/local/run_unk_model.sh</DFN>.  These scripts replace the unk
phone with a phone-level LM on phones.  They make it possible to get access to
the sequence of phones in a hypothesized unknown word.  Note: unknown words
should be considered an "advanced topic" in speech recognition and we discourage
beginners from looking into this topic too closely.



*/