data_prep.dox 43.8 KB
edit raw blame history



1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889


// doc/data_prep.dox

// Copyright 2012  Johns Hopkins University (author: Daniel Povey)

// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at

//  http://www.apache.org/licenses/LICENSE-2.0

// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.

/**
 \page data_prep  Data preparation

  \section data_prep_intro Introduction

  After running the example scripts (see \ref tutorial), you may want to set up
  Kaldi to run with your own data.  This section explains how to prepare the data.
  This page will assume that you are using the latest version of the example scripts
  (typically named "s5" in the example directories, e.g. egs/rm/s5/).
  In addition to this page, you can refer to the data preparation scripts in those
 directories.  The top-level run.sh scripts (e.g. egs/rm/s5/run.sh) have a few commands at
 the top of them that relate to various phases of data preparation.  The parts in
 the sub-directory named local/ are always specific to the database.  For example,
 in the Resource Management (RM) setup it is local/rm_data_prep.sh.  In the case of
 RM these commands are:
\verbatim
local/rm_data_prep.sh /export/corpora5/LDC/LDC93S3A/rm_comp || exit 1;

utils/prepare_lang.sh data/local/dict '!SIL' data/local/lang data/lang || exit 1;

local/rm_prepare_grammar.sh || exit 1;
\endverbatim

In the WSJ case the commands are:
\verbatim

wsj0=/export/corpora5/LDC/LDC93S6B
wsj1=/export/corpora5/LDC/LDC94S13B

local/wsj_data_prep.sh $wsj0/??-{?,??}.? $wsj1/??-{?,??}.?  || exit 1;

local/wsj_prepare_dict.sh || exit 1;

utils/prepare_lang.sh data/local/dict "<SPOKEN_NOISE>" data/local/lang_tmp data/lang || exit 1;

local/wsj_format_data.sh || exit 1;
\endverbatim
There are more commands after these in the WSJ script that relate to training language
models locally (rather than using the ones supplied by LDC), but the ones above are the most
important ones.


The output of the data preparation stage consists of two sets of things.  One relates
to "the data" (directories like data/train/) and one relates to "the language"
(directories like data/lang/).  The "data" part relates to the specific recordings you
have, and the "lang" part contains things that relate more to the language itself,
such as the lexicon, the phone set, and various extra information about the phone set
that Kaldi needs.  If you want to prepare data which you will decode with an
already existing system and an already existing language model, the "data" part is
all you need to touch.

\section data_prep_data Data preparation-- the "data" part.

As an example of the "data" part of the data preparation, look at the directory
"data/train" in one of the example directories (assuming you have already run
the scripts there).  Note: there is nothing special about the directory name
"data/train".  There are other directories such as "data/eval2000" (for a test set)
that have essentially the same format ("essentially" because we may have an "stm" and
"glm" file in the test directory, to enable sclite scoring).
The specific example we'll look at the Switchboard recipe
in egs/swbd/s5.
\verbatim
s5# ls data/train
cmvn.scp  feats.scp  reco2file_and_channel  segments  spk2utt  text  utt2spk  wav.scp
\endverbatim
Not all of the files are equally important.  For a simple setup where there is no
"segmentation" information (i.e. each utterance corresponds to a single file), the only
files you have to create yourself are "utt2spk", "text" and "wav.scp" and possibly
"segments" and "reco2file_and_channel", and the rest will be created by standard scripts.

We will describe the files in this directory, starting with the files you need to create
yourself.

\subsection data_prep_data_yourself Files you need to create yourself

 The file "text" contains the transcriptions of each utterance.
\verbatim
s5# head -3 data/train/text
sw02001-A_000098-001156 HI UM YEAH I'D LIKE TO TALK ABOUT HOW YOU DRESS FOR WORK AND
sw02001-A_001980-002131 UM-HUM
sw02001-A_002736-002893 AND IS
\endverbatim
The first element on each line is the utterance-id, which is an arbitrary text string,
but if you have speaker information in your setup, you should make the speaker-id a
prefix of the utterance id; this is important for reasons relating to the sorting of
these files.  The rest of the line is the transcription of each sentence.  You don't
have to make sure that all words in this file are in your vocabulary; out of vocabulary words will
get mapped to a word specified in the file data/lang/oov.txt.

It needs to be the case that when you sort both the utt2spk and spk2utt files,
the orders "agree", e.g. the list of speaker-ids extracted from the utt2spk file
is the same as the string sorted order.  The easiest way to make this happen is
to make the speaker-ids a prefix of the utter Although, in this particular
example we have used an underscore to separate the "speaker" and "utterance"
parts of the utterance-id, in general it is probably safer to use a dash ("-").
This is because it has a lower ASCII value; if the speaker-ids vary in length,
in certain cases the speaker-ids and their corresponding utterance ids can end
up being sorted in different orders when using the standard "C"-style ordering
on strings, which will lead to a crash.
\endverbatim
Another important file is <DFN>wav.scp</DFN>.  In the Switchboard example,
\verbatim
s5# head -3 data/train/wav.scp
sw02001-A /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 /export/corpora3/LDC/LDC97S62/swb1/sw02001.sph |
sw02001-B /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 2 /export/corpora3/LDC/LDC97S62/swb1/sw02001.sph |
\endverbatim
The format of this file is
\verbatim
<recording-id> <extended-filename>
\endverbatim
where the "extended-filename" may be
an actual filename, or as in this case, a command that extracts a wav-format file.  The pipe symbol
on the end of the extended-filename specifies that it is to be interpreted as a pipe.  We will
explain what "recording-id" is below, but we would first like to point out that if the "segments" file
does not exist, the first token on each line of "wav.scp" file is just the utterance id.
The files in wav.scp must be single-channel (mono); if the underlying wav files have multiple
channels, then a sox command must be used in the wav.scp to extract a particular channel.

In the Switchboard setup we have the "segments" file, so we'll discuss this next.
\verbatim
s5# head -3 data/train/segments
sw02001-A_000098-001156 sw02001-A 0.98 11.56
sw02001-A_001980-002131 sw02001-A 19.8 21.31
sw02001-A_002736-002893 sw02001-A 27.36 28.93
\endverbatim
The format of the "segments" file is:
\verbatim
<utterance-id> <recording-id> <segment-begin> <segment-end>
\endverbatim
where the segment-begin and segment-end are measured in seconds.
These specify time offsets into a recording.  The "recording-id"
is the same identifier as is used in the "wav.scp" file-- again, this is
an arbitrary identifier that you can choose.
The file "reco2file_and_channel" is only used when scoring (measuring
error rates) with NIST's "sclite" tool:
\verbatim
s5# head -3 data/train/reco2file_and_channel
sw02001-A sw02001 A
sw02001-B sw02001 B
sw02005-A sw02005 A
\endverbatim
The format is:
\verbatim
<recording-id> <filename> <recording-side (A or B)>
\endverbatim
The filename is typically the name of the .sph file, without the suffix, but in
general it's whatever identifier you have in your "stm" file.
The recording side is a concept that relates to telephone conversations where there are
two channels, and if not, it's probably safe to use "A". If you don't have
an "stm" file or you have no idea what this is all about, then you don't need
the "reco2file_and_channel" file.

The last file you need to create yourself is the "utt2spk" file.  This says, for each
utterance, which speaker spoke it.
\verbatim
s5# head -3 data/train/utt2spk
sw02001-A_000098-001156 2001-A
sw02001-A_001980-002131 2001-A
sw02001-A_002736-002893 2001-A
\endverbatim
The format is
\verbatim
<utterance-id> <speaker-id>
\endverbatim
Note that the speaker-ids don't need to correspond in any very accurate sense
to the names of actual speakers-- they simply need to represent a reasonable guess.
In this case we assume each conversation side (each side of the telephone conversation)
corresponds to a single speaker.  This is not entirely true -- sometimes one person
may hand the phone to another person, or the same person may be speaking in multiple
calls -- but it's good enough for our purposes.  <b> If you have no information at all about
the speaker identities, you can just make the speaker-ids the same as the utterance-ids </b>,
so the format of the file would be just <DFN>\<utterance-id\> \<utterance-id\></DFN>.
We have made the previous sentence bold because we have encountered people creating
a "global" speaker-id.  This is a bad idea because it makes cepstral mean normalization
ineffective in training (since it's applied globally), and because it will create problems
when you use utils/split_data_dir.sh to split your data into pieces.

There is another file that exists in some setups; it is used only occasionally and
not in the Kaldi system build.  We show what it looks like in the Resource Management
(RM) setup:
\verbatim
s5# head -3 ../../rm/s5/data/train/spk2gender
adg0 f
ahh0 m
ajp0 m
\endverbatim
This file maps from speaker-id to either "m" or "f" depending on the speaker gender.

All of these files should be sorted.  If they are not sorted, you will get errors
when you run the scripts.  In \ref io_sec_tables we explain why this is needed.
It has to do with the I/O framework; the ultimate reason for the sorting is to
enable something equivalent to random-access lookup on a stream that doesn't support
fseek(), such as a piped command.  Many Kaldi programs are reading multiple pipes
from other Kaldi commands, reading different types of object, and are doing something
roughly comparable to merge-sort
on the different inputs; merge-sort, of course, requires that the inputs be sorted.
Be careful when you sort that you have the shell variable LC_ALL defined as "C",
for example (in bash),
\verbatim
export LC_ALL=C
\endverbatim
If you don't do this, the files will be sorted in an order that's different from how
C++ sorts strings, and Kaldi will crash.  You have been warned!

If your data consists of a test set from NIST that has an "stm" and a "glm" file
provided so that you can measure WER, then you can put these files in the data
directory with the names "stm" and "glm".  Note that we put the scoring
script (which measures WER) in <DFN>local/score.sh</DFN>, which means it is
specific to the setup; not all of the scoring scripts in all of the setups will
recognize the stm and glm file.  An example of a scoring script that uses those files is
the one the Switchboard setup, i.e. <DFN>egs/swbd/s5/local/score_sclite.sh</DFN>,
which is invoked by the top-level scoring script
<DFN>egs/swbd/s5/local/score.sh</DFN> if it notices that your test set has the
stm and glm files.

\subsection data_prep_data_noneed Files you don't need to create yourself

The other files in this directory can be generated from the files you provide.
You can create the "spk2utt" file by a command like the following
(this one is extracted from egs/rm/s5/local/rm_data_prep.sh)
\verbatim
utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
\endverbatim
This is possible because the utt2spk and spk2utt files contain exactly
the same information; the format of the spk2utt file is
<DFN>\<speaker-id\> \<utterance-id1\> \<utterance-id2\> ...</DFN>.

Next we come to the <DFN>feats.scp</DFN> file.
\verbatim
s5# head -3 data/train/feats.scp
sw02001-A_000098-001156 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:24
sw02001-A_001980-002131 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:54975
sw02001-A_002736-002893 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:62762
\endverbatim
This points to the extracted features-- MFCC features in this case, because
that is what we use in this particular script.  The format is:
\verbatim
<utterance-id> <extended-filename-of-features>
\endverbatim
Each of the feature files contains a matrix, in Kaldi format.
In this case the dimension of the matrix would be (the length of the file in 10ms intervals) by 13.
The "extended filename" <DFN>/home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:24</DFN>
means, open the "archive" file /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark, fseek()
to position 24, and read the data that's there.

This feats.scp file is created by the command
\verbatim
steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir
\endverbatim
which is invoked by the top-level "run.sh" script.  For the definitions of the
shell variables, see that script.  <DFN>\$mfccdir</DFN> is a user-specified directory where the
.ark files will be written.

The last file in the directory data/train is "cmvn.scp".  This contains statistics
for cepstral mean and variance normalization, indexed by speaker.  Each set of
statistics is a matrix, of dimension 2 by 14 in this case.  In our example, we have:
\verbatim
s5# head -3 data/train/cmvn.scp
2001-A /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:7
2001-B /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:253
2005-A /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:499
\endverbatim
Unlike feats.scp, this scp file is indexed by speaker-id, not utterance-id.
This file is created by a command such as this:
\verbatim
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir
\endverbatim
(this example is from <DFN>egs/swbd/s5/run.sh</DFN>).

Because errors in data preparation can cause problems later on, we have a script to
check that the data directory is correctly formatted.  Run e.g.
\verbatim
utils/validate_data_dir.sh data/train
\endverbatim
You may also find the following command useful:
\verbatim
utils/fix_data_dir.sh data/train
\endverbatim
(of course the command will work for any data directory, not just data/train).  This
script will fix sorting errors and will remove any utterances for which some required
data, such as feature data or transcripts, is missing.

\section data_prep_lang Data preparation-- the "lang" directory.

Now we turn our attention to the "lang" directory.
\verbatim
s5# ls data/lang
L.fst  L_disambig.fst  oov.int	oov.txt  phones  phones.txt  topo  words.txt
\endverbatim
There may other directories with a very similar format: in the case we have
a directory "data/lang_test" that contains the same information but also a file
G.fst that is a Finite State Transducer form of the language model:
\verbatim
s5# ls data/lang_test
G.fst  L.fst  L_disambig.fst  oov.int  oov.txt	phones	phones.txt  topo  words.txt
\endverbatim
Note that lang_test/ was created by copying lang/ and adding G.fst.
Each of these directories seems to contain only a few files.
It's not quite as simple as this though, because "phones" is a directory:
\verbatim
s5# ls data/lang/phones
context_indep.csl  disambig.txt         nonsilence.txt        roots.txt    silence.txt
context_indep.int  extra_questions.int  optional_silence.csl  sets.int     word_boundary.int
context_indep.txt  extra_questions.txt  optional_silence.int  sets.txt     word_boundary.txt
disambig.csl       nonsilence.csl       optional_silence.txt  silence.csl
\endverbatim
The phones directory contains various bits of information about the phone set; there
are three versions of some of the files, with extensions .csl, .int and .txt, that contain
the same information in three formats.  Fortunately you, as a Kaldi user, don't have
to create all of these files because we have a script "utils/prepare_lang.sh" that
creates it all for you based on simpler inputs.  Before we describe that script
and the simpler inputs it takes, we feel obligated to explain what is in the "lang" directory.
After that we will explain the easy way to create it.  The user who is simply
aiming to quickly build a system without needing to understand how Kaldi works
may skip to \ref data_prep_lang_creating below.

\section data_prep_lang_contents Contents of the "lang" directory

First there are the files <DFN>phones.txt</DFN> and <DFN>words.txt</DFN>.  These
are both symbol-table files, in the OpenFst format, where each line is
the text form and then the integer form:
\verbatim
s5# head -3 data/lang/phones.txt
<eps> 0
SIL 1
SIL_B 2
s5# head -3 data/lang/words.txt
<eps> 0
!SIL 1
-'S 2
\endverbatim
These files are used by Kaldi to map back and forth between the integer and
text forms of these symbols.  They are mostly only accessed by the scripts
utils/int2sym.pl and utils/sym2int.pl, and by the OpenFst programs fstcompile and
fstprint.

The file <DFN>L.fst</DFN> is the Finite State Transducer form of the lexicon (L,
see  <a href=http://www.cs.nyu.edu/~mohri/pub/hbka.pdf> "Speech Recognition
with Weighted Finite-State Transducers" </a> by Mohri, Pereira and
Riley, in Springer Handbook on SpeechProcessing and Speech Communication, 2008).
with phone symbols on the input and word symbols on the output.  The file
<DFN>L_disambig.fst</DFN> is the lexicon, as above but including the disambiguation
symbols \#1, \#2, and so on, as well as the self-loop with \#0 on it to "pass through"
the disambiguation symbol from the grammar.  See \ref graph_disambig for more
explanation.  Anyway, you won't have to deal with this directly.

The file <DFN>data/lang/oov.txt</DFN> contains just a single line:
\verbatim
s5# cat data/lang/oov.txt
<UNK>
\endverbatim
This is the word that we will map all out-of-vocabulary words to during
training.  There is nothing special about "<UNK>" here, and it does not have
to be this particular word; what is important is that this word should have a pronunciation
containing just a phone that we designate as a "garbage phone"; this phone will
align with various kinds of spoken noise.  In our particular setup, this phone
is called <DFN>\<SPN\></DFN> (short for "spoken noise"):
\verbatim
s5# grep -w UNK data/local/dict/lexicon.txt
<UNK> SPN
\endverbatim
The file <DFN>oov.int</DFN> contains the integer form of this (extracted from <DFN>words.txt</DFN>),
which happens to be 221 in this setup.  You might notice that in the Resource Management
setup, oov.txt contains the silence word, which in that setup happens to be called "!SIL".
In that case we simply chose an arbitrary word from the vocabulary-- there are no out of vocabulary
words in the training set, so the word we choose has no effect.

The file data/lang/topo contains the following data:
\verbatim
s5# cat data/lang/topo
<Topology>
<TopologyEntry>
<ForPhones>
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.75 <Transition> 1 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.75 <Transition> 2 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 2 0.75 <Transition> 3 0.25 </State>
<State> 3 </State>
</TopologyEntry>
<TopologyEntry>
<ForPhones>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.25 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 3 <PdfClass> 3 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 4 <PdfClass> 4 <Transition> 4 0.75 <Transition> 5 0.25 </State>
<State> 5 </State>
</TopologyEntry>
</Topology>
\endverbatim
This specifies the topology of the HMMs we use.  In this case, the "real" phones contain
three emitting states
with the standard 3-state left-to-right topology-- the "Bakis model".
(Emitting states are states that "emit" feature vectors, as distinct from the "fake"
non-emitting states that are just used to glue other states together).
Phones 1 to 20 are various kinds of silence and noise; we have a lot because of word-position-dependency,
and in fact most of these will never be used; the real number excluding word position
dependency is more like five.  The "silence phones" have a more complex topology with an
initial emitting state and an end emitting state, but then three emitting states in the middle.
You don't have to create this file by hand.

There are a number of files in <DFN>data/lang/phones/</DFN> that specify various things about
the phone set.  Most of these files exist in three separate versions: a ".txt" form, e.g.:
\verbatim
s5# head -3 data/lang/phones/context_indep.txt
SIL
SIL_B
SIL_E
\endverbatim
a ".int" form, e.g:
\verbatim
s5# head -3 data/lang/phones/context_indep.int
1
2
3
\endverbatim
and a ".csl" form, which in a slight abuse of notation, denotes a colon-separated list,
not a comma-separated list:
\verbatim
s5# cat data/lang/phones/context_indep.csl
1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20
\endverbatim
These files always contain the same information, so let's focus on the ".txt" form which
is more human-readable.  The file "context_indep.txt" contains a list of those phones
for which we build context-independent models: that is, for those phones, we do not build a decision tree
that gets to ask questions about the left and right phonetic context.  In fact, we do build
smaller trees where we get to ask questions about the central phone and the HMM-state;
this depends on the "roots.txt" file which we'll describe below.  See \ref tree_externals
for more in-depth discussion of tree issues.

The file <DFN>context_indep.txt</DFN> contains all the phones which are not "real phones":
i.e. silence (SIL), spoken noise (SPN), non-spoken noise (NSN), and laughter (LAU):
\verbatim
# cat data/lang/phones/context_indep.txt
SIL
SIL_B
SIL_E
SIL_I
SIL_S
SPN
SPN_B
SPN_E
SPN_I
SPN_S
NSN
NSN_B
NSN_E
NSN_I
NSN_S
LAU
LAU_B
LAU_E
LAU_I
LAU_S
\endverbatim
There are a lot of variants of these phones because of word-position dependency; not all of these variants
will ever be used.  Here, <DFN>SIL</DFN> would be the silence that gets optionally inserted by the
lexicon (not part of a word), <DFN>SIL_B</DFN> would be a silence phone at the beginning of a word
(which should never exist), <DFN>SIL_I</DFN> word-internal silence (unlikely to exist), <DFN>SIL_E</DFN>
word-ending silence (should never exist), and <DFN>SIL_S</DFN> would be silence as a "singleton
word", i.e. a phone with only one word-- this might be used if you had a "silence word" in your
lexicon and explicit silences appear in your transcriptions.

The files <DFN>silence.txt</DFN> and <DFN>nonsilence.txt</DFN> contains lists of the silence
phones and nonsilence phones respectively.  These should be mutually exclusive and together,
should contain all the phones.  In this particular setup, <DFN>silence.txt</DFN> is identical
to <DFN>context_indep.txt</DFN>.
What we mean by "nonsilence" phones is, phones that we
intend to estimate various kinds of linear transforms on: that is, global transforms
such as LDA and MLLT, and speaker adaptation transforms such as fMLLR.  Our belief based
on prior experience is that it does not pay to include silence in the estimation of
such transforms.  Our practice is
to designate all silence, noise and vocalized-noise phones as "silence" phones, and all
phones representing traditional phonemes as "nonsilence" phones.  We haven't experimented
in Kaldi with the best way to do this.
\verbatim
s5# head -3 data/lang/phones/silence.txt
SIL
SIL_B
SIL_E
s5# head -3 data/lang/phones/nonsilence.txt
IY_B
IY_E
IY_I
\endverbatim

The file <DFN>disambig.txt</DFN> contains a list of the "disambiguation symbols"
(see \ref graph_disambig):
\verbatim
s5# head -3 data/lang/phones/disambig.txt
#0
#1
#2
\endverbatim
These symbols appear in the file <DFN>phones.txt</DFN> as if they were phones.

The file <DFN>optional_silence.txt</DFN> contains a single phone which can optionally
appear between words:
\verbatim
s5# cat data/lang/phones/optional_silence.txt
SIL
\endverbatim
The mechanism by which it appears optionally between words is that it appears
optionally in the lexicon FST at the end of every word (and also the beginning of the
utterance).  The reason it has to be specified in <DFN>phones/</DFN> instead of just appearing
in <DFN>L.fst</DFN> is obscure and we won't go into it here.

The file <DFN>sets.txt</DFN> contains sets of phones that we group together (consider as
the same phone) while clustering the phones in order to create the context-dependency questions
(in Kaldi we use automatically generated questions when building decision trees,
rather than linguistically meaningful ones).
In this particular setup, <DFN>sets.txt</DFN> groups together all the word-position-dependent
versions of each phone:
\verbatim
s5# head -3 data/lang/phones/sets.txt
SIL SIL_B SIL_E SIL_I SIL_S
SPN SPN_B SPN_E SPN_I SPN_S
NSN NSN_B NSN_E NSN_I NSN_S
\endverbatim

The file <DFN>extra_questions.txt</DFN> contains some extra questions which we'll include
in addition to the automatically generated questions:
\verbatim
s5# cat data/lang/phones/extra_questions.txt
IY_B B_B D_B F_B G_B K_B SH_B L_B M_B N_B OW_B AA_B TH_B P_B OY_B R_B UH_B AE_B S_B T_B AH_B V_B W_B Y_B Z_B CH_B AO_B DH_B UW_B ZH_B EH_B AW_B AX_B EL_B AY_B EN_B HH_B ER_B IH_B JH_B EY_B NG_B
IY_E B_E D_E F_E G_E K_E SH_E L_E M_E N_E OW_E AA_E TH_E P_E OY_E R_E UH_E AE_E S_E T_E AH_E V_E W_E Y_E Z_E CH_E AO_E DH_E UW_E ZH_E EH_E AW_E AX_E EL_E AY_E EN_E HH_E ER_E IH_E JH_E EY_E NG_E
IY_I B_I D_I F_I G_I K_I SH_I L_I M_I N_I OW_I AA_I TH_I P_I OY_I R_I UH_I AE_I S_I T_I AH_I V_I W_I Y_I Z_I CH_I AO_I DH_I UW_I ZH_I EH_I AW_I AX_I EL_I AY_I EN_I HH_I ER_I IH_I JH_I EY_I NG_I
IY_S B_S D_S F_S G_S K_S SH_S L_S M_S N_S OW_S AA_S TH_S P_S OY_S R_S UH_S AE_S S_S T_S AH_S V_S W_S Y_S Z_S CH_S AO_S DH_S UW_S ZH_S EH_S AW_S AX_S EL_S AY_S EN_S HH_S ER_S IH_S JH_S EY_S NG_S
SIL SPN NSN LAU
SIL_B SPN_B NSN_B LAU_B
SIL_E SPN_E NSN_E LAU_E
SIL_I SPN_I NSN_I LAU_I
SIL_S SPN_S NSN_S LAU_S
\endverbatim
You will observe that a question is simply a set of phones.
The first four questions are asking about the word-position, for regular phones; and the last five do the same for
the "silence phones".  The "silence" phones also come in a variety without a suffix like <DFN>_B</DFN>,
for example <DFN>SIL</DFN>.  These may appear as optional silence in the lexicon, i.e. not inside an
actual word.  In setups with things like tone dependency or stress markings, <DFN>extra_questions.txt</DFN>
may contain questions that relate to those features.

The file <DFN>word_boundary.txt</DFN> explains how the phones relate to word positions:
\verbatim
s5# head  data/lang/phones/word_boundary.txt
SIL nonword
SIL_B begin
SIL_E end
SIL_I internal
SIL_S singleton
SPN nonword
SPN_B begin
\endverbatim
This is the same information as is in the suffixes of the phones (<DFN>_B</DFN> and so on), but
we don't like to hardcode this in the text form of the phones-- for one thing, Kaldi executables
never see the text form of the phones, but only an integerized form.  So it is specified
by this file <DFN>word_boundary.txt</DFN>.  The main reason we need this information is
in order to recover the word boundaries within lattices (for example, the program
lattice-align-words reads the integer versin of this file, <DFN>word_boundaray.int</DFN>).
Finding the word boundaries is useful for reasons including NIST sclite scoring, which requires
the time markings for words, and for other downstream processing.

The file <DFN>roots.txt</DFN> contains information that relates to how we build the phonetic-context
decision tree:
\verbatim
head data/lang/phones/roots.txt
shared split SIL SIL_B SIL_E SIL_I SIL_S
shared split SPN SPN_B SPN_E SPN_I SPN_S
shared split NSN NSN_B NSN_E NSN_I NSN_S
shared split LAU LAU_B LAU_E LAU_I LAU_S
...
shared split B_B B_E B_I B_S
\endverbatim
For now you can ignore the words "shared" and "split"-- these relate to certain options
in how we build the decision tree (see \ref tree_externals for more information).
The significance of having a number of phones on a single line, for
example <DFN>SIL SIL_B SIL_E SIL_I SIL_S</DFN>, is that all of these phones
have a single "shared root" in the decision tree, so states may be shared
between those phones.  For stress and tone-dependent systems, typically
all the stress or tone-dependent versions of a particular phone will appear on
the same line.  In addition, all three states of a HMM (or all five states, for
silences) share the root, and the decision-tree building process gets to
ask about the state.  This sharing of the decision-tree root
between the HMM-states is what we mean by "shared" in the roots file.

\section data_prep_lang_creating Creating the "lang" directory

The <DFN>data/lang/</DFN> directory contains a lot of different files, so we have
provided a script that creates it for you starting from a relatively simple
input:
\verbatim
utils/prepare_lang.sh data/local/dict "<UNK>" data/local/lang data/lang
\endverbatim
Here, the inputs are the directory <DFN>data/local/dict/</DFN>, and the label <DFN>\<UNK\></DFN>
which is the dictionary word we will map OOV words to when appear in transcripts
(this becomes data/lang/oov.txt).  The location <DFN>data/local/lang/</DFN> is simply a
temporary directory which the script will use; <DFN>data/lang/</DFN> is where
it actually puts its output.

The thing which you, as the data-preparer, need to create, is the directory
<DFN>data/local/dict/</DFN>.  The directory contains the following contents:
\verbatim
s5# ls data/local/dict
extra_questions.txt  lexicon.txt nonsilence_phones.txt  optional_silence.txt  silence_phones.txt
\endverbatim
(in fact there are a few more files there which we haven't listed, but they are just temporary files that
were put there while creating that directory, and we can ignore them).  The commands below give
you an idea what is in these files:
\verbatim
s5# head -3 data/local/dict/nonsilence_phones.txt
IY
B
D
s5# cat data/local/dict/silence_phones.txt
SIL
SPN
NSN
LAU
s5# cat data/local/dict/extra_questions.txt
s5# head -5 data/local/dict/lexicon.txt
!SIL SIL
-'S S
-'S Z
-'T K UH D EN T
-1K W AH N K EY
\endverbatim
As you can see, the contents of this directory are very simple in this
setup (the Switchboard setup).  We just have lists of the "real" phones and of the
"silence" phones respectively, an empty file called <DFN>extra_questions.txt</DFN>, and
a file called <DFN>lexicon.txt</DFN> which has the format
\verbatim
<word> <phone1> <phone2> ...
\endverbatim
Note: <DFN>lexicon.txt</DFN> will contain repeated entries for the same word,
on separate lines,
if we have multiple pronunciations for it.  If you want to use pronunciation
probabilities, instead of creating the file <DFN>lexicon.txt</DFN>, create a file
called <DFN>lexiconp.txt</DFN> that has the probability as the second field.
Note that it is a common practice to normalize the pronunciations probabilities so that
instead of summing to one, the most probable pronunciation for each word is one.  This
tends to give better results.  For a top-level script that runs with
pronunciation probabilities, search for <DFN>pp</DFN> in <DFN>egs/wsj/s5/run.sh</DFN>.

Notice that in this input there is no notion of word-position dependency,
i.e. no suffixes like <DFN>_B</DFN> and <DFN>_E</DFN>.  This is because it is the
scripts <DFN>prepare_lang.sh</DFN> that adds those suffixes.

You can see from the empty <DFN>extra_questions.txt</DFN> file that there
is some kind of potential here that is not being fully exploited.  This relates
to things like stress markings or tone markings.  You may want to have different
versions of a particular phone that have different stress or tone.  In order
to demonstrate what this looks like, we'll view the same files as above,
but in the <DFN>egs/wsj/s5/</DFN> setup.  The result is below:
\verbatim
s5# cat data/local/dict/silence_phones.txt
SIL
SPN
NSN
s5# head data/local/dict/nonsilence_phones.txt
S
UW UW0 UW1 UW2
T
N
K
Y
Z
AO AO0 AO1 AO2
AY AY0 AY1 AY2
SH
s5# head -6 data/local/dict/lexicon.txt
!SIL SIL
<SPOKEN_NOISE> SPN
<UNK> SPN
<NOISE> NSN
!EXCLAMATION-POINT  EH2 K S K L AH0 M EY1 SH AH0 N P OY2 N T
"CLOSE-QUOTE  K L OW1 Z K W OW1 T
s5# cat data/local/dict/extra_questions.txt
SIL SPN NSN
S UW T N K Y Z AO AY SH W NG EY B CH OY JH D ZH G UH F V ER AA IH M DH L AH P OW AW HH AE R TH IY EH
UW1 AO1 AY1 EY1 OY1 UH1 ER1 AA1 IH1 AH1 OW1 AW1 AE1 IY1 EH1
UW0 AO0 AY0 EY0 OY0 UH0 ER0 AA0 IH0 AH0 OW0 AW0 AE0 IY0 EH0
UW2 AO2 AY2 EY2 OY2 UH2 ER2 AA2 IH2 AH2 OW2 AW2 AE2 IY2 EH2
s5#
\endverbatim
You may notice that some of the lines in <DFN>nonsilence_phones.txt</DFN> contain
multiple phones on a single line.  These are the different stress-dependent
versions of the vowels.  Note that four different versions of each phone
appear in the CMU dictionary: for example, <DFN>UW UW0 UW1 UW2</DFN>;
for some reason, one of these versions does not have a numeric suffix.
The order of the phones on the line does not matter, but the grouping into
different lines does matter; in general, we advise users to group all forms of
each "real phone" on a separate line.
We use the stress markings present in the CMU
dictionary.  The file extra_questions.txt contains a single question
containing all of the "silence" phones (in fact this is unnecessary as
it appears that the script <DFN>prepare_lang.sh</DFN> adds such a question anyway),
and also a question corresponding to each of the different stress markers.
These questions are necessary in order to get any benefit from the
stress markers, because the fact that the different stress-dependent versions
of each phone are together on the lines of <DFN>nonsilence_phones.txt</DFN>,
ensures that they stay together in <DFN>data/lang/phones/roots.txt</DFN> and
<DFN>data/lang/phones/sets.txt</DFN>, which in turn ensures that they
share the same tree root and can never be distinguished by a question.  Thus,
we have to provide a special question that affords the decision-tree building
process a way to distinguish between the phones.  Note: the reason we put the
phones together in the <DFN>sets.txt</DFN> and <DFN>roots.txt</DFN> is that some
of the stress-dependent versions of phones may have too little data to
robustly estimate either a separate decision tree or the phone clustering
information that's used in producing the questions.  By grouping them together
like this, we ensure that in the absence of enough data to estimate them
separately, these different versions of the phone all "stay together" throughout
the decision-tree building process.

We should mention at this point that the script <DFN>utils/prepare_lang.sh</DFN>
supports a number of options.  To give you an idea of what they are, here is
the usage messages of that script:
\verbatim
usage: utils/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>
e.g.: utils/prepare_lang.sh data/local/dict <SPOKEN_NOISE> data/local/lang data/lang
options:
     --num-sil-states <number of states>             # default: 5, #states in silence models.
     --num-nonsil-states <number of states>          # default: 3, #states in non-silence models.
     --position-dependent-phones (true|false)        # default: true; if true, use _B, _E, _S & _I
                                                     # markers on phones to indicate word-internal positions.
     --share-silence-phones (true|false)             # default: false; if true, share pdfs of
                                                     # all non-silence phones.
     --sil-prob <probability of silence>             # default: 0.5 [must have 0 < silprob < 1]
\endverbatim
A potentially important option is the <DFN>--share-silence-phones</DFN> option.
The default is false.  If this option is true, all the pdf's (the Gaussian
mixture models) of all the silence phones such as silence, vocalized-noise,
noise and laughter, will be shared and only the transition probabilities will
differ between those models.  It's not clear why this should help, but we found
that it was extremely helpful for the Cantonese data of IARPA's BABEL project.
That data is very messy and has long untranscribed portions that we try to
align to a special phone which we designate for that purpose.  We suspect
that the training data was somehow failing to align correctly, and for some reason
setting this option to true changed that.

Another potentially important option is the "--sil-prob" option.   In general, we have
not experimented much with any of these options so we cannot give very detailed advice.

\section data_prep_grammar Creating the language model or grammar

Our tutorial above on how to create the <DFN>lang/</DFN> directory did not address how
to create the file <DFN>G.fst</DFN>, which is the finite state transducer form of
the language model or grammar that we'll decode with.  In fact, in some setups
we may have many "lang" directories for testing purposes, with different
language models and dictionaries.  The Wall Street Journal (WSJ) setup is an example:
\verbatim
s5# echo data/lang*
data/lang data/lang_test_bd_fg data/lang_test_bd_tg data/lang_test_bd_tgpr data/lang_test_bg \
 data/lang_test_bg_5k data/lang_test_tg data/lang_test_tg_5k data/lang_test_tgpr data/lang_test_tgpr_5k
\endverbatim

The process for creating <DFN>G.fst</DFN> is different depending on whether we're using
a statistical language model or some kind of grammar.  In the RM setup there is
a bigram grammar, which only allows certain pairs of words.  We make this sum to
one within each grammar state by assigning a probability of 1 over the number of
outgoing arcs.  There is a statement in <DFN>local/rm_data_prep.sh</DFN> that does:
\verbatim
local/make_rm_lm.pl $RMROOT/rm1_audio1/rm1/doc/wp_gram.txt  > $tmpdir/G.txt || exit 1;
\endverbatim
This script <DFN>local/make_rm_lm.pl</DFN> creates a grammar in FST format (text format,
not binary format).  It contains lines like the following:
\verbatim
s5# head data/local/tmp/G.txt
0    1    ADD    ADD    5.19849703126583
0    2    AJAX+S    AJAX+S    5.19849703126583
0    3    APALACHICOLA+S    APALACHICOLA+S    5.19849703126583
\endverbatim
See <a href=www.openfst.org> www.openfst.org </a> for more information on OpenFst (they
have a useful tutorial).  The script <DFN>local/rm_prepare_grammar.sh</DFN> will turn this into
the binary-format file <DFN>G.fst</DFN> using the following statement:
\verbatim
fstcompile --isymbols=data/lang/words.txt --osymbols=data/lang/words.txt --keep_isymbols=false \
    --keep_osymbols=false $tmpdir/G.txt > data/lang/G.fst
\endverbatim
If you want to create your own grammar, you will probably want to do something similar.
Note: this type of procedure only applies to grammars of a certain class: it won't
allow you to compile a complete Context Free Grammar, because it can't be represented
in OpenFst format.  There are ways to do this in the WFST framework
(e.g. see recent work by Mike Riley with push down transducers), but we have not yet
worked with those ideas in Kaldi.

Please, before asking any questions on the list about language models or about making
grammar FSTs, read "A Bit of Progress in Language Modeling" by Joshua Goodman; and go to
www.openfst.org and do the FST tutorial so that you understand the basics of finite
state transducers.  (Note that language models would be represented as finite state
acceptors, or FSAs, which can be considered as a special case of finite state transducers).

The script <DFN>utils/format_lm.sh</DFN> deals with converting the ARPA-format language
models into an OpenFst format. Here is the usage messages of that script:
\verbatim
Usage: utils/format_lm.sh <lang_dir> <arpa-LM> <lexicon> <out_dir>
E.g.: utils/format_lm.sh data/lang data/local/lm/foo.kn.gz data/local/dict/lexicon.txt data/lang_test
Convert ARPA-format language models to FSTs.
\endverbatim
Some of the key commands from that script are:
\verbatim
gunzip -c $lm \
  | arpa2fst --disambig-symbol=#0 \
             --read-symbol-table=$out_dir/words.txt - $out_dir/G.fst
\endverbatim
This Kaldi program, <DFN>arpa2fst</DFN>, turns the ARPA-format language model
into a Weight Finite State Transducer (actually, an acceptor).

A popular toolkit for building language models is SRILM.  Various language
modeling toolkits are used in the Kaldi example scripts.  SRILM is the best
documented and most fully featured, and we generally recommend it (its only
drawback is that it don't have the most free licence). Here is the usage
messages of <DFN>utils/format_lm_sri.sh</DFN>

\verbatim
Usage: utils/format_lm_sri.sh [options] <lang-dir> <arpa-LM> <out-dir>
E.g.: utils/format_lm_sri.sh data/lang data/local/lm/foo.kn.gz data/lang_test
Converts ARPA-format language models to FSTs. Change the LM vocabulary using SRILM.
\endverbatim


\section data_prep_unknown Note on unknown words

This is an explanation of how Kaldi deals with unknown words (words not in the
vocabulary); we are putting it on the "data preparation" page for lack of a more obvious
location.

In many setups, <DFN>\<unk\></DFN> or something similar will be present in the
LM as long as the data that you used to train the LM had words that were not
in the vocabulary you used to train the LM,
because language modeling toolkits tend to map those all to a
single special world, usually called <DFN>\<unk\></DFN> or
<DFN>\<UNK\></DFN>.  You can look at the arpa file to figure out what it's called; it
will usually be one of those two.


During training, if there are words in the <DFN>text</DFN> file in your data
directory that are not in the <DFN>words.txt</DFN> in the lang directory that
you are using, Kaldi will map them to a special word that's specified in the
lang directory in the file <DFN>data/lang/oov.txt</DFN>; it will usually be
either <DFN>\<unk\></DFN>, <DFN>\<UNK\></DFN> or maybe
<DFN>\<SPOKEN_NOISE\></DFN>.  This word will have been chosen by the user
(i.e., you), and supplied to <DFN>prepare_lang.sh</DFN> as a command-line argument.
If this word has nonzero probability in the language model (which you can test
by looking at the arpa file), then it will be possible for Kaldi to recognize
this word in test time.  This will often be the case if you call this word
<DFN>\<unk\></DFN>, because as we mentioned above, language modeling toolkits
will often use this spelling for ``unknown word'' (which is a special word that
all out-of-vocabulary words get mapped to).  Decoding output will always be limited to the
intersection of the words in the language model with the words in the lexicon.txt (or whatever file format you supplied the
lexicon in, e.g. lexicop.txt); these words will all be present in the <DFN>words.txt</DFN>
in your <DFN>lang</DFN> directory.
So if Kaldi's "unknown word" doesn't match the LM's "unknown word", you will
simply never decode this word.  In any
case, even when allowed to be decoded, this word typically won't be output very
often and in practice it doesn't tend to have much impact on WERs.

Of course a single phone isn't a very good, or accurate, model of OOV words.  In
some Kaldi setups we have example scripts with names
<DFN>local/run_unk_model.sh</DFN>: e.g., see the file
<DFN>tedlium/s5_r2/local/run_unk_model.sh</DFN>.  These scripts replace the unk
phone with a phone-level LM on phones.  They make it possible to get access to
the sequence of phones in a hypothesized unknown word.  Note: unknown words
should be considered an "advanced topic" in speech recognition and we discourage
beginners from looking into this topic too closely.


*/