Yannick Estève / ONTRAC-Kaldi

Download zip

Name	Last Update	Last Commit 8dcb6dfcb61 – first commit	history
..
conf	Loading commit data...
local	Loading commit data...
README.md	Loading commit data...
RESULTS	Loading commit data...
cmd.sh	Loading commit data...
path.sh	Loading commit data...
run.sh	Loading commit data...
steps	Loading commit data...
utils	Loading commit data...

README.md

This is a WIP English LVCSR recipe that trains on data from multiple corpora. By default, it uses all of the following:

Fisher (1761 hours)
Switchboard (317 hours)
WSJ (81 hours)
HUB4 (1996 & 1997) English Broadcast News (75 + 72 hours)
TED-LIUM (118 hours)
Librispeech (960 hours)

It is possible to add or remove datasets as necessary.

This recipe was developed by Allen Guo (ICSI), Korbinian Riedhammer (Remeeting) and Xiaohui Zhang (JHU). The original spec for the recipe is at #699.

Recipe features

Recipe variants

To make it as easy as possible to extend and modify this example, we provide the ability to create separate recipe variants. These variants are named multi_a, multi_b, etc. For instance, you could have

multi_a = default
- Bootstrap GMM-HMM on WSJ SI-84
- Train SAT system with AMI/Fisher/Librispeech/SWBD/Tedlium/WSJ nnet3 model (no AMI-SDM)
- Train tdnn (ivector) system on top of that
multi_b = your first experiment
- Train monophone model on SWBD
- Then add in WSJ SI-284 for remaining GMM-HMM steps
- Then train AMI/SWBD/WSJ nnet3 model)
multi_c = your second experiment
- Train GMM-HMM on SWBD
- Then train SWBD/Tedlium nnet3 model
...

The data and exp directories for these variants can exist side-by-side: multi_x uses data/multi_x for training data directories and exp/multi_x for exp directories. This means you can easily train models on arbitrary combinations of whatever corpora you have on hand without overwriting previous work—simply create one recipe variant per experiment.

Training partitions

Instead of having a few train_* directories (like train_200k, train_50_shortest), there is one such directory (or symlink) for each step during training, e.g.

$ ls -1 data/multi_a/
mono/
mono_ali/
tri1@
tri1_ali@
tri2@
tri2_ali/
tri3@
tri3_ali/
tri4@
# ...

The result is that the training script is much easier to read, since it basically boils down to

Do ...	... with the data in ...	... and output the model to ...
training	`data/multi_a/mono`	`exp/multi_a/mono`
alignment	`data/multi_a/mono_ali`	`exp/multi_a/mono_ali`
training	`data/multi_a/tri1`	`exp/multi_a/tri1`
training	`data/multi_a/tri2a`	`exp/multi_a/tri2b`

What training data to use for each stage is specified by local/make_partitions.sh, which creates the data/{MULTI}/{STAGE} directories.

Again, this convention was chosen for its simplicity and extensibility.

Full table

This table below lists all major structural differences between this recipe and the fisher_swbd recipe (link).

Description	Location in `fisher_swbd`	Location in this recipe	Example(s) in this recipe
Config files	`conf`	`conf/{CORPUS}`	`conf/swbd`
Corpora-specific data directories for training	`data/train_swbd` or `data/train_fisher`	`data/{CORPUS}/train`	`data/tedlium/train`
Data directories for testing	`data/rt03` or `data/eval2000`	`data/{CORPUS}/test`	`data/tedlium/test` or `data/eval2000/test`
Data directories used during training (may be comprised of data from multiple corpora)	`data/train_all`	`data/{MULTI}/{STAGE}`	`data/multi_a/tri2a`
Results	`exp`	`exp/{MULTI}`	`exp/multi_a`

Scripts copied from other recipes

The files in local/ that are prefixed with a database name (e.g. wsj_*) come from those respective recipes. There is one exception: files that start with fisher_ or swbd_ come from the fisher_swbd recipe.

Each script copied from another recipe has a header that explains 1) where the file was copied from, 2) what revision it was copied from, and 3) what changes were made.