|Name||Last Update||Last Commit 8dcb6dfcb61 – first commit||history|
|conf||Loading commit data...|
|local||Loading commit data...|
|README.md||Loading commit data...|
|RESULTS||Loading commit data...|
|cmd.sh||Loading commit data...|
|path.sh||Loading commit data...|
|run.sh||Loading commit data...|
|steps||Loading commit data...|
|utils||Loading commit data...|
This is a WIP English LVCSR recipe that trains on data from multiple corpora. By default, it uses all of the following:
- Fisher (1761 hours)
- Switchboard (317 hours)
- WSJ (81 hours)
- HUB4 (1996 & 1997) English Broadcast News (75 + 72 hours)
- TED-LIUM (118 hours)
- Librispeech (960 hours)
It is possible to add or remove datasets as necessary.
This recipe was developed by Allen Guo (ICSI), Korbinian Riedhammer (Remeeting) and Xiaohui Zhang (JHU). The original spec for the recipe is at #699.
To make it as easy as possible to extend and modify this example, we provide the ability to create separate recipe variants. These variants are named
multi_b, etc. For instance, you could have
- Bootstrap GMM-HMM on WSJ SI-84
- Train SAT system with AMI/Fisher/Librispeech/SWBD/Tedlium/WSJ nnet3 model (no AMI-SDM)
- Train tdnn (ivector) system on top of that
multi_b= your first experiment
- Train monophone model on SWBD
- Then add in WSJ SI-284 for remaining GMM-HMM steps
- Then train AMI/SWBD/WSJ nnet3 model)
multi_c= your second experiment
- Train GMM-HMM on SWBD
- Then train SWBD/Tedlium nnet3 model
exp directories for these variants can exist side-by-side:
data/multi_x for training data directories and
exp/multi_x for exp directories. This means you can easily train models on arbitrary combinations of whatever corpora you have on hand without overwriting previous work—simply create one recipe variant per experiment.
Instead of having a few
train_* directories (like
train_50_shortest), there is one such directory (or symlink) for each step during training, e.g.
$ ls -1 data/multi_a/ mono/ mono_ali/ tri1@ tri1_ali@ tri2@ tri2_ali/ tri3@ tri3_ali/ tri4@ # ...
The result is that the training script is much easier to read, since it basically boils down to
|Do ...||... with the data in ...||... and output the model to ...|
What training data to use for each stage is specified by
local/make_partitions.sh, which creates the
Again, this convention was chosen for its simplicity and extensibility.
This table below lists all major structural differences between this recipe and the
fisher_swbd recipe (link).
||Location in this recipe||Example(s) in this recipe|
|Corpora-specific data directories for training||
|Data directories for testing||
|Data directories used during training (may be comprised of data from multiple corpora)||
The files in
local/ that are prefixed with a database name (e.g.
from those respective recipes. There is one exception: files that start with
swbd_ come from the
Each script copied from another recipe has a header that explains 1) where the file was copied from, 2) what revision it was copied from, and 3) what changes were made.