About the Wall Street Journal corpus:
This is a corpus of read
sentences from the Wall Street Journal, recorded under clean conditions.
The vocabulary is quite large. About 80 hours of training data.
Available from the LDC as either: [ catalog numbers LDC93S6A (WSJ0) and LDC94S13A (WSJ1) ]
or: [ catalog numbers LDC93S6B (WSJ0) and LDC94S13B (WSJ1) ]
The latter option is cheaper and includes only the Sennheiser
microphone data (which is all we use in the example scripts).
Each subdirectory of this directory contains the
scripts for a sequence of experiments. [note: most of the older
example scripts have been deleted, but are still available at
^/branches/complete].
s5: This is the current recommended recipe.