A Kaldi recipe for Spanish using the Heroico corpus.
Description of the Heroico Corpus
The Heroico corpus (LDC2006S37) was originally collected to train acoustic models for pronunciation modeling in Spanish language learning applications.
The corpus consists of two main subcorpora:
1. A subcorpus collected at Mexico's Military Academy called Heroico.
2. A subcorpus collected at the United States Military Academy (USMA) in West Point New York.
The Heroico corpus is further divided into recited and prompted speech subcorpora.
In the LDC distribution, the recited speech appears under the recordings directory and the prompted speech under the answers directory.
The USMA subcorpus includes 1.2 hours of speech from nonnative informants and 1 hour of speech from native speakers.
All the speech in the USMA corpus was recited.
With the exception of one hour of speech, the Heroico subcorpus was used for training and the USMA subcorpus was used for testing.
The Heroico subcorpus has 11.8 hours of speech.
In this recipe 10.8 hours were used as training data.
2.2 hours of speech from the USMA subcorpus were used as testing data.
One hour segment of speech in the Heroico corpus was recited from the same set of prompts that was used in the USMA collection.
To avoid overlap of the training and testing sets, this one hour segment was separated out from the Heroico corpus into a devtest set.
In summary the Heroico corpus was split into 5 parts for this recipe:
1. Heroico answers train
2. Heroico recited train
3. Heroico Recited devtest
4. USMA native test
5. USMA nonnative test