Download zip Select Archive Format
Name Last Update history
File empty ..
File dir conf Loading commit data...
File dir local Loading commit data...
File txt README Loading commit data...
File txt RESULTS Loading commit data...
File txt Loading commit data...
File txt Loading commit data...
File txt Loading commit data...
File txt steps Loading commit data...
File txt utils Loading commit data...


A Kaldi recipe for Spanish using the Heroico corpus.

Description of the Heroico Corpus
The Heroico corpus (LDC2006S37) was originally collected to train acoustic models for pronunciation modeling in Spanish language learning applications.
The corpus consists of two main subcorpora:
1. A subcorpus collected at Mexico's Military Academy called Heroico.
2. A subcorpus collected at the United States Military Academy (USMA) in West Point New York.

The Heroico corpus is further divided into recited  and prompted speech  subcorpora.
In the LDC distribution, the  recited speech appears under the recordings directory and the prompted speech under the answers directory.

The USMA subcorpus includes   1.2 hours of speech from nonnative informants  and 1 hour of speech from native speakers.
All the speech in the USMA corpus was recited.

With the exception of one hour of speech, the Heroico subcorpus was used for training and the USMA subcorpus was used for testing.
The Heroico  subcorpus has 11.8 hours of speech.
In this recipe 10.8 hours were used as training data. 
2.2 hours of speech from the USMA subcorpus were used as testing data.
One hour segment   of speech in the Heroico corpus was recited from the same set of prompts that was used in the USMA collection.
To avoid overlap of the training and testing sets, this one hour segment was separated out from the Heroico corpus into a devtest set.

In summary the Heroico corpus was split into 5 parts for this recipe:
1. Heroico answers train
2. Heroico recited train
3. Heroico Recited devtest
4. USMA native test
5. USMA nonnative test