README.txt 4.69 KB
### 
# Swahili Data collected by Hadrien Gelas 
# Prepared by Tien-Ping Tan & Laurent Besacier
# GETALP LIG, Grenoble, France
###


### OVERVIEW
The package contains swahili speech corpus with audio data in the directory asr_swahili/data. The data directory contains 2 subdirectories:
a. train - speech data and transription for training automatic speech recognition system (Kaldi ASR format [1])
b. test - speech data and transription for testing automatic speech recognition system (Kaldi ASR format)

A text corpus and language model in the directory asr_swahili/LM, and lexicon in the directory asr_swahili/lang

### PUBLICATION ON SWAHILI SPEECH & LM DATA
More details on the corpus and how it was collected can be found on the following publication (please cite this bibtex if you use this data)

 @InProceedings { gelas:hal-00954048,
  author = {Gelas, Hadrien and Besacier, Laurent and Pellegrino, Francois},
  title = {{D}evelopments of {S}wahili resources for an automatic speech recognition system},
  booktitle = {{SLTU} - {W}orkshop on {S}poken {L}anguage {T}echnologies for {U}nder-{R}esourced {L}anguages},
  year = {2012},
  address = {Cape-Town, Afrique Du Sud},
  abstract = {no abstract},
  x-international-audience = {yes},
  url = {http://hal.inria.fr/hal-00954048},
}

### SWAHILI SPEECH CORPUS
Directory: asr_swahili/data/train
Files: text (training transcription), wav.scp (file id and path), utt2spk (file id and audio id), spk2utt (audio id and file id), wav (.wav files). 
For more information about the format, please refer to Kaldi website http://kaldi-asr.org/doc/data_prep.html
Description: training data in Kaldi format about 10 hours. Note: The path of wav files in wav.scp have to be modified to point to the actual locatiion.  

Directory: asr_swahili/data/test
Files: text (test transcription), wav.scp (file id and path), utt2spk (file id and audio id), spk2utt (audio id and file id), wav (.wav files)
Description: testing data in Kaldi format about 1.8 hours. The audio files for testing has the format 
ID_16k-emission_swahili_TThTT_-_TThTT_tu_YYYYMMDD_part001Q, where ID is the audio id, T is the time, Y is the year, M is the month and D is the day of recording. The 
last character Q indicate the quality of the utterance. g is good, m is utterance with background music, n is utterance with noise, s is utterance
with overlap speech, l is very noise utterance. Note: The path of wav files in wav.scp have to be modified to point to the actual locatiion. 



### SWAHILI TEXT CORPUS
Directory: asr_swahili/LM
Files: 00-LM_SWH-CORPUS.txt, 01-CLN4-TRN.txt.zip, 02-CLN4-DEV.txt and 03-CLN4-TST.txt, swahili.arpa.zip

N.B.: You need to unzip 01-CLN4-TRN.txt.zip for 01-CLN4-TRN.txt and swahili.arpa.zip for swahili.arpa

# /00-LM_SWH-CORPUS.txt
Contains 28 M Words. Full text data grabbed from online newspaper and cleaned as much as it could
All files below are extracted from this file

# /01-CLN4-TRN.txt
Training data for LM

# /02-CLN4-DEV.txt
Dev data for LM

# /03-CLN4-TST.txt
Testing data for LM

# /swahili.arpa
A language model created using SRILM [2] using the text from the text in 01-CLN4-TRN.txt



### LEXICON/PRONUNCIATION DICTIONARY
Directory: asr_swahili/lang
Files: lexicon.txt (lexicon), nonsilence_phones.txt (speech phones), optional_silence.txt (silence phone)
Description: lexicon contains words and their respective pronunciation, non-speech sound and noise in Kaldi format. G2P conversion rules, please refer to [3]

### SCRIPTS
In asr_swahili/kaldi-scripts you find the scripts used to train and test models
from the existing data and lang directory you find scripts for run the sequence one by one : 04_train_mono.sh + 04a_train_triphone.sh + 04b_train_MLLT_LDA.sh + 04c_train_SAT_FMLLR.sh + 04d_train_MMI_FMMI.sh + 04e_train_sgmm.sh

### RUN.SH
This script automatically creates all you need to build an ASR for swahili with Kaldi : it prepares the data and runs the models. 
At the end, you should obtain the results below.

### WER RESULTS OBTAINED SO FAR (you should obtain the same on this data if same protocol used)
Monophone (13 MFCC): 49.28% (All)
Triphone (13 MFCC): 33.55% (All)
Triphone (13 MFCC + delta + delta2): 33.61% (All)
Triphone (39 features) + LDA and MLLT: 31.92% (All)
Triphone (39 features) + LDA and MLLT + SAT and FMLLR: 31.56% (All)
Triphone (39 features) + LDA and MLLT + SAT and FMLLR + MMI and fMMI: 30.87% (All)
Triphone (39 features) + LDA and MLLT + SGMM: 27.36% (All)


### REFERENCES
[1] KALDI: http://kaldi-asr.org/doc/tutorial_running.html
[2] SRILM: http://www.speech.sri.com/projects/srilm/
[3] Hadrien Gelas, Laurent Besacier, François Pellegrino, Developments of Swahili Resources for an Automatic Speech Recognition System, 
http://www.ddl.ish-lyon.cnrs.fr/fulltext/Gelas/Gelas_2012_SLTU.pdf