README.txt 1.73 KB
edit raw blame history



1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45


Kaldi recipe for the Fisher and Callhome Spanish Corpora

About the Fisher Spanish Corpus
  Fisher Spanish - Speech was developed by the Linguistic 
  Data Consortium (LDC) and consists of audio files covering
  roughly 163 hours of telephone speech from 136 native
  Caribbean Spanish and non-Caribbean Spanish speakers.
  Full orthographic transcripts of these audio files are available
  in LDC2010T04

  Speech : LDC2010S01
  Transcripts : LDC2010T04

About the Callhome Spanish Corpus
  The CALLHOME Spanish corpus of telephone speech consists
  of 120 unscripted telephone conversations between native speakers of Spanish.
  All calls, which lasted up to 30 minutes, originated in North America
  and were placed to international locations. Most participants called
  family members or close friends.

  Speech : LDC96S35
  Transcripts : LDC96T17

The LDC Spanish rule based lexicon
  The CALLHOME Spanish collection includes a lexical component. 
  The CALLHOME Spanish Lexicon consists of 45,582 words and contains
  separate information fields with phonological, morphological and
  frequency information for each word.

  Lexicon : LDC96L16


Each subdirectory of this directory contains the
scripts for a sequence of experiments.

  s5: This recipe is based on the WSJ s5 recipe. It works with the 
      the transcripts (available along with the script in LDC97T19). In addition, 
      it uses a phonetic lexicon generated using the rules based LDC lexicon. 
      The recipe follows the Triphone+SGMM+SAT+fMLLR+SGMM+DNN pipeline. It uses data
      partitions as specified by LDC in the Callhome corpus description. For Fisher
      custom partitions are available (check the run.sh file for the location 
      of the split file : This can be changed).