Blame view

tools/sctk-2.4.10/doc/infmts.htm 12.8 KB
8dcb6dfcb   Yannick Estève   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
  <!-- $Id: infmts.htm,v 1.2 2008/09/05 17:12:23 ajot Exp $ -->
  <HTML><HEAD>
  <CENTER><TITLE>options</TITLE>
  </HEAD>
  <BODY></CENTER><p><hr>
  
  <H1> 
  <A NAME="infmts_name_0">
  <A HREF="sclite.htm#sclite_name_0">Sclite</A> Input file formats: </A>
  <a href="infmts.htm#trn_fmt_name_0">trn</a>,
  <a href="infmts.htm#txt_fmt_name_0">txt</a>,
  <a href="infmts.htm#stm_fmt_name_0">stm</a>,
  <a href="infmts.htm#ctm_fmt_name_0">ctm</a>
  
  </H1>
  
  The inputs to "<a href="sclite.htm#sclite_name_0">sclite</a>" are the
  reference file and a hypothesis file(s), the text portions of which
  may be either ASCII characters or GB encoded Chinese characters.
  There are a number of different input formats permitted: 
  "<a href="infmts.htm#trn_fmt_name_0">trn</a>",
  "<a href="infmts.htm#txt_fmt_name_0">txt</a>",
  "<a href="infmts.htm#stm_fmt_name_0">stm</a>", and
  "<a href="infmts.htm#ctm_fmt_name_0">ctm</a>".
  As new scoring paradigms were created for the ARPA
  tests, accompanying formats were created to support the evaluations.
  
  <p>
  <a name="trn_fmt_name_0">
  <strong> trn - Definition of a transcript input file </strong>
  </a>
  <ul>
  <p>
            The transcript  format  is  a  file  of  word  sequence
            records  separated by newlines.  Each record contains a
            word sequence, follow by the an utterance  ID  enclosed
            in  parenthesis.   See  the  '<a href="options.htm#option_i_name_0">-i</a>'  option for a list of
            accepted utterance id types.
  <p>
            example.
  <ul>
                 she had your dark suit in greasy  wash  water  all
                 year (cmh_sa01)
  </ul>
  <p>
            Transcript alternations, described above, can  be  used
            in the word sequence by using this BNF format:
  <p>
  <ul>
                 ALTERNATE :== "{" TEXT ALT+ "}"
  <br> 
                 ALT       :== "/" TEXT
  <br>
                 TEXT      :== 1 or more whitespace separated words |
  	                    "@" | ALTERNATE
  </ul>
  <p>
       The "@" represents a  NULL  word  in  the  transcript.   For
       scoring  purposes,  an  error  is  not counted if the "@" is
       aligned as an insertion.
  <p>
       example
  <ul>
            i've { um / uh / @ } as far as i'm concerned
  </ul>
  </ul>
  <p>
  <a name="txt_fmt_name_0">
  <strong> txt - Definition of a text input file </strong>
  </a>
  <ul>
            This format is simply  free-form  text  with  no  page,
            paragraphs, sentence or speaker breaks.
  </ul>
  <a name="stm_fmt_name_0">
  <strong> stm - Definition of segment time mark input file </strong>
  </a>
  <ul>
  <p>
       This describes the segment time marked files to be used  for
       scoring  the  output  of  speech  recognizers  via  the NIST
       sclite() program.  This is a reference file format.
  <p>
       The segment time mark file consists of  a  concatenation  of
       text  segment  records from a waveform file.  Each record is
       separated by a newline and contains: the waveform's filename
       and  channel  identifier  [A | B], the talkers id, begin and
       end times (in seconds), optional subset label and  the  text
       for the segment.  Each record follows this BNF format:
  <p>
        STM :== &lt;F&gt; &lt;C&gt; &lt;S&gt; &lt;BT&gt; &lt;ET&gt; [ &lt;LABEL&gt; ] transcript . . .
  	<ul>
           Where :
  	 <ul>
            &lt;F&gt;
  	   <ul>
                 The waveform  filename.   NOTE:  no  pathnames  or
                 extensions are expected.
    	   </ul>
            &lt;C&gt;
  	   <ul>
                 The waveform channel.   The text of the waveform channel
  		is not restricted by sclite.  The text can be any text string without
  		witespace so long as the matching string is found in both the reference
  		and hypothesis input files.
    	   </ul>
            &lt;S&gt;
  	   <ul>
                 The speaker id,  no  restrictions  apply  to  this
                 name.
    	   </ul>
            &lt;BT&gt;
  	   <ul>
                 The begin time (seconds) of the segment.
    	   </ul>
            &lt;ET&gt;
  	   <ul>
                 The end time (seconds) of the segment.
    	   </ul>
            &lt;LABEL&gt;
  	   <ul>
                 A  comma  separated  list  of  subset  identifiers
                 enclosed  in angle brackets.  Ex. "&lt;O,F,00&gt;".  See
                 "USING STM FORMAT FOR  LABELED  UTTERANCE  REPORTS
                 (LUR)" below.
    	   </ul> 
           transcript
  	   <ul>
  
  The transcript can take on two forms: 1) a whitespace separated list of
  words, or 2) the string "IGNORE_TIME_SEGMENT_IN_SCORING".
  <p>
  The list of
  words can contain an transcript alternation using the following
  BNF format:
  <p>
  <ul>
            ALTERNATE :== "{" &lt;text&gt; ALT+ "}"
  	<br>
            ALT       :== "/" &lt;text&gt;
  	<br>
            TEXT      :== 1 thru n words | "@" | ALTERNATE
  </ul>
  <p>
       The "@" represents a NULL word in the transcript.  For scoring
  purposes, an error is not counted if the "@" is aligned as an
  insertion.  
  <p>
  <ul>
       Example:     "i've { um / uh / @ } as far  as i'm concerned"
  </ul>
  <p>
  When the string "IGNORE_TIME_SEGMENT_IN_SCORING" is used as the transcript,
  the process which chops the hypothesis file to matching reference segments
  ignores all hypothesis words whose time-midpoints occur within the reference
  segments beginning
  and ending time.  The effect is to declare this segments regions as 
  "out-of-bounds" for scoring, thus generation no errors from that time 
  region. 
  <p>
  <ul>
  NOTE: this only works with DP alignment of a referenc stm file
  and hypothesis ctm file.
  </ul>
  <p>
  	 </ul>
  	</ul>
  
       Example STM file:
  	<ul>
  	  ;; comment
  	<br>
            2345 A 2345-a 0.10 2.03 uh huh yes i thought
  	<br>
            2345 A 2345-b 2.10 3.04 dog walking is a very
  	<br>
            2345 A 2345-a 3.50 4.59 yes but it's worth it
  	</ul>
  
  <p>
       The file must be sorted by the first and second  columns  in
       ASCII order, and the fourth in numeric order.  The UNIX sort
       command:  "sort +0 -1 +1 -2 +3nb -4"  will  sort  the  words
       into appropriate order.
  <p>
  Lines beginning with ';;' are considered  comments  and  are
  ignored.  Blank lines are also ignored.
  <p>
  
  </ul>
       USING STM FORMAT FOR LABELED UTTERANCE REPORTS (LUR):
  <ul>
       Motivation:
     <ul>
            For the Fall '95 ARPA CSR Evaluation, it was  desirable
            to  not  only  report overall error-rate statistics but
            also error-rate  statistics  for  arbitrary  partitions
            and/or  groups  of  partitions within the test set.  To
            this end, the STM file format was  extended  to  encode
            arbitrary subset information for each segment.
     </ul>
       Usage:
     <ul>
            The subset information is encoded by adding  two  types
            of  information  into the STM file.  The first information
  	  type, is a special comment line, the subset information line, (SIL).
  	  The SIL defines the subset's label
            id, a short column heading and a description.  The special
  	  comment line format is:
  	<ul>
             ;; LABEL "&lt;ID&gt;" "&lt;COL_HD&gt;" "&lt;DESC&gt;"
  	  <ul>
               where:
  	     <ul>
                 &lt;ID&gt;
                   <ul>
                      The subset id.  Used to  label  each  segment
                      that  belongs  to  the subset.  The format is
                      arbitrary, but without spaces.
                   </ul>
                 &lt;COL_HD&gt; 
                   <ul>
                      Used as column headings in generated reports.
                      Format is arbitrary.
                   </ul>
                 &lt;DESC&gt; 
                   <ul>
                      Used for  subset  descriptions  in  generated
                      reports.  May be of arbitrary length and for-
                      mat.  Double  backslashes  '\\'  add  a  line
                      feed.
                   </ul>
               </ul>
            </ul>
            The order of the SIL lines in the STM file defines  the
            order of subset presentation the generated reports.
            The second type of information  incorporated  into  the
            STM file is an optional sixth field to the text segment
            record.  The field consists of a comma  separated  list
            of  subset ids enclosed in angle brackets.  Each unique
            id must have a special comment line,  specified  above,
            to  be  properly interpreted.  Otherwise the id will be
            ignored.
  <p>
            Each position within the label field,  separated  by  a
            commas,  defines  a group of subsets that are presented
            separately in the generated reports.  So for  instance,
            the  first  group might be all segments, and the second
            might be either male or female, and the third might  be
            the story.  The example below shows an STM file encoded
            with this information.
  <ul>
                 ;; LABEL "M" "Male" "Male Talkers"
  <br>
                 ;; LABEL "F" "Female" "Female Talkers"
  <br>
                 ;; LABEL "01" "Story 1" "Business news"
  <br>
                 ;; LABEL "00" "Not in Story" "Words or Phrases not
  			 contained in a story"
  <br>
                 940328 1 A 4.00 18.10 &lt;O,F,00&gt; FROM LOS ANGELES
  <br>
                 940328 1 B 18.10 25.55 &lt;O,M,01&gt; MEXICO IN TURMOIL
  
  </ul>
  </ul>
  </ul>
  </ul>
  </ul>
  <p>
  <a name="ctm_fmt_name_0">
  <strong> ctm - Definition of time marked conversation scoring input </strong>
  </a>
  <ul>
  <p>
       This describes the time marked conversation input  files  to
       be used for scoring the output of speech recognizers via the
       NIST sclite() program.  Both the  reference  and  hypothesis
       input files can share this format.
  <p>
       The ctm file format is a concatenation of time mark  records
       for  each  word  in each channel of a waveform.  The records
       are separated with a newline.  Each word token must  have  a
       waveform  id,  channel identifier [A | B], start time, dura-
       tion, and word text.  Optionally a confidence score  can  be
       appended  for  each word.  Each record follows this BNF for-
       mat:
  <p>
        CTM :== &lt;F&gt; &lt;C&gt; &lt;BT&gt; &lt;DUR&gt; word [ &lt;CONF&gt; ]
  <ul>
           Where :
  <ul>
            &lt;F&gt;  -&gt;
  <ul>
                 The waveform  filename.   NOTE:  no  pathnames  or
                 extensions are expected.
  </ul>
            &lt;C&gt;  -&gt;
  <ul>
                 The waveform channel.  Either "A" or "B".  The text of the waveform channel
  		is not restricted by sclite.  The text can be any text string without
  		witespace so long as the matching string is found in both the reference
  		and hypothesis input files.
  </ul>
            &lt;BT&gt; -&gt;
  <ul>
                 The begin time (seconds)  of  the  word,  measured
                 from the start time of the file.
  </ul>
            &lt;DUR&gt;  -&gt;
  <ul>
                 The duration (seconds) of the word.
  </ul>
            &lt;CONF&gt;  -&gt;
  <ul>
                 Optional confidence score.  It  is  proposed  that
                 this score will be used in the future.
  </ul>
  </ul>
  </ul>
  <p>
       The file must be sorted by  the  first  three  columns:  the
       first  and  the  second  in  ASCII order, and the third by a
       numeric order.  The UNIX sort command: "sort  +0  -1  +1  -2
       +2nb -3" will sort the words into appropriate order.
  <p>
       Lines beginning with ';;' are considered  comments  and  are
       ignored.  Blank lines are also ignored.
  <p>
       Included below is an example:
  <ul>
       ;;
       <br>
       ;;  Comments follow ';;' 
       <br>
       ;;
       <br>
       ;;  The Blank lines are ignored
       <br>
  
       <br>
       ;;
       <br>
       7654 A 11.34 0.2  YES -6.763
       <br>
       7654 A 12.00 0.34 YOU -12.384530
       <br>
       7654 A 13.30 0.5  CAN 2.806418
       <br>
       7654 A 17.50 0.2  AS 0.537922
       <br>
             :
       <br>
       7654 B 1.34 0.2  I -6.763
       <br>
       7654 B 2.00 0.34 CAN -12.384530
       <br>
       7654 B 3.40 0.5  ADD 2.806418
       <br>
       7654 B 7.00 0.2  AS 0.537922
       <br>
             :
  </ul>
  <p>
       For CTM reference files, a format extension exists to permit
       marking  alternate  transcripts.   The  alternation uses the
       same file format  as  described  above,  except  three  word
       strings, "&lt;ALT_BEGIN&gt;", "&lt;ALT&gt;" and "&lt;ALT_END&gt;", are used to
       delimit the alternation.  Each tag is  treated  as  a  word,
       with  a conversation id, channel and "*"'s for the begin and
       duration time.
  <p>
       The alternation is begun using the word  "&lt;ALT_BEGIN&gt;",  and
       terminated using the word "&lt;ALT_END&gt;".  In between the start
       and  end,  are  at  least  2  alternative  time-marked  word
       sequences separated by the word "&lt;ALT&gt;".  Each word sequence
       can contain any number of words.  An empty alternative  sig-
       nifies a null word.
  <p>
       Below is and example alternate reference transcript for  the
       words "uh" and "um".
  <p>
    <ul>
       ;;
       <br>
       7654 A   *    *   &lt;ALT_BEGIN&gt;
       <br>
       7654 A 12.00 0.34 UM
       <br>
       7654 A   *    *   &lt;ALT&gt;
       <br>
       7654 A 12.00 0.34 UH
       <br>
       7654 A   *    *   &lt;ALT_END&gt;
    <ul>
  </ul>
  </ul>
  </body>
  </html>