infmts.htm
12.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
<!-- $Id: infmts.htm,v 1.2 2008/09/05 17:12:23 ajot Exp $ -->
<HTML><HEAD>
<CENTER><TITLE>options</TITLE>
</HEAD>
<BODY></CENTER><p><hr>
<H1>
<A NAME="infmts_name_0">
<A HREF="sclite.htm#sclite_name_0">Sclite</A> Input file formats: </A>
<a href="infmts.htm#trn_fmt_name_0">trn</a>,
<a href="infmts.htm#txt_fmt_name_0">txt</a>,
<a href="infmts.htm#stm_fmt_name_0">stm</a>,
<a href="infmts.htm#ctm_fmt_name_0">ctm</a>
</H1>
The inputs to "<a href="sclite.htm#sclite_name_0">sclite</a>" are the
reference file and a hypothesis file(s), the text portions of which
may be either ASCII characters or GB encoded Chinese characters.
There are a number of different input formats permitted:
"<a href="infmts.htm#trn_fmt_name_0">trn</a>",
"<a href="infmts.htm#txt_fmt_name_0">txt</a>",
"<a href="infmts.htm#stm_fmt_name_0">stm</a>", and
"<a href="infmts.htm#ctm_fmt_name_0">ctm</a>".
As new scoring paradigms were created for the ARPA
tests, accompanying formats were created to support the evaluations.
<p>
<a name="trn_fmt_name_0">
<strong> trn - Definition of a transcript input file </strong>
</a>
<ul>
<p>
The transcript format is a file of word sequence
records separated by newlines. Each record contains a
word sequence, follow by the an utterance ID enclosed
in parenthesis. See the '<a href="options.htm#option_i_name_0">-i</a>' option for a list of
accepted utterance id types.
<p>
example.
<ul>
she had your dark suit in greasy wash water all
year (cmh_sa01)
</ul>
<p>
Transcript alternations, described above, can be used
in the word sequence by using this BNF format:
<p>
<ul>
ALTERNATE :== "{" TEXT ALT+ "}"
<br>
ALT :== "/" TEXT
<br>
TEXT :== 1 or more whitespace separated words |
"@" | ALTERNATE
</ul>
<p>
The "@" represents a NULL word in the transcript. For
scoring purposes, an error is not counted if the "@" is
aligned as an insertion.
<p>
example
<ul>
i've { um / uh / @ } as far as i'm concerned
</ul>
</ul>
<p>
<a name="txt_fmt_name_0">
<strong> txt - Definition of a text input file </strong>
</a>
<ul>
This format is simply free-form text with no page,
paragraphs, sentence or speaker breaks.
</ul>
<a name="stm_fmt_name_0">
<strong> stm - Definition of segment time mark input file </strong>
</a>
<ul>
<p>
This describes the segment time marked files to be used for
scoring the output of speech recognizers via the NIST
sclite() program. This is a reference file format.
<p>
The segment time mark file consists of a concatenation of
text segment records from a waveform file. Each record is
separated by a newline and contains: the waveform's filename
and channel identifier [A | B], the talkers id, begin and
end times (in seconds), optional subset label and the text
for the segment. Each record follows this BNF format:
<p>
STM :== <F> <C> <S> <BT> <ET> [ <LABEL> ] transcript . . .
<ul>
Where :
<ul>
<F>
<ul>
The waveform filename. NOTE: no pathnames or
extensions are expected.
</ul>
<C>
<ul>
The waveform channel. The text of the waveform channel
is not restricted by sclite. The text can be any text string without
witespace so long as the matching string is found in both the reference
and hypothesis input files.
</ul>
<S>
<ul>
The speaker id, no restrictions apply to this
name.
</ul>
<BT>
<ul>
The begin time (seconds) of the segment.
</ul>
<ET>
<ul>
The end time (seconds) of the segment.
</ul>
<LABEL>
<ul>
A comma separated list of subset identifiers
enclosed in angle brackets. Ex. "<O,F,00>". See
"USING STM FORMAT FOR LABELED UTTERANCE REPORTS
(LUR)" below.
</ul>
transcript
<ul>
The transcript can take on two forms: 1) a whitespace separated list of
words, or 2) the string "IGNORE_TIME_SEGMENT_IN_SCORING".
<p>
The list of
words can contain an transcript alternation using the following
BNF format:
<p>
<ul>
ALTERNATE :== "{" <text> ALT+ "}"
<br>
ALT :== "/" <text>
<br>
TEXT :== 1 thru n words | "@" | ALTERNATE
</ul>
<p>
The "@" represents a NULL word in the transcript. For scoring
purposes, an error is not counted if the "@" is aligned as an
insertion.
<p>
<ul>
Example: "i've { um / uh / @ } as far as i'm concerned"
</ul>
<p>
When the string "IGNORE_TIME_SEGMENT_IN_SCORING" is used as the transcript,
the process which chops the hypothesis file to matching reference segments
ignores all hypothesis words whose time-midpoints occur within the reference
segments beginning
and ending time. The effect is to declare this segments regions as
"out-of-bounds" for scoring, thus generation no errors from that time
region.
<p>
<ul>
NOTE: this only works with DP alignment of a referenc stm file
and hypothesis ctm file.
</ul>
<p>
</ul>
</ul>
Example STM file:
<ul>
;; comment
<br>
2345 A 2345-a 0.10 2.03 uh huh yes i thought
<br>
2345 A 2345-b 2.10 3.04 dog walking is a very
<br>
2345 A 2345-a 3.50 4.59 yes but it's worth it
</ul>
<p>
The file must be sorted by the first and second columns in
ASCII order, and the fourth in numeric order. The UNIX sort
command: "sort +0 -1 +1 -2 +3nb -4" will sort the words
into appropriate order.
<p>
Lines beginning with ';;' are considered comments and are
ignored. Blank lines are also ignored.
<p>
</ul>
USING STM FORMAT FOR LABELED UTTERANCE REPORTS (LUR):
<ul>
Motivation:
<ul>
For the Fall '95 ARPA CSR Evaluation, it was desirable
to not only report overall error-rate statistics but
also error-rate statistics for arbitrary partitions
and/or groups of partitions within the test set. To
this end, the STM file format was extended to encode
arbitrary subset information for each segment.
</ul>
Usage:
<ul>
The subset information is encoded by adding two types
of information into the STM file. The first information
type, is a special comment line, the subset information line, (SIL).
The SIL defines the subset's label
id, a short column heading and a description. The special
comment line format is:
<ul>
;; LABEL "<ID>" "<COL_HD>" "<DESC>"
<ul>
where:
<ul>
<ID>
<ul>
The subset id. Used to label each segment
that belongs to the subset. The format is
arbitrary, but without spaces.
</ul>
<COL_HD>
<ul>
Used as column headings in generated reports.
Format is arbitrary.
</ul>
<DESC>
<ul>
Used for subset descriptions in generated
reports. May be of arbitrary length and for-
mat. Double backslashes '\\' add a line
feed.
</ul>
</ul>
</ul>
The order of the SIL lines in the STM file defines the
order of subset presentation the generated reports.
The second type of information incorporated into the
STM file is an optional sixth field to the text segment
record. The field consists of a comma separated list
of subset ids enclosed in angle brackets. Each unique
id must have a special comment line, specified above,
to be properly interpreted. Otherwise the id will be
ignored.
<p>
Each position within the label field, separated by a
commas, defines a group of subsets that are presented
separately in the generated reports. So for instance,
the first group might be all segments, and the second
might be either male or female, and the third might be
the story. The example below shows an STM file encoded
with this information.
<ul>
;; LABEL "M" "Male" "Male Talkers"
<br>
;; LABEL "F" "Female" "Female Talkers"
<br>
;; LABEL "01" "Story 1" "Business news"
<br>
;; LABEL "00" "Not in Story" "Words or Phrases not
contained in a story"
<br>
940328 1 A 4.00 18.10 <O,F,00> FROM LOS ANGELES
<br>
940328 1 B 18.10 25.55 <O,M,01> MEXICO IN TURMOIL
</ul>
</ul>
</ul>
</ul>
</ul>
<p>
<a name="ctm_fmt_name_0">
<strong> ctm - Definition of time marked conversation scoring input </strong>
</a>
<ul>
<p>
This describes the time marked conversation input files to
be used for scoring the output of speech recognizers via the
NIST sclite() program. Both the reference and hypothesis
input files can share this format.
<p>
The ctm file format is a concatenation of time mark records
for each word in each channel of a waveform. The records
are separated with a newline. Each word token must have a
waveform id, channel identifier [A | B], start time, dura-
tion, and word text. Optionally a confidence score can be
appended for each word. Each record follows this BNF for-
mat:
<p>
CTM :== <F> <C> <BT> <DUR> word [ <CONF> ]
<ul>
Where :
<ul>
<F> ->
<ul>
The waveform filename. NOTE: no pathnames or
extensions are expected.
</ul>
<C> ->
<ul>
The waveform channel. Either "A" or "B". The text of the waveform channel
is not restricted by sclite. The text can be any text string without
witespace so long as the matching string is found in both the reference
and hypothesis input files.
</ul>
<BT> ->
<ul>
The begin time (seconds) of the word, measured
from the start time of the file.
</ul>
<DUR> ->
<ul>
The duration (seconds) of the word.
</ul>
<CONF> ->
<ul>
Optional confidence score. It is proposed that
this score will be used in the future.
</ul>
</ul>
</ul>
<p>
The file must be sorted by the first three columns: the
first and the second in ASCII order, and the third by a
numeric order. The UNIX sort command: "sort +0 -1 +1 -2
+2nb -3" will sort the words into appropriate order.
<p>
Lines beginning with ';;' are considered comments and are
ignored. Blank lines are also ignored.
<p>
Included below is an example:
<ul>
;;
<br>
;; Comments follow ';;'
<br>
;;
<br>
;; The Blank lines are ignored
<br>
<br>
;;
<br>
7654 A 11.34 0.2 YES -6.763
<br>
7654 A 12.00 0.34 YOU -12.384530
<br>
7654 A 13.30 0.5 CAN 2.806418
<br>
7654 A 17.50 0.2 AS 0.537922
<br>
:
<br>
7654 B 1.34 0.2 I -6.763
<br>
7654 B 2.00 0.34 CAN -12.384530
<br>
7654 B 3.40 0.5 ADD 2.806418
<br>
7654 B 7.00 0.2 AS 0.537922
<br>
:
</ul>
<p>
For CTM reference files, a format extension exists to permit
marking alternate transcripts. The alternation uses the
same file format as described above, except three word
strings, "<ALT_BEGIN>", "<ALT>" and "<ALT_END>", are used to
delimit the alternation. Each tag is treated as a word,
with a conversation id, channel and "*"'s for the begin and
duration time.
<p>
The alternation is begun using the word "<ALT_BEGIN>", and
terminated using the word "<ALT_END>". In between the start
and end, are at least 2 alternative time-marked word
sequences separated by the word "<ALT>". Each word sequence
can contain any number of words. An empty alternative sig-
nifies a null word.
<p>
Below is and example alternate reference transcript for the
words "uh" and "um".
<p>
<ul>
;;
<br>
7654 A * * <ALT_BEGIN>
<br>
7654 A 12.00 0.34 UM
<br>
7654 A * * <ALT>
<br>
7654 A 12.00 0.34 UH
<br>
7654 A * * <ALT_END>
<ul>
</ul>
</ul>
</body>
</html>