Blame view
tools/sctk-2.4.10/doc/infmts.htm
12.8 KB
8dcb6dfcb first commit |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 |
<!-- $Id: infmts.htm,v 1.2 2008/09/05 17:12:23 ajot Exp $ --> <HTML><HEAD> <CENTER><TITLE>options</TITLE> </HEAD> <BODY></CENTER><p><hr> <H1> <A NAME="infmts_name_0"> <A HREF="sclite.htm#sclite_name_0">Sclite</A> Input file formats: </A> <a href="infmts.htm#trn_fmt_name_0">trn</a>, <a href="infmts.htm#txt_fmt_name_0">txt</a>, <a href="infmts.htm#stm_fmt_name_0">stm</a>, <a href="infmts.htm#ctm_fmt_name_0">ctm</a> </H1> The inputs to "<a href="sclite.htm#sclite_name_0">sclite</a>" are the reference file and a hypothesis file(s), the text portions of which may be either ASCII characters or GB encoded Chinese characters. There are a number of different input formats permitted: "<a href="infmts.htm#trn_fmt_name_0">trn</a>", "<a href="infmts.htm#txt_fmt_name_0">txt</a>", "<a href="infmts.htm#stm_fmt_name_0">stm</a>", and "<a href="infmts.htm#ctm_fmt_name_0">ctm</a>". As new scoring paradigms were created for the ARPA tests, accompanying formats were created to support the evaluations. <p> <a name="trn_fmt_name_0"> <strong> trn - Definition of a transcript input file </strong> </a> <ul> <p> The transcript format is a file of word sequence records separated by newlines. Each record contains a word sequence, follow by the an utterance ID enclosed in parenthesis. See the '<a href="options.htm#option_i_name_0">-i</a>' option for a list of accepted utterance id types. <p> example. <ul> she had your dark suit in greasy wash water all year (cmh_sa01) </ul> <p> Transcript alternations, described above, can be used in the word sequence by using this BNF format: <p> <ul> ALTERNATE :== "{" TEXT ALT+ "}" <br> ALT :== "/" TEXT <br> TEXT :== 1 or more whitespace separated words | "@" | ALTERNATE </ul> <p> The "@" represents a NULL word in the transcript. For scoring purposes, an error is not counted if the "@" is aligned as an insertion. <p> example <ul> i've { um / uh / @ } as far as i'm concerned </ul> </ul> <p> <a name="txt_fmt_name_0"> <strong> txt - Definition of a text input file </strong> </a> <ul> This format is simply free-form text with no page, paragraphs, sentence or speaker breaks. </ul> <a name="stm_fmt_name_0"> <strong> stm - Definition of segment time mark input file </strong> </a> <ul> <p> This describes the segment time marked files to be used for scoring the output of speech recognizers via the NIST sclite() program. This is a reference file format. <p> The segment time mark file consists of a concatenation of text segment records from a waveform file. Each record is separated by a newline and contains: the waveform's filename and channel identifier [A | B], the talkers id, begin and end times (in seconds), optional subset label and the text for the segment. Each record follows this BNF format: <p> STM :== <F> <C> <S> <BT> <ET> [ <LABEL> ] transcript . . . <ul> Where : <ul> <F> <ul> The waveform filename. NOTE: no pathnames or extensions are expected. </ul> <C> <ul> The waveform channel. The text of the waveform channel is not restricted by sclite. The text can be any text string without witespace so long as the matching string is found in both the reference and hypothesis input files. </ul> <S> <ul> The speaker id, no restrictions apply to this name. </ul> <BT> <ul> The begin time (seconds) of the segment. </ul> <ET> <ul> The end time (seconds) of the segment. </ul> <LABEL> <ul> A comma separated list of subset identifiers enclosed in angle brackets. Ex. "<O,F,00>". See "USING STM FORMAT FOR LABELED UTTERANCE REPORTS (LUR)" below. </ul> transcript <ul> The transcript can take on two forms: 1) a whitespace separated list of words, or 2) the string "IGNORE_TIME_SEGMENT_IN_SCORING". <p> The list of words can contain an transcript alternation using the following BNF format: <p> <ul> ALTERNATE :== "{" <text> ALT+ "}" <br> ALT :== "/" <text> <br> TEXT :== 1 thru n words | "@" | ALTERNATE </ul> <p> The "@" represents a NULL word in the transcript. For scoring purposes, an error is not counted if the "@" is aligned as an insertion. <p> <ul> Example: "i've { um / uh / @ } as far as i'm concerned" </ul> <p> When the string "IGNORE_TIME_SEGMENT_IN_SCORING" is used as the transcript, the process which chops the hypothesis file to matching reference segments ignores all hypothesis words whose time-midpoints occur within the reference segments beginning and ending time. The effect is to declare this segments regions as "out-of-bounds" for scoring, thus generation no errors from that time region. <p> <ul> NOTE: this only works with DP alignment of a referenc stm file and hypothesis ctm file. </ul> <p> </ul> </ul> Example STM file: <ul> ;; comment <br> 2345 A 2345-a 0.10 2.03 uh huh yes i thought <br> 2345 A 2345-b 2.10 3.04 dog walking is a very <br> 2345 A 2345-a 3.50 4.59 yes but it's worth it </ul> <p> The file must be sorted by the first and second columns in ASCII order, and the fourth in numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +3nb -4" will sort the words into appropriate order. <p> Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored. <p> </ul> USING STM FORMAT FOR LABELED UTTERANCE REPORTS (LUR): <ul> Motivation: <ul> For the Fall '95 ARPA CSR Evaluation, it was desirable to not only report overall error-rate statistics but also error-rate statistics for arbitrary partitions and/or groups of partitions within the test set. To this end, the STM file format was extended to encode arbitrary subset information for each segment. </ul> Usage: <ul> The subset information is encoded by adding two types of information into the STM file. The first information type, is a special comment line, the subset information line, (SIL). The SIL defines the subset's label id, a short column heading and a description. The special comment line format is: <ul> ;; LABEL "<ID>" "<COL_HD>" "<DESC>" <ul> where: <ul> <ID> <ul> The subset id. Used to label each segment that belongs to the subset. The format is arbitrary, but without spaces. </ul> <COL_HD> <ul> Used as column headings in generated reports. Format is arbitrary. </ul> <DESC> <ul> Used for subset descriptions in generated reports. May be of arbitrary length and for- mat. Double backslashes '\\' add a line feed. </ul> </ul> </ul> The order of the SIL lines in the STM file defines the order of subset presentation the generated reports. The second type of information incorporated into the STM file is an optional sixth field to the text segment record. The field consists of a comma separated list of subset ids enclosed in angle brackets. Each unique id must have a special comment line, specified above, to be properly interpreted. Otherwise the id will be ignored. <p> Each position within the label field, separated by a commas, defines a group of subsets that are presented separately in the generated reports. So for instance, the first group might be all segments, and the second might be either male or female, and the third might be the story. The example below shows an STM file encoded with this information. <ul> ;; LABEL "M" "Male" "Male Talkers" <br> ;; LABEL "F" "Female" "Female Talkers" <br> ;; LABEL "01" "Story 1" "Business news" <br> ;; LABEL "00" "Not in Story" "Words or Phrases not contained in a story" <br> 940328 1 A 4.00 18.10 <O,F,00> FROM LOS ANGELES <br> 940328 1 B 18.10 25.55 <O,M,01> MEXICO IN TURMOIL </ul> </ul> </ul> </ul> </ul> <p> <a name="ctm_fmt_name_0"> <strong> ctm - Definition of time marked conversation scoring input </strong> </a> <ul> <p> This describes the time marked conversation input files to be used for scoring the output of speech recognizers via the NIST sclite() program. Both the reference and hypothesis input files can share this format. <p> The ctm file format is a concatenation of time mark records for each word in each channel of a waveform. The records are separated with a newline. Each word token must have a waveform id, channel identifier [A | B], start time, dura- tion, and word text. Optionally a confidence score can be appended for each word. Each record follows this BNF for- mat: <p> CTM :== <F> <C> <BT> <DUR> word [ <CONF> ] <ul> Where : <ul> <F> -> <ul> The waveform filename. NOTE: no pathnames or extensions are expected. </ul> <C> -> <ul> The waveform channel. Either "A" or "B". The text of the waveform channel is not restricted by sclite. The text can be any text string without witespace so long as the matching string is found in both the reference and hypothesis input files. </ul> <BT> -> <ul> The begin time (seconds) of the word, measured from the start time of the file. </ul> <DUR> -> <ul> The duration (seconds) of the word. </ul> <CONF> -> <ul> Optional confidence score. It is proposed that this score will be used in the future. </ul> </ul> </ul> <p> The file must be sorted by the first three columns: the first and the second in ASCII order, and the third by a numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +2nb -3" will sort the words into appropriate order. <p> Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored. <p> Included below is an example: <ul> ;; <br> ;; Comments follow ';;' <br> ;; <br> ;; The Blank lines are ignored <br> <br> ;; <br> 7654 A 11.34 0.2 YES -6.763 <br> 7654 A 12.00 0.34 YOU -12.384530 <br> 7654 A 13.30 0.5 CAN 2.806418 <br> 7654 A 17.50 0.2 AS 0.537922 <br> : <br> 7654 B 1.34 0.2 I -6.763 <br> 7654 B 2.00 0.34 CAN -12.384530 <br> 7654 B 3.40 0.5 ADD 2.806418 <br> 7654 B 7.00 0.2 AS 0.537922 <br> : </ul> <p> For CTM reference files, a format extension exists to permit marking alternate transcripts. The alternation uses the same file format as described above, except three word strings, "<ALT_BEGIN>", "<ALT>" and "<ALT_END>", are used to delimit the alternation. Each tag is treated as a word, with a conversation id, channel and "*"'s for the begin and duration time. <p> The alternation is begun using the word "<ALT_BEGIN>", and terminated using the word "<ALT_END>". In between the start and end, are at least 2 alternative time-marked word sequences separated by the word "<ALT>". Each word sequence can contain any number of words. An empty alternative sig- nifies a null word. <p> Below is and example alternate reference transcript for the words "uh" and "um". <p> <ul> ;; <br> 7654 A * * <ALT_BEGIN> <br> 7654 A 12.00 0.34 UM <br> 7654 A * * <ALT> <br> 7654 A 12.00 0.34 UH <br> 7654 A * * <ALT_END> <ul> </ul> </ul> </body> </html> |