infmts.htm 12.8 KB
edit raw blame history



1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400


<!-- $Id: infmts.htm,v 1.2 2008/09/05 17:12:23 ajot Exp $ -->
<HTML><HEAD>
<CENTER><TITLE>options</TITLE>
</HEAD>
<BODY></CENTER><p><hr>

<H1> 
<A NAME="infmts_name_0">
<A HREF="sclite.htm#sclite_name_0">Sclite</A> Input file formats: </A>
<a href="infmts.htm#trn_fmt_name_0">trn</a>,
<a href="infmts.htm#txt_fmt_name_0">txt</a>,
<a href="infmts.htm#stm_fmt_name_0">stm</a>,
<a href="infmts.htm#ctm_fmt_name_0">ctm</a>

</H1>

The inputs to "<a href="sclite.htm#sclite_name_0">sclite</a>" are the
reference file and a hypothesis file(s), the text portions of which
may be either ASCII characters or GB encoded Chinese characters.
There are a number of different input formats permitted: 
"<a href="infmts.htm#trn_fmt_name_0">trn</a>",
"<a href="infmts.htm#txt_fmt_name_0">txt</a>",
"<a href="infmts.htm#stm_fmt_name_0">stm</a>", and
"<a href="infmts.htm#ctm_fmt_name_0">ctm</a>".
As new scoring paradigms were created for the ARPA
tests, accompanying formats were created to support the evaluations.

<p>
<a name="trn_fmt_name_0">
<strong> trn - Definition of a transcript input file </strong>
</a>
<ul>
<p>
          The transcript  format  is  a  file  of  word  sequence
          records  separated by newlines.  Each record contains a
          word sequence, follow by the an utterance  ID  enclosed
          in  parenthesis.   See  the  '<a href="options.htm#option_i_name_0">-i</a>'  option for a list of
          accepted utterance id types.
<p>
          example.
<ul>
               she had your dark suit in greasy  wash  water  all
               year (cmh_sa01)
</ul>
<p>
          Transcript alternations, described above, can  be  used
          in the word sequence by using this BNF format:
<p>
<ul>
               ALTERNATE :== "{" TEXT ALT+ "}"
<br> 
               ALT       :== "/" TEXT
<br>
               TEXT      :== 1 or more whitespace separated words |
	                    "@" | ALTERNATE
</ul>
<p>
     The "@" represents a  NULL  word  in  the  transcript.   For
     scoring  purposes,  an  error  is  not counted if the "@" is
     aligned as an insertion.
<p>
     example
<ul>
          i've { um / uh / @ } as far as i'm concerned
</ul>
</ul>
<p>
<a name="txt_fmt_name_0">
<strong> txt - Definition of a text input file </strong>
</a>
<ul>
          This format is simply  free-form  text  with  no  page,
          paragraphs, sentence or speaker breaks.
</ul>
<a name="stm_fmt_name_0">
<strong> stm - Definition of segment time mark input file </strong>
</a>
<ul>
<p>
     This describes the segment time marked files to be used  for
     scoring  the  output  of  speech  recognizers  via  the NIST
     sclite() program.  This is a reference file format.
<p>
     The segment time mark file consists of  a  concatenation  of
     text  segment  records from a waveform file.  Each record is
     separated by a newline and contains: the waveform's filename
     and  channel  identifier  [A | B], the talkers id, begin and
     end times (in seconds), optional subset label and  the  text
     for the segment.  Each record follows this BNF format:
<p>
      STM :== &lt;F&gt; &lt;C&gt; &lt;S&gt; &lt;BT&gt; &lt;ET&gt; [ &lt;LABEL&gt; ] transcript . . .
	<ul>
         Where :
	 <ul>
          &lt;F&gt;
	   <ul>
               The waveform  filename.   NOTE:  no  pathnames  or
               extensions are expected.
  	   </ul>
          &lt;C&gt;
	   <ul>
               The waveform channel.   The text of the waveform channel
		is not restricted by sclite.  The text can be any text string without
		witespace so long as the matching string is found in both the reference
		and hypothesis input files.
  	   </ul>
          &lt;S&gt;
	   <ul>
               The speaker id,  no  restrictions  apply  to  this
               name.
  	   </ul>
          &lt;BT&gt;
	   <ul>
               The begin time (seconds) of the segment.
  	   </ul>
          &lt;ET&gt;
	   <ul>
               The end time (seconds) of the segment.
  	   </ul>
          &lt;LABEL&gt;
	   <ul>
               A  comma  separated  list  of  subset  identifiers
               enclosed  in angle brackets.  Ex. "&lt;O,F,00&gt;".  See
               "USING STM FORMAT FOR  LABELED  UTTERANCE  REPORTS
               (LUR)" below.
  	   </ul> 
         transcript
	   <ul>

The transcript can take on two forms: 1) a whitespace separated list of
words, or 2) the string "IGNORE_TIME_SEGMENT_IN_SCORING".
<p>
The list of
words can contain an transcript alternation using the following
BNF format:
<p>
<ul>
          ALTERNATE :== "{" &lt;text&gt; ALT+ "}"
	<br>
          ALT       :== "/" &lt;text&gt;
	<br>
          TEXT      :== 1 thru n words | "@" | ALTERNATE
</ul>
<p>
     The "@" represents a NULL word in the transcript.  For scoring
purposes, an error is not counted if the "@" is aligned as an
insertion.  
<p>
<ul>
     Example:     "i've { um / uh / @ } as far  as i'm concerned"
</ul>
<p>
When the string "IGNORE_TIME_SEGMENT_IN_SCORING" is used as the transcript,
the process which chops the hypothesis file to matching reference segments
ignores all hypothesis words whose time-midpoints occur within the reference
segments beginning
and ending time.  The effect is to declare this segments regions as 
"out-of-bounds" for scoring, thus generation no errors from that time 
region. 
<p>
<ul>
NOTE: this only works with DP alignment of a referenc stm file
and hypothesis ctm file.
</ul>
<p>
	 </ul>
	</ul>

     Example STM file:
	<ul>
	  ;; comment
	<br>
          2345 A 2345-a 0.10 2.03 uh huh yes i thought
	<br>
          2345 A 2345-b 2.10 3.04 dog walking is a very
	<br>
          2345 A 2345-a 3.50 4.59 yes but it's worth it
	</ul>

<p>
     The file must be sorted by the first and second  columns  in
     ASCII order, and the fourth in numeric order.  The UNIX sort
     command:  "sort +0 -1 +1 -2 +3nb -4"  will  sort  the  words
     into appropriate order.
<p>
Lines beginning with ';;' are considered  comments  and  are
ignored.  Blank lines are also ignored.
<p>

</ul>
     USING STM FORMAT FOR LABELED UTTERANCE REPORTS (LUR):
<ul>
     Motivation:
   <ul>
          For the Fall '95 ARPA CSR Evaluation, it was  desirable
          to  not  only  report overall error-rate statistics but
          also error-rate  statistics  for  arbitrary  partitions
          and/or  groups  of  partitions within the test set.  To
          this end, the STM file format was  extended  to  encode
          arbitrary subset information for each segment.
   </ul>
     Usage:
   <ul>
          The subset information is encoded by adding  two  types
          of  information  into the STM file.  The first information
	  type, is a special comment line, the subset information line, (SIL).
	  The SIL defines the subset's label
          id, a short column heading and a description.  The special
	  comment line format is:
	<ul>
           ;; LABEL "&lt;ID&gt;" "&lt;COL_HD&gt;" "&lt;DESC&gt;"
	  <ul>
             where:
	     <ul>
               &lt;ID&gt;
                 <ul>
                    The subset id.  Used to  label  each  segment
                    that  belongs  to  the subset.  The format is
                    arbitrary, but without spaces.
                 </ul>
               &lt;COL_HD&gt; 
                 <ul>
                    Used as column headings in generated reports.
                    Format is arbitrary.
                 </ul>
               &lt;DESC&gt; 
                 <ul>
                    Used for  subset  descriptions  in  generated
                    reports.  May be of arbitrary length and for-
                    mat.  Double  backslashes  '\\'  add  a  line
                    feed.
                 </ul>
             </ul>
          </ul>
          The order of the SIL lines in the STM file defines  the
          order of subset presentation the generated reports.
          The second type of information  incorporated  into  the
          STM file is an optional sixth field to the text segment
          record.  The field consists of a comma  separated  list
          of  subset ids enclosed in angle brackets.  Each unique
          id must have a special comment line,  specified  above,
          to  be  properly interpreted.  Otherwise the id will be
          ignored.
<p>
          Each position within the label field,  separated  by  a
          commas,  defines  a group of subsets that are presented
          separately in the generated reports.  So for  instance,
          the  first  group might be all segments, and the second
          might be either male or female, and the third might  be
          the story.  The example below shows an STM file encoded
          with this information.
<ul>
               ;; LABEL "M" "Male" "Male Talkers"
<br>
               ;; LABEL "F" "Female" "Female Talkers"
<br>
               ;; LABEL "01" "Story 1" "Business news"
<br>
               ;; LABEL "00" "Not in Story" "Words or Phrases not
			 contained in a story"
<br>
               940328 1 A 4.00 18.10 &lt;O,F,00&gt; FROM LOS ANGELES
<br>
               940328 1 B 18.10 25.55 &lt;O,M,01&gt; MEXICO IN TURMOIL

</ul>
</ul>
</ul>
</ul>
</ul>
<p>
<a name="ctm_fmt_name_0">
<strong> ctm - Definition of time marked conversation scoring input </strong>
</a>
<ul>
<p>
     This describes the time marked conversation input  files  to
     be used for scoring the output of speech recognizers via the
     NIST sclite() program.  Both the  reference  and  hypothesis
     input files can share this format.
<p>
     The ctm file format is a concatenation of time mark  records
     for  each  word  in each channel of a waveform.  The records
     are separated with a newline.  Each word token must  have  a
     waveform  id,  channel identifier [A | B], start time, dura-
     tion, and word text.  Optionally a confidence score  can  be
     appended  for  each word.  Each record follows this BNF for-
     mat:
<p>
      CTM :== &lt;F&gt; &lt;C&gt; &lt;BT&gt; &lt;DUR&gt; word [ &lt;CONF&gt; ]
<ul>
         Where :
<ul>
          &lt;F&gt;  -&gt;
<ul>
               The waveform  filename.   NOTE:  no  pathnames  or
               extensions are expected.
</ul>
          &lt;C&gt;  -&gt;
<ul>
               The waveform channel.  Either "A" or "B".  The text of the waveform channel
		is not restricted by sclite.  The text can be any text string without
		witespace so long as the matching string is found in both the reference
		and hypothesis input files.
</ul>
          &lt;BT&gt; -&gt;
<ul>
               The begin time (seconds)  of  the  word,  measured
               from the start time of the file.
</ul>
          &lt;DUR&gt;  -&gt;
<ul>
               The duration (seconds) of the word.
</ul>
          &lt;CONF&gt;  -&gt;
<ul>
               Optional confidence score.  It  is  proposed  that
               this score will be used in the future.
</ul>
</ul>
</ul>
<p>
     The file must be sorted by  the  first  three  columns:  the
     first  and  the  second  in  ASCII order, and the third by a
     numeric order.  The UNIX sort command: "sort  +0  -1  +1  -2
     +2nb -3" will sort the words into appropriate order.
<p>
     Lines beginning with ';;' are considered  comments  and  are
     ignored.  Blank lines are also ignored.
<p>
     Included below is an example:
<ul>
     ;;
     <br>
     ;;  Comments follow ';;' 
     <br>
     ;;
     <br>
     ;;  The Blank lines are ignored
     <br>

     <br>
     ;;
     <br>
     7654 A 11.34 0.2  YES -6.763
     <br>
     7654 A 12.00 0.34 YOU -12.384530
     <br>
     7654 A 13.30 0.5  CAN 2.806418
     <br>
     7654 A 17.50 0.2  AS 0.537922
     <br>
           :
     <br>
     7654 B 1.34 0.2  I -6.763
     <br>
     7654 B 2.00 0.34 CAN -12.384530
     <br>
     7654 B 3.40 0.5  ADD 2.806418
     <br>
     7654 B 7.00 0.2  AS 0.537922
     <br>
           :
</ul>
<p>
     For CTM reference files, a format extension exists to permit
     marking  alternate  transcripts.   The  alternation uses the
     same file format  as  described  above,  except  three  word
     strings, "&lt;ALT_BEGIN&gt;", "&lt;ALT&gt;" and "&lt;ALT_END&gt;", are used to
     delimit the alternation.  Each tag is  treated  as  a  word,
     with  a conversation id, channel and "*"'s for the begin and
     duration time.
<p>
     The alternation is begun using the word  "&lt;ALT_BEGIN&gt;",  and
     terminated using the word "&lt;ALT_END&gt;".  In between the start
     and  end,  are  at  least  2  alternative  time-marked  word
     sequences separated by the word "&lt;ALT&gt;".  Each word sequence
     can contain any number of words.  An empty alternative  sig-
     nifies a null word.
<p>
     Below is and example alternate reference transcript for  the
     words "uh" and "um".
<p>
  <ul>
     ;;
     <br>
     7654 A   *    *   &lt;ALT_BEGIN&gt;
     <br>
     7654 A 12.00 0.34 UM
     <br>
     7654 A   *    *   &lt;ALT&gt;
     <br>
     7654 A 12.00 0.34 UH
     <br>
     7654 A   *    *   &lt;ALT_END&gt;
  <ul>
</ul>
</ul>
</body>
</html>