Yannick Estève / ONTRAC-Kaldi

Blame view

tools/sctk-2.4.10/doc/infmts.htm 12.8 KB
  <!-- $Id: infmts.htm,v 1.2 2008/09/05 17:12:23 ajot Exp $ -->
  <HTML><HEAD>
  <CENTER><TITLE>options</TITLE>
  </HEAD>
  <BODY></CENTER><p><hr>
  
  <H1> 
  <A NAME="infmts_name_0">
  <A HREF="sclite.htm#sclite_name_0">Sclite</A> Input file formats: </A>
  <a href="infmts.htm#trn_fmt_name_0">trn</a>,
  <a href="infmts.htm#txt_fmt_name_0">txt</a>,
  <a href="infmts.htm#stm_fmt_name_0">stm</a>,
  <a href="infmts.htm#ctm_fmt_name_0">ctm</a>
  
  </H1>
  
  The inputs to "<a href="sclite.htm#sclite_name_0">sclite</a>" are the
  reference file and a hypothesis file(s), the text portions of which
  may be either ASCII characters or GB encoded Chinese characters.
  There are a number of different input formats permitted: 
  "<a href="infmts.htm#trn_fmt_name_0">trn</a>",
  "<a href="infmts.htm#txt_fmt_name_0">txt</a>",
  "<a href="infmts.htm#stm_fmt_name_0">stm</a>", and
  "<a href="infmts.htm#ctm_fmt_name_0">ctm</a>".
  As new scoring paradigms were created for the ARPA
  tests, accompanying formats were created to support the evaluations.
  
  <p>
  <a name="trn_fmt_name_0">
  <strong> trn - Definition of a transcript input file </strong>
  </a>
  <ul>
  <p>
            The transcript  format  is  a  file  of  word  sequence
            records  separated by newlines.  Each record contains a
            word sequence, follow by the an utterance  ID  enclosed
            in  parenthesis.   See  the  '<a href="options.htm#option_i_name_0">-i</a>'  option for a list of
            accepted utterance id types.
  <p>
            example.
  <ul>
                 she had your dark suit in greasy  wash  water  all
                 year (cmh_sa01)
  </ul>
  <p>
            Transcript alternations, described above, can  be  used
            in the word sequence by using this BNF format:
  <p>
  <ul>
                 ALTERNATE :== "{" TEXT ALT+ "}"
  <br> 
                 ALT       :== "/" TEXT
  <br>
                 TEXT      :== 1 or more whitespace separated words |
  	                    "@" | ALTERNATE
  </ul>
  <p>
       The "@" represents a  NULL  word  in  the  transcript.   For
       scoring  purposes,  an  error  is  not counted if the "@" is
       aligned as an insertion.
  <p>
       example
  <ul>
            i've { um / uh / @ } as far as i'm concerned
  </ul>
  </ul>
  <p>
  <a name="txt_fmt_name_0">
  <strong> txt - Definition of a text input file </strong>
  </a>
  <ul>
            This format is simply  free-form  text  with  no  page,
            paragraphs, sentence or speaker breaks.
  </ul>
  <a name="stm_fmt_name_0">
  <strong> stm - Definition of segment time mark input file </strong>
  </a>
  <ul>
  <p>
       This describes the segment time marked files to be used  for
       scoring  the  output  of  speech  recognizers  via  the NIST
       sclite() program.  This is a reference file format.
  <p>
       The segment time mark file consists of  a  concatenation  of
       text  segment  records from a waveform file.  Each record is
       separated by a newline and contains: the waveform's filename
       and  channel  identifier  [A | B], the talkers id, begin and
       end times (in seconds), optional subset label and  the  text
       for the segment.  Each record follows this BNF format:
  <p>
        STM :== &lt;F&gt; &lt;C&gt; &lt;S&gt; &lt;BT&gt; &lt;ET&gt; [ &lt;LABEL&gt; ] transcript . . .
  	<ul>
           Where :
  	 <ul>
            &lt;F&gt;
  	   <ul>
                 The waveform  filename.   NOTE:  no  pathnames  or
                 extensions are expected.
    	   </ul>
            &lt;C&gt;
  	   <ul>
                 The waveform channel.   The text of the waveform channel
  		is not restricted by sclite.  The text can be any text string without
  		witespace so long as the matching string is found in both the reference
  		and hypothesis input files.
    	   </ul>
            &lt;S&gt;
  	   <ul>
                 The speaker id,  no  restrictions  apply  to  this
                 name.
    	   </ul>
            &lt;BT&gt;
  	   <ul>
                 The begin time (seconds) of the segment.
    	   </ul>
            &lt;ET&gt;
  	   <ul>
                 The end time (seconds) of the segment.
    	   </ul>
            &lt;LABEL&gt;
  	   <ul>
                 A  comma  separated  list  of  subset  identifiers
                 enclosed  in angle brackets.  Ex. "&lt;O,F,00&gt;".  See
                 "USING STM FORMAT FOR  LABELED  UTTERANCE  REPORTS
                 (LUR)" below.
    	   </ul> 
           transcript
  	   <ul>
  
  The transcript can take on two forms: 1) a whitespace separated list of
  words, or 2) the string "IGNORE_TIME_SEGMENT_IN_SCORING".
  <p>
  The list of
  words can contain an transcript alternation using the following
  BNF format:
  <p>
  <ul>
            ALTERNATE :== "{" &lt;text&gt; ALT+ "}"
  	<br>
            ALT       :== "/" &lt;text&gt;
  	<br>
            TEXT      :== 1 thru n words | "@" | ALTERNATE
  </ul>
  <p>
       The "@" represents a NULL word in the transcript.  For scoring
  purposes, an error is not counted if the "@" is aligned as an
  insertion.  
  <p>
  <ul>
       Example:     "i've { um / uh / @ } as far  as i'm concerned"
  </ul>
  <p>
  When the string "IGNORE_TIME_SEGMENT_IN_SCORING" is used as the transcript,
  the process which chops the hypothesis file to matching reference segments
  ignores all hypothesis words whose time-midpoints occur within the reference
  segments beginning
  and ending time.  The effect is to declare this segments regions as 
  "out-of-bounds" for scoring, thus generation no errors from that time 
  region. 
  <p>
  <ul>
  NOTE: this only works with DP alignment of a referenc stm file
  and hypothesis ctm file.
  </ul>
  <p>
  	 </ul>
  	</ul>
  
       Example STM file:
  	<ul>
  	  ;; comment
  	<br>
            2345 A 2345-a 0.10 2.03 uh huh yes i thought
  	<br>
            2345 A 2345-b 2.10 3.04 dog walking is a very
  	<br>
            2345 A 2345-a 3.50 4.59 yes but it's worth it
  	</ul>
  
  <p>
       The file must be sorted by the first and second  columns  in
       ASCII order, and the fourth in numeric order.  The UNIX sort
       command:  "sort +0 -1 +1 -2 +3nb -4"  will  sort  the  words
       into appropriate order.
  <p>
  Lines beginning with ';;' are considered  comments  and  are
  ignored.  Blank lines are also ignored.
  <p>
  
  </ul>
       USING STM FORMAT FOR LABELED UTTERANCE REPORTS (LUR):
  <ul>
       Motivation:
     <ul>
            For the Fall '95 ARPA CSR Evaluation, it was  desirable
            to  not  only  report overall error-rate statistics but
            also error-rate  statistics  for  arbitrary  partitions
            and/or  groups  of  partitions within the test set.  To
            this end, the STM file format was  extended  to  encode
            arbitrary subset information for each segment.
     </ul>
       Usage:
     <ul>
            The subset information is encoded by adding  two  types
            of  information  into the STM file.  The first information
  	  type, is a special comment line, the subset information line, (SIL).
  	  The SIL defines the subset's label
            id, a short column heading and a description.  The special
  	  comment line format is:
  	<ul>
             ;; LABEL "&lt;ID&gt;" "&lt;COL_HD&gt;" "&lt;DESC&gt;"
  	  <ul>
               where:
  	     <ul>
                 &lt;ID&gt;
                   <ul>
                      The subset id.  Used to  label  each  segment
                      that  belongs  to  the subset.  The format is
                      arbitrary, but without spaces.
                   </ul>
                 &lt;COL_HD&gt; 
                   <ul>
                      Used as column headings in generated reports.
                      Format is arbitrary.
                   </ul>
                 &lt;DESC&gt; 
                   <ul>
                      Used for  subset  descriptions  in  generated
                      reports.  May be of arbitrary length and for-
                      mat.  Double  backslashes  '\\'  add  a  line
                      feed.
                   </ul>
               </ul>
            </ul>
            The order of the SIL lines in the STM file defines  the
            order of subset presentation the generated reports.
            The second type of information  incorporated  into  the
            STM file is an optional sixth field to the text segment
            record.  The field consists of a comma  separated  list
            of  subset ids enclosed in angle brackets.  Each unique
            id must have a special comment line,  specified  above,
            to  be  properly interpreted.  Otherwise the id will be
            ignored.
  <p>
            Each position within the label field,  separated  by  a
            commas,  defines  a group of subsets that are presented
            separately in the generated reports.  So for  instance,
            the  first  group might be all segments, and the second
            might be either male or female, and the third might  be
            the story.  The example below shows an STM file encoded
            with this information.
  <ul>
                 ;; LABEL "M" "Male" "Male Talkers"
  <br>
                 ;; LABEL "F" "Female" "Female Talkers"
  <br>
                 ;; LABEL "01" "Story 1" "Business news"
  <br>
                 ;; LABEL "00" "Not in Story" "Words or Phrases not
  			 contained in a story"
  <br>
                 940328 1 A 4.00 18.10 &lt;O,F,00&gt; FROM LOS ANGELES
  <br>
                 940328 1 B 18.10 25.55 &lt;O,M,01&gt; MEXICO IN TURMOIL
  
  </ul>
  </ul>
  </ul>
  </ul>
  </ul>
  <p>
  <a name="ctm_fmt_name_0">
  <strong> ctm - Definition of time marked conversation scoring input </strong>
  </a>
  <ul>
  <p>
       This describes the time marked conversation input  files  to
       be used for scoring the output of speech recognizers via the
       NIST sclite() program.  Both the  reference  and  hypothesis
       input files can share this format.
  <p>
       The ctm file format is a concatenation of time mark  records
       for  each  word  in each channel of a waveform.  The records
       are separated with a newline.  Each word token must  have  a
       waveform  id,  channel identifier [A | B], start time, dura-
       tion, and word text.  Optionally a confidence score  can  be
       appended  for  each word.  Each record follows this BNF for-
       mat:
  <p>
        CTM :== &lt;F&gt; &lt;C&gt; &lt;BT&gt; &lt;DUR&gt; word [ &lt;CONF&gt; ]
  <ul>
           Where :
  <ul>
            &lt;F&gt;  -&gt;
  <ul>
                 The waveform  filename.   NOTE:  no  pathnames  or
                 extensions are expected.
  </ul>
            &lt;C&gt;  -&gt;
  <ul>
                 The waveform channel.  Either "A" or "B".  The text of the waveform channel
  		is not restricted by sclite.  The text can be any text string without
  		witespace so long as the matching string is found in both the reference
  		and hypothesis input files.
  </ul>
            &lt;BT&gt; -&gt;
  <ul>
                 The begin time (seconds)  of  the  word,  measured
                 from the start time of the file.
  </ul>
            &lt;DUR&gt;  -&gt;
  <ul>
                 The duration (seconds) of the word.
  </ul>
            &lt;CONF&gt;  -&gt;
  <ul>
                 Optional confidence score.  It  is  proposed  that
                 this score will be used in the future.
  </ul>
  </ul>
  </ul>
  <p>
       The file must be sorted by  the  first  three  columns:  the
       first  and  the  second  in  ASCII order, and the third by a
       numeric order.  The UNIX sort command: "sort  +0  -1  +1  -2
       +2nb -3" will sort the words into appropriate order.
  <p>
       Lines beginning with ';;' are considered  comments  and  are
       ignored.  Blank lines are also ignored.
  <p>
       Included below is an example:
  <ul>
       ;;
       <br>
       ;;  Comments follow ';;' 
       <br>
       ;;
       <br>
       ;;  The Blank lines are ignored
       <br>
  
       <br>
       ;;
       <br>
       7654 A 11.34 0.2  YES -6.763
       <br>
       7654 A 12.00 0.34 YOU -12.384530
       <br>
       7654 A 13.30 0.5  CAN 2.806418
       <br>
       7654 A 17.50 0.2  AS 0.537922
       <br>
             :
       <br>
       7654 B 1.34 0.2  I -6.763
       <br>
       7654 B 2.00 0.34 CAN -12.384530
       <br>
       7654 B 3.40 0.5  ADD 2.806418
       <br>
       7654 B 7.00 0.2  AS 0.537922
       <br>
             :
  </ul>
  <p>
       For CTM reference files, a format extension exists to permit
       marking  alternate  transcripts.   The  alternation uses the
       same file format  as  described  above,  except  three  word
       strings, "&lt;ALT_BEGIN&gt;", "&lt;ALT&gt;" and "&lt;ALT_END&gt;", are used to
       delimit the alternation.  Each tag is  treated  as  a  word,
       with  a conversation id, channel and "*"'s for the begin and
       duration time.
  <p>
       The alternation is begun using the word  "&lt;ALT_BEGIN&gt;",  and
       terminated using the word "&lt;ALT_END&gt;".  In between the start
       and  end,  are  at  least  2  alternative  time-marked  word
       sequences separated by the word "&lt;ALT&gt;".  Each word sequence
       can contain any number of words.  An empty alternative  sig-
       nifies a null word.
  <p>
       Below is and example alternate reference transcript for  the
       words "uh" and "um".
  <p>
    <ul>
       ;;
       <br>
       7654 A   *    *   &lt;ALT_BEGIN&gt;
       <br>
       7654 A 12.00 0.34 UM
       <br>
       7654 A   *    *   &lt;ALT&gt;
       <br>
       7654 A 12.00 0.34 UH
       <br>
       7654 A   *    *   &lt;ALT_END&gt;
    <ul>
  </ul>
  </ul>
  </body>
  </html>