options

Sclite Input file formats: trn, txt, stm, ctm

The inputs to "sclite" are the reference file and a hypothesis file(s), the text portions of which may be either ASCII characters or GB encoded Chinese characters. There are a number of different input formats permitted: "trn", "txt", "stm", and "ctm". As new scoring paradigms were created for the ARPA tests, accompanying formats were created to support the evaluations.

trn - Definition of a transcript input file

The transcript format is a file of word sequence records separated by newlines. Each record contains a word sequence, follow by the an utterance ID enclosed in parenthesis. See the '-i' option for a list of accepted utterance id types.

example.

she had your dark suit in greasy wash water all year (cmh_sa01)

Transcript alternations, described above, can be used in the word sequence by using this BNF format:

The "@" represents a NULL word in the transcript. For scoring purposes, an error is not counted if the "@" is aligned as an insertion.

example

i've { um / uh / @ } as far as i'm concerned

txt - Definition of a text input file

This format is simply free-form text with no page, paragraphs, sentence or speaker breaks. stm - Definition of segment time mark input file

This describes the segment time marked files to be used for scoring the output of speech recognizers via the NIST sclite() program. This is a reference file format.

The segment time mark file consists of a concatenation of text segment records from a waveform file. Each record is separated by a newline and contains: the waveform's filename and channel identifier [A | B], the talkers id, begin and end times (in seconds), optional subset label and the text for the segment. Each record follows this BNF format:

STM :== <F> <C> <S> <BT> <ET> [ <LABEL> ] transcript . . .

The waveform filename. NOTE: no pathnames or extensions are expected.

The waveform channel. The text of the waveform channel is not restricted by sclite. The text can be any text string without witespace so long as the matching string is found in both the reference and hypothesis input files.

The speaker id, no restrictions apply to this name.

The begin time (seconds) of the segment.

The end time (seconds) of the segment.

A comma separated list of subset identifiers enclosed in angle brackets. Ex. "<O,F,00>". See "USING STM FORMAT FOR LABELED UTTERANCE REPORTS (LUR)" below.

The list of words can contain an transcript alternation using the following BNF format:

The "@" represents a NULL word in the transcript. For scoring purposes, an error is not counted if the "@" is aligned as an insertion.

Example: "i've { um / uh / @ } as far as i'm concerned"

When the string "IGNORE_TIME_SEGMENT_IN_SCORING" is used as the transcript, the process which chops the hypothesis file to matching reference segments ignores all hypothesis words whose time-midpoints occur within the reference segments beginning and ending time. The effect is to declare this segments regions as "out-of-bounds" for scoring, thus generation no errors from that time region.

NOTE: this only works with DP alignment of a referenc stm file and hypothesis ctm file.

The file must be sorted by the first and second columns in ASCII order, and the fourth in numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +3nb -4" will sort the words into appropriate order.

Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored.

For the Fall '95 ARPA CSR Evaluation, it was desirable to not only report overall error-rate statistics but also error-rate statistics for arbitrary partitions and/or groups of partitions within the test set. To this end, the STM file format was extended to encode arbitrary subset information for each segment.

The subset id. Used to label each segment that belongs to the subset. The format is arbitrary, but without spaces.

Used as column headings in generated reports. Format is arbitrary.

Used for subset descriptions in generated reports. May be of arbitrary length and for- mat. Double backslashes '\\' add a line feed.

Each position within the label field, separated by a commas, defines a group of subsets that are presented separately in the generated reports. So for instance, the first group might be all segments, and the second might be either male or female, and the third might be the story. The example below shows an STM file encoded with this information.

ctm - Definition of time marked conversation scoring input

This describes the time marked conversation input files to be used for scoring the output of speech recognizers via the NIST sclite() program. Both the reference and hypothesis input files can share this format.

The ctm file format is a concatenation of time mark records for each word in each channel of a waveform. The records are separated with a newline. Each word token must have a waveform id, channel identifier [A | B], start time, dura- tion, and word text. Optionally a confidence score can be appended for each word. Each record follows this BNF for- mat:

CTM :== <F> <C> <BT> <DUR> word [ <CONF> ]

The waveform filename. NOTE: no pathnames or extensions are expected.

The waveform channel. Either "A" or "B". The text of the waveform channel is not restricted by sclite. The text can be any text string without witespace so long as the matching string is found in both the reference and hypothesis input files.

The begin time (seconds) of the word, measured from the start time of the file.

The duration (seconds) of the word.

Optional confidence score. It is proposed that this score will be used in the future.

The file must be sorted by the first three columns: the first and the second in ASCII order, and the third by a numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +2nb -3" will sort the words into appropriate order.

Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored.

Included below is an example:

For CTM reference files, a format extension exists to permit marking alternate transcripts. The alternation uses the same file format as described above, except three word strings, "<ALT_BEGIN>", "<ALT>" and "<ALT_END>", are used to delimit the alternation. Each tag is treated as a word, with a conversation id, channel and "*"'s for the begin and duration time.

The alternation is begun using the word "<ALT_BEGIN>", and terminated using the word "<ALT_END>". In between the start and end, are at least 2 alternative time-marked word sequences separated by the word "<ALT>". Each word sequence can contain any number of words. An empty alternative sig- nifies a null word.

Below is and example alternate reference transcript for the words "uh" and "um".