The program sclite is a tool for scoring and evaluating the output of speech recognition systems. Sclite is part of the NIST SCTK Scoring Tookit. The program compares the hypothesized text (HYP) output by the speech recognizer to the correct, or reference (REF) text. After comparing REF to HYP, (a process called alignment), statistics are gathered during the scoring process and a variety of reports can be produced to summarize the performance of the recognition system.
The Alignment process consists of two steps: 1) selecting matching REF and HYP texts, and 2) performing an alignment of the reference and hypothesis texts.
Step 1: Selection of matching REF and HYP texts
Utterance ID Matching:
When alignments are performed via DP, corresponding REF and HYP records with the same utterance id's are located in the REF and HYP files. DP Alignment and scoring are then performed on each pair of records. Only the utterance ID's present in the HYP file are aligned and scored. This means the REF file may contain more utterance records than the HYP.
When "diff" is used for alignment, corresponding REF and HYP records with the same utterance id's are located in the REF and HYP files. Rather than execute "diff" for each pair of records, all matching REF and HYP pairs are re-formatted to be newline separated words and written to a temporary files. Using the two temporary files, "diff" is then called to perform a global alignment. The output of "diff" is re-chunking into REF/HYP records by applying the rule: include all words in the output stream up to and including the last word in the reference record.
The reference file can contain extra transcripts, only needed transcripts are loaded.
Word Time Mark Matching:
By default, the DP alignment is performed using word-to-word distances measures of: 0, 3, 3, 4 for correct, insertions, deletions and substitutions respectively.
Optionally, the command line flag '-T' forces the alignments to be performed using time-mediated alignments.
Reference Segment Time Mark to Hypothesis Word Time Mark
When DP alignments are performed, the hypothesis file is segmented to match the reference segments by selecting the string of hypothesized words whose times occur before the end of each reference segment. The midpoint time of a word is used to determine if the word falls within a segment. DP alignments are then performed on the selected hypothesis words and the reference segment.
If the alignments are performed via "diff", pre-process the input reference and hypothesis texts, creating temporary reference and hypothesis files with one word per line. Then use GNU's "diff" program to perform a global alignment on the word lists. The output of "diff" is re-chunked into segments for scoring. Alternate reference transcripts can not be used with "diff" alignments.
Reference Segment Time Mark to Hypothesis Text file
Step 2: Text Alignments
Dynamic Programming string alignment:
The DP string alignment algorithm performs a global minimization of a Levenshtein distance function which weights the cost of correct words, insertions, deletions and substitutions as 0, 3, 3 and 4 respectively. The computational complexity of DP is 0(NN).
When evaluating the output of speech recognition systems, the precision of generated statistics is directly correlated to the reference text accuracy. But uttered words can be coarticulated or mumbled to where they have ambiguous transcriptions, (e.i., "what are" or "what're"). In order to more accurately represent ambiguous transcriptions, and not penalize recognition systems, the ARPA community agreed upon a format for specifying alternative reference transcriptions. The convention, when used on the case above, allows the recognition system to output either transcripts, "what are" or "what're", and still be correct.
The case above handles ambiguously spoken words which are loud enough for the transcriber to think something should be recognized. For mumbled or quietly spoken words, the ARPA community agreed to neither penalize systems which correctly recognized the word, nor penalize systems which did not. To accommodate this, a NULL word, "@", can be added to an alternative reference transcript. For example, "the" is often spoken quickly with little acoustic evidence. If "the" and "@" are alternates, the recognition system will be given credit for outputting "the" but not penalized if it does not.
The presence of alternate transcriptions represents added computational complexity to the DP algorithm. Rather than align all alternate reference texts to the hypothesis text, then choose the lowest error rate alignment, this implementation of DP aligns two word networks, thus reducing the computational complexity from 2^(ref_alts + hyp_alts) * O(N_ref * N_hyp) to O((N_ref+ref_alts) * (N_hyp+hyp_alts)).
As noted above, DP alignment minimizes a distance function that is applied to word pairs. In addition to the "word" alignments which uses a distance function defined by static weights, the sclite DP alignment module can use two other distance functions. The first, called Time-Mediated alignment and the second called Word-Weight-Mediated alignment.
Time-mediated alignments are computed by replacing the standard word-to-word distance weights of 0, 3, 3, and 4 with measures based on beginning and ending word times. The formulas for time-mediated word-to-word distances are:
Where,
String alignments via GNU's "diff":
While the DP algorithm has the advantage of flexibility, it is slow for aligning large chunks of text. To address the speed concerns, an alternative string alignment module, which utilizes GNU's "diff", has been added to sclite. The sclite program pre-processes the input reference and hypothesis texts, creating temporary reference and hypothesis files with one word per line. Then GNU's "diff" program is used to perform a global alignment on the word lists and the output is re-chunked into utterances or text segments for scoring.
Alignments can be performed with "diff" in about half the time taken for DP alignments on the standard 300 Utterance ARPA CSRNAB test set. However, in the opinion of the author, "diff" has the following bad effects:
1. it can not accommodate transcription alternations,
2. "diff" does not produce the same alignments as the DP alignments,
3. there is an increase measured error rates.
The categories tallied are:
Percent of correct words | = | # Correct words # Reference words | * 100 |
Percent of substituted words | = | # Substituted words # Reference words | * 100 |
Percent of inserted words | = | # Inserted words # Reference words | * 100 |
Percent of deleted words | = | # Deleted words # Reference words | * 100 |
Percent of sentence errors | = | # incorrect ref and hyp pairs # ref and hyp pairs | * 100 |
A variation in scoring called Weighted-Word Scoring can also be implemented by sclite. After Word-Weight-Mediated Alignment, the word weights can be tallied to produce weighted-word scores. The formulas for weighted-word scoring are very simliar to word scoring described above. The difference is rather than assume each word has the same weight, 1 in the case of word scoring, each individual word has a different weight. The word scoring formulas become:
Weighted Percent of correct words | = | Sum of W(hyp) if correct Sum of W(ref) | * 100 |
Weighted Percent of substituted words | = | Sum of W(hyp) + W(ref) if substituted Sum of W(ref) | * 100 |
Weighted Percent of inserted words | = | Sum of W(hyp) if inserted Sum of W(ref) | * 100 |
Weighted Percent of deleted words | = | Sum of W(ref) if deleted Sum of W(ref) | * 100 |
Confidence scores as they have been implemented are associated with each hypothesized word. (The issue has been raised whether for languages such as Mandarin, where character error rate is considered the primary measure of performance, the confidence ought to be associated with characters.) The confidence score pc, associated with a word must be in the closed interval [0,1] and presumably, given the entropy related metric defined below, in the open interval (0,1). It should represent the system's best estimate of the a posterior probability that the hypothesized word is correct. (Correct here necessarily is with respect to an alignment procedure of the reference and hypothesis word strings.)
A single metric to use in the evaluartion of confidence scores was adopted at the August meeting. This is a normalized version of the cross entropy or mutual information. Specifically, the metric is defined as:
Sclite will automatically detect the presence of confidence measures when reading in a hypothesis "ctm" file. When sclite detects the confidence scores, the report genererated by the options "-o sum" has an additional column containing the Normalized Cross Entropy (NCE).
Output graphs concerning confidence estimates are generated by using the '-C' option. A variety of graphs can be created: