Chop up the words into separate characters before doing
the alignment. It is generally not the practice of the
ARPA community to score at the character level. The
intent of this option is to be able to score Mandarin
Chinese at the character level. The option "NOASCII"
does not separate characters if they are ASCII. The
option "DH" deletes hyphens from the ref and hyp
strings before alignment. This option only works using
the DP alignment algorithm. (-c & -d are exclusive)
Use GNU diff
for alignments rather than the default
dynamic programming. (-c & -d are exclusive)
Perform the alignment using a cost function which
counts fragments, words ending or beginning with a
hyphen, as correct if the spelling up to the hyphen
matches the spelling of the hypothesized word.
Options -F and -d are exclusive.
Define the
CMU-Cambridge Statistical Language Modeling Toolkit v2 language
mode file to be 'LM'. The LM file must be created using the
idngram2lm program.
(See the toolkit documentation details of how
to make the language model.) Currently, SCTK supports 1, 2 and 3-grams.
The language model is used to compute an individual weight for each
word in the reference and hypothesis strings. The weight is defined
to be Log2(P(word|context)). Each pair of aligned
strings is considered to be independent, so therefore, there is
no context for initial words in each pair.
The word-weights are used in two ways, first as a method to define word-to-word distances
for word-weight-mediated alignment
and second to perform
weighted word scoring .
Out-of-Vocabulary words get the default weight of 20.0, and optionally
deletable words get a default weight of 0.0.
-m [ ref | hyp ]
When scoring a hypothesis ctm file against a reference
stm file, the time spans of the two may not match,
(i.e. the start time of the first word/segment may not
match or the end time of the last word/segment may not
When this option is used, the alignment phase of scoring
ignores any segment or word (depending on the
option(s) used) which is not in the time span of the
opposite file. The time span of a file is defined to
be start time of the first time mark, to the end time
of the last time mark.
The "ref" option reduces the reference segments to
those which are within the hypothesis file time span.
The "hyp" option reduces the hypothesis words to those
which are within the reference file tiem span.
Both "ref" and "hyp" may be used simultaneously.
The argument -m by itself defaults to '-m ref'.
Exclusive with -d.
Do Case-sensitive alignments. Otherwise all input is mapped to
a single case before scoring. Of course, GB and UEX encode text
data is never case-converted.
-S algo1 lexicon [ ASCIITOO ]
The '-S' option performs an inferred word segmentation
alignment algorithm. This option
is intended to be used for the LVCSR evaluation of Mandarin
Chinese. A problem with scoring Mandarin at the
word level is the lack of clearly defined words in Mandari
text. This option implements an algorithm which,
given a word segmentation for the reference string and
a "lexicon" of legal words, computes a minimal error
rate word alignment. The algorithm is as follows:
- Convert the previously word-segmented reference
string into a word network.
- Covert the hypothesis text to a string of characters,
each character representing a word. The data
represented is then convert to a network.
ex. * --- A --- * --- T --- * --- 0 --- *
- Consider all possible sequences of letters through
the network. If a sequence creates a word which is
represented in the lexicon, add an arc to the network
representing the word. The maximum characters per word
is limited to the maximum word length in the lexicon.
,-------- TO -------.
/ \
ex. * --- A --- * --- T --- * --- 0 --- *
\ /
`------- AT --------'
- DP Align the reference and hypothesis networks, and
extract a minimal cost path.
The supplied "lexicon" must be a sorted list of word
records, each separated by a newline. Only the first
column, separated by whitespace, is read in and used
for the lexicon. By default, the algorithm only
separates hypothesis characters that are GB or EUC
encoded. If the option "ASCIITOO" is used, ASCII
hypothesis words are also converted to characters in
step 2.
Exclusive with -d.
-S algo2 lexicon [ ASCIITOO ]
Perform a similar algorithm as described in '-S alog1' except
the roles of the reference and hypothesis transcripts are reversed.
In this algorithm, the segmentation of the hypothesis text is held
constant, while the reference transcript undergoes the process of
of coversion to characters and arcs added to the network for words
found in the lexicon. Both "lexicon" and "ASCIITOO" have the same
usage as in algo1.
Exclusive with -d.
-w wwl_file
Define the word-weight list (WWL) file to be 'wwl_file'. The WWL file
defines an arbitrary weight for each word in the lexicon. The weights are
used in two ways, first as a method to define word-to-word distances
for word-weight-mediated alignment
and second to perform
weighted word scoring .
If the supplied WWL filename is "unity", then no file of weights is read in.
Instead, this is a shorthand notation to use a weight of 1.0 for all words.
Optionally deletable words get a default weight of 0.0, (even if "unity"
is supplied as the WWL filename).
The format of the WWL file is as follows.
Comment lines begin with
double semi-colons. The are two forms of "special" comment lines. The
first defines heading labels each column in the table. The format for this
line is:
;; 'Headings' '<COL1>' '<COL2>' '<COL3>' ....
The label for column 1 should be "Word Spelling" since this column is the
word's text. The labels for columns 2 though 10 are defined by the user.
The second "special" comment line defines the default weight applied to
out-of-vocabulary words if any exist. The format for this line is:
;; Default missing weight '<number>'
'number' must be a floating point number.
The remainder of the file consists of word records, each word record separated by
a newline. The format of each record is:
There should be no whitespace at the beginning if the line, and the word
texts can not include whitespace. The remainder of the line are whitespace
separated floating point weights, up to a maximum of 10 weights can
be assigned per word.
NOTE: The current version of SCTK only utilizes the first weight.