GLMRules.txt
3.07 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
File GMLRules.txt V. 2.0 11/12/96 NIST/WMF
1. General.
The rules that program "rfilter1" uses are simple string-rewriting
rules of the form
A => B
or
A => B / C __ D
where A, B, C, and D are character strings.
Each of these strings may contain internal spaces. Strings may be
bounded with either square brackets or single quotation marks, in
order to allow beginning and ending spaces.
The main purpose of using rules like this is to provide
documentation of a string-rewriting process that is easily understood.
Examples:
CANCELLED => CANCELED ;; per AHD
[ ] => [ ] ;; reduce 2 spaces to 1
Falkner => Faulkner / [William ] __
JETLINER => JET LINER
VIDEOTAPE => VIDEO TAPE / [ ] __ [ ]
2. Application Algorithm.
In rewriting the input string into the output string, a character
cursor is moved from the start to the end of the input string. At
each position, the list of rules in the specified rule file is
searched from top to bottom. The first rule (if any) that is found to
match the input string with the first character of its input field (A)
lined up at the cursor position is applied by concatenating its output
field (B) onto the output string and advancing the cursor by the
number of characters in the rule's input field (A). If no rule
matches, then the input character pointed to by the cursor is either
passed on to the output string or ignored, depending on a switch
setting, and the cursor position is advanced by one.
An indexing scheme is used for speed, but the logical effect is
still as if a linear search were done on the rules in their order as
they are in the rules file.
3. Comments.
The token denoting comments is taken to be the first token in the
first line of the file of rules. Anything on a line following this
comment flag is ignored by the program.
4. Header Information.
At the beginning of a file of rules, certain information may be
given in keyword/value format in "auxiliary" lines that begin with
"*". The value must be the only token in the line bounded by quotes,
either single or double; the keyword itself just must be some token on
the line.
Here are the current keywords and values:
KEYWORD VALUE
NAME The name of the rules, for documentation.
DESC A description of the rules, for documentation.
FORMAT NIST1 : context-free rules only (the default)
NIST2 : context-sensitive rules
MAX_NRULES <N> : the number of rules to allocate space for.
COPY_NO_HIT "T" / "YES"/ "TRUE" : if no rule hits, copy input
"F" / "NO" / "FALSE": if no rule hits, skip input
CASE_SENSITIVE "T" / "YES" / "TRUE" : case-sensitive matching.
"F" / "NO" / "FALSE" : case-insensitive matching.
Both the keyword and its value (when alphabetic) may be upper- or
lower-case.
Examples:
* NAME = "spcor1.rls"
* DESC : "Spelling Correction Rules #1"
* FORMAT = 'NIST2'
* MAX_NRULES = '200'
* COPY_NO_HIT = 'T'
* CASE_SENSITIVE = 'F'
History:
V2.0 - part of the tranfilt distribution as rules.doc
V2.1 - Moved to SCTK