| File ichartp.icn |
###########################################################################
File: ichartp.icn
Subject: Procedures for a simple chart parser
Author: Richard L. Goerwitz
Date: August 3, 2000
###########################################################################
This file is in the public domain.
###########################################################################
Version: 1.11
###########################################################################
General:
Ichartp implements a simple chart parser - a slow but
easy-to-implement strategy for parsing context free grammars (it
has a cubic worst-case time factor). Chart parsers are flexible
enough to handle a lot of natural language constructs. They also
lack many of the troubles associated with empty and left-recursive
derivations. To obtain a parse, just create a BNF file, obtain a
line of input, and then invoke parse_sentence(sentence,
bnf_filename, start-symbol). Parse_sentence suspends successive
edge structures corresponding to possible parses of the input
sentence. There is a routine called edge_2_tree() that converts
these edges to a more standard form. See the stub main() procedure
for an example of how to make use of all these facilities.
###########################################################################
Implementation details:
The parser itself operates in bottom-up fashion, but it might
just as well have been coded top-down, or for that matter as a
combination bottom-up/top-down parser (chart parsers don't care).
The parser operates in breadth-first fashion, rather than walking
through each alternative until it is exhausted. As a result, there
tends to be a pregnant pause before any results appear, but when
they appear they come out in rapid succession. To use a depth-first
strategy, just change the "put" in "put(ch.active, new_e)" to read
"push." I haven't tried to do this, but it should be that simple
to implement.
BNFs are specified using the same notation used in Griswold &
Griswold, and as described in the IPL program "pargen.icn," with
the following difference: All metacharacters (space, tab, vertical
slash, right/left parends, brackets and angle brackets) are
converted to literals by prepending a backslash. Comments can be
include along with BNFs using the same notation as for Icon code
(i.e. #-sign).
###########################################################################
Gotchas:
Pitfalls to be aware of include things like <L> ::= <L> | ha |
() (a weak attempt at a laugh recognizer). This grammar will
accept "ha," "ha ha," etc. but will suspend an infinite number of
possible parses. The right way to do this sort of thing is <L> ::=
ha <S> | ha, or if you really insist on having the empty string as
a possibility, try things like:
<S> ::= () | <LAUGHS>
<LAUGHS> ::= ha <LAUGHS> | ha
Of course, the whole problem of infinite parses can be avoided by
simply invoking the parser in a context where it is not going to
be resumed, or else one in which it will be resumed a finite number
of times.
###########################################################################
Motivation:
I was reading Byte Magazine (vol. 17:2 [February, 1992]), and
ran into an article entitled "A Natural Solution" (pages 237-244)
in which a standard chart parser was described in terms of its C++
implementation. The author remarked at how his optimizations made
it possible to parse a 14-word sentence in only 32 seconds (versus
146 for a straight Gazdar-Mellish LISP chart parser). 32 seconds
struck me as hardly anything to write home about, so I coded up a
quick system in Icon to see how it compared. This library is the
result.
I'm quite sure that this code could be very much improved upon.
As it stands, its performance seems as good as the C++ parser in
BYTE, if not better. It's hard to tell, though, seeing as I have
no idea what hardware the guy was using. I'd guess a 386 running
DOS. On a 386 running Xenix the Icon version beats the BYTE times
by a factor of about four. The Icon compiler creates an executable
that (in the above environment) parses 14-15 word sentences in
anywhere from 6 to 8 seconds. Once the BNF file is read, it does
short sentences in a second or two. If I get around to writing it,
I'll probably use the code here as the basic parsing engine for an
adventure game my son wants me to write.
###########################################################################
Links: trees, rewrap, scan, strip, stripcom, strings
###########################################################################
Requires: co-expressions
###########################################################################
Here's a sample BNF file (taken, modified, from the BYTE
Magazine article mentioned above). Note again the conventions a)
that nonterminals be enclosed in angle brackets & b) that overlong
lines be continued by terminating the preceding line with a
backslash. Although not illustrated below, the metacharacters <,
>, (, ), and | can all be escaped (i.e. can all have their special
meaning neutralized) with a backslash (e.g. \<). Comments can also
be included using the Icon #-notation. Empty symbols are illegal,
so if you want to specify a zero-derivation, use "()." There is an
example of this usage below.
<S> ::= <NP> <VP> | <S> <CONJ> <S>
<VP> ::= <VP> <CONJ> <VP> | <IV> ( () | <PP> ) | \
<TV> ( <NP> | <NP> <PP> | <NP> <VP> | <REL> <S> )
<NP> ::= <DET> ( <NP> | <ADJ> <NP> | <ADJ> <NP> <PP> | <NP> <PP> ) | \
<ADJ> <NP> | <N> | <N> <CONJ> <N> | \
<NP> <CONJ> <NP>
<PP> ::= <P> ( <NP> | <ADJ> <NP> ) | <PP> <CONJ> <PP>
<ADJ> ::= <ADJ> <CONJ> <ADJ>
<CONJ> ::= and
<DET> ::= the | a | his | her
<NP> ::= her | he | they
<N> ::= nurse | nurses | book | books | travel | arrow | arrows | \
fortune | fortunes | report
<ADJ> ::= outrageous | silly | blue | green | heavy | white | red | \
black | yellow
<IV> ::= travel | travels | report | see | suffer
<TV> ::= hear | see | suffer
<P> ::= on | of
<REL> ::= that
###########################################################################
Addendum:
Sometimes, when writing BNFs, one finds oneself repeatedly
writing the same things. In efforts to help eliminate the need for
doing this, I've written a simple macro facility. It involves one
reserved word: "define." Just make sure it begins a line. It
takes two arguments. The first is the macro. The second is its
expansion. The first argument must not contain any spaces. The
second, however, may. Here's an example:
define <silluq-clause> ( <silluq-phrase> | \
<tifcha-silluq-clause> | \
<zaqef-silluq-clause> \
)
###########################################################################
This file is part of the (main) package.
Source code.| Details |
| Procedures: |
bnf_2_edges: string -> edge records
s -> Es (a generator)
where s is a CFPSG rule in BNF form
where Es are edges
bnf_file_2_edges: concatenate backslash-final lines & parse
do_parends: string -> string(s)
Given a(b)c suspend abc; given a(b|c)d suspend abd and acd, etc.
Used in conjuction with do_slash().
do_slash: string -> string(s)
Given a|b suspend a then b. Used in conjunction with do_parends().
edge2tree: edge -> tree
e -> t
where e is an edge structure (active or inactive; both are okay)
where t is a tree like what's described in Ralph Griswold's
structs library (IPL); I don't know about the 2nd ed. of
Griswold & Griswold, but the structure is described in the 1st
ed. in section 16.1
fails if, for some reason, the conversion can't be made (e.g. the
edge structure has been screwed around with in some way)
fullcopy: make full recursive copy of object
For debugging only.
p_err: print error message to stderr & abort
parse_sentence(s, filename, start_symbol)
parse_sentence: string x string -> edge records
(s, filename) -> Es
where s is a chunk of text presumed to constitute a sentence
where filename is the name of a grammar file containing BNFs
where Es are edge records containing possible parses of s
string_2_list: string -> list
s -> L
where L is a list of partially constructed (short) edges, having
only LHS and RHS; in the case of nonterminals, the RHS is set
to 1, while for terminals the RHS is null (and remains that way
throughout the parse)
Remove non-backslashed spaces and tabs.
tokenize: break up a sentence into constituent words, using spaces,
tabs, and other punctuation as separators (we'll need to
change this a bit later on to cover apostrophed words)
| Records: |
inactive - set; active - list
edge(LHS, RHS, LEN, DONE, BEG, END, SEEN)
stats(edge_list, lhs_table, term_set)