ichartp.icn

File ichartp.icn

Summary

###########################################################################

	File:     ichartp.icn

	Subject:  Procedures for a simple chart parser

	Author:   Richard L. Goerwitz

	Date:	  August 3, 2000

###########################################################################

   This file is in the public domain.

###########################################################################

	Version:  1.11

###########################################################################

  General:

      Ichartp implements a simple chart parser - a slow but
  easy-to-implement strategy for parsing context free grammars (it
  has a cubic worst-case time factor).  Chart parsers are flexible
  enough to handle a lot of natural language constructs.  They also
  lack many of the troubles associated with empty and left-recursive
  derivations.  To obtain a parse, just create a BNF file, obtain a
  line of input, and then invoke parse_sentence(sentence,
  bnf_filename, start-symbol).  Parse_sentence suspends successive
  edge structures corresponding to possible parses of the input
  sentence.  There is a routine called edge_2_tree() that converts
  these edges to a more standard form.  See the stub main() procedure
  for an example of how to make use of all these facilities.

###########################################################################

  Implementation details:

      The parser itself operates in bottom-up fashion, but it might
  just as well have been coded top-down, or for that matter as a
  combination bottom-up/top-down parser (chart parsers don't care).
  The parser operates in breadth-first fashion, rather than walking
  through each alternative until it is exhausted.  As a result, there
  tends to be a pregnant pause before any results appear, but when
  they appear they come out in rapid succession.  To use a depth-first
  strategy, just change the "put" in "put(ch.active, new_e)" to read
  "push."  I haven't tried to do this, but it should be that simple
  to implement.
      BNFs are specified using the same notation used in Griswold &
  Griswold, and as described in the IPL program "pargen.icn," with
  the following difference:  All metacharacters (space, tab, vertical
  slash, right/left parends, brackets and angle brackets) are
  converted to literals by prepending a backslash.  Comments can be
  include along with BNFs using the same notation as for Icon code
  (i.e. #-sign).

###########################################################################

  Gotchas:

      Pitfalls to be aware of include things like <L> ::= <L> | ha |
  () (a weak attempt at a laugh recognizer).  This grammar will
  accept "ha," "ha ha," etc. but will suspend an infinite number of
  possible parses.  The right way to do this sort of thing is <L> ::=
  ha <S> | ha, or if you really insist on having the empty string as
  a possibility, try things like:

          <S>      ::= () | <LAUGHS>
          <LAUGHS> ::= ha <LAUGHS> | ha

  Of course, the whole problem of infinite parses can be avoided by
  simply invoking the parser in a context where it is not going to
  be resumed, or else one in which it will be resumed a finite number
  of times.

###########################################################################

  Motivation:

      I was reading Byte Magazine (vol. 17:2 [February, 1992]), and
  ran into an article entitled "A Natural Solution" (pages 237-244)
  in which a standard chart parser was described in terms of its C++
  implementation.  The author remarked at how his optimizations made
  it possible to parse a 14-word sentence in only 32 seconds (versus
  146 for a straight Gazdar-Mellish LISP chart parser).  32 seconds
  struck me as hardly anything to write home about, so I coded up a
  quick system in Icon to see how it compared.  This library is the
  result.
      I'm quite sure that this code could be very much improved upon.
  As it stands, its performance seems as good as the C++ parser in
  BYTE, if not better.  It's hard to tell, though, seeing as I have
  no idea what hardware the guy was using.  I'd guess a 386 running
  DOS.  On a 386 running Xenix the Icon version beats the BYTE times
  by a factor of about four.  The Icon compiler creates an executable
  that (in the above environment) parses 14-15 word sentences in
  anywhere from 6 to 8 seconds.  Once the BNF file is read, it does
  short sentences in a second or two.  If I get around to writing it,
  I'll probably use the code here as the basic parsing engine for an
  adventure game my son wants me to write.

###########################################################################

  Links: trees, rewrap, scan, strip, stripcom, strings

###########################################################################

  Requires:  co-expressions

###########################################################################

       Here's a sample BNF file (taken, modified, from the BYTE
  Magazine article mentioned above).  Note again the conventions a)
  that nonterminals be enclosed in angle brackets & b) that overlong
  lines be continued by terminating the preceding line with a
  backslash.  Although not illustrated below, the metacharacters <,
  >, (, ), and | can all be escaped (i.e. can all have their special
  meaning neutralized) with a backslash (e.g. \<).  Comments can also
  be included using the Icon #-notation.  Empty symbols are illegal,
  so if you want to specify a zero-derivation, use "()."  There is an
  example of this usage below.

  <S>    ::= <NP> <VP> | <S> <CONJ> <S>
  <VP>   ::= <VP> <CONJ> <VP> | <IV> ( () | <PP> ) | \
  	   <TV> ( <NP> | <NP> <PP> | <NP> <VP> | <REL> <S> )
  <NP>   ::= <DET> ( <NP> | <ADJ> <NP> | <ADJ> <NP> <PP> | <NP> <PP> ) | \
  	   <ADJ> <NP> | <N> | <N> <CONJ> <N> | \
  	   <NP> <CONJ> <NP>
  <PP>   ::= <P> ( <NP> | <ADJ> <NP> ) | <PP> <CONJ> <PP>
  <ADJ>  ::= <ADJ> <CONJ> <ADJ>
  <CONJ> ::= and
  <DET>  ::= the | a | his | her
  <NP>   ::= her | he | they
  <N>    ::= nurse | nurses | book | books | travel | arrow | arrows | \
  	  fortune | fortunes | report
  <ADJ>  ::= outrageous | silly | blue | green | heavy | white | red | \
  	  black | yellow
  <IV>   ::= travel | travels | report | see | suffer
  <TV>   ::= hear | see | suffer
  <P>    ::= on | of
  <REL>  ::= that

###########################################################################

  Addendum:

      Sometimes, when writing BNFs, one finds oneself repeatedly
  writing the same things.  In efforts to help eliminate the need for
  doing this, I've written a simple macro facility.  It involves one
  reserved word:  "define."  Just make sure it begins a line.  It
  takes two arguments.  The first is the macro.  The second is its
  expansion.  The first argument must not contain any spaces.  The
  second, however, may.  Here's an example:

      define <silluq-clause>    (   <silluq-phrase> | \
                                    <tifcha-silluq-clause> | \
                                    <zaqef-silluq-clause> \
                                )

###########################################################################

Procedures:

bnf_2_edges, bnf_file_2_edges, do_parends, do_slash, edge2tree, fullcopy, main, p_err, parse_sentence, string_2_list, stripspaces, tokenize

Records:

chart, edge, retval, short_edge, stats

Links:

rewrap.icn, scan.icn, strings.icn, strip.icn, stripcom.icn, trees.icn

This file is part of the (main) package.

Source code.

Details

Procedures:

bnf_2_edges(s)


 bnf_2_edges: string -> edge records
              s -> Es (a generator)
    where s is a CFPSG rule in BNF form
    where Es are edges

bnf_file_2_edges(f)


 bnf_file_2_edges: concatenate backslash-final lines & parse

do_parends(s)


 do_parends:  string -> string(s)
    Given a(b)c suspend abc; given a(b|c)d suspend abd and acd, etc.
    Used in conjuction with do_slash().

do_slash(s)


 do_slash:  string -> string(s)
     Given a|b suspend a then b.  Used in conjunction with do_parends().

edge2tree(e)


 edge2tree:  edge -> tree
             e -> t

    where e is an edge structure (active or inactive; both are okay)
    where t is a tree like what's described in Ralph Griswold's
    structs library (IPL); I don't know about the 2nd ed. of
    Griswold & Griswold, but the structure is described in the 1st
    ed. in section 16.1

    fails if, for some reason, the conversion can't be made (e.g. the
    edge structure has been screwed around with in some way)

fullcopy(obj)


 fullcopy:  make full recursive copy of object

main(a)


 For debugging only.

p_err(s, n)


 p_err:  print error message to stderr & abort

parse_sentence(s, filename, start_symbol)


 parse_sentence:  string x string -> edge records
                  (s, filename) -> Es
     where s is a chunk of text presumed to constitute a sentence
     where filename is the name of a grammar file containing BNFs
     where Es are edge records containing possible parses of s

string_2_list(s)


 string_2_list:  string -> list
                 s -> L
    where L is a list of partially constructed (short) edges, having
    only LHS and RHS; in the case of nonterminals, the RHS is set
    to 1, while for terminals the RHS is null (and remains that way
    throughout the parse)

stripspaces(s)


 Remove non-backslashed spaces and tabs.

tokenize(s)


 tokenize:  break up a sentence into constituent words, using spaces,
            tabs, and other punctuation as separators (we'll need to
            change this a bit later on to cover apostrophed words)

Records:

chart(inactive, active)

 inactive - set; active - list

edge(LHS, RHS, LEN, DONE, BEG, END, SEEN)

retval(no, item)

short_edge(LHS, RHS)

stats(edge_list, lhs_table, term_set)

This page produced by UniDoc on 2021/04/15 @ 23:59:54.