findre.icn

File findre.icn
Summary

###########################################################################

	File:     findre.icn

	Subject:  Procedure to find regular expression

	Author:   Richard L. Goerwitz

	Date:	  March 3, 1996

###########################################################################

   This file is in the public domain.

###########################################################################

	Version:  1.17

###########################################################################

  DESCRIPTION:  findre() is like the Icon builtin function find(),
  except that it takes, as its first argument, a regular expression
  pretty much like the ones the Unix egrep command uses (the few
  minor differences are listed below).  Its syntax is the same as
  find's (i.e. findre(s1,s2,i,j)), with the exception that a no-
  argument invocation wipes out all static structures utilized by
  findre, and then forces a garbage collection.

###########################################################################

  (For those not familiar with regular expressions and the Unix egrep
  command: findre() offers a simple and compact wildcard-based search
  system.  If you do a lot of searches through text files, or write
  programs which do searches based on user input, then findre is a
  utility you might want to look over.)

  IMPORTANT DIFFERENCES between find and findre:  As noted above,
  findre() is just a find() function that takes a regular expression
  as its first argument.  One major problem with this setup is that
  it leaves the user with no easy way to tab past a matched
  substring, as with
 
	s ? write(tab(find("hello")+5))

  In order to remedy this intrinsic deficiency, findre() sets the
  global variable __endpoint to the first position after any given
  match occurs.  Use this variable with great care, preferably
  assigning its value to some other variable immediately after the
  match (for example, findre("hello [.?!]*",s) & tmp := __endpoint).
  Otherwise, you will certainly run into trouble.  (See the example
  below for an illustration of how __endpoint is used).

  IMPORTANT DIFFERENCES between egrep and findre:  findre utilizes
  the same basic language as egrep.  The only big difference is that
  findre uses intrinsic Icon data structures and escaping conven-
  tions rather than those of any particular Unix variant.  Be care-
  ful!  If you put findre("\(hello\)",s) into your source file,
  findre will treat it just like findre("(hello)",s).  If, however,
  you enter '\(hello\)' at run-time (via, say, findre(!&input,s)),
  what Icon receives will depend on your operating system (most
  likely, a trace will show "\\(hello\\)").

###########################################################################

  BUGS:  Space has essentially been conserved at the expense of time
  in the automata produced by findre().  The algorithm, in other
  words, will produce the equivalent of a pushdown automaton under
  certain circumstances, rather than strive (at the expense of space)
  for full determinism.  I tried to make up a nfa -> dfa converter
  that would only create that portion of the dfa it needed to accept
  or reject a string, but the resulting automaton was actually quite
  slow (if anyone can think of a way to do this in Icon, and keep it
  small and fast, please let us all know about it).  Note that under
  version 8 of Icon, findre takes up negligible storage space, due to
  the much improved hashing algorithm.  I have not tested it under
  version 7, but I would expect it to use up quite a bit more space
  in that environment.

  IMPORTANT NOTE:  Findre takes a shortest-possible-match approach
  to regular expressions.  In other words, if you look for "a*",
  findre will not even bother looking for an "a."  It will just match
  the empty string.  Without this feature, findre would perform a bit
  more slowly.  The problem with such an approach is that often the
  user will want to tab past the longest possible string of matched
  characters (say tab((findre("a*|b*"), __endpoint)).  In circumstan-
  ces like this, please just use something like:

      s ? {
          tab(find("a")) &  # or use Arb() from the IPL (patterns.icn)
          tab(many('a'))
          tab(many('b'))
      }

  or else use some combination of findre and the above.
    
###########################################################################

  REGULAR EXPRESSION SYNTAX: Regular expression syntax is complex,
  and yet simple.  It is simple in the sense that most of its power
  is concentrated in about a dozen easy-to-learn symbols.  It is
  complex in the sense that, by combining these symbols with
  characters, you can represent very intricate patterns.

  I make no pretense here of offering a full explanation of regular
  expressions, their usage, and the deeper nuances of their syntax.
  As noted above, this should be gleaned from a Unix manual.  For
  quick reference, however, I have included a brief summary of all
  the special symbols used, accompanied by an explanation of what
  they mean, and, in some cases, of how they are used (most of this
  is taken from the comments prepended to Jerry Nowlin's Icon-grep
  command, as posted a couple of years ago):

     ^   -  matches if the following pattern is at the beginning
            of a line (i.e. ^# matches lines beginning with "#")
     $   -  matches if the preceding pattern is at the end of a line
     .   -  matches any single character
     +   -  matches from 1 to any number of occurrences of the
            previous expression (i.e. a character, or set of paren-
            thesized/bracketed characters)
     *   -  matches from 0 to any number of occurrences of the previous
            expression
     \   -  removes the special meaning of any special characters
            recognized by this program (i.e if you want to match lines
            beginning with a "[", write ^\[, and not ^[)
     |   -  matches either the pattern before it, or the one after
            it (i.e. abc|cde matches either abc or cde)
     []  -  matches any member of the enclosed character set, or,
            if ^ is the first character, any nonmember of the
            enclosed character set (i.e. [^ab] matches any character
	     _except_ a and b).
     ()  -  used for grouping (e.g. ^(abc|cde)$ matches lines consist-
            ing of either "abc" or "cde," while ^abc|cde$ matches
            lines either beginning with "abc" or ending in "cde")

###########################################################################

  EXAMPLE program:

  procedure main(a)
      while line := !&input do {
          token_list := tokenize_line(line,a[1])
          every write(!token_list)
      }
  end

  procedure tokenize_line(s,sep)
      tmp_lst := []
      s ? {
          while field := tab(findre(sep)|0) &
          mark := __endpoint
          do {
              put(tmp_lst,"" ~== field)
              if pos(0) then break
              else tab(mark)
          }
      }
      return tmp_lst
  end

  The above program would be compiled with findre (e.g. "icont
  test_prg.icn findre.icn") to produce a single executable which
  tokenizes each line of input based on a user-specified delimiter.
  Note how __endpoint is set soon after findre() succeeds.  Note
  also how empty fields are excluded with "" ~==, etc.  Finally, note
  that the temporary list, tmp_lst, is not needed.  It is included
  here merely to illustrate one way in which tokens might be stored.

  Tokenizing is, of course, only one of many uses one might put
  findre to.  It is very helpful in allowing the user to construct
  automata at run-time.  If, say, you want to write a program that
  searches text files for patterns given by the user, findre would be
  a perfect utility to use.  Findre in general permits more compact
  expression of patterns than one can obtain using intrinsic Icon
  scanning facilities.  Its near complete compatibility with the Unix
  regexp library, moreover, makes for greater ease of porting,
  especially in cases where Icon is being used to prototype C code.

###########################################################################
Procedures: