File regexp.icn

Summary

###########################################################################

	File:     regexp.icn

	Subject:  Procedure for regular-expression pattern matching

	Author:   Robert J. Alexander

	Date:     May 19, 1996

###########################################################################

   This file is in the public domain.

###########################################################################

  This is a kit of procedures to deal with UNIX-like regular expression
  patterns.

  These procedures are interesting partly because of the "recursive
  suspension" (or "suspensive recursion" :-) technique used to simulate
  conjunction of an arbitrary number of computed expressions (see
  notes, below).

  The public procedures are:

  ReMatch(pattern,s,i1,i2) : i3,i4,...,iN
  ReFind(pattern,s,i1,i2) : i3,i4,...,iN
  RePat(s) : pattern list

###########################################################################

  ReMatch() produces the sequence of positions in "s" past a substring
  starting at "i1" that matches "pattern", but fails if there is no
  such position.  Similar to match(), but is capable of generating
  multiple positions.

  ReFind() produces the sequence of positions in "s" where substrings
  begin that match "pattern", but fails if there is no such position.
  Similar to find().  Each position is produced only once, even if
  several possible matches are possible at that position.

  "pattern" can be either a string or a pattern list -- see RePat(),
  below.

  Default values of s, i1, and i2 are handled as for Icon's built-in
  string scanning procedures such as match().

###########################################################################

  RePat(s) : L

  Creates a pattern element list from pattern string "s", but fails if
  the pattern string is not syntactically correct.  ReMatch() and
  ReFind() will automatically convert a pattern string to a pattern
  list, but it is faster to do the conversion explicitly if multiple
  operations are done using the same pattern.  An additional advantage
  to compiling the pattern separately is avoiding ambiguity of failure
  caused by an incorrect pattern and failure to match a correct pattern.

###########################################################################

  ReCaseIndependent() : n
  ReCaseDependent() : n

  Set mode for case-independent or case-dependent matching.  The initial
  mode is case-dependent.

###########################################################################

  Accessible Global Variables

  After a match, the strings matched by parenthesized regular
  expressions are left in list "Re_ParenGroups", and can be accessed by
  subscripting in using the same number as the \N construct.

  If it is desired that regular expression format be similar to UNIX
  filename generation patterns but still retain the power of full
  regular expressions, make the following assignments prior to
  compiling the pattern string:

       Re_ArbString := "*"     # Defaults to ".*"

  The sets of characters (csets) that define a word, digits, and white
  space can be modified.  The following assignments can be made before
  compiling the pattern string.  The character sets are captured when
  the pattern is compiled, so changing them after pattern compilation
  will not alter the behavior of matches unless the pattern string is
  recompiled.

       Re_WordChars := 'whatever you like'
                       # Defaults to &letters ++ &digits ++ "_"
       Re_Digits := &digits ++ 'ABCDEFabcdef'
                       # Defaults to &digits
       Re_Space := 'whatever you like'
                       # Defaults to ' \t\v\n\r\f'

  These globals are normally not initialized until the first call to
  RePat(), and then only if they are null.  They can be explicitly
  initialized to their defaults (if they are null) by calling
  Re_Default().

###########################################################################

  Characters compiled into patterns can be passed through a
  user-supplied filter procedure, provided in global variable
  Re_Filter.  The filtering is done before the characters are bound
  into the pattern.  The filter proc is passed one argument, the string
  to filter, and it must return the filtered string as its result.  If
  the filter proc fails, the string will be used unfiltered.  The
  filter proc is called with an argument of either type string (for
  characters in the pattern) or cset (for character classes [...]).

  Filtering is done only as the pattern is compiled.  Any filtering of
  strings to be matched must be explicitly done.

###########################################################################

  By default, individual pattern elements are matched in a "leftmost-
  longest-first" sequence, which is the order observed by perl, egrep,
  and most other regular expression matchers.  If the order of matching
  is not important a performance improvement might be seen if pattern
  elements are matched in "shortest-first" order.  The following global
  variable setting causes the matcher to operate in leftmost-shortest-
  first order.

   Re_LeftmostShortest := 1
  
###########################################################################

  In the case of patterns containing alternation, ReFind() will
  generally not produce positions in increasing order, but will produce
  all positions from the first term of the alternation (in increasing
  order) followed by all positions from the second (in increasing
  order).  If it is necessary that the positions be generated in
  strictly increasing order, with no duplicates, assign any non-null
  value to Re_Ordered:

       Re_Ordered := 1

  If the Re_Ordered option is chosen, there is a *small* penalty in
  efficiency in some cases, and the co-expression facility is required
  in your Icon implementation.
  
###########################################################################

  Regular Expression Characters and Features Supported

  The regular expression format supported by procedures in this file
  model very closely those supported by the UNIX "egrep" program, with
  modifications as described in the Perl programming language
  definition.  Following is a brief description of the special
  characters used in regular expressions.  In the description, the
  abbreviation RE means regular expression.

  c            An ordinary character (not one of the special characters
               discussed below) is a one-character RE that matches that
               character.

  \c           A backslash followed by any special character is a one-
               character RE that matches the special character itself.

               Note that backslash escape sequences representing
               non-graphic characters are not supported directly
               by these procedures.  Of course, strings coded in an
               Icon program will have such escapes handled by the
               Icon translator.  If such escapes must be supported
               in strings read from the run-time environment (e.g.
               files), they will have to be converted by other means,
               such as the Icon Program Library procedure "escape()".

  .            A period is a one-character RE that matches any
               character.

  [string]     A non-empty string enclosed in square brackets is a one-
               character RE that matches any *one* character of that
               string.  If, the first character is "^" (circumflex),
               the RE matches any character not in the remaining
               characters of the string.  The "-" (minus), when between
               two other characters, may be used to indicate a range of
               consecutive ASCII characters (e.g. [0-9] is equivalent to
               [0123456789]).  Other special characters stand for
               themselves in a bracketed string.

  *            Matches zero or more occurrences of the RE to its left.

  +            Matches one or more occurrences of the RE to its left.

  ?            Matches zero or one occurrences of the RE to its left.

  {N}          Matches exactly N occurrences of the RE to its left.

  {N,}         Matches at least N occurrences of the RE to its left.

  {N,M}        Matches at least N occurrences but at most M occurrences
               of the RE to its left.

  ^            A caret at the beginning of an entire RE constrains
               that RE to match an initial substring of the subject
               string.

  $            A currency symbol at the end of an entire RE constrains
               that RE to match a final substring of the subject string.

  |            Alternation: two REs separated by "|" match either a
               match for the first or a match for the second.

  ()           A RE enclosed in parentheses matches a match for the
               regular expression (parenthesized groups are used
               for grouping, and for accessing the matched string
               subsequently in the match using the \N expression).

  \N           Where N is a digit in the range 1-9, matches the same
               string of characters as was matched by a parenthesized
               RE to the left in the same RE.  The sub-expression
               specified is that beginning with the Nth occurrence
               of "(" counting from the left.  E.g., ^(.*)\1$ matches
               a string consisting of two consecutive occurrences of
               the same string.

###########################################################################

  Extensions beyond UNIX egrep

  The following extensions to UNIX REs, as specified in the Perl
  programming language, are supported.

  \w           Matches any alphanumeric (including "_").
  \W           Matches any non-alphanumeric.

  \b           Matches only at a word-boundary (word defined as a string
               of alphanumerics as in \w).
  \B           Matches only non-word-boundaries.

  \s           Matches any white-space character.
  \S           Matches any non-white-space character.

  \d           Matches any digit [0-9].
  \D           Matches any non-digit.

  \w, \W, \s, \S, \d, \D can be used within [string] REs.

###########################################################################

  Notes on computed conjunction expressions by "suspensive recursion"

  A conjunction expression of an arbitrary number of terms can be
  computed in a looping fashion by the following recursive technique:

       procedure Conjunct(v)
          if <there is another term to be appended to the conjunction> then
             suspend Conjunct(<the next term expression>)
          else
             suspend v
       end

  The argument "v" is needed for producing the value of the last term
  as the value of the conjunction expression, accurately modeling Icon
  conjunction.  If the value of the conjunction is not needed, the
  technique can be slightly simplified by eliminating "v":

       procedure ConjunctAndProduceNull()
          if <there is another term to be appended to the conjunction> then
             suspend ConjunctAndProduceNull(<the next term expression>)
          else
             suspend
       end

  Note that <the next term expression> must still remain in the suspend
  expression to test for failure of the term, although its value is not
  passed to the recursive invocation.  This could have been coded as

             suspend <the next term expression> & ConjunctAndProduceNull()

  but wouldn't have been as provocative.

  Since the computed conjunctions in this program are evaluated only for
  their side effects, the second technique is used in two situations:

       (1)     To compute the conjunction of all of the elements in the
               regular expression pattern list (Re_match1()).

       (2)     To evaluate the "exactly N times" and "N to M times"
               control operations (Re_NTimes()).

###########################################################################
Procedures:
ReCaseDependent, ReCaseIndependent, ReFind, ReMatch, RePat, Re_Alt, Re_Arb, Re_ArbNo, Re_Default, Re_MatchParenGroup, Re_MatchReg, Re_NOrMoreTimes, Re_NTimes, Re_NToMTimes, Re_NonWordBoundary, Re_OneOrMore, Re_TabAny, Re_WordBoundary, Re_ZeroOrOneTimes, Re_c_tabmatch, Re_cset, Re_match1, Re_pat1, Re_prevTok, Re_result_merge, Re_skip, Re_string, Re_tok_match, untab

Records:
Re_Tok

Global variables:
Re_AnyString, Re_ArbString, Re_Digits, Re_Filter, Re_LeftmostShortest, Re_NonDigits, Re_NonSpace, Re_NonWordChars, Re_Ordered, Re_ParenGroups, Re_Space, Re_WordChars, Re__any, Re__find, Re__many, Re__match, Re__tabmatch, Re__upto

Links:
noncase.icn

This file is part of the (main) package.

Source code.

Details
Procedures:

ReCaseDependent()


ReCaseIndependent()


ReFind(plist, s, i1, i2)

: position where regular expression matched


ReMatch(plist, s, i1, i2)

: position past regular expression matched


RePat(s)

: regular expression pattern list


Re_Alt(tokList1, tokList2)


Re_Arb(plist, i)


Re_ArbNo(tok)


Re_Default()


Re_MatchParenGroup(n)


Re_MatchReg(tokList, groupNbr)


Re_NOrMoreTimes(tok, n)


Re_NTimes(tok, n)


Re_NToMTimes(tok, n, m)


Re_NonWordBoundary(wd, nonwd)


Re_OneOrMore(tok)


Re_TabAny(C)


Re_WordBoundary(wd, nonwd)


Re_ZeroOrOneTimes(tok)


Re_c_tabmatch(s)


Re_cset()


Re_match1(plist, i)

 s1,s2,...,sN


Re_pat1(level)

 L


Re_prevTok(plist)


Re_result_merge(L)


Re_skip(plist, i)

 s1,s2,...,sN


Re_string(level)


Re_tok_match(tok, plist, i)


untab(origPos)


Records:

Re_Tok(proc, args)


Global variables:
Re_AnyString

Re_ArbString

Re_Digits

Re_Filter

Re_LeftmostShortest

Re_NonDigits

Re_NonSpace

Re_NonWordChars

Re_Ordered

Re_ParenGroups

Re_Space

Re_WordChars

Re__any

Re__find

Re__many

Re__match

Re__tabmatch

Re__upto


This page produced by UniDoc on 2021/04/15 @ 23:59:54.