File itokens.icn

Summary

###########################################################################

	File:     itokens.icn

	Subject:  Procedures for tokenizing Icon code

	Author:   Richard L. Goerwitz

	Date:	  March 3, 1996

	Modified: Bruce Rennie

	Date:     August 7, 2020

###########################################################################

   This file is in the public domain.

###########################################################################

	Version:  1.12

###########################################################################

  This file contains itokens() - a utility for breaking Icon source
  files up into individual tokens.  This is the sort of routine one
  needs to have around when implementing things like pretty printers,
  preprocessors, code obfuscators, etc.  It would also be useful for
  implementing cut-down implementations of Icon written in Icon - the
  sort of thing one might use in an interactive tutorial.

  Itokens(f, x) takes, as its first argument, f, an open file, and
  suspends successive TOK records.  TOK records contain two fields.
  The first field, sym, contains a string that represents the name of
  the next token (e.g. "CSET", "STRING", etc.).  The second field,
  str, gives that token's literal value.  E.g. the TOK for a literal
  semicolon is TOK("SEMICOL", ";").  For a mandatory newline, itokens
  would suspend TOK("SEMICOL", "\n").

  Unlike Icon's own tokenizer, itokens() does not return an EOFX
  token on end-of-file, but rather simply fails.  It also can be
  instructed to return syntactically meaningless newlines by passing
  it a nonnull second argument (e.g. itokens(infile, 1)).  These
  meaningless newlines are returned as TOK records with a null sym
  field (i.e. TOK(&null, "\n")).

  NOTE WELL: If new reserved words or operators are added to a given
  implementation, the tables below will have to be altered.  Note
  also that &keywords should be implemented on the syntactic level -
  not on the lexical one.  As a result, a keyword like &features will
  be suspended as TOK("CONJUNC", "&") and TOK("IDENT", "features").

  Updates to this file are to bring it to the current Unicon 13 Language

###########################################################################

  Links:  scan

###########################################################################

  Requires: coexpressions

###########################################################################
Procedures:
create_arcs, do_apostrophe, do_digits, do_dot, do_identifier, do_newline, do_number_sign, do_operator, do_quotation_mark, do_whitespace, expand_fake_beginner, iparse_tokens, itokens, recognop

Records:
dfstn_state, start_state

Global variables:
TOK, line_number, next_c, str, sym

Links:
scan.icn

This file is part of the (main) package.

Source code.

Details
Procedures:

create_arcs(master_list, field, current_state, POS)


  create_arcs:  fill out a table of arcs leading out of the current
                state, and place that table in the tbl field for
                current_state


do_apostrophe(getchar)


  do_apostrophe:  coexpression -> TOK record
                  getchar      -> t

      Where getchar is the coexpression that yields another character
      from the input stream, and t is a TOK record with "CSETLIT"
      as its sym field.  Puts everything upto and including the next
      non-backslashed apostrope into the str field.


do_digits(getchar)


  do_digits:  coexpression -> TOK record
              getchar      -> t

      Where getchar is the coexpression that produces the next char
      on the input stream, and where t is a TOK record containing
      either "REALLIT" or "INTLIT" in its sym field, and the text of
      the numeric literal in its str field.


do_dot(getchar)


  do_dot:  coexpression -> TOK record
           getchar      -> t

      Where getchar is the coexpression that produces the next
      character from the input stream and t is a token record whose
      sym field contains either "REALLIT" or "DOT".  Essentially,
      do_dot checks the next char on the input stream to see if it's
      an integer.  Since the preceding char was a dot, an integer
      tips us off that we have a real literal.  Otherwise, it's just
      a dot operator.  Note that do_dot resets next_c for the next
      cycle through the main case loop in the calling procedure.


do_identifier(getchar, reserved_tbl)


  do_identifier:  coexpression x table    -> TOK record
                  (getchar, reserved_tbl) -> t

      Where getchar is the coexpression that pops off characters from
      the input stream, reserved_tbl is a table of reserved words
      (keys = the string values, values = the names qua symbols in
      the grammar), and t is a TOK record containing all subsequent
      letters, digits, or underscores after next_c (which must be a
      letter or underscore).  Note that next_c is global and gets
      reset by do_identifier.


do_newline(getchar, last_token, be_tbl)


  do_newline:  coexpression x TOK record x table -> TOK records
               (getchar, last_token, be_tbl)     -> Ts (a generator)

      Where getchar is the coexpression that returns the next
      character from the input stream, last_token is the last TOK
      record suspended by the calling procedure, be_tbl is a table of
      tokens and their "beginner/ender" status, and Ts are TOK
      records.  Note that do_newline resets next_c.  Do_newline is a
      mess.  What it does is check the last token suspended by the
      calling procedure to see if it was a beginner or ender.  It
      then gets the next token by calling iparse_tokens again.  If
      the next token is a beginner and the last token is an ender,
      then we have to suspend a SEMICOL token.  In either event, both
      the last and next token are suspended.


do_number_sign(getchar)


  do_number_sign:  coexpression -> &null
                   getchar      ->

      Where getchar is the coexpression that pops characters off the
      main input stream.  Sets the global variable next_c.  This
      procedure simply reads characters until it gets a newline, then
      returns with next_c == "\n".  Since the starting character was
      a number sign, this has the effect of stripping comments.


do_operator(getchar, operators)


  do_operator:  coexpression x list  -> TOK record
                (getchar, operators) -> t

     Where getchar is the coexpression that produces the next
     character on the input stream, operators is the operator list,
     and where t is a TOK record describing the operator just
     scanned.  Calls recognop, which creates a DFSA to recognize
     valid Icon operators.  Arg2 (operators) is the list of lists
     containing valid Icon operator string values and names (see
     above).


do_quotation_mark(getchar)


  do_quotation_mark:  coexpression -> TOK record
                      getchar      -> t

      Where getchar is the coexpression that yields another character
      from the input stream, and t is a TOK record with "STRINGLIT"
      as its sym field.  Puts everything upto and including the next
      non-backslashed quotation mark into the str field.  Handles the
      underscore continuation convention.


do_whitespace(getchar, whitespace)


  do_whitespace:  coexpression x cset  -> &null
                  getchar x whitespace -> &null

      Where getchar is the coexpression producing the next char on
      the input stream.  Do_whitespace just repeats until it finds a
      non-whitespace character, whitespace being defined as
      membership of a given character in the whitespace argument (a
      cset).


expand_fake_beginner(next_token)


 expand_fake_beginner: TOK record -> TOK records

     Some "beginner" tokens aren't really beginners.  They are token
     sequences that could be either a single binary operator or a
     series of unary operators.  The tokenizer's job is just to snap
     up as many characters as could logically constitute an operator.
     Here is where we decide whether to break the sequence up into
     more than one op or not.


iparse_tokens(stream, getchar)


 iparse_tokens:  file     -> TOK records (a generator)
                 (stream) -> tokens

     Where file is an open input stream, and tokens are TOK records
     holding both the token type and actual token text.

     TOK records contain two parts, a preterminal symbol (the first
     "sym" field), and the actual text of the token ("str").  The
     parser only pays attention to the sym field, although the
     strings themselves get pushed onto the value stack.

     Note the following kludge:  Unlike real Icon tokenizers, this
     procedure returns syntactially meaningless newlines as TOK
     records with a null sym field.  Normally they would be ignored.
     I wanted to return them so they could be printed on the output
     stream, thus preserving the line structure of the original
     file, and making later diagnostic messages more usable.

     Changes to this procedure include adding all addition reserved
     words that have entered the Unicon language, adding each of the
     additional operators that relate to message passing and pattern
     mathcing.


itokens(stream, nostrip)


 itokens:  file x anything    -> TOK records (a generator)
           (stream, nostrip)  -> Rs

     Where stream is an open file, anything is any object (it only
     matters whether it is null or not), and Rs are TOK records.
     Note that itokens strips out useless newlines.  If the second
     argument is nonnull, itokens does not strip out superfluous
     newlines.  It may be useful to keep them when the original line
     structure of the input file must be maintained.


recognop(l, s, i)


  recognop: list x string x integer -> list
            (l, s, i)               -> l2

     Where l is the list of lists created by the calling procedure
     (each element contains a token string value, name, and
     beginner/ender string), where s is a string possibly
     corresponding to a token in the list, where i is the position in
     the elements of l where the operator string values are recorded,
     and where l2 is a list of elements from l that contain operators
     for which string s is an exact match.  Fails if there are no
     operators that s is a prefix of, but returns an empty list if
     there just aren't any that happen to match exactly.

      What this does is let the calling procedure just keep adding
      characters to s until recognop fails, then check the last list
      it returned to see if it is of length 1.  If it is, then it
      contains list with the vital stats for the operator last
      recognized.  If it is of length 0, then string s did not
      contain any recognizable operator.


Records:

dfstn_state(b, e, tbl)


start_state(b, e, tbl, master_list)


Global variables:
TOK

line_number

next_c

str

sym


This page produced by UniDoc on 2021/04/15 @ 23:59:54.