| File itokens.icn |
###########################################################################
File: itokens.icn
Subject: Procedures for tokenizing Icon code
Author: Richard L. Goerwitz
Date: March 3, 1996
Modified: Bruce Rennie
Date: August 7, 2020
###########################################################################
This file is in the public domain.
###########################################################################
Version: 1.12
###########################################################################
This file contains itokens() - a utility for breaking Icon source
files up into individual tokens. This is the sort of routine one
needs to have around when implementing things like pretty printers,
preprocessors, code obfuscators, etc. It would also be useful for
implementing cut-down implementations of Icon written in Icon - the
sort of thing one might use in an interactive tutorial.
Itokens(f, x) takes, as its first argument, f, an open file, and
suspends successive TOK records. TOK records contain two fields.
The first field, sym, contains a string that represents the name of
the next token (e.g. "CSET", "STRING", etc.). The second field,
str, gives that token's literal value. E.g. the TOK for a literal
semicolon is TOK("SEMICOL", ";"). For a mandatory newline, itokens
would suspend TOK("SEMICOL", "\n").
Unlike Icon's own tokenizer, itokens() does not return an EOFX
token on end-of-file, but rather simply fails. It also can be
instructed to return syntactically meaningless newlines by passing
it a nonnull second argument (e.g. itokens(infile, 1)). These
meaningless newlines are returned as TOK records with a null sym
field (i.e. TOK(&null, "\n")).
NOTE WELL: If new reserved words or operators are added to a given
implementation, the tables below will have to be altered. Note
also that &keywords should be implemented on the syntactic level -
not on the lexical one. As a result, a keyword like &features will
be suspended as TOK("CONJUNC", "&") and TOK("IDENT", "features").
Updates to this file are to bring it to the current Unicon 13 Language
###########################################################################
Links: scan
###########################################################################
Requires: coexpressions
###########################################################################
This file is part of the (main) package.
Source code.| Details |
| Procedures: |
create_arcs(master_list, field, current_state, POS)
create_arcs: fill out a table of arcs leading out of the current
state, and place that table in the tbl field for
current_state
do_apostrophe: coexpression -> TOK record
getchar -> t
Where getchar is the coexpression that yields another character
from the input stream, and t is a TOK record with "CSETLIT"
as its sym field. Puts everything upto and including the next
non-backslashed apostrope into the str field.
do_digits: coexpression -> TOK record
getchar -> t
Where getchar is the coexpression that produces the next char
on the input stream, and where t is a TOK record containing
either "REALLIT" or "INTLIT" in its sym field, and the text of
the numeric literal in its str field.
do_dot: coexpression -> TOK record
getchar -> t
Where getchar is the coexpression that produces the next
character from the input stream and t is a token record whose
sym field contains either "REALLIT" or "DOT". Essentially,
do_dot checks the next char on the input stream to see if it's
an integer. Since the preceding char was a dot, an integer
tips us off that we have a real literal. Otherwise, it's just
a dot operator. Note that do_dot resets next_c for the next
cycle through the main case loop in the calling procedure.
do_identifier(getchar, reserved_tbl)
do_identifier: coexpression x table -> TOK record
(getchar, reserved_tbl) -> t
Where getchar is the coexpression that pops off characters from
the input stream, reserved_tbl is a table of reserved words
(keys = the string values, values = the names qua symbols in
the grammar), and t is a TOK record containing all subsequent
letters, digits, or underscores after next_c (which must be a
letter or underscore). Note that next_c is global and gets
reset by do_identifier.
do_newline(getchar, last_token, be_tbl)
do_newline: coexpression x TOK record x table -> TOK records
(getchar, last_token, be_tbl) -> Ts (a generator)
Where getchar is the coexpression that returns the next
character from the input stream, last_token is the last TOK
record suspended by the calling procedure, be_tbl is a table of
tokens and their "beginner/ender" status, and Ts are TOK
records. Note that do_newline resets next_c. Do_newline is a
mess. What it does is check the last token suspended by the
calling procedure to see if it was a beginner or ender. It
then gets the next token by calling iparse_tokens again. If
the next token is a beginner and the last token is an ender,
then we have to suspend a SEMICOL token. In either event, both
the last and next token are suspended.
do_number_sign: coexpression -> &null
getchar ->
Where getchar is the coexpression that pops characters off the
main input stream. Sets the global variable next_c. This
procedure simply reads characters until it gets a newline, then
returns with next_c == "\n". Since the starting character was
a number sign, this has the effect of stripping comments.
do_operator(getchar, operators)
do_operator: coexpression x list -> TOK record
(getchar, operators) -> t
Where getchar is the coexpression that produces the next
character on the input stream, operators is the operator list,
and where t is a TOK record describing the operator just
scanned. Calls recognop, which creates a DFSA to recognize
valid Icon operators. Arg2 (operators) is the list of lists
containing valid Icon operator string values and names (see
above).
do_quotation_mark: coexpression -> TOK record
getchar -> t
Where getchar is the coexpression that yields another character
from the input stream, and t is a TOK record with "STRINGLIT"
as its sym field. Puts everything upto and including the next
non-backslashed quotation mark into the str field. Handles the
underscore continuation convention.
do_whitespace(getchar, whitespace)
do_whitespace: coexpression x cset -> &null
getchar x whitespace -> &null
Where getchar is the coexpression producing the next char on
the input stream. Do_whitespace just repeats until it finds a
non-whitespace character, whitespace being defined as
membership of a given character in the whitespace argument (a
cset).
expand_fake_beginner(next_token)
expand_fake_beginner: TOK record -> TOK records
Some "beginner" tokens aren't really beginners. They are token
sequences that could be either a single binary operator or a
series of unary operators. The tokenizer's job is just to snap
up as many characters as could logically constitute an operator.
Here is where we decide whether to break the sequence up into
more than one op or not.
iparse_tokens(stream, getchar)
iparse_tokens: file -> TOK records (a generator)
(stream) -> tokens
Where file is an open input stream, and tokens are TOK records
holding both the token type and actual token text.
TOK records contain two parts, a preterminal symbol (the first
"sym" field), and the actual text of the token ("str"). The
parser only pays attention to the sym field, although the
strings themselves get pushed onto the value stack.
Note the following kludge: Unlike real Icon tokenizers, this
procedure returns syntactially meaningless newlines as TOK
records with a null sym field. Normally they would be ignored.
I wanted to return them so they could be printed on the output
stream, thus preserving the line structure of the original
file, and making later diagnostic messages more usable.
Changes to this procedure include adding all addition reserved
words that have entered the Unicon language, adding each of the
additional operators that relate to message passing and pattern
mathcing.
itokens: file x anything -> TOK records (a generator)
(stream, nostrip) -> Rs
Where stream is an open file, anything is any object (it only
matters whether it is null or not), and Rs are TOK records.
Note that itokens strips out useless newlines. If the second
argument is nonnull, itokens does not strip out superfluous
newlines. It may be useful to keep them when the original line
structure of the input file must be maintained.
recognop: list x string x integer -> list
(l, s, i) -> l2
Where l is the list of lists created by the calling procedure
(each element contains a token string value, name, and
beginner/ender string), where s is a string possibly
corresponding to a token in the list, where i is the position in
the elements of l where the operator string values are recorded,
and where l2 is a list of elements from l that contain operators
for which string s is an exact match. Fails if there are no
operators that s is a prefix of, but returns an empty list if
there just aren't any that happen to match exactly.
What this does is let the calling procedure just keep adding
characters to s until recognop fails, then check the last list
it returned to see if it is of length 1. If it is, then it
contains list with the vital stats for the operator last
recognized. If it is of length 0, then string s did not
contain any recognizable operator.
| Records: |
start_state(b, e, tbl, master_list)
| Global variables: |