File itokens.icn |
########################################################################### File: itokens.icn Subject: Procedures for tokenizing Icon code Author: Richard L. Goerwitz Date: March 3, 1996 Modified: Bruce Rennie Date: August 7, 2020 ########################################################################### This file is in the public domain. ########################################################################### Version: 1.12 ########################################################################### This file contains itokens() - a utility for breaking Icon source files up into individual tokens. This is the sort of routine one needs to have around when implementing things like pretty printers, preprocessors, code obfuscators, etc. It would also be useful for implementing cut-down implementations of Icon written in Icon - the sort of thing one might use in an interactive tutorial. Itokens(f, x) takes, as its first argument, f, an open file, and suspends successive TOK records. TOK records contain two fields. The first field, sym, contains a string that represents the name of the next token (e.g. "CSET", "STRING", etc.). The second field, str, gives that token's literal value. E.g. the TOK for a literal semicolon is TOK("SEMICOL", ";"). For a mandatory newline, itokens would suspend TOK("SEMICOL", "\n"). Unlike Icon's own tokenizer, itokens() does not return an EOFX token on end-of-file, but rather simply fails. It also can be instructed to return syntactically meaningless newlines by passing it a nonnull second argument (e.g. itokens(infile, 1)). These meaningless newlines are returned as TOK records with a null sym field (i.e. TOK(&null, "\n")). NOTE WELL: If new reserved words or operators are added to a given implementation, the tables below will have to be altered. Note also that &keywords should be implemented on the syntactic level - not on the lexical one. As a result, a keyword like &features will be suspended as TOK("CONJUNC", "&") and TOK("IDENT", "features"). Updates to this file are to bring it to the current Unicon 13 Language ########################################################################### Links: scan ########################################################################### Requires: coexpressions ###########################################################################
This file is part of the (main) package.
Source code.Details |
Procedures: |
create_arcs(master_list, field, current_state, POS)
create_arcs: fill out a table of arcs leading out of the current state, and place that table in the tbl field for current_state
do_apostrophe: coexpression -> TOK record getchar -> t Where getchar is the coexpression that yields another character from the input stream, and t is a TOK record with "CSETLIT" as its sym field. Puts everything upto and including the next non-backslashed apostrope into the str field.
do_digits: coexpression -> TOK record getchar -> t Where getchar is the coexpression that produces the next char on the input stream, and where t is a TOK record containing either "REALLIT" or "INTLIT" in its sym field, and the text of the numeric literal in its str field.
do_dot: coexpression -> TOK record getchar -> t Where getchar is the coexpression that produces the next character from the input stream and t is a token record whose sym field contains either "REALLIT" or "DOT". Essentially, do_dot checks the next char on the input stream to see if it's an integer. Since the preceding char was a dot, an integer tips us off that we have a real literal. Otherwise, it's just a dot operator. Note that do_dot resets next_c for the next cycle through the main case loop in the calling procedure.
do_identifier(getchar, reserved_tbl)
do_identifier: coexpression x table -> TOK record (getchar, reserved_tbl) -> t Where getchar is the coexpression that pops off characters from the input stream, reserved_tbl is a table of reserved words (keys = the string values, values = the names qua symbols in the grammar), and t is a TOK record containing all subsequent letters, digits, or underscores after next_c (which must be a letter or underscore). Note that next_c is global and gets reset by do_identifier.
do_newline(getchar, last_token, be_tbl)
do_newline: coexpression x TOK record x table -> TOK records (getchar, last_token, be_tbl) -> Ts (a generator) Where getchar is the coexpression that returns the next character from the input stream, last_token is the last TOK record suspended by the calling procedure, be_tbl is a table of tokens and their "beginner/ender" status, and Ts are TOK records. Note that do_newline resets next_c. Do_newline is a mess. What it does is check the last token suspended by the calling procedure to see if it was a beginner or ender. It then gets the next token by calling iparse_tokens again. If the next token is a beginner and the last token is an ender, then we have to suspend a SEMICOL token. In either event, both the last and next token are suspended.
do_number_sign: coexpression -> &null getchar -> Where getchar is the coexpression that pops characters off the main input stream. Sets the global variable next_c. This procedure simply reads characters until it gets a newline, then returns with next_c == "\n". Since the starting character was a number sign, this has the effect of stripping comments.
do_operator(getchar, operators)
do_operator: coexpression x list -> TOK record (getchar, operators) -> t Where getchar is the coexpression that produces the next character on the input stream, operators is the operator list, and where t is a TOK record describing the operator just scanned. Calls recognop, which creates a DFSA to recognize valid Icon operators. Arg2 (operators) is the list of lists containing valid Icon operator string values and names (see above).
do_quotation_mark: coexpression -> TOK record getchar -> t Where getchar is the coexpression that yields another character from the input stream, and t is a TOK record with "STRINGLIT" as its sym field. Puts everything upto and including the next non-backslashed quotation mark into the str field. Handles the underscore continuation convention.
do_whitespace(getchar, whitespace)
do_whitespace: coexpression x cset -> &null getchar x whitespace -> &null Where getchar is the coexpression producing the next char on the input stream. Do_whitespace just repeats until it finds a non-whitespace character, whitespace being defined as membership of a given character in the whitespace argument (a cset).
expand_fake_beginner(next_token)
expand_fake_beginner: TOK record -> TOK records Some "beginner" tokens aren't really beginners. They are token sequences that could be either a single binary operator or a series of unary operators. The tokenizer's job is just to snap up as many characters as could logically constitute an operator. Here is where we decide whether to break the sequence up into more than one op or not.
iparse_tokens(stream, getchar)
iparse_tokens: file -> TOK records (a generator) (stream) -> tokens Where file is an open input stream, and tokens are TOK records holding both the token type and actual token text. TOK records contain two parts, a preterminal symbol (the first "sym" field), and the actual text of the token ("str"). The parser only pays attention to the sym field, although the strings themselves get pushed onto the value stack. Note the following kludge: Unlike real Icon tokenizers, this procedure returns syntactially meaningless newlines as TOK records with a null sym field. Normally they would be ignored. I wanted to return them so they could be printed on the output stream, thus preserving the line structure of the original file, and making later diagnostic messages more usable. Changes to this procedure include adding all addition reserved words that have entered the Unicon language, adding each of the additional operators that relate to message passing and pattern mathcing.
itokens: file x anything -> TOK records (a generator) (stream, nostrip) -> Rs Where stream is an open file, anything is any object (it only matters whether it is null or not), and Rs are TOK records. Note that itokens strips out useless newlines. If the second argument is nonnull, itokens does not strip out superfluous newlines. It may be useful to keep them when the original line structure of the input file must be maintained.
recognop: list x string x integer -> list (l, s, i) -> l2 Where l is the list of lists created by the calling procedure (each element contains a token string value, name, and beginner/ender string), where s is a string possibly corresponding to a token in the list, where i is the position in the elements of l where the operator string values are recorded, and where l2 is a list of elements from l that contain operators for which string s is an exact match. Fails if there are no operators that s is a prefix of, but returns an empty list if there just aren't any that happen to match exactly. What this does is let the calling procedure just keep adding characters to s until recognop fails, then check the last list it returned to see if it is of length 1. If it is, then it contains list with the vital stats for the operator last recognized. If it is of length 0, then string s did not contain any recognizable operator.
Records: |
start_state(b, e, tbl, master_list)
Global variables: |