Summary
###########################################################################
File: html.icn
Subject: Procedures for parsing HTML
Author: Gregg M. Townsend
Date: July 4, 2000
###########################################################################
This file is in the public domain.
###########################################################################
These procedures parse HTML files:
htchunks(f) generates the basic chunks -- tags and text --
that compose an HTML file.
htrefs(f) generates the tagname/keyword/value combinations
that reference other files.
These procedures process strings from HTML files:
httag(s) extracts the name of a tag.
htvals(s) generates the keyword/value pairs from a tag.
urlmerge(base,new) interprets a new URL in the context of a base.
###########################################################################
htchunks(f) generates the HTML chunks from file f.
It returns strings beginning with
<!-- for unclosed comments (legal comments are deleted)
< for tags (will end with ">" unless unclosed at EOF)
anything else for text
At this level entities such as & are left unprocessed and all
whitespace is preserved, including newlines.
###########################################################################
htrefs(f) extracts file/url references from within an HTML file
and generates a string of the form
tagname keyword value
for each reference.
A single space character separates the three fields, but if no
value is supplied for the keyword, no space follows the keyword.
Tag and keyword names are always returned in upper case.
Quotation marks are stripped from the value, but note that the
value can contain spaces or other special characters (although
by strict HTML rules it probably shouldn't).
A table in the code determines which fields are references to
other files. For example, with <IMG>, SRC= is a reference but
WIDTH= is not. The table is based on the HTML 4.0 standard:
http://www.w3.org/TR/REC-html40/
###########################################################################
httag(s) extracts and returns the tag name from within an HTML
tag string of the form "<tagname...>". The tag name is returned
in upper case.
###########################################################################
htvals(s) generates the tag values contained within an HTML tag
string of the form "<tagname kw=val kw=val ...>". For each
keyword=value pair beyond the tagname, a string of the form
keyword value
is generated. One space follows the keyword, which is returned
in upper case, and quotation marks are stripped from the value.
The value itself can be an empty string.
For each keyword given without a value, the keyword is generated
in upper case with no following space.
Parsing is somewhat tolerant of errors.
###########################################################################
urlmerge(base,new) interprets a full or partial new URL in the
context of a base URL, returning the combined URL.
Here are some examples of applying urlmerge() with a base value
of "http://www.vcu.edu/misc/sched.html" and a new value as given:
new result
------------- -------------------
#tuesday http://www.vcu.edu/misc/sched.html#tuesday
bulletin.html http://www.vcu.edu/misc/bulletin.html
./results.html http://www.vcu.edu/misc/results.html
images/rs.gif http://www.vcu.edu/misc/images/rs.gif
../ http://www.vcu.edu/
/greet.html http://www.vcu.edu/greet.html
file:a.html file:a.html
Path components of "./" and "../" at the beginning of the
new URL are handled specially to produce a simpler result.
No other simplifications are applied.
###########################################################################
Procedures:
htc_comment, htc_tag, htc_text, htchunks, htrefs, httag, htvals, url_trim, urlmerge
This file is part of the (main) package.
Source code.
htc_comment(f)
htc_tag(f)
htc_text(f)
htchunks(f)
: generate chunks of HTML file
htrefs(f)
: generate references from HTML file
httag(s)
: extract name of HTML tag
htvals(s)
: generate values in HTML tag
url_trim(path)
urlmerge(base, new)
: merge URLs
This page produced by UniDoc on 2021/04/15 @ 23:59:54.