File html.icn

Summary

###########################################################################

	File:     html.icn

	Subject:  Procedures for parsing HTML

	Author:   Gregg M. Townsend

	Date:     July 4, 2000

###########################################################################

   This file is in the public domain.

###########################################################################

	These procedures parse HTML files:

	htchunks(f)	generates the basic chunks -- tags and text --
			that compose an HTML file.

	htrefs(f)	generates the tagname/keyword/value combinations
			that reference other files.

	These procedures process strings from HTML files:

	httag(s)	extracts the name of a tag.

	htvals(s)	generates the keyword/value pairs from a tag.

	urlmerge(base,new) interprets a new URL in the context of a base.

###########################################################################

   	htchunks(f) generates the HTML chunks from file f.
	It returns strings beginning with

		<!--	for unclosed comments (legal comments are deleted)
		<	for tags (will end with ">" unless unclosed at EOF)
	anything else	for text

	At this level entities such as & are left unprocessed and all
	whitespace is preserved, including newlines.

###########################################################################

	htrefs(f) extracts file/url references from within an HTML file
	and generates a string of the form
		tagname keyword value
   	for each reference.

	A single space character separates the three fields, but if no
	value is supplied for the keyword, no space follows the keyword.
	Tag and keyword names are always returned in upper case.

	Quotation marks are stripped from the value, but note that the
	value can contain spaces or other special characters (although
	by strict HTML rules it probably shouldn't).

       A table in the code determines which fields are references to
	other files.  For example, with <IMG>, SRC= is a reference but
	WIDTH= is not.  The table is based on the HTML 4.0 standard:
		http://www.w3.org/TR/REC-html40/

###########################################################################

	httag(s) extracts and returns the tag name from within an HTML
	tag string of the form "<tagname...>".   The tag name is returned
	in upper case.

###########################################################################

	htvals(s) generates the tag values contained within an HTML tag
	string of the form "<tagname kw=val kw=val ...>".   For each
	keyword=value pair beyond the tagname, a string of the form

		keyword value

	is generated.  One space follows the keyword, which is returned
	in upper case, and quotation marks are stripped from the value.
	The value itself can be an empty string.

	For each keyword given without a value, the keyword is generated
	in upper case with no following space.

	Parsing is somewhat tolerant of errors.

###########################################################################

	urlmerge(base,new) interprets a full or partial new URL in the
	context of a base URL, returning the combined URL.

	Here are some examples of applying urlmerge() with a base value
	of "http://www.vcu.edu/misc/sched.html" and a new value as given:

	new		result
	-------------	-------------------
	#tuesday	http://www.vcu.edu/misc/sched.html#tuesday
	bulletin.html	http://www.vcu.edu/misc/bulletin.html
	./results.html	http://www.vcu.edu/misc/results.html
	images/rs.gif	http://www.vcu.edu/misc/images/rs.gif
	../		http://www.vcu.edu/
	/greet.html	http://www.vcu.edu/greet.html
	file:a.html	file:a.html

	Path components of "./" and "../" at the beginning of the
	new URL are handled specially to produce a simpler result.
	No other simplifications are applied.

###########################################################################
Procedures:
htc_comment, htc_tag, htc_text, htchunks, htrefs, httag, htvals, url_trim, urlmerge

This file is part of the (main) package.

Source code.

Details
Procedures:

htc_comment(f)


htc_tag(f)


htc_text(f)


htchunks(f)

: generate chunks of HTML file


htrefs(f)

: generate references from HTML file


httag(s)

: extract name of HTML tag


htvals(s)

: generate values in HTML tag


url_trim(path)


urlmerge(base, new)

: merge URLs



This page produced by UniDoc on 2021/04/15 @ 23:59:54.