mostly adding some documentation bits
This commit is contained in:
@@ -0,0 +1,129 @@
|
||||
# The Amsterdam HTML Checker
|
||||
|
||||
One of the key elements to making conferencing work in the Amsterdam system is the HTML Checker, which is applied to every post,
|
||||
pseud, and topic name that are added to the system. This component, which lives in the "htmlcheck" package and subdirectory,
|
||||
is derived from an original CommunityWare ASP ActiveX component (written in C++ with ATL and STL), which was then reimplemented
|
||||
in Java for Venice, and then reimplemented in Go for Amsterdam.
|
||||
|
||||
The component's objective is to balance _safety,_ _expressiveness,_ and _forgiveness._ It ensures that user-generated content
|
||||
cannot break page layout, while still allowing limited markup and community-specific linking syntax.
|
||||
|
||||
## Functions
|
||||
|
||||
The HTML Checker takes the raw input text as typed into the post box by a user, and performs the following functions:
|
||||
|
||||
1. Only a limited number of HTML tags are permitted to pass through to the output; all others are "escaped out" by turning
|
||||
their angle brackets into the appropriate HTML entities. The exact tags that are allowed through is configurable.
|
||||
2. Text is word-wrapped to a selectable number of columns, which fits with the preformatted blocks that hold posts.
|
||||
3. Any HTML tags that are opened and need to be closed, but have not been, are automatically closed. This prevents
|
||||
a user's malformed HTML from affecting the larger site layout.
|
||||
4. In a "preview" mode, words in the text are spell-checked by matching them against an internal dictionary, and words
|
||||
not appearing in the dictionary are highlighted in red.
|
||||
5. Bare URLs and E-mail addresses appearing in the text are automatically converted into links. These may also be enclosed
|
||||
in angle brackets and converted into links.
|
||||
6. User names appearing in either angle brackets or parentheses are converted into links to the user's profile.
|
||||
7. Links to other posts, topics, conferences, or even communities may appear in angle brackets, using an established syntax;
|
||||
these are automatically turned into links.
|
||||
|
||||
## Configuration
|
||||
|
||||
When an instance of the HTML Checker is instantiated with **AmNewHTMLChecker,** the caller specifies a _configuration name_
|
||||
for the checker to use. These configurations are expressed in the `configs.yaml` file, which is deserialized at Amsterdam
|
||||
startup time. The configuration specifies:
|
||||
|
||||
* Basic settings like the word-wrap length and master switches that control the recognition of angle brackets and parentheses.
|
||||
* Lists of _output filters_ to apply to visible text and "raw" text.
|
||||
* Lists of _rewriters,_ software components that are applied to various pieces of text:
|
||||
* Strings of non-whitespace characters.
|
||||
* Individual words.
|
||||
* "Tags," text enclosed inside angle brackets.
|
||||
* Text enclosed inside parentheses.
|
||||
* A _tag set_ name, that specifies which HTML tags are allowed through.
|
||||
|
||||
## HTML Tag Sets
|
||||
|
||||
Individual HTML tags are devided into groups, described as follows:
|
||||
|
||||
* _Inline formatting_ tags, such as B, I, EM, and STRONG.
|
||||
* _Anchors,_ meaning the A tag.
|
||||
* _Block formatting_ tags, such as P, BR, and BLOCKQUOTE.
|
||||
* _Active content_ tags, such as EMBED and SCRIPT. These are generally never allowed.
|
||||
* _Image map_ tags, such as MAP and AREA. These are generally never allowed.
|
||||
* _Document formatting_ tags like HEAD and BODY, as well as metadata tags like META and LINK. These are generally never allowed.
|
||||
* _Font format_ tags, which is basically FONT and all the Hx tags.
|
||||
* _Form_ tags, such as FORM, INPUT, BUTTON, and SELECT. These are generally never allowed.
|
||||
* _Table_ tags, such as TABLE, TR, and TD. These are generally never allowed.
|
||||
* _Change markup_ tags, such as DEL and INS.
|
||||
* _Frame_ tags such as FRAME, IFRAME, and FRAMESET. These are generally never allowed.
|
||||
* _Image_ tags, basically the IMG tag.
|
||||
* _Preformatting_ tags, such as PRE and PLAINTEXT. These are generally never allowed.
|
||||
* A number of groups of Netscape-specific and Microsoft-specific tags, such as Netscape's (infamous) BLINK tag and Microsoft's
|
||||
(equally infamous) MARQUEE and BGSOUND tags. These are generally never allowed.
|
||||
* Certain tags used in server-side and Java server-side markup. These are generally never allowed.
|
||||
|
||||
These groups are further aggregated into _tag sets._ The `normal` tag set consists of the following groups:
|
||||
|
||||
* Inline formatting
|
||||
* Anchor
|
||||
* Block format
|
||||
* Font format
|
||||
* Images
|
||||
|
||||
The `restricted` tag set consists of only the "inline formatting" group. It is intended for post pseuds and topic names.
|
||||
|
||||
## Rewriters
|
||||
|
||||
Rewriters are components identified by registered names that examine a chunk of text, decide if it can be rewritten, and,
|
||||
if so, apply markup before and after it as necessary. Examples of rewriters configured in the HTML checker are:
|
||||
|
||||
* `emoticon` - Takes certain character pattrns and rewrites them as emoji. The patterns it recognizes are configured in the
|
||||
`emoticons.yaml` file.
|
||||
* `emoticon_tag` - A variant of the above used inside tag text.
|
||||
* `email` - Recognizes an E-mail address and creates a `mailto:` link to it.
|
||||
* `url` - Recognize a URL and create a link surrounding it.
|
||||
* `postlink` - Recognize a post link and creates a link to it. (The link is created with an `x-postlink:` schema, which is
|
||||
further fixed up when the post is displayed.)
|
||||
* `userlink` - Recognize a username and create a link to that user's profile. (The link is created with an `x-userlink:`
|
||||
schema, which is further fixed up when the post is displayed.)
|
||||
|
||||
## Post Links
|
||||
|
||||
Post links have a specific syntax, which was originated on The WELL and implemented in WellEngaged before being reimplemented
|
||||
in CommunityWare, Venice, and Ansterdam. Post links are always enclosed in angle brackets.
|
||||
|
||||
Here are the various forms of post links supported:
|
||||
|
||||
* `<45>` - Link to a single post by number, within the current topic.
|
||||
* `<13-17>` - Link to a range of posts by number, within the current topic.
|
||||
* `<64->` - Link to all posts in the current topic, starting at the specified post number.
|
||||
* `<16.>` - Link to another topic by number, within the current conference. The trailing "." is required.
|
||||
* `<8.101>` - Link to a single post by number, in a topic by number, in the current conference. (The "range" syntaxes
|
||||
for the post number are also supported.)
|
||||
* `<Commons.>` - Link to another conference by "alias" within the same community. The trailing "." is required.
|
||||
* `<Altered.6>` - Link to another topic by number, in a conference within the same community.
|
||||
* `<TechSide.8.14>` - Link to a single post by number, in a topic by number, in a conference within the same community.
|
||||
(The "range" syntaxes for the post number are also supported.)
|
||||
* `<minds!>` - Link to another community by "alias." The trailing "!" is required.
|
||||
* `<minds!Commons>` Link to another conference in another community.
|
||||
* `<minds!Altered.9>` Link to another topic by number, in a conference in another community.
|
||||
* `<minds!TechSide.8.14>` - Link to a single post by number, in a topic by number, in a conference in another community.
|
||||
(The "range" syntaxes for the post number are also supported.)
|
||||
|
||||
Any of the post link types that are "fully qualified" (that is, start with a specific community alias) can be concatenated to
|
||||
the special Amsterdam `/go/` URL to jump to the specified community, conference, topic, or post(s). This is, in fact, how
|
||||
`x-postlink:` URLs are resolved at display time.
|
||||
|
||||
## Internal operation
|
||||
|
||||
The HTML Checker employs a finite-state machine examining the input text one byte at a time. Characters in the "current"
|
||||
state are saved in a temporary buffer before being written to the main output buffer when the state changes, possibly
|
||||
after having been modified by a rewriter.
|
||||
|
||||
The `context.Context` value passed to **AmNewHTMLChecker** is checked on every iteration of the main parse loop. If
|
||||
it returns an error, the parser is stopped, allowing the HTML checker to respect external timeouts. (The value is
|
||||
stored in the HTML Checker itself, which is generally frowned upon, but used in this case to simplify the external
|
||||
API since HTML Checker objects are typically scoped to a single request.)
|
||||
|
||||
Currently-open tags are managed on an internal stack, which supports the special operation of "remove most recent,"
|
||||
which searches the stack from the top down for a specific data element and removes it. This is required because
|
||||
HTML tags need not be strictly nested.
|
||||
@@ -0,0 +1,47 @@
|
||||
# TEMPORARY NOTES
|
||||
|
||||
(to be moved elsewhere)
|
||||
|
||||
## Amsterdam Identifier Values
|
||||
|
||||
Amsterdam identifier values are used for user names, community aliases, and conference aliases, and may be used
|
||||
for other purposes in the future. A valid Amsterdam ID consists of characters from the following character set:
|
||||
|
||||
* Alphanumerics (A-Z, a-z, 0-9)
|
||||
* Dash (-)
|
||||
* Underscore (_)
|
||||
* Tilde (~)
|
||||
* Asterisk (*)
|
||||
* Apostrophe (')
|
||||
* Dollar sign ($)
|
||||
|
||||
All characters are represented in the ISO 8859-1 character set, and may be represented with single-byte encoding
|
||||
in UTF-8. Also note that all Amsterdam identifiers are case-insensitive.
|
||||
|
||||
### Rationale
|
||||
|
||||
The character set was defined starting with the list of characters allowable in URL path components ("pchar" as
|
||||
defined in [RFC 3986](https://www.rfc-editor.org/rfc/rfc3986.txt), section 3.3, page 23), minus the percent-encoded
|
||||
forms, so that Amsterdam identifiers would be usable as "path information" in a URL.
|
||||
|
||||
The ampersand (&) was excluded because of its possible confusion with a URL parameter separator, and because it requires HTML escaping.
|
||||
|
||||
The at sign (@) was excluded because of possible confusion with E-mail addresses and XMPP identifiers.
|
||||
|
||||
The plus sign (+) was excluded because of possible confusion with a URL=encoded space character.
|
||||
|
||||
The comma (,) was excluded because of its possible interpretation as a separator character.
|
||||
|
||||
The equals sign (=) was excluded because of its possible confusion with a URL parameter/value separator.
|
||||
|
||||
The semicolon (;) was excluded because of its possible interpretation as a separator character.
|
||||
|
||||
The colon (:) was withheld to provide for a possible future "namespace" expansion (as in XML namespaces).
|
||||
|
||||
The parentheses ((, )) were excluded because of possible confusion with user link syntax in conferencing.
|
||||
|
||||
The period (.) was excluded because of possible confusion with post link syntax in conferencing.
|
||||
|
||||
The exclamation point (!) was excluded because of possible confusion with extended post link syntax in conferencing.
|
||||
|
||||
The definition of Amsterdam identifiers was taken almost directly from the definition of Venice identifiers in the predecessor project.
|
||||
Reference in New Issue
Block a user