mostly adding some documentation bits
This commit is contained in:
@@ -0,0 +1,129 @@
|
|||||||
|
# The Amsterdam HTML Checker
|
||||||
|
|
||||||
|
One of the key elements to making conferencing work in the Amsterdam system is the HTML Checker, which is applied to every post,
|
||||||
|
pseud, and topic name that are added to the system. This component, which lives in the "htmlcheck" package and subdirectory,
|
||||||
|
is derived from an original CommunityWare ASP ActiveX component (written in C++ with ATL and STL), which was then reimplemented
|
||||||
|
in Java for Venice, and then reimplemented in Go for Amsterdam.
|
||||||
|
|
||||||
|
The component's objective is to balance _safety,_ _expressiveness,_ and _forgiveness._ It ensures that user-generated content
|
||||||
|
cannot break page layout, while still allowing limited markup and community-specific linking syntax.
|
||||||
|
|
||||||
|
## Functions
|
||||||
|
|
||||||
|
The HTML Checker takes the raw input text as typed into the post box by a user, and performs the following functions:
|
||||||
|
|
||||||
|
1. Only a limited number of HTML tags are permitted to pass through to the output; all others are "escaped out" by turning
|
||||||
|
their angle brackets into the appropriate HTML entities. The exact tags that are allowed through is configurable.
|
||||||
|
2. Text is word-wrapped to a selectable number of columns, which fits with the preformatted blocks that hold posts.
|
||||||
|
3. Any HTML tags that are opened and need to be closed, but have not been, are automatically closed. This prevents
|
||||||
|
a user's malformed HTML from affecting the larger site layout.
|
||||||
|
4. In a "preview" mode, words in the text are spell-checked by matching them against an internal dictionary, and words
|
||||||
|
not appearing in the dictionary are highlighted in red.
|
||||||
|
5. Bare URLs and E-mail addresses appearing in the text are automatically converted into links. These may also be enclosed
|
||||||
|
in angle brackets and converted into links.
|
||||||
|
6. User names appearing in either angle brackets or parentheses are converted into links to the user's profile.
|
||||||
|
7. Links to other posts, topics, conferences, or even communities may appear in angle brackets, using an established syntax;
|
||||||
|
these are automatically turned into links.
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
When an instance of the HTML Checker is instantiated with **AmNewHTMLChecker,** the caller specifies a _configuration name_
|
||||||
|
for the checker to use. These configurations are expressed in the `configs.yaml` file, which is deserialized at Amsterdam
|
||||||
|
startup time. The configuration specifies:
|
||||||
|
|
||||||
|
* Basic settings like the word-wrap length and master switches that control the recognition of angle brackets and parentheses.
|
||||||
|
* Lists of _output filters_ to apply to visible text and "raw" text.
|
||||||
|
* Lists of _rewriters,_ software components that are applied to various pieces of text:
|
||||||
|
* Strings of non-whitespace characters.
|
||||||
|
* Individual words.
|
||||||
|
* "Tags," text enclosed inside angle brackets.
|
||||||
|
* Text enclosed inside parentheses.
|
||||||
|
* A _tag set_ name, that specifies which HTML tags are allowed through.
|
||||||
|
|
||||||
|
## HTML Tag Sets
|
||||||
|
|
||||||
|
Individual HTML tags are devided into groups, described as follows:
|
||||||
|
|
||||||
|
* _Inline formatting_ tags, such as B, I, EM, and STRONG.
|
||||||
|
* _Anchors,_ meaning the A tag.
|
||||||
|
* _Block formatting_ tags, such as P, BR, and BLOCKQUOTE.
|
||||||
|
* _Active content_ tags, such as EMBED and SCRIPT. These are generally never allowed.
|
||||||
|
* _Image map_ tags, such as MAP and AREA. These are generally never allowed.
|
||||||
|
* _Document formatting_ tags like HEAD and BODY, as well as metadata tags like META and LINK. These are generally never allowed.
|
||||||
|
* _Font format_ tags, which is basically FONT and all the Hx tags.
|
||||||
|
* _Form_ tags, such as FORM, INPUT, BUTTON, and SELECT. These are generally never allowed.
|
||||||
|
* _Table_ tags, such as TABLE, TR, and TD. These are generally never allowed.
|
||||||
|
* _Change markup_ tags, such as DEL and INS.
|
||||||
|
* _Frame_ tags such as FRAME, IFRAME, and FRAMESET. These are generally never allowed.
|
||||||
|
* _Image_ tags, basically the IMG tag.
|
||||||
|
* _Preformatting_ tags, such as PRE and PLAINTEXT. These are generally never allowed.
|
||||||
|
* A number of groups of Netscape-specific and Microsoft-specific tags, such as Netscape's (infamous) BLINK tag and Microsoft's
|
||||||
|
(equally infamous) MARQUEE and BGSOUND tags. These are generally never allowed.
|
||||||
|
* Certain tags used in server-side and Java server-side markup. These are generally never allowed.
|
||||||
|
|
||||||
|
These groups are further aggregated into _tag sets._ The `normal` tag set consists of the following groups:
|
||||||
|
|
||||||
|
* Inline formatting
|
||||||
|
* Anchor
|
||||||
|
* Block format
|
||||||
|
* Font format
|
||||||
|
* Images
|
||||||
|
|
||||||
|
The `restricted` tag set consists of only the "inline formatting" group. It is intended for post pseuds and topic names.
|
||||||
|
|
||||||
|
## Rewriters
|
||||||
|
|
||||||
|
Rewriters are components identified by registered names that examine a chunk of text, decide if it can be rewritten, and,
|
||||||
|
if so, apply markup before and after it as necessary. Examples of rewriters configured in the HTML checker are:
|
||||||
|
|
||||||
|
* `emoticon` - Takes certain character pattrns and rewrites them as emoji. The patterns it recognizes are configured in the
|
||||||
|
`emoticons.yaml` file.
|
||||||
|
* `emoticon_tag` - A variant of the above used inside tag text.
|
||||||
|
* `email` - Recognizes an E-mail address and creates a `mailto:` link to it.
|
||||||
|
* `url` - Recognize a URL and create a link surrounding it.
|
||||||
|
* `postlink` - Recognize a post link and creates a link to it. (The link is created with an `x-postlink:` schema, which is
|
||||||
|
further fixed up when the post is displayed.)
|
||||||
|
* `userlink` - Recognize a username and create a link to that user's profile. (The link is created with an `x-userlink:`
|
||||||
|
schema, which is further fixed up when the post is displayed.)
|
||||||
|
|
||||||
|
## Post Links
|
||||||
|
|
||||||
|
Post links have a specific syntax, which was originated on The WELL and implemented in WellEngaged before being reimplemented
|
||||||
|
in CommunityWare, Venice, and Ansterdam. Post links are always enclosed in angle brackets.
|
||||||
|
|
||||||
|
Here are the various forms of post links supported:
|
||||||
|
|
||||||
|
* `<45>` - Link to a single post by number, within the current topic.
|
||||||
|
* `<13-17>` - Link to a range of posts by number, within the current topic.
|
||||||
|
* `<64->` - Link to all posts in the current topic, starting at the specified post number.
|
||||||
|
* `<16.>` - Link to another topic by number, within the current conference. The trailing "." is required.
|
||||||
|
* `<8.101>` - Link to a single post by number, in a topic by number, in the current conference. (The "range" syntaxes
|
||||||
|
for the post number are also supported.)
|
||||||
|
* `<Commons.>` - Link to another conference by "alias" within the same community. The trailing "." is required.
|
||||||
|
* `<Altered.6>` - Link to another topic by number, in a conference within the same community.
|
||||||
|
* `<TechSide.8.14>` - Link to a single post by number, in a topic by number, in a conference within the same community.
|
||||||
|
(The "range" syntaxes for the post number are also supported.)
|
||||||
|
* `<minds!>` - Link to another community by "alias." The trailing "!" is required.
|
||||||
|
* `<minds!Commons>` Link to another conference in another community.
|
||||||
|
* `<minds!Altered.9>` Link to another topic by number, in a conference in another community.
|
||||||
|
* `<minds!TechSide.8.14>` - Link to a single post by number, in a topic by number, in a conference in another community.
|
||||||
|
(The "range" syntaxes for the post number are also supported.)
|
||||||
|
|
||||||
|
Any of the post link types that are "fully qualified" (that is, start with a specific community alias) can be concatenated to
|
||||||
|
the special Amsterdam `/go/` URL to jump to the specified community, conference, topic, or post(s). This is, in fact, how
|
||||||
|
`x-postlink:` URLs are resolved at display time.
|
||||||
|
|
||||||
|
## Internal operation
|
||||||
|
|
||||||
|
The HTML Checker employs a finite-state machine examining the input text one byte at a time. Characters in the "current"
|
||||||
|
state are saved in a temporary buffer before being written to the main output buffer when the state changes, possibly
|
||||||
|
after having been modified by a rewriter.
|
||||||
|
|
||||||
|
The `context.Context` value passed to **AmNewHTMLChecker** is checked on every iteration of the main parse loop. If
|
||||||
|
it returns an error, the parser is stopped, allowing the HTML checker to respect external timeouts. (The value is
|
||||||
|
stored in the HTML Checker itself, which is generally frowned upon, but used in this case to simplify the external
|
||||||
|
API since HTML Checker objects are typically scoped to a single request.)
|
||||||
|
|
||||||
|
Currently-open tags are managed on an internal stack, which supports the special operation of "remove most recent,"
|
||||||
|
which searches the stack from the top down for a specific data element and removes it. This is required because
|
||||||
|
HTML tags need not be strictly nested.
|
||||||
@@ -0,0 +1,47 @@
|
|||||||
|
# TEMPORARY NOTES
|
||||||
|
|
||||||
|
(to be moved elsewhere)
|
||||||
|
|
||||||
|
## Amsterdam Identifier Values
|
||||||
|
|
||||||
|
Amsterdam identifier values are used for user names, community aliases, and conference aliases, and may be used
|
||||||
|
for other purposes in the future. A valid Amsterdam ID consists of characters from the following character set:
|
||||||
|
|
||||||
|
* Alphanumerics (A-Z, a-z, 0-9)
|
||||||
|
* Dash (-)
|
||||||
|
* Underscore (_)
|
||||||
|
* Tilde (~)
|
||||||
|
* Asterisk (*)
|
||||||
|
* Apostrophe (')
|
||||||
|
* Dollar sign ($)
|
||||||
|
|
||||||
|
All characters are represented in the ISO 8859-1 character set, and may be represented with single-byte encoding
|
||||||
|
in UTF-8. Also note that all Amsterdam identifiers are case-insensitive.
|
||||||
|
|
||||||
|
### Rationale
|
||||||
|
|
||||||
|
The character set was defined starting with the list of characters allowable in URL path components ("pchar" as
|
||||||
|
defined in [RFC 3986](https://www.rfc-editor.org/rfc/rfc3986.txt), section 3.3, page 23), minus the percent-encoded
|
||||||
|
forms, so that Amsterdam identifiers would be usable as "path information" in a URL.
|
||||||
|
|
||||||
|
The ampersand (&) was excluded because of its possible confusion with a URL parameter separator, and because it requires HTML escaping.
|
||||||
|
|
||||||
|
The at sign (@) was excluded because of possible confusion with E-mail addresses and XMPP identifiers.
|
||||||
|
|
||||||
|
The plus sign (+) was excluded because of possible confusion with a URL=encoded space character.
|
||||||
|
|
||||||
|
The comma (,) was excluded because of its possible interpretation as a separator character.
|
||||||
|
|
||||||
|
The equals sign (=) was excluded because of its possible confusion with a URL parameter/value separator.
|
||||||
|
|
||||||
|
The semicolon (;) was excluded because of its possible interpretation as a separator character.
|
||||||
|
|
||||||
|
The colon (:) was withheld to provide for a possible future "namespace" expansion (as in XML namespaces).
|
||||||
|
|
||||||
|
The parentheses ((, )) were excluded because of possible confusion with user link syntax in conferencing.
|
||||||
|
|
||||||
|
The period (.) was excluded because of possible confusion with post link syntax in conferencing.
|
||||||
|
|
||||||
|
The exclamation point (!) was excluded because of possible confusion with extended post link syntax in conferencing.
|
||||||
|
|
||||||
|
The definition of Amsterdam identifiers was taken almost directly from the definition of Venice identifiers in the predecessor project.
|
||||||
+1
-3
@@ -53,9 +53,7 @@ func (stk *Stack[T]) RemoveMostRecent(data T) bool {
|
|||||||
} else if (i + 1) == len(stk.elements) {
|
} else if (i + 1) == len(stk.elements) {
|
||||||
stk.elements = stk.elements[:i]
|
stk.elements = stk.elements[:i]
|
||||||
} else {
|
} else {
|
||||||
high := stk.elements[i+1:]
|
stk.elements = append(stk.elements[:i], stk.elements[i+1:]...)
|
||||||
stk.elements = stk.elements[:i]
|
|
||||||
stk.elements = append(stk.elements, high...)
|
|
||||||
}
|
}
|
||||||
return true
|
return true
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user