diff --git a/docs/HTMLChecker.md b/docs/HTMLChecker.md new file mode 100644 index 0000000..a89ae12 --- /dev/null +++ b/docs/HTMLChecker.md @@ -0,0 +1,129 @@ +# The Amsterdam HTML Checker + +One of the key elements to making conferencing work in the Amsterdam system is the HTML Checker, which is applied to every post, +pseud, and topic name that are added to the system. This component, which lives in the "htmlcheck" package and subdirectory, +is derived from an original CommunityWare ASP ActiveX component (written in C++ with ATL and STL), which was then reimplemented +in Java for Venice, and then reimplemented in Go for Amsterdam. + +The component's objective is to balance _safety,_ _expressiveness,_ and _forgiveness._ It ensures that user-generated content +cannot break page layout, while still allowing limited markup and community-specific linking syntax. + +## Functions + +The HTML Checker takes the raw input text as typed into the post box by a user, and performs the following functions: + +1. Only a limited number of HTML tags are permitted to pass through to the output; all others are "escaped out" by turning + their angle brackets into the appropriate HTML entities. The exact tags that are allowed through is configurable. +2. Text is word-wrapped to a selectable number of columns, which fits with the preformatted blocks that hold posts. +3. Any HTML tags that are opened and need to be closed, but have not been, are automatically closed. This prevents + a user's malformed HTML from affecting the larger site layout. +4. In a "preview" mode, words in the text are spell-checked by matching them against an internal dictionary, and words + not appearing in the dictionary are highlighted in red. +5. Bare URLs and E-mail addresses appearing in the text are automatically converted into links. These may also be enclosed + in angle brackets and converted into links. +6. User names appearing in either angle brackets or parentheses are converted into links to the user's profile. +7. Links to other posts, topics, conferences, or even communities may appear in angle brackets, using an established syntax; + these are automatically turned into links. + +## Configuration + +When an instance of the HTML Checker is instantiated with **AmNewHTMLChecker,** the caller specifies a _configuration name_ +for the checker to use. These configurations are expressed in the `configs.yaml` file, which is deserialized at Amsterdam +startup time. The configuration specifies: + +* Basic settings like the word-wrap length and master switches that control the recognition of angle brackets and parentheses. +* Lists of _output filters_ to apply to visible text and "raw" text. +* Lists of _rewriters,_ software components that are applied to various pieces of text: + * Strings of non-whitespace characters. + * Individual words. + * "Tags," text enclosed inside angle brackets. + * Text enclosed inside parentheses. +* A _tag set_ name, that specifies which HTML tags are allowed through. + +## HTML Tag Sets + +Individual HTML tags are devided into groups, described as follows: + +* _Inline formatting_ tags, such as B, I, EM, and STRONG. +* _Anchors,_ meaning the A tag. +* _Block formatting_ tags, such as P, BR, and BLOCKQUOTE. +* _Active content_ tags, such as EMBED and SCRIPT. These are generally never allowed. +* _Image map_ tags, such as MAP and AREA. These are generally never allowed. +* _Document formatting_ tags like HEAD and BODY, as well as metadata tags like META and LINK. These are generally never allowed. +* _Font format_ tags, which is basically FONT and all the Hx tags. +* _Form_ tags, such as FORM, INPUT, BUTTON, and SELECT. These are generally never allowed. +* _Table_ tags, such as TABLE, TR, and TD. These are generally never allowed. +* _Change markup_ tags, such as DEL and INS. +* _Frame_ tags such as FRAME, IFRAME, and FRAMESET. These are generally never allowed. +* _Image_ tags, basically the IMG tag. +* _Preformatting_ tags, such as PRE and PLAINTEXT. These are generally never allowed. +* A number of groups of Netscape-specific and Microsoft-specific tags, such as Netscape's (infamous) BLINK tag and Microsoft's + (equally infamous) MARQUEE and BGSOUND tags. These are generally never allowed. +* Certain tags used in server-side and Java server-side markup. These are generally never allowed. + +These groups are further aggregated into _tag sets._ The `normal` tag set consists of the following groups: + +* Inline formatting +* Anchor +* Block format +* Font format +* Images + +The `restricted` tag set consists of only the "inline formatting" group. It is intended for post pseuds and topic names. + +## Rewriters + +Rewriters are components identified by registered names that examine a chunk of text, decide if it can be rewritten, and, +if so, apply markup before and after it as necessary. Examples of rewriters configured in the HTML checker are: + +* `emoticon` - Takes certain character pattrns and rewrites them as emoji. The patterns it recognizes are configured in the + `emoticons.yaml` file. +* `emoticon_tag` - A variant of the above used inside tag text. +* `email` - Recognizes an E-mail address and creates a `mailto:` link to it. +* `url` - Recognize a URL and create a link surrounding it. +* `postlink` - Recognize a post link and creates a link to it. (The link is created with an `x-postlink:` schema, which is + further fixed up when the post is displayed.) +* `userlink` - Recognize a username and create a link to that user's profile. (The link is created with an `x-userlink:` + schema, which is further fixed up when the post is displayed.) + +## Post Links + +Post links have a specific syntax, which was originated on The WELL and implemented in WellEngaged before being reimplemented +in CommunityWare, Venice, and Ansterdam. Post links are always enclosed in angle brackets. + +Here are the various forms of post links supported: + +* `<45>` - Link to a single post by number, within the current topic. +* `<13-17>` - Link to a range of posts by number, within the current topic. +* `<64->` - Link to all posts in the current topic, starting at the specified post number. +* `<16.>` - Link to another topic by number, within the current conference. The trailing "." is required. +* `<8.101>` - Link to a single post by number, in a topic by number, in the current conference. (The "range" syntaxes + for the post number are also supported.) +* `` - Link to another conference by "alias" within the same community. The trailing "." is required. +* `` - Link to another topic by number, in a conference within the same community. +* `` - Link to a single post by number, in a topic by number, in a conference within the same community. + (The "range" syntaxes for the post number are also supported.) +* `` - Link to another community by "alias." The trailing "!" is required. +* `` Link to another conference in another community. +* `` Link to another topic by number, in a conference in another community. +* `` - Link to a single post by number, in a topic by number, in a conference in another community. + (The "range" syntaxes for the post number are also supported.) + +Any of the post link types that are "fully qualified" (that is, start with a specific community alias) can be concatenated to +the special Amsterdam `/go/` URL to jump to the specified community, conference, topic, or post(s). This is, in fact, how +`x-postlink:` URLs are resolved at display time. + +## Internal operation + +The HTML Checker employs a finite-state machine examining the input text one byte at a time. Characters in the "current" +state are saved in a temporary buffer before being written to the main output buffer when the state changes, possibly +after having been modified by a rewriter. + +The `context.Context` value passed to **AmNewHTMLChecker** is checked on every iteration of the main parse loop. If +it returns an error, the parser is stopped, allowing the HTML checker to respect external timeouts. (The value is +stored in the HTML Checker itself, which is generally frowned upon, but used in this case to simplify the external +API since HTML Checker objects are typically scoped to a single request.) + +Currently-open tags are managed on an internal stack, which supports the special operation of "remove most recent," +which searches the stack from the top down for a specific data element and removes it. This is required because +HTML tags need not be strictly nested. diff --git a/docs/tmpnotes.md b/docs/tmpnotes.md new file mode 100644 index 0000000..08e76fa --- /dev/null +++ b/docs/tmpnotes.md @@ -0,0 +1,47 @@ +# TEMPORARY NOTES + +(to be moved elsewhere) + +## Amsterdam Identifier Values + +Amsterdam identifier values are used for user names, community aliases, and conference aliases, and may be used +for other purposes in the future. A valid Amsterdam ID consists of characters from the following character set: + +* Alphanumerics (A-Z, a-z, 0-9) +* Dash (-) +* Underscore (_) +* Tilde (~) +* Asterisk (*) +* Apostrophe (') +* Dollar sign ($) + +All characters are represented in the ISO 8859-1 character set, and may be represented with single-byte encoding +in UTF-8. Also note that all Amsterdam identifiers are case-insensitive. + +### Rationale + +The character set was defined starting with the list of characters allowable in URL path components ("pchar" as +defined in [RFC 3986](https://www.rfc-editor.org/rfc/rfc3986.txt), section 3.3, page 23), minus the percent-encoded +forms, so that Amsterdam identifiers would be usable as "path information" in a URL. + +The ampersand (&) was excluded because of its possible confusion with a URL parameter separator, and because it requires HTML escaping. + +The at sign (@) was excluded because of possible confusion with E-mail addresses and XMPP identifiers. + +The plus sign (+) was excluded because of possible confusion with a URL=encoded space character. + +The comma (,) was excluded because of its possible interpretation as a separator character. + +The equals sign (=) was excluded because of its possible confusion with a URL parameter/value separator. + +The semicolon (;) was excluded because of its possible interpretation as a separator character. + +The colon (:) was withheld to provide for a possible future "namespace" expansion (as in XML namespaces). + +The parentheses ((, )) were excluded because of possible confusion with user link syntax in conferencing. + +The period (.) was excluded because of possible confusion with post link syntax in conferencing. + +The exclamation point (!) was excluded because of possible confusion with extended post link syntax in conferencing. + +The definition of Amsterdam identifiers was taken almost directly from the definition of Venice identifiers in the predecessor project. diff --git a/util/stack.go b/util/stack.go index 4e6233f..4aba52b 100644 --- a/util/stack.go +++ b/util/stack.go @@ -53,9 +53,7 @@ func (stk *Stack[T]) RemoveMostRecent(data T) bool { } else if (i + 1) == len(stk.elements) { stk.elements = stk.elements[:i] } else { - high := stk.elements[i+1:] - stk.elements = stk.elements[:i] - stk.elements = append(stk.elements, high...) + stk.elements = append(stk.elements[:i], stk.elements[i+1:]...) } return true }