lost by storing partial message sections in HI while waiting for reassemble() would be more than
compensated for by not having two instances of zlib.
-When the end of a script is found and the normal flush point has not been found, the current TCP
-segment and all previous segments for the current message section are flushed using a special
-procedure known as partial inspection. From the perspective of Stream (or H2I) a partial inspection
-is a regular flush in every respect.
-
-scan() calls prep_partial_flush() to prepare for the partial inspection. Then it returns a normal
-flush point to Stream at the end of the current TCP segment. Partial inspections perform all of the
-functions of a regular inspection including forwarding data to file processing and detection.
-
-The difference between a partial inspection and a regular inspection is reassemble() saves the
-input data for future reuse. Eventually there will be a regular full inspection of the entire
-message section. reassemble() will accomplish this by combining the input data for the partial
-inspection with later data that complete the message section.
-
-Correct and efficient execution of a full inspection following a partial inspection requires
-special handling of certain functions. Unzipping is only done once in reassemble(). The stored
-input in reassemble() has already been through dechunking and unzipping. Data is forwarded to file
-processing during the partial inspection and duplicate data will not be forwarded again. Some
-of the message body normalization steps are done once during partial inspection with work
-products saved for reuse.
-
-It is possible to do more than one partial inspection of a single message section. Each partial
-inspection is cumulative, covering the new data and all previous data.
-
-Compared to just doing a full inspection, a partial inspection followed by a partial inspection
-will not miss anything. The benefits of partial inspection are in addition to the benefits of a
-full inspection.
-
-The http_inspect partial inspection mechanism is also used by http2_inspect to manage frame
-boundaries. When inspecting HTTP/2, a partial inspection by http_inspect may occur because script
-detection triggered it, because H2I wanted it, or both.
-
-Some applications may be affected by blocks too late scenarios related to seeing part of the
-zero-length chunk. For example a TCP packet that ends with:
-
- 8<CR><LF>abcdefgh<CR><LF>0
-
-might be sufficient to forward the available data ("abcdefgh") to the application even though the
-final <CR><LF> has not been received.
-
-Note that the actual next bytes are uncertain here. The next packet might begin with <CR><LF>, but
-
- 100000<CR><LF>ijklmnopq ...
-
-is another perfectly legal possibility. There is no rule against starting a nonzero chunk length
-with a zero character and some applications reputedly do this.
-
-As a precaution partial inspection is performed when 1) a TCP segment ends inside a possible
-zero-length chunk or 2) chunk processing fails (broken chunk).
-
HttpFlowData is a data class representing all HI information relating to a flow. It serves as
persistent memory between invocations of HI by the framework. It also glues together the inspector,
the client-to-server splitter, and the server-to-client splitter which pass information through the
flow data.
-Message section is a core concept of HI. A message section is a piece of an HTTP message that is
-processed together. There are eight types of message section:
-
-1. Request line (client-to-server start line)
-2. Status line (server-to-client start line)
-3. Headers (all headers after the start line as a group)
-4. Content-Length message body (a block of message data usually not much larger than 16K from a
- body defined by the Content-Length header)
-5. Chunked message body (same but from a chunked body)
-6. Old message body (same but from a response body with no Content-Length header that runs to
- connection close)
-7. HTTP/X message body (same but content taken from an HTTP/2 or HTTP/3 Data frame)
-8. Trailers (all header lines following a chunked or HTTP/X body as a group)
-
-Message sections are represented by message section objects that contain and process them. There
-are twelve message section classes that inherit as follows. An asterisk denotes a virtual class.
-
-1. HttpMsgSection* - top level with all common elements
-2. HttpMsgStart* : HttpMsgSection - common elements of request and status
-3. HttpMsgRequest : HttpMsgStart
-4. HttpMsgStatus : HttpMsgStart
-5. HttpMsgHeadShared* : HttpMsgSection - common elements of headers and trailers
-6. HttpMsgHeader : HttpMsgHeadShared
-7. HttpMsgTrailer : HttpMsgHeadShared
-8. HttpMsgBody* : HttpMsgSection - common elements of message body processing
-9. HttpMsgBodyCl : HttpMsgBody
-10. HttpMsgBodyChunk : HttpMsgBody
-11. HttpMsgBodyOld : HttpMsgBody
-12. HttpMsgBodyHX : HttpMsgBody
-
An HttpTransaction is a container that keeps all the sections of a message together and associates
the request message with the response message. Transactions may be organized into pipelines when an
HTTP pipeline is present. The current transaction and any pipeline live in the flow data. A
The attach_my_transaction() factory method contains all the logic that makes this work. There are
many corner cases. Don't mess with it until you fully understand it.
-Message sections implement the Just-In-Time (JIT) principle for work products. A minimum of
-essential processing is done under process(). Other work products are derived and stored the first
-time detection or some other customer asks for them.
-
-HI also supports defining custom "x-forwarded-for" type headers. In a multi-vendor world, it is
-quite possible that the header name carrying the original client IP could be vendor-specific. This
-is due to the absence of standardization which would otherwise standardize the header name. In such
-a scenario, it is important to provide a configuration with which such x-forwarded-for type headers
-can be introduced to HI. The headers can be introduced with the xff_headers configuration. The
-default value of this configuration is "x-forwarded-for true-client-ip". The default definition
-introduces the two commonly known "x-forwarded-for" type headers and is preferred in the same order
-by the inspector as they are defined, e.g "x-forwarded-for" will be preferred than "true-client-ip"
-if both headers are present in the stream. Every HTTP Header is mapped to an ID internally. The
-custom headers are mapped to a dynamically generated ID and the mapping is appended at the end
-of the mapping of the known HTTP headers. Every HI instance can have its own list of custom
-headers and thus an instance of HTTP header mapping list is also associated with an HI instance.
-
-The Field class is an important tool for managing JIT. It consists of a pointer to a raw message
-field or derived work product with a length field. Various negative length values specify the
-status of the field. For instance STAT_NOTCOMPUTE means the item has not been computed yet,
-STAT_NOTPRESENT means the item does not exist, and STAT_PROBLEMATIC means an attempt to compute the
-item failed. Never dereference the pointer without first checking the length value.
-
-All of these values and more are in http_enums.h which is a general repository for enumerated
-values in HI.
-
-A Field is intended to represent an immutable object. It is either part of the original message
-section or it is a work product that has been derived from the original message section. In the
-former case the original message is constant and there is no reason for a Field value to change. In
-the latter case, once the value has been derived from the original message there is no reason to
-derive it again.
-
-Once Field is set to a non-null value it should never change. The set() functions will assert if
-this rule is disregarded.
-
-Partial inspections have created an exception. Fields may be used to store work products from a
-partial inspection that may be updated by subsequent inspections. The reset() method has been
-provided for this situation. It deletes any owned buffer and reinitializes the Field to null.
-This feature should be used with care to avoid weakening the architecture.
-
-A Field may own the buffer containing the message or it may point to a buffer that belongs to
-someone else. When a Field owning a buffer is deleted the buffer is deleted as well. Ownership is
-determined with the Field is initially set. In general any dynamically allocated buffer should be
-owned by a Field. If you follow this rule you won't need to keep track of allocated buffers or have
-delete[]s all over the place.
-
HI implements flow depth using the request_depth and response_depth parameters. HI seeks to provide
a consistent experience to detection by making flow depth independent of factors that a sender
could easily manipulate, such as header length, chunking, compression, and encodings. The maximum
depth is computed against normalized message body data.
-HttpUri is the class that represents a URI. HttpMsgRequest objects have an HttpUri that is created
-during analyze().
-
-URI normalization is performed during HttpUri construction in four steps.
-
-Step 1: Identify the type of URI.
-
-HI recognizes four types of URI:
-
-1. Asterisk: a lone ‘*’ character signifying that the request does not refer to any resource in
-particular. Often used with the OPTIONS method. This is not normalized.
-
-2. Authority: any URI used with the CONNECT method. The entire URI is treated as an authority.
-
-3. Origin: a URI that begins with a slash. Consists of only an absolute path with no scheme or
-authority present.
-
-4. Absolute: a URI which includes a scheme and a host as well as an absolute path. E.g.
-http://example.com/path/to/resource.
-
-In addition there are malformed URIs that don't meet any of the four types. These are protocol
-errors and will trigger an alert. Because their format is unrecognized they are not normalized.
-
-Step 2: Decompose the URI into its up to six constituent pieces: scheme, host, port, path, query,
-and fragment.
-
-Based on the URI type the overall URI is divided into scheme, authority, and absolute path. The
-authority is subdivided into host and port. The absolute path is subdivided into path, query, and
-fragment.
-
-The raw URI pieces can be accessed via rules. For example: http_raw_uri: query; content: “foo”
-will only match the query portion of the URI.
-
-Step 3: Normalize the individual pieces.
-
-The port is not normalized. The scheme is normalized to lower case. The other four pieces are
-normalized in a fashion similar to 2.X with an important exception. Path-related normalizations
-such as eliminating directory traversals and squeezing out extra slashes are only done for the
-path.
-
-The normalized URI pieces can be accessed via rules. For example: http_uri: path; content:
-“foo/bar”.
-
-Step 4: Stitch the normalized pieces back together into a complete normalized URI.
-
-This allows rules to be written against a normalized whole URI as is done in 2.X.
-
-The procedures for normalizing the individual pieces are mostly identical to 2.X. Some points
-warrant mention:
-
-1. HI considers it to be normal for reserved characters to be percent encoded and does not
-generate an alert. The 119/1 alert is used only for unreserved characters that are found to be
-percent encoded. The ignore_unreserved configuration option allows the user to specify a list of
-unreserved characters that are exempt from this alert.
-
-2. Plus to space substitution is a configuration option. It is not currently limited to the query
-but that would not be a difficult feature to add.
-
-3. The 2.X multi_slash and directory options are combined into a single option called
-simplify_path.
-
-HttpJsNorm class serves as a script Normalizer, and currently has two implementations:
-the Legacy Normalizer and the Enhanced Normalizer.
-
-During message body analysis the Enhanced Normalizer does one of the following:
-1. If Content-Type says its an external script then Normalizer processes the
- whole message body as a script text.
-2. If it is an HTML-page, Normalizer searches for an opening tag and processes
- subsequent bytes in a stream mode, until it finds a closing tag.
- It proceeds and scans the entire message body for inline scripts.
-
-Enhanced Normalizer is a stateful JavaScript whitespace and identifiers normalizer.
-Normalizer will remove all extraneous whitespace and newlines, keeping a single space where
-syntactically necessary. Comments will be removed, but contents of string literals will
-be kept intact. Any string literals, added by the plus operator,
-will be concatenated. This also works for functions that result in string
-literals. Semicolons will be inserted, if not already present, according to ECMAScript
-automatic semicolon insertion rules.
-
-All JavaScript identifier names, except those from the ident_ignore or prop_ignore lists,
-will be substituted with unified names in the following format: var_0000 -> var_ffff.
-The number of unique identifiers available is 65536 names per HTTP transaction. If Normalizer
-overruns the configured limit, built-in alert is generated.
-
-A config option to set the limit manually:
-
- * http_inspect.js_norm_identifier_depth.
-
-Identifiers from the ident_ignore list will be placed as is, without substitution. Starting with
-the listed identifier, any chain of dot accessors, brackets and function calls will be kept
-intact.
-
-For example:
-
- * console.log("bar")
- * document.getElementById("id").text
- * eval("script")
- * foo["bar"]
-
-Ignored identifiers are configured via the following config option that accepts a list of object
-and function names:
-
- * http_inspect.js_norm_ident_ignore = { 'console', 'document', 'eval', 'foo' }
-
-When a variable assignment that 'aliases' an identifier from the list is found,
-the assignment will be tracked and subsequent occurrences of the variable will be
-replaced with the stored value. This substitution will follow JavaScript variable scope
-limits.
-
-For example:
-
- var a = console.log
- a("hello") // will be substituted to 'console.log("hello")'
- a.foo.bar() // will be normalized as 'console.log.foo.bar()'. When variable is 'de-aliased',
- // following identifiers are not normalized, just like identifiers from ident_ignore
-
-When an object is created using a 'new' keyword, and the class/constructor is found in ident_ignore
-list, the object will be tracked, and although its own identifier will be converted to normal form
-its property and function calls will be kept intact, as with ignored identifiers.
-
-For example:
-
- var obj = new Array()
- obj.insert(1,2,3) // will be normalized to var_0000.insert(1,2,3)
-
-For properties and methods of objects that can be created implicitly, there is a
-js_norm_prop_ignore list. All names in the call chain after the first property or
-method from the list has been occurred will not be normalized.
-
-Note that identifiers are normalized by name, i.e. an identifier and a property with the same name
-will be normalized to the same value. However, the ignore lists act separately on identifiers
-and properties.
-
-For example:
-
- http_inspect.js_norm_prop_ignore = { 'split' }
-
- in: "string".toUpperCase().split("").reverse().join("");
- out: "string".var_0000().split("").reverse().join("");
-
-In addition to the scope tracking, JS Normalizer specifically tracks unescape-like JavaScript
-functions (unescape, decodeURI, decodeURIComponent, String.fromCharCode, String.fromCodePoint).
-This allows detection of unescape functions nested within other unescape functions, which is
-a potential indicator of a multilevel obfuscation. The definition of a function call depends on
-identifier substitution, so such identifiers must be included in the ignore list in
-order to use this feature. After determining the unescape sequence, it is decoded into the
-corresponding string, and the name of unescape function will not be present in the output.
-Single-byte escape sequences within the string and template literals which are arguments of
-unescape, decodeURI and decodeURIComponent functions will be decoded according to ISO/IEC 8859-1
-(Latin-1) charset. Except these cases, escape sequences and code points will be decoded to UTF-8
-format.
-
-For example:
-
- unescape('\u0062\u0061\u0072') -> 'bar'
- decodeURI('%62%61%72') -> 'bar'
- decodeURIComponent('\x62\x61\x72') -> 'bar'
- String.fromCharCode(98, 0x0061, 0x72) -> 'bar'
- String.fromCodePoint(65600, 65601, 0x10042) -> '𐁀𐁁𐁂'
-
-Supported formats follow
-
- \xXX
- \uXXXX
- \u{XXXX}
- %XX
- \uXX
- %uXXXX
- decimal code point
- hexadecimal code point
-
-JS Normalizer is able to decode mixed encoding sequences. However, a built-in alert rises
-in such case.
-
-JS Normalizer's syntax parser follows ECMA-262 standard. For various features,
-tracking of variable scope and individual brackets is done in accordance to the standard.
-Additionally, Normalizer enforces standard limits on HTML content in JavaScript:
- * no nesting tags allowed, i.e. two opening tags in a row
- * script closing tag is not allowed in string literals, block comments, regular expression literals, etc.
-
-If source JavaScript is syntactically incorrect (containing a bad token, brackets mismatch,
-HTML-tags, etc) Normalizer fires corresponding built-in rule and abandons the current script,
-though the already-processed data remains in the output buffer.
-
-Enhanced Normalizer supports scripts over multiple PDUs.
-So, if the script is not ended, Normalizer's context is saved in HttpFlowData.
-The script continuation will be processed with the saved context.
-
-In order to support Script Detection feature for inline scripts, Normalizer ensures
-that after reaching the script end (legitimate closing tag or bad token),
-it falls back to an initial state, so that the next script can be processed by the same context.
-
-Algorithm for reassembling chunked message bodies:
-
-NHI parses chunked message bodies using an algorithm based on the HTTP RFC. Chunk headers are not
-included in reassembled message sections and do not count against the message section length. The
-attacker cannot affect split points by adjusting their chunks.
-
-Built-in alerts for chunking are generated for protocol violations and suspicious usages. Many
-irregularities can be compensated for but others cannot. Whenever a fatal problem occurs, NHI
-generates 119:213 HTTP chunk misformatted and converts to a mode very similar to run to connection
-close. The rest of the flow is sent to detection as is. No further attempt is made to dechunk the
-message body or look for the headers that begin the next message. The user should block 119:213
-unless they are willing to run the risk of continuing with no real security.
-
-In addition to 119:213 there will often be a more specific alert based on what went wrong.
-
-From the perspective of NHI, a chunked message body is a sequence of zero or more chunks followed
-by a zero-length chunk. Following the zero-length chunk there will be trailers which may be empty
-(CRLF only).
-
-Each chunk begins with a header and is parsed as follows:
-
-1. Zero or more unexpected CR or LF characters. If any are present 119:234 is generated and
-processing continues.
-
-2. Zero or more unexpected space and tab characters. If any are present 119:214 is generated. If
-five or more are present that is a fatal error as described above and chunk processing stops.
-
-3. Zero or more '0' characters. Leading zeros before other digits are meaningless and ignored. A
-chunk length consisting solely of zeros is the zero-length chunk. Five or more leading zeros
-generate 119:202 regardless of whether the chunk length eventually turns out to be zero or nonzero.
-
-4. The chunk length in hexadecimal format. The chunk length may be zero (see above) but it must be
-present. Both upper and lower case hex letters are acceptable. The 0x prefix for hex numbers is not
-acceptable.
-+
-The goal here is a hexadecimal number followed by CRLF ending the chunk header. Many things may go
-wrong:
-+
-* More than 8 hex digits other than the leading zeros. The number is limited by Snort to fit into
- 32 bits and if it does not that is a fatal error.
-* The CR may be missing, leaving a bare LF as the separator. That generates 119:235 after which
- processing continues normally.
-* There may be one or more trailing spaces or tabs following the number. If any are present 119:214
- is generated after which processing continues normally.
-* There may be chunk options. This is legal and parsing is supported but options are so unexpected
- that they are suspicious. 119:210 is generated.
-* There may be a completely illegal character in the chunk length (other than those mentioned
- above). That is a fatal error.
-* The character following the CR may not be LF. This is a fatal error. This is different from
- similar bare CR errors because it does not provide a transparent data channel. An "innocent"
- sender that implements this error has no way to transmit chunk data that begins with LF.
-
-5. Following the chunk header should be a number of bytes of transparent user data equal to the
-chunk length. This is the part of the chunked message body that is reassembled and inspected.
-Everything else is discarded.
-
-6. Following the chunk data should be CRLF which do not count against the chunk length. These are
-not present for the zero length chunk. If one of the two separators is missing, 119:234 is
-generated and processing continues normally. If there is no separator at all that is a fatal error.
-
-Then we return to #1 as the next chunk begins. In particular extra separators beyond the two
-expected are attributed to the beginning of the next chunk.
-
-MIME processing:
-
-NHI processes request message bodies in MIME format differently from other message bodies. Message
-sections are forwarded to the MIME library instead of being directly input to file processing. The
-library parses the input into individual MIME attachments. This creates a design issue because
-there may be multiple attachments within a single message body section. The email inspectors solve
-this issue by splitting MIME attachments within their stream splitters so that there is only one
-attachment per reassembled packet. This attachment, if it contains a file, is the source material for
-the file_data rule option.
-
-NHI stream splitter does not work this way. It does not consider MIME at all. Split points between
-message sections are never based on MIME or any other type of message body content.
-
-The problem for NHI is that file_data is a singular entity and cannot accomodate multiple
-simultaneous files derived from a message section. NHI resolves this by accumulating the processed
-file attachments in a list and directly calling detection multiple times--once for each file
-attachment installed as file_data.
-
-Rule options:
-
-HttpIpsOption is the base class for http rule options. It supports the parameters field and request
-that are used by some rule options. HttpBufferIpsOption is a rule option that sets a buffer. It
-implements most of the rule options.
-
-Test tool usage instructions:
+===== Partial inspection
-The HI test tool consists of two features. test_output provides extensive information about the
-inner workings of HI. It is strongly focused on showing work products (Fields) rather than being a
-tracing feature. Given a problematic pcap, the developer can see what the input is, how HI
-interprets it, and what the output to rule options will be. Several related configuration options
-(see help) allow the developer to customize the output.
+include::dev_notes_partial_inspection.txt[]
-test_input is provided by the HttpTestInput class. It allows the developer to write tests that
-simulate HTTP messages split into TCP segments at specified points. The tests cover all of splitter
-and inspector and the impact on downstream customers such as detection and file processing. The
-test_input option activates a modified form of test_output. It is not necessary to also specify
-test_output.
+===== Message section
-The test input comes from the file http_test_msgs.txt in the current directory. Enter HTTP test
-message text as you want it to be presented to the StreamSplitter.
+include::dev_notes_message_section.txt[]
-The easiest way to format is to put a blank line between message sections so that each message
-section is its own "paragraph". Within a paragraph the placement of single new lines does not have
-any effect. Format a paragraph any way you are comfortable. Extra blank lines between paragraphs
-also do not have any effect.
+===== Field class
-Each paragraph represents a TCP segment. The splitter can be tested by putting multiple sections in
-the same paragraph (splitter must split) or continuing a section in the next paragraph (splitter
-must search and reassemble).
+include::dev_notes_field_class.txt[]
-Lines beginning with # are comments. Lines beginning with @ are commands. These do not apply to
-lines in the middle of a paragraph. Lines that begin with $ are insert commands - a special class
-of commands that may be used within a paragraph to insert data into the message buffer.
+===== Support for custom xff type headers
-Commands:
- @break resets HTTP Inspect data structures and begins a new test. Use it liberally to prevent
- unrelated tests from interfering with each other.
- @tcpclose simulates a half-duplex TCP close.
- @request and @response set the message direction. Applies to subsequent paragraphs until changed.
- The initial direction is always request and the break command resets the direction to request.
- @fileset <pathname> specifies a file from which the tool will read data into the message buffer.
- This may be used to include a zipped or other binary file into a message body. Data is read
- beginning at the start of the file. The file is closed automatically whenever a new file is
- set or there is a break command.
- @fileskip <decimal number> skips over the specified number of bytes in the included file. This
- must be a positive number. To move backward do a new fileset and skip forward from the
- beginning.
- @<decimal number> sets the test number and hence the test output file name. Applies to subsequent
- sections until changed. Don't reuse numbers.
+include::dev_notes_hff_headers.txt[]
-Insert commands:
- $fill <decimal number> create a paragraph consisting of <number> octets of auto-fill data
- ABCDEFGHIJABC ....
- $fileread <decimal number> read the specified number of bytes from the included file into the
- message buffer.
- $h2preface creates the HTTP/2 connection preface "PRI * HTTP/2.0\r\n\r\nSM\r\n\r\n"
- $h2frameheader <frame_type> <frame_length> <flags> <stream_id> generates an HTTP/2 frame header.
- The frame type may be the frame type name in all lowercase or the numeric frame type code:
- (data|headers|priority|rst_stream|settings|push_promise|ping|goaway|window_update|
- continuation|\{0:9\})
- The frame length is the length of the frame payload, may be in decimal or test tool hex value
- (\xnn, see below under escape sequence for more details)
- The frame flags are represented as a single test tool hex byte (\xnn)
- The stream id is optional. If provided it must be a decimal number. If not included it defaults
- to 0.
+===== URI normalization
-Escape sequences begin with '\'. They may be used within a paragraph or to begin a paragraph.
- \r - carriage return
- \n - linefeed
- \t - tab
- \\ - backslash
- \# - #
- \@ - @
- \$ - $
- \xnn or \Xnn - where nn is a two-digit hexadecimal number. Insert an arbitrary 8-bit number as
- the next character. a-f and A-F are both acceptable.
+include::dev_notes_uri_norm.txt[]
-Data are separated into segments for presentation to the splitter whenever a paragraph ends (blank
-line).
+===== JS normalization
-When the inspector aborts the connection (scan() returns StreamSplitter::ABORT) it does not expect
-to receive any more input from stream on that connection in that direction. Accordingly the test
-tool should not send it any more input. A paragraph of test input expected to result in an abort
-should be the last paragraph. The developer should either start a new test (@break, etc.) or at
-least reverse the direction and not send any more data in the original direction. Sending more data
-after an abort is likely to lead to confusing output that has no bearing on the test.
+include::dev_notes_js_norm.txt[]
-This test tool does not implement the feature of being hardened against bad input. If you write a
-badly formatted or improper test case the program may assert or crash. The responsibility is on the
-developer to get it right.
+===== Reassembling chunked message
-The test tool is designed for single-threaded operation only.
+include::dev_notes_chunked_processing.txt[]
-The test tool is only available when compiled with REG_TEST.
+===== MIME inspection
-NHI has some trace messages available. Trace options follow:
+include::dev_notes_mime_inspection.txt[]
-* trace.module.http_inspect.js_proc turns on messages from script processing flow.
-+
-Verbosity levels:
-+
-1. Script opening tag detected (available in release build)
-2. Attributes of detected script (available in release build)
-3. Normalizer return code (available in release build)
-4. Contexts management (debug build only)
-5. Parser states (debug build only)
-6. Input stream states (debug build only)
+===== HI test tool usage
-* trace.module.http_inspect.js_dump dumps JavaScript data from processing layers.
-+
-Verbosity levels:
-+
-1. js_data buffer as it is being passed to detection (available in release build)
-2. (no messages available currently)
-3. Payload passed to Normalizer (available in release build)
-4. Temporary buffer (debug build only)
-5. Matched token (debug build only)
-6. Identifier substitution (debug build only)
+include::dev_notes_test_tool.txt[]
--- /dev/null
+HttpJsNorm class serves as a script Normalizer, and currently has two implementations:
+the Legacy Normalizer and the Enhanced Normalizer.
+
+During message body analysis the Enhanced Normalizer does one of the following:
+1. If Content-Type says its an external script then Normalizer processes the
+ whole message body as a script text.
+2. If it is an HTML-page, Normalizer searches for an opening tag and processes
+ subsequent bytes in a stream mode, until it finds a closing tag.
+ It proceeds and scans the entire message body for inline scripts.
+
+Enhanced Normalizer is a stateful JavaScript whitespace and identifiers normalizer.
+Normalizer will remove all extraneous whitespace and newlines, keeping a single space where
+syntactically necessary. Comments will be removed, but contents of string literals will
+be kept intact. Any string literals, added by the plus operator,
+will be concatenated. This also works for functions that result in string
+literals. Semicolons will be inserted, if not already present, according to ECMAScript
+automatic semicolon insertion rules.
+
+All JavaScript identifier names, except those from the ident_ignore or prop_ignore lists,
+will be substituted with unified names in the following format: var_0000 -> var_ffff.
+The number of unique identifiers available is 65536 names per HTTP transaction. If Normalizer
+overruns the configured limit, built-in alert is generated.
+
+A config option to set the limit manually:
+
+ * http_inspect.js_norm_identifier_depth.
+
+Identifiers from the ident_ignore list will be placed as is, without substitution. Starting with
+the listed identifier, any chain of dot accessors, brackets and function calls will be kept
+intact.
+
+For example:
+
+ * console.log("bar")
+ * document.getElementById("id").text
+ * eval("script")
+ * foo["bar"]
+
+Ignored identifiers are configured via the following config option that accepts a list of object
+and function names:
+
+ * http_inspect.js_norm_ident_ignore = { 'console', 'document', 'eval', 'foo' }
+
+When a variable assignment that 'aliases' an identifier from the list is found,
+the assignment will be tracked and subsequent occurrences of the variable will be
+replaced with the stored value. This substitution will follow JavaScript variable scope
+limits.
+
+For example:
+
+ var a = console.log
+ a("hello") // will be substituted to 'console.log("hello")'
+ a.foo.bar() // will be normalized as 'console.log.foo.bar()'. When variable is 'de-aliased',
+ // following identifiers are not normalized, just like identifiers from ident_ignore
+
+When an object is created using a 'new' keyword, and the class/constructor is found in ident_ignore
+list, the object will be tracked, and although its own identifier will be converted to normal form
+its property and function calls will be kept intact, as with ignored identifiers.
+
+For example:
+
+ var obj = new Array()
+ obj.insert(1,2,3) // will be normalized to var_0000.insert(1,2,3)
+
+For properties and methods of objects that can be created implicitly, there is a
+js_norm_prop_ignore list. All names in the call chain after the first property or
+method from the list has been occurred will not be normalized.
+
+Note that identifiers are normalized by name, i.e. an identifier and a property with the same name
+will be normalized to the same value. However, the ignore lists act separately on identifiers
+and properties.
+
+For example:
+
+ http_inspect.js_norm_prop_ignore = { 'split' }
+
+ in: "string".toUpperCase().split("").reverse().join("");
+ out: "string".var_0000().split("").reverse().join("");
+
+In addition to the scope tracking, JS Normalizer specifically tracks unescape-like JavaScript
+functions (unescape, decodeURI, decodeURIComponent, String.fromCharCode, String.fromCodePoint).
+This allows detection of unescape functions nested within other unescape functions, which is
+a potential indicator of a multilevel obfuscation. The definition of a function call depends on
+identifier substitution, so such identifiers must be included in the ignore list in
+order to use this feature. After determining the unescape sequence, it is decoded into the
+corresponding string, and the name of unescape function will not be present in the output.
+Single-byte escape sequences within the string and template literals which are arguments of
+unescape, decodeURI and decodeURIComponent functions will be decoded according to ISO/IEC 8859-1
+(Latin-1) charset. Except these cases, escape sequences and code points will be decoded to UTF-8
+format.
+
+For example:
+
+ unescape('\u0062\u0061\u0072') -> 'bar'
+ decodeURI('%62%61%72') -> 'bar'
+ decodeURIComponent('\x62\x61\x72') -> 'bar'
+ String.fromCharCode(98, 0x0061, 0x72) -> 'bar'
+ String.fromCodePoint(65600, 65601, 0x10042) -> '𐁀𐁁𐁂'
+
+Supported formats follow
+
+ \xXX
+ \uXXXX
+ \u{XXXX}
+ %XX
+ \uXX
+ %uXXXX
+ decimal code point
+ hexadecimal code point
+
+JS Normalizer is able to decode mixed encoding sequences. However, a built-in alert rises
+in such case.
+
+JS Normalizer's syntax parser follows ECMA-262 standard. For various features,
+tracking of variable scope and individual brackets is done in accordance to the standard.
+Additionally, Normalizer enforces standard limits on HTML content in JavaScript:
+ * no nesting tags allowed, i.e. two opening tags in a row
+ * script closing tag is not allowed in string literals, block comments, regular expression literals, etc.
+
+If source JavaScript is syntactically incorrect (containing a bad token, brackets mismatch,
+HTML-tags, etc) Normalizer fires corresponding built-in rule and abandons the current script,
+though the already-processed data remains in the output buffer.
+
+Enhanced Normalizer supports scripts over multiple PDUs.
+So, if the script is not ended, Normalizer's context is saved in HttpFlowData.
+The script continuation will be processed with the saved context.
+
+In order to support Script Detection feature for inline scripts, Normalizer ensures
+that after reaching the script end (legitimate closing tag or bad token),
+it falls back to an initial state, so that the next script can be processed by the same context.