From: Arran Cudbard-Bell Date: Mon, 5 Sep 2022 20:24:18 +0000 (-0400) Subject: sbuff-ng X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=c1389574fff8b8ec61e3d61ed95f67aff6f2d9c2;p=thirdparty%2Ffreeradius-server.git sbuff-ng --- diff --git a/doc/antora/modules/developers/pages/sbuff-ng.adoc b/doc/antora/modules/developers/pages/sbuff-ng.adoc new file mode 100644 index 00000000000..84d39fdab62 --- /dev/null +++ b/doc/antora/modules/developers/pages/sbuff-ng.adoc @@ -0,0 +1,192 @@ +# Current sbuff failings + +The major issues with the current sbuffs are + +- Markers need to be released before the function they're created in returns. This means we often create temporary `fr_sbuff_t` on the stack which is inefficient. +- Using pointers often results in abuse, i.e. people storing pointer offsets direct from the sbuff. +- Sbuffs don't support streaming properly. Large parts of the buffer ends up getting pinned in place and we're unable to shift it out. +- Shifting data out of sbuffs is expensive because we need to update every nested sbuff, and every marker associated with them. +- Passing terminals and unescape callbacks into functions leads to massive complexity, and needing to allocate memory to allocate merged tables of token sequences. +- Last matched token isn't stored anywhere, so the caller of a parsing function needs to check the current position of an sbuff against its tokens. +- Returning negative offsets to indicate error positions dosn't work when there are transform functions in the sbuff chain. + +# Fixes +## Marker release + +Current sbuffs require markers to be tracked because they need to be updated as the underlying marker is released. + +Using offsets relative to the start of the current sbuff removes the tracking requirement, and means that the markers no longer need to be updated. + +Sbuffs no long record references to the markers. + +[source,c] +---- +typedef struct { + sbuff_t *parent; + size_t offset; +} sbuff_marker_t; +---- + +## Pointer abuse + +In almost every instance a "legacy" C function that's operating on a standard C buffer takes a buffer and a length. + +Direct access into the sbuff should be restricted. + +A macro should be provided that provides both the length of the data and a pointer into the sbuff as consective function arguments. +This prevents assignments to C pointers as they will result in a syntax error. + +[source,c] +---- +#define sbuff_ptr_and_len(_sbuff_or_marker, _max_len) (sbuff->buff + sbuff_offset(_sbuff_or_marker)), sbuff_max_len(_max_len) +---- + +Standard copy functions could copy data from one "parsing" sbuff to a "printing" sbuff which wrapped a buffer on the stack. +Printing sbuffs could also be setup to do standard escaping for \0, making the string "c safe". + +## Overzealous pinning + +The key to fixing the pinning problem is to allow backtracking. + +Not all sources support backtracking, so we'd need a type of sbuff which can grow a temporary buffer and will allow backtracking +within a given range. This sbuff can be inserted into the stack of sbuffs to buffer a unidirectional stream. + +We probably want an initialisation macro that can determine whether the parent sbuff supports backtracking, and if not allocate +a "backtrack buffer" of a given size. Only the function that wants to backtrack knows when data can be released from the backtrack +buffer so it needs to be responsible for doing that. + +This unfortunately, places responsibility on the developer to add these "backtrack" buffers in functions where processing the +token stream doesn't move forward monotonically. + +A good example of one of these functions is `tmpl_afrom_substr`, where multiple parsing functions attempt matches against known +data types, like mac addresses and integers. In this case `tmpl_afrom_substr` must insert an sbuff into the sbuff chain which +will guarantee sufficient bytes are retainted that it can always backtrack. + +Again, if the parent sbuff supports seeking back to its current position, no underlying buffer need be created, and this decision +can be made transparently. + +### Backtrackable child buffers + +As with parsing functions today, whenever a parse function is entered, it should allocate a new sbuff. This sbuff records the +current offset of its parent and performs the same mapping of pointers as current sbuffs, i.e. + +``` +new->parent_start = parent->current; +new->start = 0; //!< Increases as data is shifted from the start position of the buffer. +new->current = 0; //!< Always 0 on init. +new->end = 0; //!< Maximum offset readable, zero for functions which perform transofmration, or parent->max +``` + +All markers and positions in the child sbuff start from 0. If the child sbuff is shifted, then start indicates the number of +bytes shifted, and is used to calculate positions in the underlying buffer. + +It is impossible to seek a child sbuff past its starting point of zero, because we can't map this to an offset in its parent +sbuff. + +Consider a chain of sbuffs: ` -> -> `. The offsets of `parse sbuff``, will likely +not map 1:1 with `FILE * sbuff`. This is because `` is transforming the text. As soon as one escape sequence +is removed during the work done by the ``, offsets between `FILE * sbuff` and `parse sbuff` are no longer relative. + +In order to be able to map any offset of `` to any offset of `FILE * sbuff` we'd need a lookup table. This isn't feasible. + +As backtracking will be relatively rare, the mechanism we use can be expensive. + +If a backtrack is requested to an offset < `start`, we request a backtrack in the parent to `parent_start`, the parent sbuff +can request a backtrack in its parent to its `parent_start`. In this way the entire chain of sbuffs can backtrack to +an earlier position in the stream being read. + +This will likely result in work being redone, to retransform text which had previously been read in from the file. +This doesn't matter. + +One concern is a `parent_start` offset falling in the middle of an atom in the parent, but if this happens, then it means the +transform function is broken. i.e. if a transform function provides the `\` from `\t` in its output buffer, then the transform +function must be rewritten. The output buffer of a transform sbuff must contain only complete, transformed, atoms. + +## Shifting data out of sbuffs is expensive + +The reason why it's expensive is because the entire chain needs to have its pointers updated. If every sbuff maintains positions +using offsets, only the start and end pointers need to be changed. This gives a real advantage to using markers, as they will +never need to be updated during shifts. Their offset remains relative to `sbuff->start`, even if `sbuff->start changes` .i.e. +`char *current = sbuff->buff + (marker->offset - sbuff->start)`. + +If `sbuff->offset < sbuff->start`, this event would trigger a backtrack. + +## Pasing terminal sequences and escape rules into functions is awful + +Text transformation must be transparent to the parsing function, anything else will not function correctly. + +A trasnformation sbuff takes data from its parent, transforms it, and places it in and output buffer. The transform sbuff's +offsets are valid for its output buffer only. + +A terminal sbuff takes requests for more data and scans the stream coming from its parent sbuff for terminal sequences. +If a terminal sequence is found, it notes the last terminal found in its internal structure, and will refuse to allow data +to be read past the terminal sequence. + +Both these functions could be combined into a single sbuff type, or implemented separately. + +Consider parsing a double quoted string + +[source,c] +---- +int my_parse_function(fr_sbuff_t *in) +{ + fr_sbuff_switch(in) { + case '"': + { + sbuff_term_table[] = { + { "\"" , DOUBLE_QUOTE } + }; + + sbuff_t dquote = FR_SBUFF_TRANSFORM_TERM(in, dquote_transform_func, quote_terminal_table); + + tmpl_afrom_substr(ctx, &dquote, TMPL_TYPE_DOUBLE_QUOTE); + + if (sbuff_last_token(&dquote) != DOUBLE_QUOTE) /* ERROR */ + } + + default: + ... + } +} +---- + +Here `tmpl_afrom_substr` would be completely unaware it was processing a double quoted string unless we told it explicitly +(we need to do that so it can find xlats). All unescaping and terminal sequence location is done transparently, and +`tmpl_afrom_substr` only sees the unescaped byte stream. + +## Negative offsets don't work when transform sbuffs are used + +One key realisation is that returning negative offsets up the call stack will not work when the offsets don't map 1:1 +between parent and child. + +As we only use negative offsets for error printing, it makes most sense to duplicate some of the code from the `fr_strerror` +API. + +[source,c] +---- +/** Pushes an error onto the error stack + * + * @param[in] subject that experienced the error + * @param[in] ctx_len Number of bytes to store before the current position. Might want to have this accept a marker or len. + * @param[in] subject_len Maximum number of bytes to store after the current position. Might want to have this accept a marker or len. + * @param[in] err_fmt The error message. + * @param[in] ... Arguments for err_fmt. + */ +void sbuff_error_printf(sbuff_t *subject, size_t ctx_len, size_t subject_len, char const *err_fmt, ...); +void sbuff_error_printf_push(sbuff_t *sbuff, size_t ctx_len, size_t subject_len, char const *err_fmt, ...); +void sbuff_error(void (error_print_t *)(char const *ptr, size_t len, void *uctx), void *uctx) +#define sbuff_error_foreach(...) +---- + +These functions work similarly to the freeradius strerror functions using Thread Local Storage. + +When `sbuff_error` is called, it calls a callback function multiple times to print any errors stored in the thread local +stack. + +Behind the scenes, when `sbuff_error_printf` is called, it copies `ctx_len` bytes from before the current position, +and `subject_len` bytes after the current position, records the offset of the error in the string it just copied, +and creates/stored the error string from `err_fmt`, and `...`. + +In this way we still get rich, contextful errors, but without passing an offset back up the stack. + +All parse functions can revert to returning `int (0, -1)` instead of `fr_slen_t` as they do now. diff --git a/doc/antora/modules/developers/pages/sbuff-parsing.adoc b/doc/antora/modules/developers/pages/sbuff-parsing.adoc new file mode 100644 index 00000000000..7e32a854970 --- /dev/null +++ b/doc/antora/modules/developers/pages/sbuff-parsing.adoc @@ -0,0 +1,69 @@ +# Parsing with sbuffs + +## Introduction + +The aim of parsing is to convert a token stream into some internal format we can operate on easily. +The style of parsing used in FreeRADIUS is almost exclusively recursive descent parsing. + +In recursive descent parsing functions are used to encapsulate grammar used for parsing a given token +(or sequence of tokens). Each time we navigate to a deeper level in a syntax tree we call a parsing +function which represents the grammar for that level. Whenever we find a terminal for that +grammar (or an obvious syntax error) we return up the call stack to allow the caller to determine +what to do. + +Unfortunately, because we're working in C some conventions must be applied by the developer in order +to make parsing functions broadly compatible and callable from each other. + +This document sets out standard function prototypes and conversions to ensure interoperability. + +## Public functions + +Unlike private parse functions public functions can be called anywhere from within the FreeRADIUS +code base. Public functions **MUST** conform to the set of conventions below. + +- On success the number of bytes must be returned. If consuming no bytes is valid, `0` should + be returned. +- On success the input `fr_sbuff_t` must be advanced by the number of bytes consumed. +- On error the input `fr_sbuff_t` **MUST NOT** be advanced. +- On error a negative integer representing the error location must be returned. + This negative integer is indexed from 1 (not 0). + For example a return value of `-5` with an input string `the lazy brown fox` indicates the error + occurred at `l`. +- On success or error, any sbuff markers added to the input `fr_sbuff_t` must be released. +- The return type of the function should be `fr_slen_t`, this indicates that the function conforms + to the above conventions. + +The easiest way to ensure these conventions are met is to initialise a new `fr_sbuff_t` from the +input `fr_sbuff_t` as soon as the function is entered. + +[source,c] +/** Looks for the sequence `octo`,`puss` in a given string + * + * If `puss` does not proceed after `octo`, this is a syntax error. + * If `octo` isn't found, this is ignored, the caller should verify the return value + * is `> 0` if it requires the `octo`,`puss` token sequence is present. + */ +fr_slen_t my_parse_function(fr_sbuff_t *in) +{ + /* our_in->start = in->p, our_in->p = in->p, our_in->end = in->end */ + fr_sbuff_t our_in = FR_SBUFF(in); + + /* + * Optional token sequence not found. + */ + if (!fr_sbuff_adv_past_str_literal(&our_in, "octo")) return 0; + + /* + * Can't have an octo without a puss, are you crazy?! + * + * Return the parse error as a negative integer and leave + * `in` unchanged. + */ + if (!fr_sbuff_adv_past_str_literal(&our_in, "puss")) FR_SBUFF_RETURN_ERROR(&our_in); /* Returns -5 */ + + /* + * Got an `octo` and a `puss`, inform the caller how + * much we consumed, and update `in`. + */ + FR_SBUFF_RETURN_SET(in, &our_in); /* Returns 8 */ +}