From: drh <> Date: Thu, 21 Dec 2023 18:08:05 +0000 (+0000) Subject: Add internal core-developer-only documentation of the JSONB format. X-Git-Tag: version-3.45.0~50 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=c2eff91bd320b0199cd37b0c448466f3262a8f74;p=thirdparty%2Fsqlite.git Add internal core-developer-only documentation of the JSONB format. FossilOrigin-Name: 4d30478863b2a60512010de9ec6e3099bfaf75d4afee20acec536713fe94334d --- diff --git a/doc/jsonb.md b/doc/jsonb.md new file mode 100644 index 0000000000..5beed1631d --- /dev/null +++ b/doc/jsonb.md @@ -0,0 +1,290 @@ +# The JSONB Format + +This document describes SQLite's JSONB binary encoding of +JSON. + +## 1.0 What Is JSONB? + +Beginning with version 3.45.0 (circa 2024-01-01), SQLite supports an +alternative binary encoding of JSON which we call "JSONB". JSONB is +a binary format that stored as a BLOB. + +The advantage of JSONB over ordinary text RFC 8259 JSON is that JSONB +is both slightly smaller (by between 5% and 10% in most cases) and +can be processed in less than half the number of CPU cycles. The built-in +[JSON SQL functions] of SQLite can accept either ordinary text JSON +or the binary JSONB encoding for any of their JSON inputs. + +The "JSONB" name is inspired by [PostgreSQL](https://postgresql.org), but the +on-disk format for SQLite's JSONB is not the same as PostgreSQL's. +The two formats have the same name, but they have wildly different internal +representations and are not in any way binary compatible. + +The central idea behind this JSONB specification is that each element +begins with a header that includes the size and type of that element. +The header takes the place of punctuation such as double-quotes, +curly-brackes, square-brackets, commas, and colons. Since the size +and type of each element is contained in its header, the element can +be read faster since it is no longer necessary to carefully scan forward +looking for the closing delimiter. The payload of JSONB is the same +as for corresponding text JSON. The same payload bytes occur in the +same order. The only real difference between JSONB and ordinary text +JSON is that JSONB includes a binary header on +each element and omits delimiter and separator punctuation. + +### 1.1 Internal Use Only + +The details of the JSONB are not intended to be visible to application +developers. Application developers should look at JSONB as an opaque BLOB +used internally by SQLite. Nevertheless, we want the format to be backwards +compatible across all future versions of SQLite. To that end, the format +is documented by this file in the source tree. But this file should be +used only by SQLite core developers, not by developers of applications +that only use SQLite. + +## 2.0 The Purpose Of This Document + +JSONB is not intended as an external format to be used by +applications. JSONB is designed for internal use by SQLite only. +Programmers do not need to understand the JSONB format in order to +use it effectively. +Applications should access JSONB only through the [JSON SQL functions], +not by looking at individual bytes of the BLOB. + +However, JSONB is intended to be portable and backwards compatible +for all future versions of SQLite. In other words, you should not have +to export and reimport your SQLite database files when you upgrade to +a newer SQLite version. For that reason, the JSONB format needs to +be well-defined. + +This document is therefore similar in purpose to the +[SQLite database file format] document that describes the on-disk +format of an SQLite database file. Applications are not expected +to directly read and write the bits and bytes of SQLite database files. +The SQLite database file format is carefully documented so that it +can be stable and enduring. In the same way, the JSONB representation +of JSON is documented here so that it too can be stable and enduring, +not so that applications can read or writes individual bytes. + +## 3.0 Encoding + +JSONB is a direct translation of the underlying text JSON. The difference +is that JSONB uses a binary encoding that is faster to parse compared to +the detailed syntax of text JSON. + +Each JSON element is encoded as a header and a payload. The header +determines type of element (string, numeric, boolean, null, object, or +array) and the size of the payload. The header can be between 1 and +9 bytes in size. The payload can be any size from zero bytes up to the +maximum allowed BLOB size. + +### 3.1 Payload Size + +The upper four bits of the first byte of the header determine size of the +header and possibly also the size of the payload. +If the upper four bits have a value between 0 and 11, then the header is +exactly one byte in size and the payload size is determined by those +upper four bits. If the upper four bits have a value between 12 and 15, +that means that the total header size is 2, 3, 5, or 9 bytes and the +payload size is unsigned big-endian integer that is contained in the +subsequent bytes. The size integer is the one byte that following the +initial header byte if the upper four bits +are 12, two bytes if the upper bits are 13, four bytes if the upper bits +are 14, and eight bytes if the upper bits are 15. The current design +of SQLite does not support BLOB values larger than 2GiB, so the eight-byte +variant of the payload size integer will never be used by the current code. +The eight-byte payload size integer is included in the specification +to allow for future expansion. + +The header for an element does *not* need to be in its simplest +form. For example, consider the JSON numeric value "`1`". +That element can be encode in five different ways: + + * `0x13 0x31` + * `0xc3 0x01 0x31` + * `0xd3 0x00 0x01 0x31` + * `0xe3 0x00 0x00 0x00 0x01 0x31` + * `0xf3 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x31` + +The shortest encoding is preferred, of course, and usually happens with +primitive elements such as numbers. However the total size of an array +or object might not be known exactly when the header of the element is +first generated. It is convenient to reserve space for the largest +possible header and then go back and fill in the correct payload size +at the end. This technique can result in array or object headers that +are larger than absolutely necessary. + +### 3.2 Element Type + +The least-significant four bits of the first byte of the header (the first +byte masked against 0x0f) determine element type. The following codes are +used: + +
NULL → +The element is a JSON "null". The payload size for a true JSON NULL must +must be zero. Future versions of SQLite might extend the JSONB format +with elements that have a zero element type but a non-zero size. In that +way, legacy versions of SQLite will interpret the element as a NULL +for backwards compatibility while newer versions will interpret the +element in some other way. + +
TRUE → +The element is a JSON "true". The payload size must be zero for a actual +"true" value. Elements with type 1 and a non-zero payload size are +reserved for future expansion. Legacy implementations that see an element +type of 1 with a non-zero payload size should continue to interpret that +element as "true" for compatibility. + +
FALSE → +The element is a JSON "false". The payload size must be zero for a actual +"false" value. Elements with type 2 and a non-zero payload size are +reserved for future expansion. Legacy implementations that see an element +type of 2 with a non-zero payload size should continue to interpret that +element as "false" for compatibility. + +
INT → +The element is a JSON integer value in the canonical +RFC 8259 format, without extensions. The payload is the ASCII +text representation of that numeric value. + +
INT5 → +The element is a JSON integer value that is not in the +canonical format. The payload is the ASCII +text representation of that numeric value. Because the payload is in a +non-standard format, it will need to be translated when the JSONB is +converted into RFC 8259 text JSON. + +
FLOAT → +The element is a JSON floating-point value in the canonical +RFC 8259 format, without extensions. The payload is the ASCII +text representation of that numeric value. + +
FLOAT5 → +The element is a JSON floating-point value that is not in the +canonical format. The payload is the ASCII +text representation of that numeric value. Because the payload is in a +non-standard format, it will need to be translated when the JSONB is +converted into RFC 8259 text JSON. + +
TEXT → +The element is a JSON string value that does not contain +any escapes nor any characters that need to be escaped for either SQL or +JSON. The payload is the UTF8 text representation of the string value. +The payload does not include string delimiters. + +
TEXTJ → +The element is a JSON string value that contains +RFC 8259 character escapes (such as "\n" or "\u0020"). +Those escapes will need to be translated into actual UTF8 if this element +is [json_extract|extracted] into SQL. +The payload is the UTF8 text representation of the escaped string value. +The payload does not include string delimiters. + +
TEXT5 → +The element is a JSON string value that contains +character escapes, including some character escapes that part of JSON5 +and which are not found in the canonical RFC 8259 spec. +Those escapes will need to be translated into standard JSON prior to +rendering the JSON as text, or into their actual UTF8 characters if this +element is [json_extract|extracted] into SQL. +The payload is the UTF8 text representation of the escaped string value. +The payload does not include string delimiters. + +
TEXTRAW → +The element is a JSON string value that contains +UTF8 characters that need to be escaped if this string is rendered into +standard JSON text. +The payload does not include string delimiters. + +
ARRAY → +The element is a JSON array. The payload contains +JSONB elements that comprise values contained within the array. + +
OBJECT → +The element is a JSON object. The payload contains +pairs of JSONB elements that comprise entries for the JSON object. +The first element in each pair must be a string (types 7 through 10). +The second element of each pair may be any types, including nested +arrays or objects. + +
RESERVED-13 → +Reserved for future expansion. Legacy implements that encounter this +element type should raise an error. + +
RESERVED-14 → +Reserved for future expansion. Legacy implements that encounter this +element type should raise an error. + +
RESERVED-15 → +Reserved for future expansion. Legacy implements that encounter this +element type should raise an error. +