From: drh <> Date: Thu, 21 Dec 2023 18:08:05 +0000 (+0000) Subject: Add internal core-developer-only documentation of the JSONB format. X-Git-Tag: version-3.45.0~50 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=c2eff91bd320b0199cd37b0c448466f3262a8f74;p=thirdparty%2Fsqlite.git Add internal core-developer-only documentation of the JSONB format. FossilOrigin-Name: 4d30478863b2a60512010de9ec6e3099bfaf75d4afee20acec536713fe94334d --- diff --git a/doc/jsonb.md b/doc/jsonb.md new file mode 100644 index 0000000000..5beed1631d --- /dev/null +++ b/doc/jsonb.md @@ -0,0 +1,290 @@ +# The JSONB Format + +This document describes SQLite's JSONB binary encoding of +JSON. + +## 1.0 What Is JSONB? + +Beginning with version 3.45.0 (circa 2024-01-01), SQLite supports an +alternative binary encoding of JSON which we call "JSONB". JSONB is +a binary format that stored as a BLOB. + +The advantage of JSONB over ordinary text RFC 8259 JSON is that JSONB +is both slightly smaller (by between 5% and 10% in most cases) and +can be processed in less than half the number of CPU cycles. The built-in +[JSON SQL functions] of SQLite can accept either ordinary text JSON +or the binary JSONB encoding for any of their JSON inputs. + +The "JSONB" name is inspired by [PostgreSQL](https://postgresql.org), but the +on-disk format for SQLite's JSONB is not the same as PostgreSQL's. +The two formats have the same name, but they have wildly different internal +representations and are not in any way binary compatible. + +The central idea behind this JSONB specification is that each element +begins with a header that includes the size and type of that element. +The header takes the place of punctuation such as double-quotes, +curly-brackes, square-brackets, commas, and colons. Since the size +and type of each element is contained in its header, the element can +be read faster since it is no longer necessary to carefully scan forward +looking for the closing delimiter. The payload of JSONB is the same +as for corresponding text JSON. The same payload bytes occur in the +same order. The only real difference between JSONB and ordinary text +JSON is that JSONB includes a binary header on +each element and omits delimiter and separator punctuation. + +### 1.1 Internal Use Only + +The details of the JSONB are not intended to be visible to application +developers. Application developers should look at JSONB as an opaque BLOB +used internally by SQLite. Nevertheless, we want the format to be backwards +compatible across all future versions of SQLite. To that end, the format +is documented by this file in the source tree. But this file should be +used only by SQLite core developers, not by developers of applications +that only use SQLite. + +## 2.0 The Purpose Of This Document + +JSONB is not intended as an external format to be used by +applications. JSONB is designed for internal use by SQLite only. +Programmers do not need to understand the JSONB format in order to +use it effectively. +Applications should access JSONB only through the [JSON SQL functions], +not by looking at individual bytes of the BLOB. + +However, JSONB is intended to be portable and backwards compatible +for all future versions of SQLite. In other words, you should not have +to export and reimport your SQLite database files when you upgrade to +a newer SQLite version. For that reason, the JSONB format needs to +be well-defined. + +This document is therefore similar in purpose to the +[SQLite database file format] document that describes the on-disk +format of an SQLite database file. Applications are not expected +to directly read and write the bits and bytes of SQLite database files. +The SQLite database file format is carefully documented so that it +can be stable and enduring. In the same way, the JSONB representation +of JSON is documented here so that it too can be stable and enduring, +not so that applications can read or writes individual bytes. + +## 3.0 Encoding + +JSONB is a direct translation of the underlying text JSON. The difference +is that JSONB uses a binary encoding that is faster to parse compared to +the detailed syntax of text JSON. + +Each JSON element is encoded as a header and a payload. The header +determines type of element (string, numeric, boolean, null, object, or +array) and the size of the payload. The header can be between 1 and +9 bytes in size. The payload can be any size from zero bytes up to the +maximum allowed BLOB size. + +### 3.1 Payload Size + +The upper four bits of the first byte of the header determine size of the +header and possibly also the size of the payload. +If the upper four bits have a value between 0 and 11, then the header is +exactly one byte in size and the payload size is determined by those +upper four bits. If the upper four bits have a value between 12 and 15, +that means that the total header size is 2, 3, 5, or 9 bytes and the +payload size is unsigned big-endian integer that is contained in the +subsequent bytes. The size integer is the one byte that following the +initial header byte if the upper four bits +are 12, two bytes if the upper bits are 13, four bytes if the upper bits +are 14, and eight bytes if the upper bits are 15. The current design +of SQLite does not support BLOB values larger than 2GiB, so the eight-byte +variant of the payload size integer will never be used by the current code. +The eight-byte payload size integer is included in the specification +to allow for future expansion. + +The header for an element does *not* need to be in its simplest +form. For example, consider the JSON numeric value "`1`". +That element can be encode in five different ways: + + * `0x13 0x31` + * `0xc3 0x01 0x31` + * `0xd3 0x00 0x01 0x31` + * `0xe3 0x00 0x00 0x00 0x01 0x31` + * `0xf3 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x31` + +The shortest encoding is preferred, of course, and usually happens with +primitive elements such as numbers. However the total size of an array +or object might not be known exactly when the header of the element is +first generated. It is convenient to reserve space for the largest +possible header and then go back and fill in the correct payload size +at the end. This technique can result in array or object headers that +are larger than absolutely necessary. + +### 3.2 Element Type + +The least-significant four bits of the first byte of the header (the first +byte masked against 0x0f) determine element type. The following codes are +used: + +
    +
  1. NULL → +The element is a JSON "null". The payload size for a true JSON NULL must +must be zero. Future versions of SQLite might extend the JSONB format +with elements that have a zero element type but a non-zero size. In that +way, legacy versions of SQLite will interpret the element as a NULL +for backwards compatibility while newer versions will interpret the +element in some other way. + +

  2. TRUE → +The element is a JSON "true". The payload size must be zero for a actual +"true" value. Elements with type 1 and a non-zero payload size are +reserved for future expansion. Legacy implementations that see an element +type of 1 with a non-zero payload size should continue to interpret that +element as "true" for compatibility. + +

  3. FALSE → +The element is a JSON "false". The payload size must be zero for a actual +"false" value. Elements with type 2 and a non-zero payload size are +reserved for future expansion. Legacy implementations that see an element +type of 2 with a non-zero payload size should continue to interpret that +element as "false" for compatibility. + +

  4. INT → +The element is a JSON integer value in the canonical +RFC 8259 format, without extensions. The payload is the ASCII +text representation of that numeric value. + +

  5. INT5 → +The element is a JSON integer value that is not in the +canonical format. The payload is the ASCII +text representation of that numeric value. Because the payload is in a +non-standard format, it will need to be translated when the JSONB is +converted into RFC 8259 text JSON. + +

  6. FLOAT → +The element is a JSON floating-point value in the canonical +RFC 8259 format, without extensions. The payload is the ASCII +text representation of that numeric value. + +

  7. FLOAT5 → +The element is a JSON floating-point value that is not in the +canonical format. The payload is the ASCII +text representation of that numeric value. Because the payload is in a +non-standard format, it will need to be translated when the JSONB is +converted into RFC 8259 text JSON. + +

  8. TEXT → +The element is a JSON string value that does not contain +any escapes nor any characters that need to be escaped for either SQL or +JSON. The payload is the UTF8 text representation of the string value. +The payload does not include string delimiters. + +

  9. TEXTJ → +The element is a JSON string value that contains +RFC 8259 character escapes (such as "\n" or "\u0020"). +Those escapes will need to be translated into actual UTF8 if this element +is [json_extract|extracted] into SQL. +The payload is the UTF8 text representation of the escaped string value. +The payload does not include string delimiters. + +

  10. TEXT5 → +The element is a JSON string value that contains +character escapes, including some character escapes that part of JSON5 +and which are not found in the canonical RFC 8259 spec. +Those escapes will need to be translated into standard JSON prior to +rendering the JSON as text, or into their actual UTF8 characters if this +element is [json_extract|extracted] into SQL. +The payload is the UTF8 text representation of the escaped string value. +The payload does not include string delimiters. + +

  11. TEXTRAW → +The element is a JSON string value that contains +UTF8 characters that need to be escaped if this string is rendered into +standard JSON text. +The payload does not include string delimiters. + +

  12. ARRAY → +The element is a JSON array. The payload contains +JSONB elements that comprise values contained within the array. + +

  13. OBJECT → +The element is a JSON object. The payload contains +pairs of JSONB elements that comprise entries for the JSON object. +The first element in each pair must be a string (types 7 through 10). +The second element of each pair may be any types, including nested +arrays or objects. + +

  14. RESERVED-13 → +Reserved for future expansion. Legacy implements that encounter this +element type should raise an error. + +

  15. RESERVED-14 → +Reserved for future expansion. Legacy implements that encounter this +element type should raise an error. + +

  16. RESERVED-15 → +Reserved for future expansion. Legacy implements that encounter this +element type should raise an error. +

+ +Element types outside the range of 0 to 12 are reserved for future +expansion. The current implement raises an error if see an element type +other than those listed above. However, future versions of SQLite might +use of the three remaining element types to implement indexing or similar +optimizations, to speed up lookup against large JSON arrays and/or objects. + +### 3.3 Design Rationale For Element Types + +A key goal of JSONB is that it should be quick to translate +to and from text JSON and/or be constructed from SQL values. +When converting from text into JSONB, we do not want the +converter subroutine to burn CPU cycles converting elements +values into some standard format which might never be used. +Format conversion is "lazy" - it is deferred until actually +needed. This has implications for the JSONB format design: + + 1. Numeric values are stored as text, not a numbers. The values are + a direct copy of the text JSON values from which they are derived. + + 2. There are multiple element types depending on the details of value + formats. For example, INT is used for pure RFC-8259 integer + literals and INT5 exists for JSON5 extensions such as hexadecimal + notation. FLOAT is used for pure RFC-8259 floating point literals + and FLOAT5 is used for JSON5 extensions. There are four different + representations of strings, depending on where the string came from + and how special characters within the string are escaped. + +A second goal of JSONB is that it should be capable of serving as the +"parse tree" for JSON when a JSON value is being processed by the +various [JSON SQL functions] built into SQLite. Before JSONB was +developed, operations such [json_replace()] and [json_patch()] +and similar worked in three stages: + + + 1. Translate the text JSON into a internal format that is + easier to scan and edit. + 2. Perform the requested operation on the JSON. + 3. Translate the internal format back into text. + +JSONB seeks to serve as the internal format directly - bypassing +the first and third stages of that process. Since most of the CPU +cycles are spent on the first and third stages, that suggests that +JSONB processing will be much faster than text JSON processing. + +So when processing JSONB, only the second stage of the three-stage +process is required. But when processing text JSON, it is still necessary +to do stages one and three. If JSONB is to be used as the internal +binary representation, this is yet another reason to store numeric +values as text. Storing numbers as text minimizes the amount of +conversion work needed for stages one and three. This is also why +there are four different representations of text in JSONB. Different +text representations are used for text coming from different sources +(RFC-8259 JSON, JSON5, or SQL string values) and conversions only +happen if and when they are actually needed. + +### 3.4 Valid JSONB BLOBs + +A valid JSONB BLOB consists of a single JSON element. The element must +exactly fill the BLOB. This one element is often a JSON object or array +and those usually contain additional elements as its payload, but the +element can be a primite value such a string, number, boolean, or null. + +When the built-in JSON functions are attempting to determine if a BLOB +argument is a JSONB or just a random BLOB, they look at the header of +the outer element to see that it is well-formed and that the element +completely fills the BLOB. If these conditions are met, then the BLOB +is accepted as a JSONB value. diff --git a/manifest b/manifest index 801a4eda9d..ef0d0902d3 100644 --- a/manifest +++ b/manifest @@ -1,5 +1,5 @@ -C Fix\sSQLITE_ENABLE_SETLK_TIMEOUT\sassert()\sstatements\sin\sos_unix.c\sto\savoid\sreading\spast\sthe\send\sof\sthe\sunixShmNode.aMutex[]\sarray. -D 2023-12-20T19:33:41.679 +C Add\sinternal\score-developer-only\sdocumentation\sof\sthe\sJSONB\sformat. +D 2023-12-21T18:08:05.881 F .fossil-settings/empty-dirs dbb81e8fc0401ac46a1491ab34a7f2c7c0452f2f06b54ebb845d024ca8283ef1 F .fossil-settings/ignore-glob 35175cdfcf539b2318cb04a9901442804be81cd677d8b889fcc9149c21f239ea F LICENSE.md df5091916dbb40e6e9686186587125e1b2ff51f022cc334e886c19a0e9982724 @@ -39,6 +39,7 @@ F contrib/sqlitecon.tcl 210a913ad63f9f991070821e599d600bd913e0ad F doc/F2FS.txt c1d4a0ae9711cfe0e1d8b019d154f1c29e0d3abfe820787ba1e9ed7691160fcd F doc/compile-for-windows.md 50b27d77be96195c66031a3181cb8684ed822327ea834e07f9c014213e5e3bcf F doc/json-enhancements.md e356fc834781f1f1aa22ee300027a270b2c960122468499bf347bb123ce1ea4f +F doc/jsonb.md 5fab4b8613aa9153fbeb6259297bd4697988af8b3d23900deba588fa7841456b F doc/lemon.html 44a53a1d2b42d7751f7b2f478efb23c978e258d794bfd172442307a755b9fa44 F doc/pager-invariants.txt 27fed9a70ddad2088750c4a2b493b63853da2710 F doc/testrunner.md 8d36ec692cf4994bb66d84a4645b9afa1ce9d47dc12cbf8d437c5a5fb6ddeedb @@ -696,7 +697,7 @@ F src/hash.h 3340ab6e1d13e725571d7cee6d3e3135f0779a7d8e76a9ce0a85971fa3953c51 F src/hwtime.h f9c2dfb84dce7acf95ce6d289e46f5f9d3d1afd328e53da8f8e9008e3b3caae6 F src/in-operator.md 10cd8f4bcd225a32518407c2fb2484089112fd71 F src/insert.c 3f0a94082d978bbdd33c38fefea15346c6c6bffb70bc645a71dc0f1f87dd3276 -F src/json.c c6fae579d681d5ab43387c8667cc6ef0c0f8769fd881fde3af6260e837cae762 +F src/json.c 498cbe3346f216e3119e6d01585c3bc09994612c16459f2fa682a1cec61b1cf7 F src/legacy.c d7874bc885906868cd51e6c2156698f2754f02d9eee1bae2d687323c3ca8e5aa F src/loadext.c 7432c944ff197046d67a1207790a1b13eec4548c85a9457eb0896bb3641dfb36 F src/main.c ce714ee501122c76eb2e69b292bebe443aba611fc3b88e6786eb910285515fe4 @@ -2155,8 +2156,8 @@ F vsixtest/vsixtest.tcl 6a9a6ab600c25a91a7acc6293828957a386a8a93 F vsixtest/vsixtest.vcxproj.data 2ed517e100c66dc455b492e1a33350c1b20fbcdc F vsixtest/vsixtest.vcxproj.filters 37e51ffedcdb064aad6ff33b6148725226cd608e F vsixtest/vsixtest_TemporaryKey.pfx e5b1b036facdb453873e7084e1cae9102ccc67a0 -P 206d8c650d937bc700946c40a82a62ea6bc4a80e5f3fb42d0ae2968de25f0644 -R f91549b5ba9f8490543a9270abff43ad -U dan -Z 03359f67466d165815d902ddaee87b98 +P 029a05cd2928d43d81e4549cce5388c432e2c9e75e3fa0b2fe6e91021b2fb9ac +R 37c4deae61c278596f4d5133c7cb9761 +U drh +Z eab6e0f06a97a969bebac69bfb9fe946 # Remove this line to create a well-formed Fossil manifest. diff --git a/manifest.uuid b/manifest.uuid index 4428cae8a4..39fbf7193d 100644 --- a/manifest.uuid +++ b/manifest.uuid @@ -1 +1 @@ -029a05cd2928d43d81e4549cce5388c432e2c9e75e3fa0b2fe6e91021b2fb9ac \ No newline at end of file +4d30478863b2a60512010de9ec6e3099bfaf75d4afee20acec536713fe94334d \ No newline at end of file diff --git a/src/json.c b/src/json.c index 8fae3d1140..e4f658ca7f 100644 --- a/src/json.c +++ b/src/json.c @@ -113,6 +113,9 @@ ** checks are true, the BLOB is assumed to be JSONB and processing continues. ** Errors are only raised if some other miscoding is discovered during ** processing. +** +** Additional information can be found in the doc/jsonb.md file of the +** canonical SQLite source tree. */ #ifndef SQLITE_OMIT_JSON #include "sqliteInt.h"