From 539f2fef03004a11476235a942615fb49f1ce48a Mon Sep 17 00:00:00 2001 From: drh Date: Thu, 25 Aug 2016 14:00:15 +0000 Subject: [PATCH] Add notes on the implementation of the IN operator. FossilOrigin-Name: d256b2caeb9e3eb5dd88bb569ec71f91e9991c81 --- manifest | 11 ++--- manifest.uuid | 2 +- src/in-operator.md | 105 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 112 insertions(+), 6 deletions(-) create mode 100644 src/in-operator.md diff --git a/manifest b/manifest index 105c9bfbb2..b69115a5fc 100644 --- a/manifest +++ b/manifest @@ -1,5 +1,5 @@ -C Simplified\sVDBE\scode\sfor\sthe\svector\sNOT\sIN\snull-scanning\sloop. -D 2016-08-24T21:54:47.055 +C Add\snotes\son\sthe\simplementation\sof\sthe\sIN\soperator. +D 2016-08-25T14:00:15.437 F Makefile.in cfd8fb987cd7a6af046daa87daa146d5aad0e088 F Makefile.linux-gcc 7bc79876b875010e8c8f9502eb935ca92aa3c434 F Makefile.msc d66d0395c38571aab3804f8db0fa20707ae4609a @@ -346,6 +346,7 @@ F src/global.c c45ea22aff29334f6a9ec549235ac3357c970015 F src/hash.c 55b5fb474100cee0b901edaf203e26c970940f36 F src/hash.h ab34c5c54a9e9de2e790b24349ba5aab3dbb4fd4 F src/hwtime.h 747c1bbe9df21a92e9c50f3bbec1de841dc5e5da +F src/in-operator.md 8176075ceca5b1a0564d4aff1e415ff6f9037280 F src/insert.c a255eb795cf475e7a0659297144fc80f70eb4e30 F src/legacy.c 75d3023be8f0d2b99d60f905090341a03358c58e F src/loadext.c dd7a2b77902cc66c22555aef02e1a682554b7aec @@ -1520,7 +1521,7 @@ F vsixtest/vsixtest.tcl 6a9a6ab600c25a91a7acc6293828957a386a8a93 F vsixtest/vsixtest.vcxproj.data 2ed517e100c66dc455b492e1a33350c1b20fbcdc F vsixtest/vsixtest.vcxproj.filters 37e51ffedcdb064aad6ff33b6148725226cd608e F vsixtest/vsixtest_TemporaryKey.pfx e5b1b036facdb453873e7084e1cae9102ccc67a0 -P bbc1b016164ed0793e07302614384d52119463e0 -R 5a565b60b3cfcc2e549e66fdc0e5eedc +P 7ae504e62e9bbbbd85a676f3c3922b7fd0cc73d2 +R 85a14e909442e83b4fd5e48b00047186 U drh -Z 08a1319918b046fa6beb913b7de3f092 +Z fb24d479cf53b7630adc4dea69c62760 diff --git a/manifest.uuid b/manifest.uuid index 0459a673e1..0fc9cd5be3 100644 --- a/manifest.uuid +++ b/manifest.uuid @@ -1 +1 @@ -7ae504e62e9bbbbd85a676f3c3922b7fd0cc73d2 \ No newline at end of file +d256b2caeb9e3eb5dd88bb569ec71f91e9991c81 \ No newline at end of file diff --git a/src/in-operator.md b/src/in-operator.md new file mode 100644 index 0000000000..76002b8597 --- /dev/null +++ b/src/in-operator.md @@ -0,0 +1,105 @@ +IN-Operator Implementation Notes +================================ + +## Definitions: + +An IN operator has one of the following formats: + +> + x IN (list) + x IN (subquery) + +The "x" is referred to as the LHS (left-hand side). The list or subquery +on the right is called the RHS (right-hand side). If the RHS is a list +it must be a non-empty list. But if the RHS is a subquery, it can be an +empty set. + +Both the LHS and RHS can be scalars or vectors. The two must match. +In other words, they must both be scalar or else they must both be +vectors of the same length. + +NULL values can occur in either or both of the LHS and RHS. +If the LHS contains only +NULL values then we say that it is a "total-NULL". If the LHS contains +some NULL values and some non-NULL values, then it is a "partial-NULL". +For a scalar, there is no difference between a partial-NULL and a total-NULL. +The RHS is a partial-NULL if any row contains a NULL value. The RHS is +a total-NULL if it contains one or more rows that contain only NULL values. +The LHS is called "non-NULL" if it contains no NULL values. The RHS is +called "non-NULL" if it contains no NULL values in any row. + +The result of an IN operator is one of TRUE, FALSE, or NULL. A NULL result +means that it cannot be determined if the LHS is contained in the RHS due +to the presence of NULL values. In some contexts (for example, when the IN +operator occurs in a WHERE clause) +the system only needs a binary result: TRUE or NOT-TRUE. One can also +to define a binary result of FALSE and NOT-FALSE, but +it turns out that no extra optimizations are possible in that case, so if +the FALSE/NOT-FALSE binary is needed, we have to compute the three-state +TRUE/FALSE/NULL result and then combine the TRUE and NULL values into +NOT-FALSE. + +A "NOT IN" operator is computed by first computing the equivalent IN +operator, then interchanging the TRUE and FALSE results. + +## Simple Full-Scan Algorithm + +The following algorithm always compute the correct answer. However, this +algorithm is suboptimal, especially if there are many rows on the RHS. + + 1. Set the null-flag to false + 2. For each row in the RHS: +
    +
  1. Compare the LHS against the RHS +
  2. If the LHS exactly matches the RHS, immediately return TRUE +
  3. If the comparison result is NULL, set the null-flag to true +
+ 3. If the null-flag is true, return NULL. + 4. Return FALSE + +## Optimized Algorithm + +The following procedure computes the same answer as the simple full-scan +algorithm, though it does so with less work in the common case. This +is the algorithm that is implemented in SQLite. The steps must occur +in the order specified. Except for the INDEX_NOOP optimization of step 1, +none of the steps can be skipped. + + 1. If the RHS is a constant list of length 1 or 2, then rewrite the + IN operator as a simple expression. Implement + + x IN (y1,y2) + + as if it were + + x=y1 OR x=y2 + + This is the INDEX_NOOP optimization and is only undertaken if the + IN operator is used for membership testing. If the IN operator is + driving a loop, then skip this step entirely. + + 2. If the RHS is empty, return FALSE. + + 3. If the LHS is a total-NULL or if the RHS contains a total-NULL, + then return NULL. + + 4. If the LHS is non-NULL, then use the LHS as a probe in a binary + search of the RHS + +
    +
  1. If the binary search finds an exact match, return TRUE + +
  2. If the RHS is known to be not-null, return FALSE +
+ + 5. At this point, it is known that the result cannot be TRUE. All + that remains is to distinguish between NULL and FALSE. + If a NOT-TRUE result is acceptable, then return NOT-TRUE now. + + 6. For each row in the RHS, compare that row against the LHS and + if the result is NULL, immediately return NULL. This step is + essentially the "Simple Full-scan Algorithm" above with the + tests for TRUE removed, since we know that the result cannot be + TRUE at this point. + + 7. Return FALSE. -- 2.47.2