Add notes on the implementation of the IN operator.

author drh <drh@noemail.net>

Thu, 25 Aug 2016 14:00:15 +0000 (14:00 +0000)

committer drh <drh@noemail.net>

Thu, 25 Aug 2016 14:00:15 +0000 (14:00 +0000)
author drh <drh@noemail.net>
Thu, 25 Aug 2016 14:00:15 +0000 (14:00 +0000)
committer drh <drh@noemail.net>
Thu, 25 Aug 2016 14:00:15 +0000 (14:00 +0000)
diff --git a/manifest b/manifest

index 105c9bfbb28e4de998c2d03a42a5e4bb9bdc2024..b69115a5fc66d0ca33a7ec17d4835ae1a5842e7a 100644 (file)
--- a/manifest
+++ b/manifest
@@ -1,5 +1,5 @@
-C Simplified\sVDBE\scode\sfor\sthe\svector\sNOT\sIN\snull-scanning\sloop.
-D 2016-08-24T21:54:47.055
+C Add\snotes\son\sthe\simplementation\sof\sthe\sIN\soperator.
+D 2016-08-25T14:00:15.437
  F Makefile.in cfd8fb987cd7a6af046daa87daa146d5aad0e088
  F Makefile.linux-gcc 7bc79876b875010e8c8f9502eb935ca92aa3c434
  F Makefile.msc d66d0395c38571aab3804f8db0fa20707ae4609a
@@ -346,6 +346,7 @@ F src/global.c c45ea22aff29334f6a9ec549235ac3357c970015
  F src/hash.c 55b5fb474100cee0b901edaf203e26c970940f36
  F src/hash.h ab34c5c54a9e9de2e790b24349ba5aab3dbb4fd4
  F src/hwtime.h 747c1bbe9df21a92e9c50f3bbec1de841dc5e5da
+F src/in-operator.md 8176075ceca5b1a0564d4aff1e415ff6f9037280
  F src/insert.c a255eb795cf475e7a0659297144fc80f70eb4e30
  F src/legacy.c 75d3023be8f0d2b99d60f905090341a03358c58e
  F src/loadext.c dd7a2b77902cc66c22555aef02e1a682554b7aec
@@ -1520,7 +1521,7 @@ F vsixtest/vsixtest.tcl 6a9a6ab600c25a91a7acc6293828957a386a8a93
  F vsixtest/vsixtest.vcxproj.data 2ed517e100c66dc455b492e1a33350c1b20fbcdc
  F vsixtest/vsixtest.vcxproj.filters 37e51ffedcdb064aad6ff33b6148725226cd608e
  F vsixtest/vsixtest_TemporaryKey.pfx e5b1b036facdb453873e7084e1cae9102ccc67a0
-P bbc1b016164ed0793e07302614384d52119463e0
-R 5a565b60b3cfcc2e549e66fdc0e5eedc
+P 7ae504e62e9bbbbd85a676f3c3922b7fd0cc73d2
+R 85a14e909442e83b4fd5e48b00047186
  U drh
-Z 08a1319918b046fa6beb913b7de3f092
+Z fb24d479cf53b7630adc4dea69c62760
diff --git a/manifest.uuid b/manifest.uuid

index 0459a673e1dee4c1f2b123e8b00acb2402e7de47..0fc9cd5be359e6478ab292307ca322886465051a 100644 (file)
--- a/manifest.uuid
+++ b/manifest.uuid
@@ -1 +1 @@
-7ae504e62e9bbbbd85a676f3c3922b7fd0cc73d2
-\ No newline at end of file
+d256b2caeb9e3eb5dd88bb569ec71f91e9991c81
+\ No newline at end of file
diff --git a/src/in-operator.md b/src/in-operator.md

new file mode 100644 (file)

index 0000000..76002b8
--- /dev/null
+++ b/src/in-operator.md
@@ -0,0 +1,105 @@
+IN-Operator Implementation Notes
+================================
+
+## Definitions:
+
+An IN operator has one of the following formats:
+
+>
+     x IN (list)
+     x IN (subquery)
+
+The "x" is referred to as the LHS (left-hand side).  The list or subquery
+on the right is called the RHS (right-hand side).  If the RHS is a list
+it must be a non-empty list.  But if the RHS is a subquery, it can be an
+empty set.
+
+Both the LHS and RHS can be scalars or vectors.  The two must match.
+In other words, they must both be scalar or else they must both be
+vectors of the same length.
+
+NULL values can occur in either or both of the LHS and RHS.
+If the LHS contains only
+NULL values then we say that it is a "total-NULL".  If the LHS contains
+some NULL values and some non-NULL values, then it is a "partial-NULL".
+For a scalar, there is no difference between a partial-NULL and a total-NULL.
+The RHS is a partial-NULL if any row contains a NULL value.  The RHS is
+a total-NULL if it contains one or more rows that contain only NULL values.
+The LHS is called "non-NULL" if it contains no NULL values.  The RHS is
+called "non-NULL" if it contains no NULL values in any row.
+
+The result of an IN operator is one of TRUE, FALSE, or NULL.  A NULL result
+means that it cannot be determined if the LHS is contained in the RHS due
+to the presence of NULL values.  In some contexts (for example, when the IN
+operator occurs in a WHERE clause)
+the system only needs a binary result: TRUE or NOT-TRUE.  One can also
+to define a binary result of FALSE and NOT-FALSE, but
+it turns out that no extra optimizations are possible in that case, so if
+the FALSE/NOT-FALSE binary is needed, we have to compute the three-state
+TRUE/FALSE/NULL result and then combine the TRUE and NULL values into 
+NOT-FALSE.
+
+A "NOT IN" operator is computed by first computing the equivalent IN
+operator, then interchanging the TRUE and FALSE results.
+
+## Simple Full-Scan Algorithm
+
+The following algorithm always compute the correct answer.  However, this
+algorithm is suboptimal, especially if there are many rows on the RHS.
+
+  1.  Set the null-flag to false
+  2.  For each row in the RHS:
+      <ol type='a'>
+      <li>  Compare the LHS against the RHS
+      <li>  If the LHS exactly matches the RHS, immediately return TRUE
+      <li>  If the comparison result is NULL, set the null-flag to true
+      </ol>
+  3.  If the null-flag is true, return NULL.
+  4.  Return FALSE
+
+## Optimized Algorithm
+
+The following procedure computes the same answer as the simple full-scan
+algorithm, though it does so with less work in the common case.  This
+is the algorithm that is implemented in SQLite.  The steps must occur
+in the order specified.  Except for the INDEX_NOOP optimization of step 1,
+none of the steps can be skipped.
+
+  1.  If the RHS is a constant list of length 1 or 2, then rewrite the
+      IN operator as a simple expression.  Implement
+
+            x IN (y1,y2)
+
+      as if it were
+
+            x=y1 OR x=y2
+
+      This is the INDEX_NOOP optimization and is only undertaken if the
+      IN operator is used for membership testing.  If the IN operator is
+      driving a loop, then skip this step entirely.
+
+  2.  If the RHS is empty, return FALSE.
+
+  3.  If the LHS is a total-NULL or if the RHS contains a total-NULL,
+      then return NULL.
+
+  4.  If the LHS is non-NULL, then use the LHS as a probe in a binary
+      search of the RHS 
+
+      <ol type='a'>
+      <li> If the binary search finds an exact match, return TRUE
+
+      <li> If the RHS is known to be not-null, return FALSE
+      </ol>
+
+  5.  At this point, it is known that the result cannot be TRUE.  All
+      that remains is to distinguish between NULL and FALSE.
+      If a NOT-TRUE result is acceptable, then return NOT-TRUE now.
+
+  6.  For each row in the RHS, compare that row against the LHS and
+      if the result is NULL, immediately return NULL.  This step is
+      essentially the "Simple Full-scan Algorithm" above with the
+      tests for TRUE removed, since we know that the result cannot be
+      TRUE at this point.
+
+  7.  Return FALSE.
author	drh <drh@noemail.net>
	Thu, 25 Aug 2016 14:00:15 +0000 (14:00 +0000)
committer	drh <drh@noemail.net>
	Thu, 25 Aug 2016 14:00:15 +0000 (14:00 +0000)
manifest		patch \| blob \| blame \| history
manifest.uuid		patch \| blob \| blame \| history
src/in-operator.md	[new file with mode: 0644]	patch \| blob