From: dan <dan@noemail.net>
Date: Mon, 9 Oct 2017 19:49:08 +0000 (+0000)
Subject: Add a header comment to wal.c describing the differences between wal and wal2
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=40927fd61e4db1149e31a65d4ca05d64bb6212dc;p=thirdparty%2Fsqlite.git

Add a header comment to wal.c describing the differences between wal and wal2
mode.

FossilOrigin-Name: 9c80cd202f1c966929b279e18f19c663912686fcf92f03b85a02b9c7e55a0fc6
---

diff --git a/manifest b/manifest
index aa48e86c68..0f5352281d 100644
--- a/manifest
+++ b/manifest
@@ -1,5 +1,5 @@
-C Ignore\sthe\s*-wal2\sfile\sif\sthe\s*-wal\sfile\sis\szero\sbytes\sin\ssize.
-D 2017-10-07T19:55:37.322
+C Add\sa\sheader\scomment\sto\swal.c\sdescribing\sthe\sdifferences\sbetween\swal\sand\swal2\nmode.
+D 2017-10-09T19:49:08.605
 F Makefile.in 4bc36d913c2e3e2d326d588d72f618ac9788b2fd4b7efda61102611a6495c3ff
 F Makefile.linux-gcc 7bc79876b875010e8c8f9502eb935ca92aa3c434
 F Makefile.msc 6033b51b6aea702ea059f6ab2d47b1d3cef648695f787247dd4fb395fe60673f
@@ -537,7 +537,7 @@ F src/vdbesort.c 731a09e5cb9e96b70c394c1b7cf3860fbe84acca7682e178615eb941a3a0ef2
 F src/vdbetrace.c 48e11ebe040c6b41d146abed2602e3d00d621d7ebe4eb29b0a0f1617fd3c2f6c
 F src/vtab.c 0e4885495172e1bdf54b12cce23b395ac74ef5729031f15e1bc1e3e6b360ed1a
 F src/vxworks.h d2988f4e5a61a4dfe82c6524dd3d6e4f2ce3cdb9
-F src/wal.c 11314f64edbb7613e80290830a7379ff083b511bd1ee97846518c3102ebf5c4f
+F src/wal.c c025455c9d6cf48ca55bd894be4a37a160565de9c845510a048fb9113761b2f4
 F src/wal.h b6063e6be1b03389372f3f32240e99b8ab92c32cdd05aa0e31b30a21e4e41654
 F src/walker.c 3ccfa8637f95355bff61144e01a615b8ef26f79c312880848da73f03367da1e6
 F src/where.c 049522adcf5426f1a8c3ed07be15e1ffa3266afd34e8e7bee64b63e2fbfad0b5
@@ -1657,7 +1657,7 @@ F vsixtest/vsixtest.tcl 6a9a6ab600c25a91a7acc6293828957a386a8a93
 F vsixtest/vsixtest.vcxproj.data 2ed517e100c66dc455b492e1a33350c1b20fbcdc
 F vsixtest/vsixtest.vcxproj.filters 37e51ffedcdb064aad6ff33b6148725226cd608e
 F vsixtest/vsixtest_TemporaryKey.pfx e5b1b036facdb453873e7084e1cae9102ccc67a0
-P 8932b2f1d7e6a26221ea3dea01000832b2d1eb17ac0b70ef6028f9286ae450a3
-R c46e3ca31dc60584535891a8280b1e6f
+P f7360fad51f224f347bb7d263eb89056b27461c278309e00e575a0e8898c9f40
+R 3eebf17818b82033242954f80ad24b93
 U dan
-Z fe4e7b40b37235ec7703f9add7af6779
+Z fe65619fadae77c6923d24a16aa4aa37
diff --git a/manifest.uuid b/manifest.uuid
index bd32a9a8c9..f96be1474c 100644
--- a/manifest.uuid
+++ b/manifest.uuid
@@ -1 +1 @@
-f7360fad51f224f347bb7d263eb89056b27461c278309e00e575a0e8898c9f40
\ No newline at end of file
+9c80cd202f1c966929b279e18f19c663912686fcf92f03b85a02b9c7e55a0fc6
\ No newline at end of file
diff --git a/src/wal.c b/src/wal.c
index 277f674bbb..836e876c5c 100644
--- a/src/wal.c
+++ b/src/wal.c
@@ -101,7 +101,7 @@
 **
 ** To read a page from the database (call it page number P), a reader
 ** first checks the WAL to see if it contains page P.  If so, then the
-** last valid instance of page P that is a followed by a commit frame
+** last valid instance of page P that is followed by a commit frame
 ** or is a commit frame itself becomes the value read.  If the WAL
 ** contains no copies of page P that are valid and which are a commit
 ** frame or are followed by a commit frame, then page P is read from
@@ -229,7 +229,7 @@
 ** and to the wal-index) might be using a different value K1, where K1>K0.
 ** Both readers can use the same hash table and mapping section to get
 ** the correct result.  There may be entries in the hash table with
-** K>K0 but to the first reader, those entries will appear to be unused
+** K>K0, but to the first reader those entries will appear to be unused
 ** slots in the hash table and so the first reader will get an answer as
 ** if no values greater than K0 had ever been inserted into the hash table
 ** in the first place - which is what reader one wants.  Meanwhile, the
@@ -240,6 +240,166 @@
 ** that correspond to frames greater than the new K value are removed
 ** from the hash table at this point.
 */
+
+/*
+** WAL2 NOTES
+**
+** This file also contains the implementation of "wal2" mode - activated
+** using "PRAGMA journal_mode = wal2". Wal2 mode is very similar to wal
+** mode, except that it uses two wal files instead of one. Under some
+** circumstances, wal2 mode provides more concurrency than legacy wal 
+** mode.
+**
+** THE PROBLEM WAL2 SOLVES:
+**
+** In legacy wal mode, if a writer wishes to write to the database while
+** a checkpoint is ongoing, it may append frames to the existing wal file.
+** This means that after the checkpoint has finished, the wal file consists
+** of a large block of checkpointed frames, followed by a block of
+** uncheckpointed frames. In a deployment that features a high volume of
+** write traffic, this may mean that the wal file is never completely
+** checkpointed. And so grows indefinitely.
+**
+** An alternative is to use "PRAGMA wal_checkpoint=RESTART" or similar to
+** force a complete checkpoint of the wal file. But this must:
+**
+**   1) Wait on all existing readers to finish,
+**   2) Wait on any existing writer, and then block all new writers,
+**   3) Do the checkpoint,
+**   4) Wait on any new readers that started during steps 2 and 3. Writers
+**      are still blocked during this step.
+**
+** This means that in order to avoid the wal file growing indefinitely 
+** in a busy system, writers must periodically pause to allow a checkpoint
+** to complete. In a system with long running readers, such pauses may be
+** for a non-trivial amount of time.
+**
+** OVERVIEW OF SOLUTION
+**
+** Wal2 mode uses two wal files. After writers have grown the first wal 
+** file to a pre-configured size, they begin appending transactions to 
+** the second wal file. Once all existing readers are reading snapshots
+** new enough to include the entire first wal file, a checkpointer can
+** checkpoint it.
+**
+** Meanwhile, writers are writing transactions to the second wal file.
+** Once that wal file has grown larger than the pre-configured size, each
+** new writer checks if:
+**
+**    * the first wal file has been checkpointed, and if so, if
+**    * there are no readers still reading from the first wal file (once
+**      it has been checkpointed, new readers read only from the second
+**      wal file).
+**
+** If both these conditions are true, the writer may switch back to the
+** first wal file. Eventually, a checkpointer can checkpoint the second
+** wal file, and so on.
+**
+** The wal file that writers are currently appending to (the one they
+** don't have to check the above two criteria before writing to) is called
+** the "current" wal file.
+**
+** The first wal file takes the same name as the wal file in legacy wal
+** mode systems - "<db>-wal". The second is named "<db>-wal2".
+**
+** WAL FILE FORMAT
+**
+** The file format used for each wal file in wal2 mode is the same as for
+** legacy wal mode.  Except, the file format field is set to 3021000 
+** instead of 3007000.
+**
+** WAL-INDEX FORMAT
+**
+** The wal-index format is also very similar. Even though there are two
+** wal files, there is still a single wal-index shared-memory area (*-shm
+** file with the default unix or win32 VFS). The wal-index header is the
+** same size, with the following exceptions it has the same format:
+**
+**   * The version field is set to 3021000 instead of 3007000.
+**
+**   * An unused 32-bit field in the legacy wal-index header is
+**     now used to store (a) a single bit indicating which of the
+**     two wal files writers should append to and (b) the number
+**     of frames in the second wal file (31 bits).
+**
+** The first hash table in the wal-index contains entries corresponding
+** to the first HASHTABLE_NPAGE_ONE frames stored in the first wal file.
+** The second hash table in the wal-index contains entries indexing the
+** first HASHTABLE_NPAGE frames in the second wal file. The third hash
+** table contains the next HASHTABLE_NPAGE frames in the first wal file,
+** and so on.
+**
+** LOCKS
+**
+** Read-locks are simpler than for legacy wal mode. There are no locking
+** slots that contain frame numbers. Instead, there are four distinct
+** combinations of read locks a reader may hold:
+**
+**   WAL_LOCK_PART1:       "part" lock on first wal, none of second.
+**   WAL_LOCK_PART1_FULL2: "part" lock on first wal, "full" of second.
+**   WAL_LOCK_PART2: no lock on first wal, "part" lock on second.
+**   WAL_LOCK_PART2_FULL1: "full" lock on first wal, "part" lock on second.
+**
+** When a reader reads the wal-index header as part of opening a read
+** transaction, it takes a "part" lock on the current wal file. "Part" 
+** because the wal file may grow while the read transaction is active, in 
+** which case the reader would be reading only part of the wal file. 
+** A part lock prevents a checkpointer from checkpointing the wal file 
+** on which it is held.
+**
+** If there is data in the non-current wal file that has not been 
+** checkpointed, the reader takes a "full" lock on that wal file. A 
+** "full" lock indicates that the reader is using the entire wal file.
+** A full lock prevents a writer from overwriting the wal file on which
+** it is held, but does not prevent a checkpointer from checkpointing 
+** it.
+**
+** There is still a single WRITER and a single CHECKPOINTER lock. The
+** recovery procedure still takes the same exclusive lock on the entire
+** range of SQLITE_SHM_NLOCK shm-locks. This works because the read-locks
+** above use four of the six read-locking slots used by legacy wal mode.
+** See the header comment for function walLockReader() for details.
+**
+** STARTUP/RECOVERY
+**
+** The read and write version fields of the database header in a wal2
+** database are set to 0x03, instead of 0x02 as in legacy wal mode.
+**
+** The wal file format used in wal2 mode is the same as the format used
+** in legacy wal mode. However, in order to support recovery, there are two
+** differences in the way wal file header fields are populated, as follows:
+**
+**   * When the first wal file is first created, the "nCkpt" field in
+**     the wal file header is set to 0. Thereafter, each time the writer
+**     switches wal file, it sets the nCkpt field in the new wal file
+**     header to ((nCkpt0 + 1) & 0x0F), where nCkpt0 is the value in
+**     the previous wal file header. This means that the first wal file
+**     always has an even value in the nCkpt field, and the second wal
+**     file always has an odd value.
+**
+**   * When a writer switches wal file, it sets the salt values in the
+**     new wal file to a copy of the checksum for the final frame in
+**     the previous wal file.
+**
+** Recovery proceeds as follows:
+**
+** 1. Each wal file is recovered separately. Except, if the first wal 
+**    file does not exist or is zero bytes in size, the second wal file
+**    is truncated to zero bytes before it is "recovered".
+**
+** 2. If both wal files contain valid headers, then the nCkpt fields
+**    are compared to see which of the two wal files is older. If the
+**    salt keys in the second wal file match the final frame checksum 
+**    in the older wal file, then both wal files are used. Otherwise,
+**    the newer wal file is ignored.
+**
+** 3. Or, if only one or neither of the wal files has a valid header, 
+**    then only a single or no wal files are recovered into the 
+**    reconstructed wal-index.
+**
+** Refer to header comments for walIndexRecover() for further details.
+*/
+
 #ifndef SQLITE_OMIT_WAL
 
 #include "wal.h"
@@ -1491,6 +1651,10 @@ static int walIndexRecover(Wal *pWal){
         ){
           SWAP(WalIndexHdr, pWal->hdr, hdr);
           walidxSetMxFrame(&pWal->hdr, 1, hdr.mxFrame);
+        }else{
+          walidxSetFile(&pWal->hdr, 1);
+          walidxSetMxFrame(&pWal->hdr, 1, pWal->hdr.mxFrame);
+          walidxSetMxFrame(&pWal->hdr, 0, 0);
         }
       }else