From: dan Date: Mon, 9 Oct 2017 19:49:08 +0000 (+0000) Subject: Add a header comment to wal.c describing the differences between wal and wal2 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=40927fd61e4db1149e31a65d4ca05d64bb6212dc;p=thirdparty%2Fsqlite.git Add a header comment to wal.c describing the differences between wal and wal2 mode. FossilOrigin-Name: 9c80cd202f1c966929b279e18f19c663912686fcf92f03b85a02b9c7e55a0fc6 --- diff --git a/manifest b/manifest index aa48e86c68..0f5352281d 100644 --- a/manifest +++ b/manifest @@ -1,5 +1,5 @@ -C Ignore\sthe\s*-wal2\sfile\sif\sthe\s*-wal\sfile\sis\szero\sbytes\sin\ssize. -D 2017-10-07T19:55:37.322 +C Add\sa\sheader\scomment\sto\swal.c\sdescribing\sthe\sdifferences\sbetween\swal\sand\swal2\nmode. +D 2017-10-09T19:49:08.605 F Makefile.in 4bc36d913c2e3e2d326d588d72f618ac9788b2fd4b7efda61102611a6495c3ff F Makefile.linux-gcc 7bc79876b875010e8c8f9502eb935ca92aa3c434 F Makefile.msc 6033b51b6aea702ea059f6ab2d47b1d3cef648695f787247dd4fb395fe60673f @@ -537,7 +537,7 @@ F src/vdbesort.c 731a09e5cb9e96b70c394c1b7cf3860fbe84acca7682e178615eb941a3a0ef2 F src/vdbetrace.c 48e11ebe040c6b41d146abed2602e3d00d621d7ebe4eb29b0a0f1617fd3c2f6c F src/vtab.c 0e4885495172e1bdf54b12cce23b395ac74ef5729031f15e1bc1e3e6b360ed1a F src/vxworks.h d2988f4e5a61a4dfe82c6524dd3d6e4f2ce3cdb9 -F src/wal.c 11314f64edbb7613e80290830a7379ff083b511bd1ee97846518c3102ebf5c4f +F src/wal.c c025455c9d6cf48ca55bd894be4a37a160565de9c845510a048fb9113761b2f4 F src/wal.h b6063e6be1b03389372f3f32240e99b8ab92c32cdd05aa0e31b30a21e4e41654 F src/walker.c 3ccfa8637f95355bff61144e01a615b8ef26f79c312880848da73f03367da1e6 F src/where.c 049522adcf5426f1a8c3ed07be15e1ffa3266afd34e8e7bee64b63e2fbfad0b5 @@ -1657,7 +1657,7 @@ F vsixtest/vsixtest.tcl 6a9a6ab600c25a91a7acc6293828957a386a8a93 F vsixtest/vsixtest.vcxproj.data 2ed517e100c66dc455b492e1a33350c1b20fbcdc F vsixtest/vsixtest.vcxproj.filters 37e51ffedcdb064aad6ff33b6148725226cd608e F vsixtest/vsixtest_TemporaryKey.pfx e5b1b036facdb453873e7084e1cae9102ccc67a0 -P 8932b2f1d7e6a26221ea3dea01000832b2d1eb17ac0b70ef6028f9286ae450a3 -R c46e3ca31dc60584535891a8280b1e6f +P f7360fad51f224f347bb7d263eb89056b27461c278309e00e575a0e8898c9f40 +R 3eebf17818b82033242954f80ad24b93 U dan -Z fe4e7b40b37235ec7703f9add7af6779 +Z fe65619fadae77c6923d24a16aa4aa37 diff --git a/manifest.uuid b/manifest.uuid index bd32a9a8c9..f96be1474c 100644 --- a/manifest.uuid +++ b/manifest.uuid @@ -1 +1 @@ -f7360fad51f224f347bb7d263eb89056b27461c278309e00e575a0e8898c9f40 \ No newline at end of file +9c80cd202f1c966929b279e18f19c663912686fcf92f03b85a02b9c7e55a0fc6 \ No newline at end of file diff --git a/src/wal.c b/src/wal.c index 277f674bbb..836e876c5c 100644 --- a/src/wal.c +++ b/src/wal.c @@ -101,7 +101,7 @@ ** ** To read a page from the database (call it page number P), a reader ** first checks the WAL to see if it contains page P. If so, then the -** last valid instance of page P that is a followed by a commit frame +** last valid instance of page P that is followed by a commit frame ** or is a commit frame itself becomes the value read. If the WAL ** contains no copies of page P that are valid and which are a commit ** frame or are followed by a commit frame, then page P is read from @@ -229,7 +229,7 @@ ** and to the wal-index) might be using a different value K1, where K1>K0. ** Both readers can use the same hash table and mapping section to get ** the correct result. There may be entries in the hash table with -** K>K0 but to the first reader, those entries will appear to be unused +** K>K0, but to the first reader those entries will appear to be unused ** slots in the hash table and so the first reader will get an answer as ** if no values greater than K0 had ever been inserted into the hash table ** in the first place - which is what reader one wants. Meanwhile, the @@ -240,6 +240,166 @@ ** that correspond to frames greater than the new K value are removed ** from the hash table at this point. */ + +/* +** WAL2 NOTES +** +** This file also contains the implementation of "wal2" mode - activated +** using "PRAGMA journal_mode = wal2". Wal2 mode is very similar to wal +** mode, except that it uses two wal files instead of one. Under some +** circumstances, wal2 mode provides more concurrency than legacy wal +** mode. +** +** THE PROBLEM WAL2 SOLVES: +** +** In legacy wal mode, if a writer wishes to write to the database while +** a checkpoint is ongoing, it may append frames to the existing wal file. +** This means that after the checkpoint has finished, the wal file consists +** of a large block of checkpointed frames, followed by a block of +** uncheckpointed frames. In a deployment that features a high volume of +** write traffic, this may mean that the wal file is never completely +** checkpointed. And so grows indefinitely. +** +** An alternative is to use "PRAGMA wal_checkpoint=RESTART" or similar to +** force a complete checkpoint of the wal file. But this must: +** +** 1) Wait on all existing readers to finish, +** 2) Wait on any existing writer, and then block all new writers, +** 3) Do the checkpoint, +** 4) Wait on any new readers that started during steps 2 and 3. Writers +** are still blocked during this step. +** +** This means that in order to avoid the wal file growing indefinitely +** in a busy system, writers must periodically pause to allow a checkpoint +** to complete. In a system with long running readers, such pauses may be +** for a non-trivial amount of time. +** +** OVERVIEW OF SOLUTION +** +** Wal2 mode uses two wal files. After writers have grown the first wal +** file to a pre-configured size, they begin appending transactions to +** the second wal file. Once all existing readers are reading snapshots +** new enough to include the entire first wal file, a checkpointer can +** checkpoint it. +** +** Meanwhile, writers are writing transactions to the second wal file. +** Once that wal file has grown larger than the pre-configured size, each +** new writer checks if: +** +** * the first wal file has been checkpointed, and if so, if +** * there are no readers still reading from the first wal file (once +** it has been checkpointed, new readers read only from the second +** wal file). +** +** If both these conditions are true, the writer may switch back to the +** first wal file. Eventually, a checkpointer can checkpoint the second +** wal file, and so on. +** +** The wal file that writers are currently appending to (the one they +** don't have to check the above two criteria before writing to) is called +** the "current" wal file. +** +** The first wal file takes the same name as the wal file in legacy wal +** mode systems - "-wal". The second is named "-wal2". +** +** WAL FILE FORMAT +** +** The file format used for each wal file in wal2 mode is the same as for +** legacy wal mode. Except, the file format field is set to 3021000 +** instead of 3007000. +** +** WAL-INDEX FORMAT +** +** The wal-index format is also very similar. Even though there are two +** wal files, there is still a single wal-index shared-memory area (*-shm +** file with the default unix or win32 VFS). The wal-index header is the +** same size, with the following exceptions it has the same format: +** +** * The version field is set to 3021000 instead of 3007000. +** +** * An unused 32-bit field in the legacy wal-index header is +** now used to store (a) a single bit indicating which of the +** two wal files writers should append to and (b) the number +** of frames in the second wal file (31 bits). +** +** The first hash table in the wal-index contains entries corresponding +** to the first HASHTABLE_NPAGE_ONE frames stored in the first wal file. +** The second hash table in the wal-index contains entries indexing the +** first HASHTABLE_NPAGE frames in the second wal file. The third hash +** table contains the next HASHTABLE_NPAGE frames in the first wal file, +** and so on. +** +** LOCKS +** +** Read-locks are simpler than for legacy wal mode. There are no locking +** slots that contain frame numbers. Instead, there are four distinct +** combinations of read locks a reader may hold: +** +** WAL_LOCK_PART1: "part" lock on first wal, none of second. +** WAL_LOCK_PART1_FULL2: "part" lock on first wal, "full" of second. +** WAL_LOCK_PART2: no lock on first wal, "part" lock on second. +** WAL_LOCK_PART2_FULL1: "full" lock on first wal, "part" lock on second. +** +** When a reader reads the wal-index header as part of opening a read +** transaction, it takes a "part" lock on the current wal file. "Part" +** because the wal file may grow while the read transaction is active, in +** which case the reader would be reading only part of the wal file. +** A part lock prevents a checkpointer from checkpointing the wal file +** on which it is held. +** +** If there is data in the non-current wal file that has not been +** checkpointed, the reader takes a "full" lock on that wal file. A +** "full" lock indicates that the reader is using the entire wal file. +** A full lock prevents a writer from overwriting the wal file on which +** it is held, but does not prevent a checkpointer from checkpointing +** it. +** +** There is still a single WRITER and a single CHECKPOINTER lock. The +** recovery procedure still takes the same exclusive lock on the entire +** range of SQLITE_SHM_NLOCK shm-locks. This works because the read-locks +** above use four of the six read-locking slots used by legacy wal mode. +** See the header comment for function walLockReader() for details. +** +** STARTUP/RECOVERY +** +** The read and write version fields of the database header in a wal2 +** database are set to 0x03, instead of 0x02 as in legacy wal mode. +** +** The wal file format used in wal2 mode is the same as the format used +** in legacy wal mode. However, in order to support recovery, there are two +** differences in the way wal file header fields are populated, as follows: +** +** * When the first wal file is first created, the "nCkpt" field in +** the wal file header is set to 0. Thereafter, each time the writer +** switches wal file, it sets the nCkpt field in the new wal file +** header to ((nCkpt0 + 1) & 0x0F), where nCkpt0 is the value in +** the previous wal file header. This means that the first wal file +** always has an even value in the nCkpt field, and the second wal +** file always has an odd value. +** +** * When a writer switches wal file, it sets the salt values in the +** new wal file to a copy of the checksum for the final frame in +** the previous wal file. +** +** Recovery proceeds as follows: +** +** 1. Each wal file is recovered separately. Except, if the first wal +** file does not exist or is zero bytes in size, the second wal file +** is truncated to zero bytes before it is "recovered". +** +** 2. If both wal files contain valid headers, then the nCkpt fields +** are compared to see which of the two wal files is older. If the +** salt keys in the second wal file match the final frame checksum +** in the older wal file, then both wal files are used. Otherwise, +** the newer wal file is ignored. +** +** 3. Or, if only one or neither of the wal files has a valid header, +** then only a single or no wal files are recovered into the +** reconstructed wal-index. +** +** Refer to header comments for walIndexRecover() for further details. +*/ + #ifndef SQLITE_OMIT_WAL #include "wal.h" @@ -1491,6 +1651,10 @@ static int walIndexRecover(Wal *pWal){ ){ SWAP(WalIndexHdr, pWal->hdr, hdr); walidxSetMxFrame(&pWal->hdr, 1, hdr.mxFrame); + }else{ + walidxSetFile(&pWal->hdr, 1); + walidxSetMxFrame(&pWal->hdr, 1, pWal->hdr.mxFrame); + walidxSetMxFrame(&pWal->hdr, 0, 0); } }else