git.ipfire.org Git - thirdparty/squid.git/commit

Fix rock disk entry contamination related to aborted swapouts (#444)

Also probably fixed hit response corruption related to aborted rock
swapins.

The following disk entry contamination sequence was observed in detailed
cache logs during high-load Polygraph tests. Some of the observed
level-1 errors matched those in real-world deployments.

1. Worker A schedules AX – a request to write a piece of entry A to disk
   slot X. The disker is busy writing and reading other slots for worker
   A and other workers so AX stays in the A-to-Disker queue for a while.
2. Entry A aborts swapout (for any reason, including network errors
   while receiving the being-stored response). Squid makes disk slot X
   available for other entries to use. AX stays queued.
3. Another worker B picks up disk slot X (from the shared free disk slot
   pool) and schedules BX, a request to write a piece of entry B to disk
   slot X. BX gets queued in an B-to-Disker queue. AX stays queued.
4. The disker satisfies write request BX. Disk slot X contains entry B
   contents now. AX stays queued.
5. The disker satisfies write request AX. Disk slot X is a part of entry
   B slot chain but contains former entry A contents now! HTTP requests
   for entry B now read entry A leftovers from disk and complain about
   metadata mismatches (at best) or get wrong response contents (at
   worst).

To prevent premature disk slot reuse, we now keep disk slots reserved
while they are in the disk queue, even if the corresponding cache entry
is long gone: Individual disk write requests now "own" the slot they are
writing. The Rock::IoState object owns reserved but not yet used slots
so that they can be freed when the object is gone. The disk entry owns
the (successfully written) slots added to its chain in the map.

The new slot ownership scheme required changes in what metadata the
writing code has to maintain. For example, we now keep the address of
the previous slot in the entry chain so that we can update its .next
field after a successful disk write. Also, the old code detecting
dropped write requests could no longer rely on the now-empty .next field
in the previous map entry. The rewritten code numbers I/O transactions
so that out-of-order replies can be detected without using the map.

I tried to simplify the metadata maintenance code by shortening
reservation lifetimes and using just-in-time [first] slot reservations.
The new code may also leak fewer slots when facing C++ exceptions.

As for reading, I realized that we had no protection from dropped rock
read requests. If the first read request is dropped, the metadata
decoding would probably fail but if subsequent reads are skipped, the
client may be fed with data that is missing those skipped blocks. I did
not try to reproduce these problems, but they are now fixed using the
same I/O transaction numbering mechanism that the writing code now uses.
Negative length checks in store_client.cc treat dropped reads as errors.

I also removed commented out "partial writing" code because IoState
class member changes should expose any dangerous merge problems.

author	Eduard Bagdasaryan <eduard.bagdasaryan@measurement-factory.com>
	Wed, 7 Aug 2019 12:04:30 +0000 (12:04 +0000)
committer	Squid Anubis <squid-anubis@squid-cache.org>
	Wed, 7 Aug 2019 12:04:34 +0000 (12:04 +0000)
commit	18102f7de14e3ba35d2b4e9457b3611a178bf8b9
tree	cbd841374f5c13349fe2b43e207884dc21798dc9	tree
parent	71778e77daf5231e8b97195b831b4889aa1fa53e	commit \| diff

src/fs/rock/RockIoRequests.cc		diff \| blob \| blame \| history
src/fs/rock/RockIoRequests.h		diff \| blob \| blame \| history
src/fs/rock/RockIoState.cc		diff \| blob \| blame \| history
src/fs/rock/RockIoState.h		diff \| blob \| blame \| history
src/fs/rock/RockSwapDir.cc		diff \| blob \| blame \| history
src/fs/rock/RockSwapDir.h		diff \| blob \| blame \| history
src/fs/rock/forward.h		diff \| blob \| blame \| history
src/store_client.cc		diff \| blob \| blame \| history