From: Alex Rousskov <rousskov@measurement-factory.com>
Date: Tue, 13 Sep 2011 16:47:32 +0000 (-0600)
Subject: SMP Caching: Core changes, IPC primitives, Shared memory cache, and Rock Store
X-Git-Tag: take08~3^2
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=35a1b223314eddb67b631cb0224d6e03a2360c77;p=thirdparty%2Fsquid.git

SMP Caching: Core changes, IPC primitives, Shared memory cache, and Rock Store

Core changes
------------

* Added MemObject::expectedReplySize() and used it instead of object_sz.

When deciding whether an object with a known content length can be
swapped out, do not wait until the object is completely received and its
size (mem_obj->object_sz) becomes known (while asking the store to
recheck in vain with every incoming chunk). Instead, use the known
content length, if any, to make the decision.

This optimizes the common case where the complete object is eventually
received and swapped out, preventing accumulating potentially large
objects in RAM while waiting for the end of the response. Should not
affect objects with unknown content length.

Side-effect1: probably fixes several cases of unknowingly using negative
(unknown) mem_obj->object_sz in calculations. I added a few assertions
to double check some of the remaining object_sz/objectLen() uses.

Side-effect2: When expectedReplySize() is stored on disk as StoreEntry
metadata, it may help to detect truncated entries when the writer
process dies before completing the swapout.


* Removed mem->swapout.memnode in favor of mem->swapout.queue_offset.

The code used swapout.memnode pointer to keep track of the last page
that was swapped out. The code was semi-buggy because it could reset the
pointer to NULL if no new data came in before the call to doPages().
Perhaps the code relied on the assumption that the caller will never
doPages if there is no new data, but I am not sure that assumption was
correct in all cases (it could be that I broke the calling code, of course).

Moreover, the page pointer was kept without any protection from page
disappearing during asynchronous swapout. There were "Evil hack time"
comments discussing how the page might disappear.

Fortunately, we already have mem->swapout.queue_offset that can be fed
to getBlockContainingLocation to find the page that needs to be swapped
out. There is no need to keep the page pointer around. The
queue_offset-based math is the same so we are not adding any overheads
by using that offset (in fact, we are removing some minor computations).


* Added "close how?" parameter to storeClose() and friends.

The old code would follow the same path when closing swapout activity
for an aborted entry and when completing a perfectly healthy swapout. In
non-shared case, that could have been OK because the abort code would
then release the entry, removing any half-written entry from the index
and the disk (but I am not sure that release happened fast enough in
100% of cases).

When the index and disk storage is shared among workers, such
"temporary" inconsistencies result in truncated responses being
delivered by other workers to the user because once the swapout activity
is closed, other workers can start using the entry.

By adding the "close how?" parameter to closing methods we allow the
core and SwapDir-specific code to handle aborted swapouts appropriately.

Since swapin code is "read only", we do not currently distinguish
between aborted and fully satisfied readers: The readerGone enum value
applies to both cases. If needed, the SwapDir reading code can make that
distinction by analyzing how much was actually swapped in.


* Moved "can you store this entry?" code to virtual SwapDir::canStore().

The old code had some of the tests in SwapDir-specific canStore()
methods and some in storeDirSelect*() methods. This resulted in
inconsistencies, code duplication, and extra calculation overheads.
Making this call virtual allows individual cache_dir types to do custom
access controls.

The same method is used for cache_dir load reporting (if it returns
true). Load management needs more work, but the current code is no worse
than the old one in this aspect, and further improvements are outside
this change scope.


* Minimized from-disk StoreEntry loading/unpacking code duplication.

Moved common (and often rather complex!) code from store modules into
storeRebuildLoadEntry, storeRebuildParseEntry, and storeRebuildKeepEntry.


* Do not set object_sz when the entry is aborted because the true object
size (HTTP reply headers + body) is not known in this case. Setting
object_sz may fool client-side code into believing that the object is
complete.

This addresses an old RBC's complaint.


* When swapout initiation fails, mark swapout decision as
MemObject::SwapOut::swImpossible. This prevents the caller code from trying to
swap out again and again because swap_status becomes SWAPOUT_NONE.

TODO: Consider add SWAPOUT_ERROR, STORE_ERROR, and similar states. It
may solve several problems where the code sees _NONE or _OK and thinks
everything is peachy when in fact there was an error.


* Call haveParsedReplyHeaders() before entry->replaceHttpReply().

HaveParsedReplyHeaders() sets the entry public key and various flags (at
least). ReplaceHttpReply() packs reply headers, starting swapout process.
It feels natural to adjust the entry _before_ we pack/swap it, but I may be
missing some side-effects here.

The change was necessary because we started calling checkCachable() from
swapoutPossible(). If haveParsedReplyHeaders() is not called before we swap
out checks, the entry will still have the private key and will be declared
impossible to cache.


* Extracted the write-to-store step from StoreEntry::replaceHttpReply().

This allows the caller to set the reply for the entry and then update the
entry and the reply before writing them to store. For example, the server-side
haveParsedReplyHeaders() code needs to set the entry timestamps and make the
entry key public before the entry starts swapping out, but the same code also
needs access to entry->getReply() and such for timestampsSet() and similar
code to work correctly.

TODO: Calls to StoreEntry::replaceHttpReply() do not have to be modified
because replaceHttpReply() does write by default. However, it is likely that
callers other than ServerStateData::setFinalReply() should take advantage of
the new split interface because they call timestampsSet() and such after
replaceHttpReply().


* Moved SwapDir::cur_size and n_disk_objects to specific SwapDirs. Removed
updateSize().  Some cache_dirs maintain their own maps and size statistics,
making the one-size-fits-all SwapDir members inappropriate.

* A new SwapDir public method swappedOut() added. It is called from
storeSwapOutFileClosed() to notify SwapDir that an object was swapped
out.

* Change SwapDir::max_size to bytes, make it protected, use maxSize() instead.

Change SwapDir::cur_size to bytes, make it private, use currentSize() instead.

Store Config.Store.avgObjectSize in bytes to avoid repeated and error-prone
KB<->bytes conversions.


* Change Config.cacheSwap.swapDirs and StoreEntry::store() type to SwapDir.

This allows using SwapDir API without dynamic_cast.


* Always call StoreEntry::abort() instead of setting ENTRY_ABORTED manually.

* Rely on entry->abort() side-effects if ENTRY_ABORTED was set.

* Added or updated comments to better document current code.

* Added operator << for dumping StoreEntry summary into the debugging
log. Needs more work to report more info (and not report yet-unknown info).

* Fixed blocking reads that were sometimes reading from random file offsets.

Core "disk file" reading code assumed that if the globally stored disk.offset
matches the desired offset, there is no reason to seek. This was probably done
to reduce seek overhead between consecutive reads. Unfortunately, the disk
writing code did not know about that optimization and left F->disk.offset
unchanged after writing.

This may have worked OK for UFS if it never writes to the file it reads from,
but it does not work for store modules that do both kinds of I/O at different
offsets of the same disk file.

Eventually, implement this optimization correctly or remove disk.offset.


IPC primitives
--------------

To make SMP disk and memory caching non-blocking and correct, worker and
disker processes must asynchronously communicate with each other. We are
adding a collection of classes that support such communication.

At the base of the collection is the AtomicWordT template that uses GCC atomic
primitives such as __sync_add_and_fetch() to perform atomic operations on
integral values in memory shared by multiple Squid kids. AtomicWordT is used
to implement non-blocking shared locks, queues, store tables, and page pools.

To avoid blocking or very long searches, many operations are "optimistic" in
nature. For example, it is possible that an atomic store map will refuse to
allocate an entry for two processes even though a blocking implementation
would have allowed one of the processes to get the map slot. We speculate that
such conflict resolution is better than blocking locks when it comes to
caching, especially if the conflicts are rare due to large number of cache
entries, fast operations, and relatively small number of kids.


TODO: Eventually, consider breaking locks left by dead kids.


Shared Memory Cache
-------------------

* Added initial shared memory cache implementation (MemStore).

The shared memory cache keeps its own compact index of cached entries using
extended Ipc::StoreMap class (MemStoreMap). The cache also strives to keep its
Root.get() results out of the store_table except during transit.

Eventually, the non-shared/local memory cache should also be implemented
using a MemStore-like class, I think. This will allow to clearly isolate
local from shared memory cache code.

Allow the user to explicitly disable shared memory caching in SMP mode via
memory_cache_shared to squid.conf. Report whether mem_cache is shared.

Disable shared memory caching by default if atomic operations are not
supported. Prohibit shared memory caching if atomic operations are not
supported.

TODO: Better limits/separation for cache and I/O shared memory pages.
Eventually, support shared memory caching of multi-page entries.


Rock Store
----------

Rock Store uses a single [large] database-style file per cache_dir to store
cached responses and metadata. This part of the design is similar to COSS.
Rock Store does not maintain or rely on swap.state "log" for recovery.
Instead, the database is scanned in the background to load entries when Squid
starts. Rock Store maintains its own index of cached entries and avoids global
store_table. All entries must be max-size or smaller.

In SMP mode, each Rock cache_dir is given a dedicated Kid processes called
"disker". All SMP workers communicate with diskers to store misses and load
hits, using shared memory pages and atomic shared memory queues. Disker blocks
when doing disk I/O but workers do not. Any Diskers:Workers ratio is supported
so that the user can find and configure the optimal number of workers and
diskers for a given number of disks and CPU cores.

In non-SMP mode, should use good old blocking disk I/O, without any diskers,
but this has not been tested recently and probably needs more work.

Feature page: http://wiki.squid-cache.org/Features/RockStore

TODO: Disk rate limit to protect Squid from disk overload. More stats.
Multiple readers? Seek optimization? Remove known max-size requirement?
---

35a1b223314eddb67b631cb0224d6e03a2360c77