From: Alex Rousskov Date: Tue, 13 Sep 2011 16:47:32 +0000 (-0600) Subject: SMP Caching: Core changes, IPC primitives, Shared memory cache, and Rock Store X-Git-Tag: take08~3^2 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=35a1b223314eddb67b631cb0224d6e03a2360c77;p=thirdparty%2Fsquid.git SMP Caching: Core changes, IPC primitives, Shared memory cache, and Rock Store Core changes ------------ * Added MemObject::expectedReplySize() and used it instead of object_sz. When deciding whether an object with a known content length can be swapped out, do not wait until the object is completely received and its size (mem_obj->object_sz) becomes known (while asking the store to recheck in vain with every incoming chunk). Instead, use the known content length, if any, to make the decision. This optimizes the common case where the complete object is eventually received and swapped out, preventing accumulating potentially large objects in RAM while waiting for the end of the response. Should not affect objects with unknown content length. Side-effect1: probably fixes several cases of unknowingly using negative (unknown) mem_obj->object_sz in calculations. I added a few assertions to double check some of the remaining object_sz/objectLen() uses. Side-effect2: When expectedReplySize() is stored on disk as StoreEntry metadata, it may help to detect truncated entries when the writer process dies before completing the swapout. * Removed mem->swapout.memnode in favor of mem->swapout.queue_offset. The code used swapout.memnode pointer to keep track of the last page that was swapped out. The code was semi-buggy because it could reset the pointer to NULL if no new data came in before the call to doPages(). Perhaps the code relied on the assumption that the caller will never doPages if there is no new data, but I am not sure that assumption was correct in all cases (it could be that I broke the calling code, of course). Moreover, the page pointer was kept without any protection from page disappearing during asynchronous swapout. There were "Evil hack time" comments discussing how the page might disappear. Fortunately, we already have mem->swapout.queue_offset that can be fed to getBlockContainingLocation to find the page that needs to be swapped out. There is no need to keep the page pointer around. The queue_offset-based math is the same so we are not adding any overheads by using that offset (in fact, we are removing some minor computations). * Added "close how?" parameter to storeClose() and friends. The old code would follow the same path when closing swapout activity for an aborted entry and when completing a perfectly healthy swapout. In non-shared case, that could have been OK because the abort code would then release the entry, removing any half-written entry from the index and the disk (but I am not sure that release happened fast enough in 100% of cases). When the index and disk storage is shared among workers, such "temporary" inconsistencies result in truncated responses being delivered by other workers to the user because once the swapout activity is closed, other workers can start using the entry. By adding the "close how?" parameter to closing methods we allow the core and SwapDir-specific code to handle aborted swapouts appropriately. Since swapin code is "read only", we do not currently distinguish between aborted and fully satisfied readers: The readerGone enum value applies to both cases. If needed, the SwapDir reading code can make that distinction by analyzing how much was actually swapped in. * Moved "can you store this entry?" code to virtual SwapDir::canStore(). The old code had some of the tests in SwapDir-specific canStore() methods and some in storeDirSelect*() methods. This resulted in inconsistencies, code duplication, and extra calculation overheads. Making this call virtual allows individual cache_dir types to do custom access controls. The same method is used for cache_dir load reporting (if it returns true). Load management needs more work, but the current code is no worse than the old one in this aspect, and further improvements are outside this change scope. * Minimized from-disk StoreEntry loading/unpacking code duplication. Moved common (and often rather complex!) code from store modules into storeRebuildLoadEntry, storeRebuildParseEntry, and storeRebuildKeepEntry. * Do not set object_sz when the entry is aborted because the true object size (HTTP reply headers + body) is not known in this case. Setting object_sz may fool client-side code into believing that the object is complete. This addresses an old RBC's complaint. * When swapout initiation fails, mark swapout decision as MemObject::SwapOut::swImpossible. This prevents the caller code from trying to swap out again and again because swap_status becomes SWAPOUT_NONE. TODO: Consider add SWAPOUT_ERROR, STORE_ERROR, and similar states. It may solve several problems where the code sees _NONE or _OK and thinks everything is peachy when in fact there was an error. * Call haveParsedReplyHeaders() before entry->replaceHttpReply(). HaveParsedReplyHeaders() sets the entry public key and various flags (at least). ReplaceHttpReply() packs reply headers, starting swapout process. It feels natural to adjust the entry _before_ we pack/swap it, but I may be missing some side-effects here. The change was necessary because we started calling checkCachable() from swapoutPossible(). If haveParsedReplyHeaders() is not called before we swap out checks, the entry will still have the private key and will be declared impossible to cache. * Extracted the write-to-store step from StoreEntry::replaceHttpReply(). This allows the caller to set the reply for the entry and then update the entry and the reply before writing them to store. For example, the server-side haveParsedReplyHeaders() code needs to set the entry timestamps and make the entry key public before the entry starts swapping out, but the same code also needs access to entry->getReply() and such for timestampsSet() and similar code to work correctly. TODO: Calls to StoreEntry::replaceHttpReply() do not have to be modified because replaceHttpReply() does write by default. However, it is likely that callers other than ServerStateData::setFinalReply() should take advantage of the new split interface because they call timestampsSet() and such after replaceHttpReply(). * Moved SwapDir::cur_size and n_disk_objects to specific SwapDirs. Removed updateSize(). Some cache_dirs maintain their own maps and size statistics, making the one-size-fits-all SwapDir members inappropriate. * A new SwapDir public method swappedOut() added. It is called from storeSwapOutFileClosed() to notify SwapDir that an object was swapped out. * Change SwapDir::max_size to bytes, make it protected, use maxSize() instead. Change SwapDir::cur_size to bytes, make it private, use currentSize() instead. Store Config.Store.avgObjectSize in bytes to avoid repeated and error-prone KB<->bytes conversions. * Change Config.cacheSwap.swapDirs and StoreEntry::store() type to SwapDir. This allows using SwapDir API without dynamic_cast. * Always call StoreEntry::abort() instead of setting ENTRY_ABORTED manually. * Rely on entry->abort() side-effects if ENTRY_ABORTED was set. * Added or updated comments to better document current code. * Added operator << for dumping StoreEntry summary into the debugging log. Needs more work to report more info (and not report yet-unknown info). * Fixed blocking reads that were sometimes reading from random file offsets. Core "disk file" reading code assumed that if the globally stored disk.offset matches the desired offset, there is no reason to seek. This was probably done to reduce seek overhead between consecutive reads. Unfortunately, the disk writing code did not know about that optimization and left F->disk.offset unchanged after writing. This may have worked OK for UFS if it never writes to the file it reads from, but it does not work for store modules that do both kinds of I/O at different offsets of the same disk file. Eventually, implement this optimization correctly or remove disk.offset. IPC primitives -------------- To make SMP disk and memory caching non-blocking and correct, worker and disker processes must asynchronously communicate with each other. We are adding a collection of classes that support such communication. At the base of the collection is the AtomicWordT template that uses GCC atomic primitives such as __sync_add_and_fetch() to perform atomic operations on integral values in memory shared by multiple Squid kids. AtomicWordT is used to implement non-blocking shared locks, queues, store tables, and page pools. To avoid blocking or very long searches, many operations are "optimistic" in nature. For example, it is possible that an atomic store map will refuse to allocate an entry for two processes even though a blocking implementation would have allowed one of the processes to get the map slot. We speculate that such conflict resolution is better than blocking locks when it comes to caching, especially if the conflicts are rare due to large number of cache entries, fast operations, and relatively small number of kids. TODO: Eventually, consider breaking locks left by dead kids. Shared Memory Cache ------------------- * Added initial shared memory cache implementation (MemStore). The shared memory cache keeps its own compact index of cached entries using extended Ipc::StoreMap class (MemStoreMap). The cache also strives to keep its Root.get() results out of the store_table except during transit. Eventually, the non-shared/local memory cache should also be implemented using a MemStore-like class, I think. This will allow to clearly isolate local from shared memory cache code. Allow the user to explicitly disable shared memory caching in SMP mode via memory_cache_shared to squid.conf. Report whether mem_cache is shared. Disable shared memory caching by default if atomic operations are not supported. Prohibit shared memory caching if atomic operations are not supported. TODO: Better limits/separation for cache and I/O shared memory pages. Eventually, support shared memory caching of multi-page entries. Rock Store ---------- Rock Store uses a single [large] database-style file per cache_dir to store cached responses and metadata. This part of the design is similar to COSS. Rock Store does not maintain or rely on swap.state "log" for recovery. Instead, the database is scanned in the background to load entries when Squid starts. Rock Store maintains its own index of cached entries and avoids global store_table. All entries must be max-size or smaller. In SMP mode, each Rock cache_dir is given a dedicated Kid processes called "disker". All SMP workers communicate with diskers to store misses and load hits, using shared memory pages and atomic shared memory queues. Disker blocks when doing disk I/O but workers do not. Any Diskers:Workers ratio is supported so that the user can find and configure the optimal number of workers and diskers for a given number of disks and CPU cores. In non-SMP mode, should use good old blocking disk I/O, without any diskers, but this has not been tested recently and probably needs more work. Feature page: http://wiki.squid-cache.org/Features/RockStore TODO: Disk rate limit to protect Squid from disk overload. More stats. Multiple readers? Seek optimization? Remove known max-size requirement? --- 35a1b223314eddb67b631cb0224d6e03a2360c77