Alex Rousskov [Tue, 12 Apr 2011 00:33:41 +0000 (18:33 -0600)]
Added initial shared memory cache implementation (MemStore) and integrated it.
Like Rock Store, shared memory cache keeps its own compact index of cached
entries using extended Ipc::StoreMap class (MemStoreMap). Like Rock Store, the
cache also struggles to keep its Root.get() results out of the store_table
except during transit.
There are several XXXs and TODOs that still need to be addressed for a more
polished implementation.
Eventually, the non-shared/local memory cache should also be implemented
using a MemStore-like class, I think. This will allow to clearly isolate
local from shared memory cache code.
Alex Rousskov [Mon, 11 Apr 2011 23:50:50 +0000 (17:50 -0600)]
Avoid creating unlocked store_table entries when handling rebuild conflicts.
Such StoreEntry objects persist until a hit locks and unlocks them (or the
replacement policy removes them?), creating SMP synchronization problems
because they are treated as in-transit objects even though their store slot
may be gone already.
Alex Rousskov [Sat, 9 Apr 2011 04:24:06 +0000 (22:24 -0600)]
Split Rock-only Rock::DirMap into Rock::DirMap and reusable Ipc pieces
which a shared memory cache implementation can use:
Ipc::StoreMap is responsible for maintaining a collection of lockable slots,
each with readable/writeable/free state and a "waiting to be free" flag. Kids
of this class can add more metadata (in parallel structures using the same
index as primary slots). I tried extending the slots themselves, but that
turned out to be more complex/messy.
Ipc::ReadWriteLock is a basic multiple readers, single writer lock. Its
earlier implementation inside Rock::DirMap mixed slot locking and slot
state/flags. That simplified the caller code a little, but the current simpler
class is easier to understand and reuse.
Rock::DirMap now just adds Rock::DbCellHeader metadata to Ipc::StoreMap slots.
Simplified mapping API by reducing the number of similar-but-different
methods. For example, instead of putAt, the caller can use an
openForWriting/closeForWriting pair. This helps with moving custom metadata
manipulations outside of the reusable Ipc::StoreMap.
It would be possible to split Ipc::StoreMap further by moving Store-specific
bits outside of its slots. Currently, there is no need for that though.
Alex Rousskov [Sat, 9 Apr 2011 04:20:21 +0000 (22:20 -0600)]
Added reserve() method to allow nested classes or similar related users of
the same segment to safely bite off pieces of the same shared segment. Still
need to convert the callers.
The reserve() method is useful for single-users as well because it allows
to check that a segment has enough bytes allocated for its single user.
Changed theSize type from int to "size of any single object in RAM" size_t.
Alex Rousskov [Thu, 31 Mar 2011 21:22:12 +0000 (15:22 -0600)]
Moved ipc/SharedMemory to ipc/mem/ and renamed it to Ipc::Mem::Segment.
All IPC things are "shared" by default.
The new name is also more specific because there are other shared memory
structures such as queues and stacks (all using shared memory segments
internally though).
Alex Rousskov [Tue, 22 Feb 2011 23:13:13 +0000 (16:13 -0700)]
Fixed blocking reads that were sometimes reading from random file offsets.
Core "disk file" reading code assumed that if the globally stored disk.offset
matches the desired offset, there is no reason to seek. This was probably done
to reduce seek overhead between consecutive reads. Unfortunately, the disk
writing code did not know about that optimization and left F->disk.offset
unchanged after writing.
This may have worked OK for UFS if it never writes to the file it reads from,
but it does not work for store modules that do both kinds of I/O at different
offsets of the same disk file.
TODO: Implement this optimization correctly or remove disk.offset.
Alex Rousskov [Tue, 15 Feb 2011 04:09:58 +0000 (21:09 -0700)]
* Removed old Rock Store hack to update swap_file_sz before it is packed into
store entry disk "header". Prepend our own disk to each db cell. Eventually,
the same header may be used to store "next page" or other info.
Modern Squid code should not use packed swap_file_sz or object_sz because
neither can be unknown at the packing time. The storage code should either
record its own size information (Rock Store does that now) or rely on
3rd-party metadata (UFS uses fstat(2) system call).
* Do not promise to read a being-written entry. Use peekAtReader() to check
that we can read the presumably locked (by us) entry.
It is probably possible to queue a reader somehow and wait for the entry to
become readable or even to read initial pages as we write (when we start doing
incremental writing) but it would be rather complex and expensive to
synchronize such shared read/write access across workers.
* Fixed ::StoreIOState::offset_ Rock maintenance. The old code probably
misinterpret offset_ meaning and kept the disk offset there. It should have
been just the distance from the beginning of the db cell (not db file)
instead. It probably worked because we write everything at once and the
offset_ values were usually big enough to pass "wrote everything" checks.
* Extracted storeRebuildParseEntry from storeRebuildLoadEntry.
This allows Rock Store to analyze and skip disk cell header before giving
the packed store entry info to the common core code.
* Report more disk stats. Needs more work, especially in SMP reporting case.
Map stats are only reported for tiny maps because scanning the entire map may
be too expensive for bigger maps (is it worth the overhead to maintain these
stats all the time instead of computing them from scratch on-demand?).
* Moved StoreEntryBasics into Rock namespace.
* Be more consistent in setting Rock::IoState's swap_filen, swap_dirn,
diskOffset, and payloadEnd.
* Renamed vague (and confusing after the header was added to the db cell)
entrySize to more specific payloadEnd.
* Synced with r11228 changes (making Store work with SMP-shared max-size
cache_dirs).
Alex Rousskov [Tue, 15 Feb 2011 04:02:28 +0000 (21:02 -0700)]
Changes revolving around making Store work with SMP-shared max-size cache_dirs:
* Added MemObject::expectedReplySize() and used it instead of object_sz.
When deciding whether an object with a known content length can be swapped
out, do not wait until the object is completely received and its size
(mem_obj->object_sz) becomes known (while asking the store to recheck in vain
with every incoming chunk). Instead, use the known content length, if any, to
make the decision.
This optimizes the common case where the complete object is eventually
received and swapped out, preventing accumulating potentially large objects in
RAM while waiting for the end of the response. Should not affect objects with
unknown content length.
Side-effect1: probably fixes several cases of unknowingly using negative
(unknown) mem_obj->object_sz in calculations. I added a few assertions to
double check some of the remaining object_sz/objectLen() uses.
Side-effect2: When expectedReplySize() is stored on disk as StoreEntry
metadata, it may help to detect truncated entries when the writer process dies
before completing the swapout.
* Removed mem->swapout.memnode in favor of mem->swapout.queue_offset.
The code used swapout.memnode pointer to keep track of the last page that was
swapped out. The code was semi-buggy because it could reset the pointer to
NULL if no new data came in before the call to doPages(). Perhaps the code
relied on the assumption that the caller will never doPages if there is no new
data, but I am not sure that assumption was correct in all cases (it could be
that I broke the calling code, of course).
Moreover, the page pointer was kept without any protection from page
disappearing during asynchronous swapout. There were "Evil hack time" comments
discussing how the page might disappear.
Fortunately, we already have mem->swapout.queue_offset that can be fed to
getBlockContainingLocation to find the page that needs to be swapped out.
There is no need to keep the page pointer around. The queue_offset-based math
is the same so we are not adding any overheads by using that offset (in fact,
we are removing some minor computations).
* Added "close how?" parameter to storeClose() and friends.
The old code would follow the same path when closing swapout activity for an
aborted entry and when completing a perfectly healthy swapout. In non-shared
case, that could have been OK because the abort code would then release the
entry, removing any half-written entry from the index and the disk (but I am
not sure that release happened fast enough in 100% of cases).
When the index and disk storage is shared among workers, such "temporary"
inconsistencies result in truncated responses being delivered by other workers
to the user because once the swapout activity is closed, other workers can
start using the entry.
By adding the "close how?" parameter to closing methods we allow the core and
SwapDir-specific code to handle aborted swapouts appropriately.
Since swapin code is "read only", we do not currently distinguish between
aborted and fully satisfied readers: The readerGone enum value applies to both
cases. If needed, the SwapDir reading code can make that distinction by
analyzing how much was actually swapped in.
* Moved "can you store this entry?" code to virtual SwapDir::canStore().
The old code had some of the tests in SwapDir-specific canStore() methods and
some in storeDirSelect*() methods. This resulted in inconsistencies, code
duplication, and extra calculation overheads. Making this call virtual allows
individual cache_dir types to do custom access controls.
The same method is used for cache_dir load reporting (if it returns true).
Load management needs more work, but the current code is no worse than the old
one in this aspect, and further improvements are outside this change scope.
Moved common (and often rather complex!) code from store modules into
storeRebuildLoadEntry, storeRebuildParseEntry, and storeRebuildKeepEntry.
* Do not set object_sz when the entry is aborted because the true object size
(HTTP reply headers + body) is not known in this case. Setting object_sz may
fool client-side code into believing that the object is complete.
This addresses an old RBC's complaint.
* When swapout initiation fails, release StoreEntry. This prevents the caller
code from trying to swap out again and again because swap_status becomes
SWAPOUT_NONE.
TODO: Consider add SWAPOUT_ERROR, STORE_ERROR, and similar states. It may
solve several problems where the code sees _NONE or _OK and thinks everything
is peachy when in fact there was an error.
* Always call StoreEntry::abort() instead of setting ENTRY_ABORTED manually.
* Rely on entry->abort() side-effects if ENTRY_ABORTED was set.
* Added or updated comments to better document current code.
* Added operator << for dumping StoreEntry summary into the debugging log.
Needs more work to report more info (and not report yet-unknown info).
Alex Rousskov [Mon, 14 Feb 2011 04:56:59 +0000 (21:56 -0700)]
Call StoreEntry::abort() instead of setting ENTRY_ABORTED flag.
This may be necessary because the abort() method does more than just setting
the flag and releasing the request. It guarantees, for example, that the
swapout I/O, if any, is closed.
Alex Rousskov [Mon, 14 Feb 2011 04:48:35 +0000 (21:48 -0700)]
Polished mgr query handoff from the original recepient worker to Coordinator.
When the worker receives a cache manager query, gives it to Coordinator, and
receives an ACK from Coordinator, the worker should stop handling the
originating transaction without declaring the associated StoreEntry as
complete because doing so triggers store client activity on the client-side
and might cause undesirable output to the now-shared HTTP client socket.
Besides, declaring an empty entry as complete is kind of wrong.
Alex Rousskov [Mon, 7 Feb 2011 17:59:28 +0000 (10:59 -0700)]
Made Rock::Rebuild an AsyncJob because it is.
Increment StoreController::store_dirs_rebuilding early, when SwapDir is
created and before the disk db file is opened and the actual rebuild starts.
Otherwise, if one SwapDir finishes rebuild before others start,
storeRebuildComplete() will see StoreController::store_dirs_rebuilding equal
to one, and think the rebuild is over.
This was not a problem for cache_dirs using blocking I/O because they either
did not try to open some file at SwapDir::init() time or did so synchronously,
resulting in "immediate" StoreController::store_dirs_rebuilding increment from
Store init loop point of view.
Alex Rousskov [Fri, 4 Feb 2011 22:25:45 +0000 (15:25 -0700)]
Quiet down swap out error reporting.
Do not report swap out errors at level 1. When things go wrong, the already
bad situation is made worse by writing lots of error messages to cache.log.
Do not report system error because the errno may be stale or irrelevant.
If error details are needed, the code should save and propagate the actual
errno in addition to the DISK_ERROR or similar status.
When StoreEntry is deleted, we need to release the SwapDir map slot locks it
holds, if any. This is difficult because SwapDir maintains the locks while
Squid Core maintains the entry swap_status. The Core gets swap_status-related
notifications using async calls so it is easy for swap_status to get out of
sync if SwapDir updates the map slot proactively.
The new code no longer releases the slot lock until the associated StoreEntry
is unlinked or gone, even if the slot is known to be unusable and waiting to
be deleted. We also do not rely on swap_status to guess which lock to release;
we use slot state to determine that instead.
Removed rock-specific code from StoreEntry destructor by introducing a general
SwapDir::disconnect(StoreEntry&) API.
Alex Rousskov [Thu, 3 Feb 2011 23:41:32 +0000 (16:41 -0700)]
Revised Slot management in Rock::DirMap.
Old code was occasionally hitting a s.state == Slot::Writing assertion when
closing the writing state. Since I could not find a specific bug that would
lead to this, I decided to simplify state management by moving Slot locking
further away from the Slot state.
Two kinds of Slot locks are now supported: exclusive and shared. These are
implemented using simple atomic counters. To obtain the shared lock, the slot
must also be in a readable, not-marked-for-freeing state (this is where the
lock and the state still overlap). The code should eventually be polished
to use explicit creation-is-acquisition lock objects.
Old code could not cope with Slot deletion event arriving when the Slot was
being written to. We now mark the slot as in need of freeing, regardless of
the slot state. This may need more work to properly cleanup marked slots.
The old code used open/closeForWriting sequences for rebuilding the map from
disk. There were possibly some race conditions in that code. It is now
replaced with an dedicated, simpler, and optimized putAt() method.
Alex Rousskov [Thu, 3 Feb 2011 05:33:05 +0000 (22:33 -0700)]
Support IpcIO timeouts.
Penging IpcIo requests are now stored in two alternating maps: "old" and
"new". Every T seconds, any requests remaining in the "old" map are treated
as timed out. After that check, the current "new" and (now empty) "old" map
pointers are swapped so that the previously "new" requests can now age for T
seconds. New requests are always added to the "new" map. Responses are
always checked against both maps.
This approach gives us access to pending request information and allows to
report errors to the right I/O requestors without creating additional
per-request state attached to a per-request timeout event. The price is (a)
two instead of one map lookups when the response comes and (b) timeout
precision decrease from "about T" to "anywhere from T to 2*T".
Alex Rousskov [Wed, 2 Feb 2011 19:05:25 +0000 (12:05 -0700)]
Fixed Rock MapDir read and write locking:
The IoState object created by openStoreIO() can be used for many reads. Thus,
incrementing read level at open and decrementing it at [each] readCompleted
leads to negative read levels if the stored object need more than one I/O.
Moreover, the only way core Squid can swap in an entry is if an entry has our
fileno set (by our get()). Thus, the slot is already locked for reading by
get(), with the entry responsible for decreasing the read level upon
destruction. We do not need to open/close for reading in
openStoreIO/readComleted.
When writing fails, invalidate the slot before unlocking it.
Alex Rousskov [Wed, 2 Feb 2011 01:49:34 +0000 (18:49 -0700)]
Polished skipping of cache_dirs inactive in a given strand (e.g. Coordinator)
by adding SwapDir::active() method. The directory is active if it makes sense
to call its init/create/get methods in a given strand.
Fixed counting cache_dirs that need dedicated strands. We no longer assume
that all cache_dirs do but use SwapDir::needsDiskStrand() to ask each dir.
The result is stored in Config.cacheSwap.n_strands to optimize NumberOfKids().
Alex Rousskov [Tue, 1 Feb 2011 20:35:42 +0000 (13:35 -0700)]
Call ioCompletedNotification after we are done with the opening sequence,
not in the middle of it. The effect should be the same, but the logs may be
easier to read, and there will be fewer chances of getting into a reentrant
mess of some kind.
Alex Rousskov [Tue, 1 Feb 2011 20:27:13 +0000 (13:27 -0700)]
Do not start rebuilding cache_dir (i.e., loading its index into RAM) until we
complete cache_dir initialization sequence, which ends in not in
Rock::SwapDir::init but in Rock::SwapDir::ioCompletedNotification where we
open the shared map or bail on errors.
It does not make sense to start loading index before the map is configured
because there will be no place to store loaded information.
Alex Rousskov [Tue, 1 Feb 2011 20:18:27 +0000 (13:18 -0700)]
Use Blocking DiskIO module when runnining in a no-daemon mode.
We cannot use IpcIo module in no-daemon mode because there are no diskers
to communicate with. If our implementation is correct, IpcIo module should
contain no shared map or other rock-specific manipulations and, hence,
should not be required for Rock Store to work.
Alex Rousskov [Tue, 1 Feb 2011 08:22:59 +0000 (01:22 -0700)]
Preserve old registration tag when updating registration info.
Sometimes, tagless strand registers self only after its module (like
IpcIoFile) supplies a tag. We need to keep the tag for future tag searches
to succeed.
Alex Rousskov [Tue, 1 Feb 2011 05:01:43 +0000 (22:01 -0700)]
Added IpcIo DiskIO module for communication with remote disk processes via UDS.
Used IpcIo for Rock Store filesystem module.
Added StrandSearch API: Workers use it to ask Coordinator for the right
address (i.e., kid identifier) of the disk process for a given cache_dir path.
If Coordinator does not know the answer, it waits for more disk processes to
register. Implemented using generic tagging of kids (StrandCoord) and
searching for the right tag.
Raised UDS message size maximum to 36K in order to accommodate non-trivial
rock store I/O while we are using UDS messages for I/O content.
Fixed shutdown handling broken by hiding cache_dirs from Coordinator while
switching IamPrimaryProcess() logic to use NumberOfKids() which needs
cache_dir count.
Alex Rousskov [Sun, 30 Jan 2011 23:16:22 +0000 (16:16 -0700)]
Added "disker" processes to be responsible for individual cache_dir I/O.
Determine kid process role based on the process name rather than kid ID.
This allows the process to perform role-specific actions before (or while)
squid.conf is parsed.
Alex Rousskov [Sat, 29 Jan 2011 00:08:52 +0000 (17:08 -0700)]
Added a configuration check to prevent IoState::startWriting() assertions.
Rock::IoState::startWriting() asserts that [padded] write request size does
not exceed the slot size. Padded request size always exceeds the slot size for
slots smaller than the page.
This check may also help avoid using unallocated buffer for padding, but that
part may need more work.