From: Alex Rousskov <rousskov@measurement-factory.com>
Date: Sat, 26 Apr 2014 17:30:33 +0000 (-0600)
Subject: Document counter-intuitive round-robin cache_dir selection bias; decrease it.
X-Git-Tag: SQUID_3_5_0_1~170^2~5
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=29a238a3a24c7776da9a6904824f86980a40c2de;p=thirdparty%2Fsquid.git

Document counter-intuitive round-robin cache_dir selection bias; decrease it.

Many squid.confs have at least two groups of cache_dir lines. For example,
rare "large" objects go to larger/slower HDDs while popular "small" objects go
to smaller/fast SSDs:

    # HDDs
    cache_dir rock /hdd1 ... min-size=large
    cache_dir rock /hdd2 ... min-size=large
    cache_dir rock /hdd3 ... min-size=large
    # SSDs
    cache_dir rock /ssd1 ... max-size=large-1
    cache_dir rock /ssd2 ... max-size=large-1
    cache_dir rock /ssd3 ... max-size=large-1
    # rock store does not support least-load yet
    store_dir_select_algorithm round-robin

Since round-robin selects the first suitable disk during a sequential scan,
the probability of /hdd1 (/ssd1) selection is higher than other HDDs (SSDs).
Consider a large object that needs an HDD: /hdd1 is selected whenever scan
starts at /ssd1, /ssd2, /ssd3, and /hdd1 while /hdd2 is selected only when the
scan starts at /hdd2 itself! Documentation now warns against the above
cache_dir configuration approach and suggests to interleave cache_dirs:

    cache_dir rock /hdd1 ... min-size=large
    cache_dir rock /ssd1 ... max-size=large-1
    cache_dir rock /hdd2 ... min-size=large
    cache_dir rock /ssd2 ... max-size=large-1
    cache_dir rock /hdd3 ... min-size=large
    cache_dir rock /ssd3 ... max-size=large-1
    store_dir_select_algorithm round-robin

Squid's historic implementation of round-robin made its natural bias even
worse because it made the starting point of the sequential scan to be the last
selected dir. In the first configuration example above, it boosted the
probability that the scan will start at one of the SSDs because smaller
objects are more popular and, hence, their dirs are selected more often.  With
the starting point usually at an SSD, even more _large_ objects were sent to
/hdd1 compared to other HDDs! The code change avoids this artificial boost
(but the cache_dir lines should still be interleaved to avoid the natural
round-robin bias discussed earlier).
---

diff --git a/src/cf.data.pre b/src/cf.data.pre
index 194ee1696a..07851c60e7 100644
--- a/src/cf.data.pre
+++ b/src/cf.data.pre
@@ -3583,6 +3583,19 @@ DOC_START
 	disks. This algorithm does not spread objects by size, so any
 	I/O loading per-disk may appear very unbalanced and volatile.
 
+	If several cache_dirs use similar min-size, max-size, or other
+	limits to to reject certain responses, then do not group such
+	cache_dir lines together, to avoid round-robin selection bias
+	towards the first cache_dir after the group. Instead, interleave
+	cache_dir lines from different groups. For example:
+
+		store_dir_select_algorithm round-robin
+		cache_dir rock /hdd1 ... min-size=100000
+		cache_dir rock /ssd1 ... max-size=99999
+		cache_dir rock /hdd2 ... min-size=100000
+		cache_dir rock /ssd2 ... max-size=99999
+		cache_dir rock /hdd3 ... min-size=100000
+		cache_dir rock /ssd3 ... max-size=99999
 DOC_END
 
 NAME: max_open_disk_fds
diff --git a/src/store_dir.cc b/src/store_dir.cc
index 8d124736b8..28d1c32a1f 100644
--- a/src/store_dir.cc
+++ b/src/store_dir.cc
@@ -207,22 +207,22 @@ SwapDir::objectSizeIsAcceptable(int64_t objsize) const
 static int
 storeDirSelectSwapDirRoundRobin(const StoreEntry * e)
 {
-    static int dirn = 0;
-    int i;
-    int load;
-    RefCount<SwapDir> sd;
-
     // e->objectLen() is negative at this point when we are still STORE_PENDING
     ssize_t objsize = e->mem_obj->expectedReplySize();
     if (objsize != -1)
         objsize += e->mem_obj->swap_hdr_sz;
 
-    for (i = 0; i < Config.cacheSwap.n_configured; ++i) {
-        if (++dirn >= Config.cacheSwap.n_configured)
-            dirn = 0;
+    // Increment the first candidate once per selection (not once per
+    // iteration) to reduce bias when some disk(s) attract more entries.
+    static int firstCandidate = 0;
+    if (++firstCandidate >= Config.cacheSwap.n_configured)
+        firstCandidate = 0;
 
-        sd = dynamic_cast<SwapDir *>(INDEXSD(dirn));
+    for (int i = 0; i < Config.cacheSwap.n_configured; ++i) {
+        const int dirn = (firstCandidate + i) % Config.cacheSwap.n_configured;
+        const SwapDir *sd = dynamic_cast<SwapDir*>(INDEXSD(dirn));
 
+        int load = 0;
         if (!sd->canStore(*e, objsize, load))
             continue;