TrieNode::add() incorrectly computed an offset of an internal data
structure, resulting in out-of-bounds memory accesses that could cause
corruption or crashes.
This bug was discovered and detailed by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/esi-underflow.html
where it was filed as "Buffer Underflow in ESI".
Alex Rousskov [Mon, 20 May 2024 14:50:19 +0000 (14:50 +0000)]
Docs: REQUIRED in ident_regex, proxy_auth_regex, ext_user_regex (#1818)
The three ACLs were documented as matching any username when configured
with a parameter spelled "REQUIRED". Neither actually treated that
parameter specially -- all interpreted it as an ordinary regex.
This dangerous documentation bug was introduced in 2000 commit 145cf92
that added ident_regex and proxy_auth_regex support. It was then
duplicated in 2003 commit abb929f that added ext_user_regex support.
This minimal documentation fix does not imply that these ACLs should not
treat REQUIRED values specially. Enabling such special treatment
requires significant code changes, especially if we want to do that well
and without duplicating the corresponding code.
Andreas Weigel [Tue, 14 Nov 2023 15:17:09 +0000 (15:17 +0000)]
FTP: Ignore credentials with a NUL-prefixed username (#1557)
FATAL: FTP login parsing destroyed username info
This bug was discovered and detailed by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/ftp-fatal.html
where it was filed as "FTP Authentication Crash".
Thomas Leroy [Tue, 28 Nov 2023 07:35:46 +0000 (07:35 +0000)]
Limit the number of allowed X-Forwarded-For hops (#1589)
Squid will ignore all X-Forwarded-For elements listed after the first 64
addresses allowed by the follow_x_forwarded_for directive. A different
limit can be specified by defining a C++ SQUID_X_FORWARDED_FOR_HOP_MAX
macro, but that macro is not a supported Squid configuration interface
and may change or disappear at any time.
Squid will log a cache.log ERROR if the hop limit has been reached.
This change works around problematic ACLChecklist and/or slow ACLs
implementation that results in immediate nonBlockingCheck() callbacks.
Such callbacks have caused many bugs and development complications. In
clientFollowXForwardedForCheck() context, they lead to indirect
recursion that was bound only by the number of allowed XFF entries,
which could reach thousands and exhaust Squid process call stack.
This recursion bug was discovered and detailed by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/xff-stackoverflow.html
where it was filed as "X-Forwarded-For Stack Overflow".
Squids receives IPv6 addresses from traffic, configuration, or
hard-coded constants even when ./configured with --disable-ipv6 or when
IPv6 support was automatically disabled at startup after failing IPv6
tests. To handle IPv6 correctly, such Squids must support basic IPv6
operations like recognizing an IPv6 address in a request-target or
reporting an unsolicited IPv6 DNS record. At least for now, such Squids
must also correctly parse configuration-related IPv6 addresses.
All those activities rely on various low-level operations like filling
addrinfo structure with IP address information. Since 2012 commit c5fbbc7, Ip::Address::getAddrInfo() was failing for IPv6 addresses when
Ip::EnableIpv6 was falsy. That change correctly recognized[^1] the need
for such Squids to handle IPv6, but to support basic operations, we need
to reject IPv6 addresses at a higher level and without asserting.
That high-level rejection work is ongoing, but initial attempts have
exposed difficult problems that will take time to address. For now, we
just avoid the assertion while protecting IPv6-disabled Squid from
listening on or opening connections to IPv6 addresses. Since Squid
already expects (and usually correctly handles) socket opening failures,
disabling those operations is better than failing in low-level IP
manipulation code.
The overall IPv6 posture of IPv6-disabled Squids that lack http_access
or other rules to deny IPv6 requests will change: This fix exposes more
of IPv6-disabled Squid code to IPv6 addresses. It is possible that such
exposure will make some IPv6 resources inside Squid (e.g., a previously
cached HTTP response) accessible to external requests. Squids will not
open or accept IPv6 connections but may forward requests with raw IPv6
targets to IPv4 cache_peers. Whether these and similar behavior changes
are going to be permanent is open for debate, but even if they are
temporary, they are arguably better than the corresponding assertions.
These changes do not effect IPv6-enabled Squids.
The assertion in IPv6-disabled Squid was reported by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/ipv6-assert.html where
it was filed as "Assertion on IPv6 Host Requests with --disable-ipv6".
Alex Rousskov [Sun, 12 Nov 2023 09:33:20 +0000 (09:33 +0000)]
Do not update StoreEntry expiration after errorAppendEntry() (#1580)
errorAppendEntry() is responsible for setting entry expiration times,
which it does by calling StoreEntry::storeErrorResponse() that calls
StoreEntry::negativeCache().
This change was triggered by a vulnerability report by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/cache-uaf.html where
it was filed as "Use-After-Free in Cache Manager Errors". The reported
"use after free" vulnerability was unknowingly addressed by 2022 commit 1fa761a that removed excessively long "reentrant" store_client calls
responsible for the disappearance of the properly locked StoreEntry in
this (and probably other) contexts.
2015 commit 21cd322 started to continue ClientStream processing after
socket write(2) failures. In most cases, the code still "worked". For
example, initiateClose() would close the client-Squid connection, and
connStateClosed() would be called before Store has a chance to deliver
response body data requested by pullData() in writeComplete().
However, that response body data could sometimes reach Server, and
handleReply() would assert because startOfOutput() says that we have not
written the headers, but ClientStream state (i.e. a nil `rep` parameter)
says that we have. These assertion can be triggered by disabling
initiateClose(), and they can probably be triggered by traffic as well.
Now, after a Comm::Write() error, we terminateAll() client transactions
on the failed connection[^1] and do not call afterClientWrite() that is
not equipped to handle I/O errors and would continue ClientStream
processing if called.
This bug was discovered and detailed by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/stream-assert.html
where it was filed as "Implicit Assertion in Stream Handling".
[^1]: We terminateAll() instead of potentially postponing closure with
initiateClose() because the failed client-Squid connection most likely
cannot be salvaged for, say, reading the remainder of the request body.
Alex Rousskov [Wed, 25 Oct 2023 11:47:19 +0000 (11:47 +0000)]
Improve handling of expanding HTTP header values (#1536)
Squid manipulations often increase HTTP header value length compared to
the corresponding raw value received by Squid. Raw header length is
checked against request_header_max_size and reply_header_max_size that
default to 64KB, making the raw value safe to store in a String object
(by default). However, when the increased length of a manipulated value
exceeds String class limits, Squid leaks memory, asserts, or possibly
stalls affected transactions. The long-term fix for this problem is a
complete String elimination from Squid sources, but that takes time.
Known manipulations may effectively concatenate headers and/or increase
header value length by 50%. This workaround makes such known increases
safe by essentially tripling String class limits:
(64KB + 64KB) * 150% = 3 * 64KB
This bug was discovered and detailed by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/response-memleaks.html
where it was filed as "Memory Leak in HTTP Response Parsing".
Alex Rousskov [Wed, 25 Oct 2023 19:41:45 +0000 (19:41 +0000)]
RFC 1123: Fix date parsing (#1538)
The bug was discovered and detailed by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/datetime-overflow.html
where it was filed as "1-Byte Buffer OverRead in RFC 1123 date/time
Handling".
Alex Rousskov [Thu, 25 May 2023 02:10:28 +0000 (02:10 +0000)]
Fix userinfo percent-encoding (#1367)
%X expects an unsigned int, and that is what we were giving it. However,
to get to the correct unsigned int value from a (signed) char, one has
to cast to an unsigned char (or equivalent) first.
Andreas Weigel [Wed, 18 Oct 2023 04:14:31 +0000 (04:14 +0000)]
Fix validation of certificates with CN=* (#1523)
The bug was discovered and detailed by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/
where it was filed as "Buffer UnderRead in SSL CN Parsing".
squidadm [Tue, 17 Oct 2023 15:50:56 +0000 (04:50 +1300)]
Fix stack buffer overflow when parsing Digest Authorization (#1517)
The bug was discovered and detailed by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/digest-overflow.html
where it was filed as "Stack Buffer Overflow in Digest Authentication".
---------
Co-authored-by: Alex Bason <nonsleepr@gmail.com> Co-authored-by: Amos Jeffries <yadij@users.noreply.github.com>
Bug 5301: cachemgr.cgi not showing new manager interface URLs (#1479)
Also fix several related UI issues uncovered during testing:
* Prune the list of servers accessible via the CGI tool login.
Their responses would be badly mangled if accessed via
the old tools parse logic.
Also, hide the old login form if all servers use the new
manager interface.
* Ensure the 'menu' report is always used by default after
the CGI tool login. This prevents errors about MGR_INDEX
not being available on recent Squid releases. Restoring the
expected CGI tool behavior.
Alex Rousskov [Tue, 27 Jun 2023 09:00:37 +0000 (05:00 -0400)]
Fix reported assertion condition being always "EX" (#1397)
FATAL: assertion failed: foobar.cc:123: "EX"
Autoconf v2.71 provides AC_HEADER_STDC that does not define
STDC_HEADERS. As a result, our assert() macro definitions print "EX"
instead of the actual failed condition when Squid is bootstrapped with
Autoconf v2.71 or later.
Autoconf documentation suggests writing code as if STDC_HEADERS is true
(but without actually checking it). This change follows that suggestion:
Squid no longer uses AC_HEADER_STDC and STDC_HEADERS.
A `#define STDC_HEADERS 1` line may still be present in generated
./configure because AC_USE_SYSTEM_EXTENSIONS places it there. However,
in Squid v5 ./configure, AC_USE_SYSTEM_EXTENSIONS only runs on Solaris.
----
This change contains a tiny portion of master/v6 commit a1c2236, a
commit that does not disclose the assert() message fix and contains many
more changes (some of which needed fixups). That commit also did not
remove STDC_HEADERS from compat/GnuRegex.c.
Fix default build with libgnutls but absent GnuTLS headers (#1332)
Ensure that when libgnutls is found but any of the required GnuTLS
headers cannot be used, GnuTLS support is not enabled by default and an
explicit request for GnuTLS support is fatally rejected during
./configure. Otherwise, the build fails later anyway, during "make".
ClientHttpRequest::out.offset is documented to only count body bytes,
and that is what receivedBodyTooLarge() needs. We should not subtract
the guessed header size, occasionally violating reply_body_max_size
configuration and often leading to a confusing negative body size in
receivedBodyTooLarge() debugging.
Possibly broken since 2003 commit b51aec6: out.offset was not documented
back then, but, AFAICT, it was already used just for the body bytes.
The changes in commit 6ab5d388ae8cffc4ac0e99b0bfac4f0c6356e81f
(PR#1254) are targeting code post commit 96f97829e13 (PR#1062)
which will not be backported to v5.
Adapting the stub file to the v5 methods signature.
Alex Rousskov [Sun, 2 Apr 2023 07:42:10 +0000 (07:42 +0000)]
Fix GCC v13 build [-Warray-bounds] (#1318)
src/ipc/mem/FlexibleArray.h:34:52: error: array subscript -1 is
below array bounds of 'int [1]' [-Werror=array-bounds]
We suspect this warning is a GCC v13 regression bug because the callers
marked as problematic by GCC (e.g., Rock::LoadingEntry::LoadingEntry) do
not use "array subscript -1", and the Ipc::StoreMapItems::at() operator
they use even asserts that the subscript is not negative. It might be
GCC bug 107699: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107699
This change replaces the fake one-item array hack with a properly
aligned byte (still used as the "start of the real array" marker).
Also removed some unused and problematic code (instead of polishing it).
Alex Rousskov [Tue, 14 Feb 2023 21:31:11 +0000 (21:31 +0000)]
CI: Disable optimizations in "minimal" build tests (#1278)
Our "minimal" layer tests had to stop disabling optimizations in 2019
commit 4f3c41b due to problems specific to GCC v9 (v9.1 or earlier). GCC
v9.4 (and later) appear to work OK (at least on Ubuntu 22.04), so let's
try to resume testing --disable-optimizations ./configure option.
Since inception (2017 commit e99fa72), FileOpeningConfig::RetryGapUsec
was declared but not defined in any translation unit (a bug). The build
probably "worked" because compilers simply inlined the value of the
constant data member when calling xusleep(). Commit 5eee0e4 removed code
that passed RetryGapUsec to that external "C" function. Now, the same
member is passed to a heavily-inlined and convoluted STL code, and at
least some compilers (e.g., GCC v10-v12 on Ubuntu 22.04) stopped
inlining RetryGapUsec unless optimization is enabled. Our CI tests
passed because we do not test --disable-optimizations builds (yet).
We do not need FileOpeningConfig::RetryGapUsec to be static, breaking
builds, requiring extra code, and triggering questions. We do not even
need it to be constant, but removing "const" is an arguably out-of-scope
change, as is improving its type.
File::InvalidHandle is missing a definition as well, except on Windows,
but we are leaving that code "as is" for now, until we can test whether
Windows is OK with removing that "static" keyword.
The original text contained a "last modification age" typo and was
misinterpreted by some admins as if Squid applied the configured percent
to the current object age.
The text also did not make it clear that the percent-based heuristic is
effectively only applied to responses with a Last-Modified header (in
addition to the usual "without an explicit expiry time" precondition).
Fix build with clang v15.0.6 [-Wdeprecated-non-prototype] (#1246)
clang v15.0.6 distributed with Fedora Rawhide complains about function
definitions using K&R style:
lib/snmplib/mib.c:229:1: error: a function definition without a
prototype is deprecated in all versions of C and is not supported in
C2x [-Werror,-Wdeprecated-non-prototype]
ext_kerberos_ldap_group_acl: Support -b with -D (#1207)
When both '-b' (i.e. bind DN) and '-D' (i.e. Kerberos domain) options
are specified, '-b' is ignored completely. This breaks the helper when a
search subtree has to be limited (e.g., when using FreeIPA).
Fix it to take '-b' into account if it was specified with '-D'.
Alex Rousskov [Wed, 21 Sep 2022 13:53:22 +0000 (13:53 +0000)]
Fix escaping leading regex dashes in ./configure sources (#1152)
grep: invalid option -- 't'
Usage: grep [OPTION]... PATTERNS [FILE]...
Try 'grep --help' for more information.
Usage: grep [OPTION]... PATTERNS [FILE]...
Try 'grep --help' for more information.
./configure: line 27524: test: too many arguments
Usage: grep [OPTION]... PATTERNS [FILE]...
Try 'grep --help' for more information.
./configure: line 27698: test: too many arguments
Configure scripts are written in m4, not shell. M4 treats [square
brackets] specially. We could use quadrigraphs instead (as documented in
autoconf manual) but `(-)` looks a lot simpler than `@<:@-@:>@`!
Alex Rousskov [Mon, 17 Oct 2022 13:22:34 +0000 (13:22 +0000)]
Bug 5241: Block all non-localhost requests by default (#1164)
This change goes one step further than commit 6d2f8ed in satisfying the
same "by default, introduce no new attack vectors" principle.
Commit 6d2f8ed blocked access to localhost and to-link-local services
but still allowed access to potentially vulnerable popular cloud
instance metadata services running on site-local IPv6 addresses. It also
still allowed external access to localnet services that could be
completely unprepared for such dangers! This change closes those holes.
This default configuration has two extra deny rules. These rules become
necessary only when the admin adds an "http_access allow" rule below
them. We enabled these rules despite their overheads, including more DNS
queries in some environments, so that the admin does not have to
remember to enable them when following our "INSERT...HERE" instructions.
Alex Rousskov [Mon, 10 Oct 2022 23:07:02 +0000 (23:07 +0000)]
Bug 5241: Block to-localhost, to-link-local requests by default (#1161)
Squid suggested blocking to-localhost access since 2001 commit 4cc6eb1.
At that time, the default was not adjusted because some use cases were
known to require to-localhost access. However, the existence of special
cases should not affect defaults! The _default_ configuration should
either block all traffic or only allow traffic that is unlikely to
introduce new attack vectors into the network.
Also block to-link-local traffic (by default), for very similar reasons:
Popular cloud services use well-known IPv4 link-local (i.e. RFC 3927)
addresses (a.k.a. APIPA), to provide sensitive instance metadata
information, via HTTP, to instance users and scripts. Given cloud
popularity, those special addresses become nearly as ubiquitous as
127.0.0.1. Cloud provider networks shield metadata services from
external accesses, but proxies like Squid that forward external HTTP
requests may circumvent that protection.
Bug 5162: mgr:index URL do not produce MGR_INDEX template (#1191)
Satisfy mgr:index requests using
* a 200 OK response with a body derived from the MGR_INDEX template (if
that template file was found during (re)configuration) or
* a 404 (Not Found) error response (otherwise).
Broken in 2019 commit 7e6eabb, when Squid started replying using a 200
OK response with a hard-coded "mgr_index" text as a body, ignoring any
configured MGR_INDEX template.
* Fix build with OpenSSL v3.
* Refactor RSA key generation to avoid deprecated RSA_*() APIs.
* Refactor DH parameter and key config to avoid deprecated DH_*() APIs.
* Refactor ECDH key creation to avoid deprecated EC_*() APIs.
* Deprecate ssl_engine support in builds with OpenSSL v1-.
* Disable ssl_engine support in builds OpenSSL v3+.
We deprecated/removed ssl_engine support (as summarized in the last two
bullets above) without providing an OpenSSL Providers-based alternative
because of the following factors:
1. We do not have the resources to update ssl_engine code to build
(without deprecation warnings) with OpenSSL v3 when the feature is
unused.
2. We do not have the resources to create an OpenSSL v3 Provider-based
replacement for ssl_engine code that uses deprecated Engine APIs.
3. OpenSSL v3 deprecated Engine support (triggering deprecation warnings
in applications that use Engine APIs with OpenSSL v3). Since Squid
default builds use -Werror, doing nothing would break such builds.
4. Squid ssl_engine does not appear to be a popular feature.
Alex Rousskov [Fri, 12 Aug 2022 16:06:12 +0000 (16:06 +0000)]
Do not tell admins that we are replacing their .* with our .* (#1119)
acl hasHeaderFoo req_header Foo .*
2022/08/12 11:43:27| WARNING: regular expression '.*' has only
wildcards and matches all strings. Using '.*' instead.
Broken since RE optimization inception (commit 6564dae), but the
excessive WARNING was "invisible" to many admins until 61be1d8 started
saving early level-0/1 messages to cache.log.
Alex Rousskov [Tue, 19 Jul 2022 21:25:32 +0000 (21:25 +0000)]
Discarded connections do not contribute to forward_max_tries (#1093)
The forward_max_tries directive is said to limit the number of
"high-level request forwarding attempts", but its documentation does not
specify whether each Happy Eyeballs connection opening attempt
constitutes such a forwarding attempt (because commit 3eebd26 that
clarified forward_max_tries meaning came before Happy Eyeballs code
started opening multiple concurrent connections in commit 5562295).
Official Squid counts each Happy Eyeballs connection attempt as a
request forwarding attempt. For example, if forward_max_tries is 2, then
Squid, to a likely admin surprise, may refuse to re-forward the request
after an HTTP failure because the second request forwarding attempt has
been used up by the (canceled!) spare TCP connection opening attempt.
This change stops counting discarded connections as request forwarding
attempts. For example, if forward_max_tries is 2, Squid will re-forward
the request (if needed and possible) even if Squid must open up to 3
connections to do that (discarding the second one in the process). In
this context, discarding the race-losing connection and going with the
winning one may be viewed as a low-level retry activity that
forward_max_tries is already documented to ignore.
Eventually, Squid may stop discarding connections that lost the Happy
Eyeballs race. When/if that complicated improvement is implemented,
those canceled connection opening attempts should be counted, but the
then-saved connections should be used for request re-forwarding,
preserving the same overall outcome: With forward_max_tries set to 2,
Squid will still re-forward the request (if needed and possible).
Before this change, setting forward_max_tries to 1 prevented all spare
connection attempts: The first DNS response received (A or AAAA) would
be used for the primary connection attempt, increasing n_tries, making
ranOutOfTimeOrAttempts() true, and blocking any spare attempt (both
concurrent and sequential). That low-level side effect is as wrong as
the blocked retries described in the "likely admin surprise" example
above. Now, admins that really want to disable spare connection attempts
may set forward_max_tries to 1 _and_ happy_eyeballs_connect_limit to 0.
Alex Rousskov [Sun, 3 Jul 2022 14:48:50 +0000 (14:48 +0000)]
Maintenance: Fix cf_gen build warnings with clang v12 (#1076)
cf_gen.cc:... warning: comparison of array 'buff' not equal to a
null pointer is always true [-Wtautological-pointer-compare]
These conditions are not just always true but a bit odd or misleading
because they check the pointer after it was dereferenced. They were
added in c1f8bbd which misinterpreted the "else" conditions. The two
"regular DOC/CONFIG line" conditions are self-documented well enough.
When built with Samba TrivialDB, ext_session_acl would never
successfully look a session up due to an uninitialized key argument. As
a result, the helper would be unable to validate that an user had logged
in. Broken since TrivialDB support was added in acd207a.
Detected by Coverity. CID 1441979: Uninitialized scalar variable
(UNINIT).
Use SBuf for Fs::Ufs::RebuildState file paths (#1063)
In theory, a combination of a non-terminating-on-overflows snprintf()
and an incorrectly guessed MAXPATHLEN (or a cache_dir root path being at
the edge of a correctly guessed MAXPATHLEN) may result in non-terminated
fullpath. Use SBuf to avoid these particular headaches.
Detected by Coverity. CID 1461166: Copy of overlapping memory
(OVERLAPPING_COPY).
When the "no more destinations to try" notification comes after the last
forwarding/tunneling attempt has failed, the !destinationsFound block
does not run (because destinations were found), the usingDestination()
block does not run (because we are done with that last/failed
destination), but transportWait is false (for the same reason).
Also applied Bug 5090 (master/v6 commit 15bde30) FwdState protections to
tunnel.cc code. Tunnels may continue writing to the client while the
to-server connection is closing or closed, so TunnelStateData can be
potentially exposed to bug 5090 "no server connection but still
transporting" concerns. TunnelStateData never _retries_ successfully
established tunnels and, hence, can (and now does) stop receiving spare
destinations after committedToServer becomes true, but future tunnels
may start reforwarding in many cases, and most of the code does not need
to know about this (temporary?) simplification.
Also re-unified and polished related FwdState and TunnelStateData code,
including fixing lying source code comments and debug messages.
Segmentation fault in tests/testStore is another symptom.
testStoreHashIndex calls Store::Root().init() which eventually gets to
MemStore::Requested() which correctly requires Config.memShared to be
configured. Hack the missing configuration replacement, following the
inconsistent pattern in three other test suites.
TODO: Reduce configuration code duplication and such bugs by providing
initialization method(s) for all Store::Root()-using tests to (re)use.
What does LTO have to do with any of it? CPPUnit [allegedly] randomizes
execution of test suites using raw C++ pointers, and a test suite in the
testStore collection probably contains the right initialization code.
Unfortunately, Config initialization is evidently shared across test
suites (XXX). LTO evidently changes those raw pointer values enough to
change the order in an unfortunate way.
Initialise default_keytab in negotiate_kerberos_auth (#1032)
Address a Coverity-identified issue, where default_keytab might be read
when uninitialised in negotiate_kerberos_auth.
Ensure it is initialised at declaration.
Detected by Coverity, CID 1503291 (Uninitialized scalar variable)
Alex Rousskov [Thu, 28 Apr 2022 10:37:56 +0000 (10:37 +0000)]
Bug 5208: Part 1: Restart kids killed by SIGKILL (#1035)
OOM killer uses SIGKILL. Squid did not restart kids killed by SIGKILL.
Kids are essential Squid components. Essential components should be
revived (by default) because providing a service without an essential
component would violate Squid functionality requirements.
Squid did not revive a kid killed by SIGKILL because we thought that
doing so will interfere with the "squid -k kill" feature that uses that
signal to kill the whole Squid instance. However, that feature does not
actually work[^1] -- the signal is sent to (and kills) the master
process only, the process which PID is in squid.pid file. This change is
orthogonal to fixing "squid -k kill" (a difficult out-of-scope project).
[^1]: Except in the special case of the no-daemon (squid -N) mode.
Rob Cowart [Tue, 19 Apr 2022 22:39:11 +0000 (22:39 +0000)]
Fix SQUID-MIB smilint errors (#1020)
- Reorganized cachePeerTable and related objects to fix
smilint errors which prevent importing the MIB in compilers
with more strict validation (e.g. MGsoft).
- Import of Unsigned32 from SNMPv2-SMI was removed as it was
not being used.
Kid restart leads to persistent queue overflows, delays/timeouts (#706) (#1012)
WARNING: communication with ... may be too slow or disrupted...
WARNING: abandoning ... I/Os
ERROR: worker I/O push queue for ... overflow...
ERROR: Collapsed forwarding queue overflow...
SMP queues rely on the shared memory QueueReader::popSignal flag to
reduce the number of UDS messages that queue readers and writers need to
send each other. If the flag is true but there is no corresponding "wake
up, you have new queued items to read" UDS message for the reader, the
reader may stall. This happens when the reader restarts (e.g., after
hitting an assertion) while the flag is true. A stalled queue reader
leads to delays and queue overflows:
* When the problem affects worker-disker queues, disk I/O delays under
the hard-coded 7-second timeout are not reported to the admin but may
affect user experience. Larger delays trigger level-1 WARNINGs. Push
queue overflows trigger level-1 ERRORs.
* Transient worker-worker queue problems may stall concurrent
transactions that are reading from the cache entry being written by
another process. Overflows trigger level-1 ERRORs.
The restarted worker usually starts working just fine because it does
not expect any messages. A busy restarted worker may also appear to
continue working fine because workers always pop queued items before
pushing new ones -- as long as the worker queues new items, it will see
and pop responses to earlier requests, masking the problem. However, the
"stuck popSignal" problem never goes away: Squid only clears the flag
when receiving a notification, but sending new notifications is blocked
by that stuck flag.
Upon kid start, we now clear popSignal (to reflect the actual
communication state) and empty the queue (to reduce overflows). Since
commit 4c21861, any corresponding in-flight UDS queue notification is
ignored because it was sent to the previous process playing the same kid
role. The queue writer will see the false popSignal flag and send a new
notification when queuing a new item, preventing queue stalls.
Also properly ignore stale disker responses about cache_dirs we have not
opened yet, especially since we are now trying to empty the queues ASAP
after a restart, before Coordinator has a chance to inform us about
available diskers, populating the IpcIoFiles container. We already have
similar protection from stale UDS messages and from stale disker queue
messages about _opened_ cache_dirs ("LATE disker response to...").
Nick Wellnhofer [Sun, 20 Feb 2022 19:42:40 +0000 (19:42 +0000)]
ESI: Drop incorrect and unnecessary xmlSetFeature() call (#988)
xmlSetFeature() has been deprecated for 10+ years and will eventually be
removed from libxml2. Squid calls xmlSetFeature() with the wrong
argument: a nil `value` pointer instead of a pointer to a zero value.
When called with a nil `value`, the function does nothing but returning
an error. Squid does not check whether xmlSetFeature() call is
successful, and the bug went unnoticed since libxml2 support was added
in commit 964b44c.
Since libxml2 does not substitute entities by default, the call can be
removed to achieve the intended effect.
Amos Jeffries [Sat, 29 Jan 2022 05:02:38 +0000 (05:02 +0000)]
Bug 5192: esi_parser default is incorrect (#968)
Since commit f5f7786 reversed the internal list order of ESI parser
registry, the esi_parser documentation and code comment in esi/Module.cc
have been incorrect.
Return ESI parsers to the documented default behaviour and make that
default explicit in the code selecting which Parser to initialize.
Also fixed an inverted test that resulted in the esi_parser configured
library _not_ to be the one used.
Alex Rousskov [Wed, 5 Jan 2022 18:38:07 +0000 (18:38 +0000)]
Bug 5177: clientca certificates sent to https_port clients (#955)
When sending an https_port _server_ certificate chain to the client,
Squid may send intermediate CA certificates found in clientca=... or
tls-cafile=... client certificate bundles. This "leak" of client CAs
surprises admins, may trigger traffic monitoring alarms, and might even
break https_port certificate validation in some TLS clients.
This surprising "leak" of client CAs is triggered by OpenSSL default
behavior of auto-completing server certificate chains using whatever CA
certificates happened to be in the TLS context certificate store. When
client certificate authentication is enabled, that store may contain
clientca CAs (or equivalent). OpenSSL CHANGES file acknowledges that
this aggressive default behavior can be a problem and introduces
SSL_MODE_NO_AUTO_CHAIN as a way to disable it.
This fix breaks misconfigured Squid deployments that (usually
unknowingly) rely on the OpenSSL clientca "leak" to build a complete
https_port server certificate chain sent to TLS clients. Such
deployments should add the right intermediate CA certificate(s) to their
https_port tls-cert=... bundle (or equivalent).
FwdState::usingDestination() should cover 3 post-transportWait periods:
1. peerWait: opening a CONNECT tunnel through the cache_peer
2. encryptionWait: TLS negotiations to secure the connection
3. Comm::IsConnOpen(serverConn): a Squid-peer transaction
The condition for the last period did not account for the time between
FwdState::unregister() and FwdState::complete() (or their equivalents),
when the transport connection is either closed or moved to the pconn
pool, but FwdState is still waiting for complete() and must not attempt
to connect to another destination. The bug is usually hidden because
complete() is usually called immediately after unregister(). However,
RESPMOD adaptation (at least) may delay complete(). If peer selection
news comes during that delay, usingDestination() lies, and various
Must()s may fail, depending on Squid version and HTTP request details.
Now, FwdState does not rely on the to-peer connection state for the
third period condition. Instead, we explicitly track whether the last
dispatch()ed activity has ended. This tracking is tricky because
complete() may be called without dispatching an asynchronous activity,
and because there are at least three ways for an activity to end:
explicit complete(), explicit handleUnregisteredServerEnd(), and
implicit serverClosed() connection closure callback. This tracking will
become easier and more reliable when FwdState no longer co-owns/stores
the to-peer connection. This change simplifies the future removal of
that connection ownership.
Also reordered usingDestination() conditions to match the chronological
order of the corresponding three periods.
Also reordered transportWait-vs-usingDestination() checks to match the
chronological order of those two forwarding stages.
That (void) was added in commit 7ec6d51 to "remove GNU-specific syntax",
but we cannot tell what specific problem that 10-year old change solved.
Known 3-parameter CPU_AND(3) documentation says the call returns void.
OpenBSD doesn't offer CPU_SET and CPU_ISSET. Implement their stubs as
inline functions to give the compiler proper hints about arguments (non)
use.
Update buildtest.sh to try and use relative paths first. This prevents
autoconf complaining and failing if the directory path includes
characters from an unsafe set.
This change has several overlapping parts. Unfortunately, merging
individual parts is both difficult and likely to cause crashes.
## Part 1: Bug 5055.
FwdState used to check serverConn to decide whether to open a connection
to forward the request. Since commit 25b0ce4, a nil serverConn pointer
no longer implies that a new connection should be opened: FwdState
helper jobs may already be working on preparing an existing open
connection (e.g., sending a CONNECT request or negotiating encryption).
Bad serverConn checks in both FwdState::noteDestination() and
FwdState::noteDestinationsEnd() methods led to extra connectStart()
calls creating two conflicting concurrent helper jobs.
To fix this, we replaced direct serverConn inspection with a
usingDestination() call which also checks whether we are waiting for a
helper job. Testing that fix exposed another set of bugs: The helper job
pointers or in-job connections left stale/set after forwarding failures.
The changes described below addressed (most of) those problems.
## Part 2: Connection establishing helper jobs and their callbacks
A proper fix for Bug 5055 required answering a difficult question: When
should a dying job call its callbacks? We only found one answer which
required cooperation from the job creator and led to the following
rules:
* AsyncJob destructors must not call any callbacks.
* AsyncJob::swanSong() is responsible for async-calling any remaining
(i.e. set, uncalled, and uncancelled) callbacks.
* AsyncJob::swanSong() is called (only) for started jobs.
* AsyncJob destructing sequence should validate that no callbacks remain
uncalled for started jobs.
... where an AsyncJob x is considered "started" if AsyncJob::Start(x)
has returned without throwing.
A new JobWait class helps job creators follow these rules while keeping
track on in-progress helper jobs and killing no-longer-needed helpers.
Also fixed very similar bugs in tunnel.cc code.
## Part 3: ConnOpener fixes
1. Many ConnOpener users are written to keep a ConnectionPointer to the
destination given to ConnOpener. This means that their connection
magically opens when ConnOpener successfully connects, before
ConnOpener has a chance to notify the user about the changes. Having
multiple concurrent connection owners is always dangerous, and the
user cannot even have a close handler registered for its now-open
connection. When something happens to ConnOpener or its answer, the
user job may be in trouble. Now, ConnOpener callers no longer pass
Connection objects they own, cloning them as needed. That adjustment
required adjustment 2:
2. Refactored ConnOpener users to stop assuming that the answer contains
a pointer to their connection object. After adjustment 1 above, it
does not. HappyConnOpener relied on that assumption quite a bit so we
had to refactor to use two custom callback methods instead of one
with a complicated if-statement distinguishing prime from spare
attempts. This refactoring is an overall improvement because it
simplifies the code. Other ConnOpener users just needed to remove a
few no longer valid paranoid assertions/Musts.
3. ConnOpener users were forced to remember to close params.conn when
processing negative answers. Some, naturally, forgot, triggering
warnings about orphaned connections (e.g., Ident and TcpLogger).
ConnOpener now closes its open connection before sending a negative
answer.
4. ConnOpener would trigger orphan connection warnings if the job ended
after opening the connection but without supplying the connection to
the requestor (e.g., because the requestor has gone away). Now,
ConnOpener explicitly closes its open connection if it has not been
sent to the requestor.
Also fixed Comm::ConnOpener::cleanFd() debugging that was incorrectly
saying that the method closes the temporary descriptor.
Also fixed ConnOpener callback's syncWithComm(): The stale
CommConnectCbParams override was testing unused (i.e. always negative)
CommConnectCbParams::fd and was trying to cancel the callback that most
(possibly all) recipients rely on: ConnOpener users expect a negative
answer rather than no answer at all.
Also, after comparing the needs of two old/existing and a temporary
added ("clone everything") Connection cloning method callers, we decided
there is no need to maintain three different methods. All existing
callers should be fine with a single method because none of them suffers
from "extra" copying of members that others need. Right now, the new
cloneProfile() method copies everything except FD and a few
special-purpose members (with documented reasons for not copying).
Also added Comm::Connection::cloneDestinationDetails() debugging to
simplify tracking dependencies between half-baked Connection objects
carrying destination/flags/other metadata and open Connection objects
created by ConnOpener using that metadata (which are later delivered to
ConnOpener users and, in some cases, replace those half-baked
connections mentioned earlier. Long-term, we need to find a better way
to express these and other Connection states/stages than comments and
debugging messages.
## Part 4: Comm::Connection closure callbacks
We improved many closure callbacks to make sure (to the extent possible)
that Connection and other objects are in sync with Comm. There are lots
of small bugs, inconsistencies, and other problems in Connection closure
handlers. It is not clear whether any of those problems could result in
serious runtime errors or leaks. In theory, the rest of the code could
neutralize their negative side effects. However, even in that case, it
was just a matter of time before the next bug will bite us due to stale
Connection::fd and such. These changes themselves carry elevated risk,
but they get us closer to reliable code as far as Connection maintenance
is concerned; otherwise, we will keep chasing their deadly side effects.
Long-term, all these manual efforts to keep things in sync should become
unnecessary with the introduction of appropriate Connection ownership
APIs that automatically maintain the corresponding environments (TODO).
## Part 5: Other notable improvements in the adjusted code
Improved Security::PeerConnector::serverConn and
Http::Tunneler::connection management, especially when sending negative
answers. When sending a negative answer, we would set answer().conn to
an open connection, async-send that answer, and then hurry to close the
connection using our pointer to the shared Connection object. If
everything went according to the plan, the recipient would get a non-nil
but closed Connection object. Now, a negative answer simply means no
connection at all. Same for a tunneled answer.
Refactored ICAP connection-establishing code to to delay Connection
ownership until the ICAP connection is fully ready. This change
addresses primary Connection ownership concerns (as they apply to this
ICAP code) except orphaning of the temporary Connection object by helper
job start exceptions (now an explicit XXX). For example, the transaction
no longer shares a Connection object with ConnOpener and
IcapPeerConnector jobs.
Probably fixed a bug where PeerConnector::negotiate() assumed that a
sslFinalized() does not return true after callBack(). It may (e.g., when
CertValidationHelper::Submit() throws). Same for
PeekingPeerConnector::checkForPeekAndSpliceMatched().
Fixed FwdState::advanceDestination() bug that did not save
ERR_GATEWAY_FAILURE details and "lost" the address of that failed
destination, making it unavailable to future retries (if any).
Polished PeerPoolMgr, Ident, and Gopher code to be able to fix similar
job callback and connection management issues.
Polished AsyncJob::Start() API. Start() used to return a job pointer,
but that was a bad idea:
* It implies that a failed Start() will return a nil pointer, and that
the caller should check the result. Neither is true.
* It encourages callers to dereference the returned pointer to further
adjust the job. That technically works (today) but violates the rules
of communicating with an async job. The Start() method is the boundary
after which the job is deemed asynchronous.
Also removed old "and wait for..." post-Start() comments because the
code itself became clear enough, and those comments were becoming
increasingly stale (because they duplicated the callback names above
them).
Fix Tunneler and PeerConnector handling of last-resort callback
requirements. Events like handleStopRequest() and callException() stop
the job but should not be reported as a BUG (e.g., it would be up to the
callException() to decide how to report the caught exception). There
might (or there will) be other, similar cases where the job is stopped
prematurely for some non-BUG reason beyond swanSong() knowledge. The
existence of non-bug cases does not mean there could be no bugs worth
reporting here, but until they can be identified more reliably than all
these benign/irrelevant cases, reporting no BUGs is a (much) lesser
evil.
TODO: Revise AsyncJob::doneAll(). Many of its overrides are written to
check for both positive (i.e. mission accomplished) and negative (i.e.
mission cancelled or cannot be accomplished) conditions, but the latter
is usually unnecessary, especially after we added handleStopRequest()
API to properly support external job cancellation events. Many doneAll()
overrides can probably be greatly simplified.
In git log order:
* e64a6c1: Undone an out-of-scope change and added a missing 'auto'
* aeaf83d: Fixed an 'unused parameter' error
* f49d009: fixup: No explicit destructors with implicit copying methods
* c30c37f: Removed an unnecessary explicit copy constructor
* 012f5ec: Excluded Connection::rfc931 from cloning
* 366c78a: ICAP: do not set connect_timeout on the established conn...
This branch is now in sync with SQUID-568-premature-serverconn-use (S)
commit e64a6c1 (except for official changes merged from master/v6 into S
closer to the end of PR 877 work (i.e. S' merge commit 0a7432a).
Since master/v6 commit 2b6b1bc, a timeout on a ready-to-shovel
Squid-service ICAP connection was decrementing theBusyConns level one
extra time because Adaptation::Icap::Xaction::noteCommTimedout() started
calling both noteConnectionFailed() and closeConnection(). Depending on
the actual theBusyConns level, the extra decrement could result in FATAL
errors later, when putConnection() was called (for a different ICAP
transaction) with zero theBusyConns in an exception-unprotected context.
Throughout these changes, Xaction still counts the above timeouts as a
service failure. That is done by calling ServiceRep::noteFailure() from
Xaction::callException(), including in timeout cases described above.
The order was reversed in commit cf48712, probably by accident. Wrong
order violates TLS protocol and breaks TLS clients that are incapable of
reordering received intermediate CAs. Squid deployments that use
wrong-order bundles (to compensate for this bug) should reorder their
bundles when deploying this fix (or wait for Squid to order certificates
correctly, regardless of the bundle order -- a work in progress).
Bug 5187: Properly track (and mark) truncated store entries (#909)
Squid used an error-prone approach to identifying truncated responses:
The response is treated as whole[^1] unless somebody remembers to mark
it as truncated. This dangerous default naturally resulted in bugs where
truncated responses are treated as complete under various conditions.
This change reverses that approach: Responses not explicitly marked as
whole are treated as truncated. This change affects all Squid-server
FwsState-dispatched communications: HTTP, FTP, Gopher, and WHOIS. It
also affects responses received from the adaptation services.
Squid still tries to deliver responses with truncated bodies to clients
in most cases (no changes are expected/intended in that area).
[^1]: A better word to describe a "whole" response would be "complete",
but several key Squid APIs use "complete" to mean "no more content
is coming", misleading developers into thinking that a "completed"
response has all the expected bytes and may be cached/shared/etc.
Transactions that failed due to origin server or peer timeout (a common
source of truncation) are now logged with a _TIMEOUT %Ss suffix and
ERR_READ_TIMEOUT/WITH_SRV %err_code/%err_detail.
Transactions prematurely canceled by Squid during client-Squid
communication (usually due to various timeouts) now have WITH_CLT
default %err_detail. This detail helps distinguish otherwise
similarly-logged problems that may happen when talking to the client or
to the origin server/peer.
FwdState now (indirectly) complete()s truncated entries _after_
releasing/adjusting them so that invokeHandlers() and others do not get
a dangerous glimpse at a seemingly OK entry before its release().
Alex Rousskov [Sun, 9 Jan 2022 10:41:24 +0000 (10:41 +0000)]
Bug 5132: Close the tunnel if to-server conn closes after client (#957)
Since commit 25d2603, blind CONNECT tunnel "jobs" (and equivalent) were
not destroyed upon a "lonely" to-server connection closure, leading to
memory leaks. And when a from-client connection was still present at the
time of the to-server connection closure, we did not try to reforward,
violating the spirit of commit 25d2603 changes. Calling retryOrBail() is
sufficient to handle both cases.