store/Disks.cc:690: error: argument 1 value 18446744073709551615
exceeds maximum object size 9223372036854775807
[-Werror=alloc-size-larger-than=]
const auto tmp = new SwapDir::Pointer[swap->n_allocated];
pconn.cc:43:53: error: argument 1 value 18446744073709551615 ...
theList_ = new Comm::ConnectionPointer[capacity_];
During shutdown, iocb_table global was deleted, but the corresponding
cleanup code ignored some IoCallback members, triggering misleading
Ipc::UdsSender memory leak reports from Valgrind.
This change uses NoNewGlobals design to address the above problem, but
this code needs a lot more refactoring to address other old problems
associated with Comm initialization.
test-builds.sh records build output in per-layer logs, and we are
already collecting those logs. Send build output to standard output as
well, to expose it in GitHub Actions user interface. This new detailed
output is enclosed in a "layer ... output" group to keep emphasizing
problematic output lines extracted by test-builds.sh.
Also enclose MacOS build setup output in a group, for similar reasons.
Level-2 "Tunnel Server RESPONSE:..." debugs() incorrectly assumed that
its readBuf parameter contained hdr_sz header bytes. In reality, by the
time code reached that debugs(), readBuf no longer had any header bytes
(and often had no bytes at all). Besides broken header dumps, this bug
could lead to problems that Valgrind reports as "Conditional jump or
move depends on uninitialised value" in DebugChannel::writeToStream().
This fix mimics HttpStateData::processReplyHeader() reporting code,
including its known problems. Future changes should address those
problems and reduce code duplication across at least ten functions
containing similar "decorated" level-2 message dumps.
Bug 5093: List http_port params that https_port/ftp_port lack (#1977)
To avoid documentation duplication, current https_port and ftp_port
directive descriptions reference http_port directive instead of
detailing their own supported parameters. For https_port, this solution
creates a false impression that the directive supports all http_port
options. Our ftp_port documentation is better but still leaves the
reader guessing which options are actually supported.
This change starts enumerating http_port configuration parameters that
ftp_port and https_port directives do _not_ support. Eventually, Squid
should reject configurations with unsupported listening port options.
Alex Rousskov [Tue, 31 Dec 2024 21:59:05 +0000 (21:59 +0000)]
Work around some mgr:forward accounting/reporting bugs (#1969)
In modern code, FwdReplyCodes[0][i] is usually zero because n_tries is
usually at least one at logReplyStatus() call time. This leads to
mgr:forward report showing nothing but table heading (i.e. no stats)
Also improve `try#N` heading:data match by skipping FwdReplyCodes[0]
reporting (there is still no `try#0` heading) and adding a previously
missing `try#9` heading
Alex Rousskov [Tue, 31 Dec 2024 20:40:46 +0000 (20:40 +0000)]
Clarify --enable-ecap failure on missing shared library support (#1968)
checking if libtool supports shared libraries... no
checking whether to build shared libraries... no
configure: error: eCAP support requires loadable modules.
Please do not use --disable-shared with --enable-ecap.
After 2022 commit 5a2409b7, our advice for handling the above error
became misleading in environments that do not --disable-shared
explicitly but lack shared libraries support for other reasons
Alex Rousskov [Tue, 31 Dec 2024 19:22:21 +0000 (19:22 +0000)]
Do not lookup IP addresses of X509 certificate subject CNs (#1967)
A true-vs-false `nodns` parameter value bug in a recent commit 22b2a7a0
caused, in some environments, significant startup delays and/or runtime
stalls because getaddrinfo(3) performed blocking DNS lookups when
parsing common names of X509 certificate subjects. Squid parses CNs when
loading configured and validating received certificates. Other side
effects may have included Squid-generated certificates having wrong
alternative subject names and/or wrong certificate validation results.
Negative names and context-disassociated boolean constants strike again!
Fortunately, associated problematic Ip::Address::lookupHostIP() will be
replaced when the existing Ip::Address::Parse() TODO is addressed.
Alex Rousskov [Tue, 31 Dec 2024 17:27:40 +0000 (17:27 +0000)]
REQMOD stuck when adapted request (body) is not forwarded (#1966)
Squid forwards request bodies using BodyPipe objects. A BodyPipe object
has two associated agents: producer and consumer. Those agents are set
independently, at different processing stages. If BodyPipe consumer is
not set, the producer may get stuck waiting for BodyPipe buffer space.
When producer creates a BodyPipe, it effectively relies on some code
somewhere to register a consumer (or declare a registration failure),
but automatically tracking that expectation fulfillment is impractical
For REQMOD transactions involving adapted request bodies, including ICAP
204 transactions, Client::startRequestBodyFlow() sets body consumer. If
that method is not called, there will be no consumer, and REQMOD may get
stuck. Many `if` statements can block request forwarding altogether or
block a being-forwarded request from reaching that method. For example,
adapted_http_access and miss_access denials block request forwarding
Without REQMOD, request processing can probably get stuck for similar
lack-of-consumer reasons, but regular requests ought to be killed by
various I/O or forwarding timeouts. There are no such timeouts for those
REQMOD transactions that are only waiting for request body consumer to
clear adapted BodyPipe space (e.g., after all ICAP 204 I/Os are over).
Relying on timeouts is also very inefficient
For a `mgr:mem` observer, stuck REQMOD transactions look like a ModXact
memory leak. A `mgr:jobs` report shows ModXact jobs with RBS(1) status
Report cache_peer context in probe and standby pool messages (#1960)
The absence of the usual "current master transaction:..." detail in
certain errors raises "Has Squid lost the current transaction context?"
red flags:
ERROR: Connection to peerXyz failed
In some cases, Squid may have lost that context, but for cache_peer TCP
probes, Squid has not because those probes are not associated with
master transactions. It is difficult to distinguish the two cases
because no processing context is reported. To address those concerns,
we now report current cache_peer probing context (instead of just not
reporting absent master transaction context):
ERROR: Connection to peerXyz failed
current cache_peer probe: peerXyzIP
When maintaining a cache_peer standy=N connection pool, Squid has and
now reports both contexts, attributing messages to pool maintenance:
ERROR: Connection to peerXyz failed
current cache_peer standby pool: peerXyz
current master transaction: master1234
The new PrecomputedCodeContext class handles both reporting cases and
can be reused whenever the cost of pre-computing detailCodeContext()
output is acceptable.
CI: Add OpenBSD build tests for staged commits (#1964)
Use a GitHub-hosted VM to create OpenBSD test environment. Requires
GitHub repository configuration that permits the use of
`vmactions/openbsd-vm@v1` Action.
We have not enabled ccache optimization for OpenBSD tests because we
do not know how to copy updated ccache files from VM back into runner.
CI: Add GitHub Actions workflow for periodic Coverity Scan (#1958)
Implement a weekly scheduled GitHub Actions workflow to run Coverity
Scan (i.e. cov-build). Currently, we run Coverity Scan using Jenkins.
The new job uses the Squid Project pre-made docker image because
installing the tools required to use free Coverity Scan service cannot
be easily automated at the moment.
The job only runs for the official Squid Project repository.
Portability: remove explicit check for libdl (#1963)
OpenBSD does not have libdl, as it has dlopen() in libc.
It is not really needed, and force-requiring the presence of libdl
causes ./configure to fail on openbsd:
checking for dlopen in -ldl... no
configure: error: Required library 'dl' not found
stat5minClientRequests() was to meant to return the number of recent
client requests. However, the function did not provide implied 5 minute
precision. It returned, roughly speaking, the number of requests during
the last 0-6 minutes. The new, less strict function name and boolean
type avoid this false precision implication.
Squid may crash when the SquidConfig global is auto-destructed after
main() ends. Since SquidConfig global is used by cleanup code, we should
keep its fields alive, essentially emulating "No New Globals" policy
effects. This surgical fix will be followed up with more changes to
address general OpenSSL cleanup problems exposed by this bug.
This bug fix facilitates backporting by using FuturePeerContext shim.
Alex Rousskov [Wed, 27 Nov 2024 02:32:02 +0000 (02:32 +0000)]
Treat responses to collapsed requests as fresh (#1927)
Squid transactions involving collapsed requests receive their responses
as Store cache hits. Cache hits undergo mandatory freshness validation
checks and, if those checks detect a stale response, may trigger
revalidation (e.g., an If-Modified-Since request to the origin server).
This logic results in a still-being-delivered-to-Squid response
triggering its own revalidation if that response is deemed stale on
arrival by collapsed request (e.g., has a Cache-Control: max-age=0
header field).
HTTP RFC 9111 Section 4.7 briefly mentions collapsed requests but is
ambiguous with regard to validation of responses to collapsed requests.
IETF HTTP Working Group chair believes that specs are unclear, and that
these responses should not be treated as cache hits (in primary cases):
https://lists.w3.org/Archives/Public/ietf-http-wg/2024JanMar/0095.html
This change follows the above advice and prevents arguably excessive
freshness checks for responses to collapsed requests. This change is
collapsed-forwarding specific: It does not prevent freshness checks for
responses that were, at the time of a hit request, either fully cached
or still receiving response body bytes.
After this change, clientReplyContext can no longer collapse twice, once
after initial Store lookup and then again during refresh, because the
first collapse now precludes refresh.
Simplified quick_abort_pct code and improved its docs (#1921)
Instead of ignoring quick_abort_pct settings that would, together with
other conditions, abort a pending download of a 99-byte or smaller
response, Squid now honors quick_abort_pct for all response sizes. Most
Squids are not going to be affected by this change because default
quick_abort_min settings (16KB) prevent aborts of 99-byte responses even
before quick_abort_pct is checked.
Due to conversion from integer to floating point math, this change may
affect responses larger than 99 bytes as well, but these effects ought
to be limited to cases where the decision is based on a tiny difference
(e.g., receiving 1% more bytes would have triggered full download). In
most such cases, the decision could probably go either way due to
response header size fluctuations anyway.
Also updated quick_abort_pct documentation, primarily to clarify a
misleading statement: Squid did not and does not treat 16KB or smaller
responses specially in this context. The original statement was probably
based on quick_abort_min _default_ setting of 16KB, but statement
phrasing and placement hid that connection.
Tony Walker [Sat, 16 Nov 2024 22:10:39 +0000 (22:10 +0000)]
Bug 5363: Handle IP-based X.509 SANs better (#1793)
Most X.509 Subject Alternate Name extensions encountered by Squid are
based on DNS domain names. However, real-world servers (including
publicly available servers that use vanity IP addresses) also use
IP-based SANs. Squid mishandled IP-based SANs in several ways:
* When generating certificates for servers targeted by their IP
addresses, addAltNameWithSubjectCn() used that target IP as a
DNS-based SAN, resulting in a frankenstein DNS:[ip] SAN value that
clients ignored when validating a Squid-generated certificate.
* When validating a received certificate, Squid was ignoring IP-based
SANs. When Subject CN did not match the requested IP target, Squid
only looked at DNS-based SANs, incorrectly failing validation.
* When checking certificate-related ACLs like ssl::server_name,
matchX509CommonNames() ignored IP-based SANs, not matching
certificates containing ACL-listed IP addresses.
Squid now recognizes and generates IP-based SANs.
Squid now attempts to match IP-based SANs with ACL-listed IP addresses,
but the success of that attempt depends on whether ACL IP parameters are
formatted the same way inet_ntop(3) formats those IP addresses: Matching
is still done using c-string/domain-based ::matchDomainName() (for
ssl::server_name) and string-based regexes (for ssl::server_name_regex).
Similar problems affect dstdomain and dstdomain_regex ACLs. A dedicated
fix is needed to stop treating IPs as domain names in those contexts.
This change introduces partial support for preserving IP-vs-domain
distinction in parsed/internal Squid state rather than converting both
to a string and then assuming that string is a DNS domain name.
Alex Rousskov [Wed, 13 Nov 2024 18:16:08 +0000 (18:16 +0000)]
Annotate PoolMalloc memory in valgrind builds (#1946)
MemPoolMalloc code (i.e. memory_pools code used by default) was missing
VALGRIND_MAKE_MEM_*() calls. Similar calls do exist in MemPoolChunked
code (i.e. code enabled by setting MEMPOOLS environment variable to 1).
Even with these markings, "memory_pools on" configuration is still not
quite compatible with Valgrind leak reporting suppressions: In some
cases, Valgrind may still incorrectly suppress leak reporting (or report
leaks that should have been suppressed) because Valgrind associates leak
suppressions with memory _allocators_ while buggy code may leak memory
allocated by others. The long-term solution (if it exists) requires
upgrading these markings to VALGRIND_MEMPOOL_*() API targeting memory
pools, but that requires a serious effort, especially when dealing with
MemPoolChunked complexities. The added markings help detect non-leak
violations and improve PoolMalloc/MemPoolChunked code symmetry.
CI: Cancel obsolete concurrent GitHub Actions workflow jobs (#1940)
If a new push happens to a staging branch or a PR branch, continuing to
run now-obsolete tests is pointless and wasteful. However, we do want to
finish any jobs running on previous master branch commits, so that every
master branch commit has full test results.
Alex Rousskov [Mon, 11 Nov 2024 16:17:09 +0000 (16:17 +0000)]
CI: Use ccache-action repo maintained by its original author (#1943)
squid-cache/ccache-action@v1.2.14 is not allowed to be used in ...
For recent commit 627cca6d, we cloned hendrikmuhs/ccache-action
repository because GitHub Actions prohibited usage of that repository in
Squid Project CI tests. Cloning worked around that restriction, but we
did not realize that there are other solutions, and that cloning forces
all other Squid repositories to either clone squid-cache/ccache-action
or permit squid-cache/ccache-action execution by other means.
To reduce the number of obscure repositories Squid Project and others
have to deal with, it is better to adjust repository configuration to
allow well-known hendrikmuhs/ccache-action in "Actions permissions" at
https://github.com/.../squid/settings/actions
Also prevent nil connection pointer dereference and setsockopt() calls
with negative FD in comm_reset_close() and old_comm_reset_close().
It is unknown whether such bugs could be triggered before these changes.
Also removed fde::flags::nolinger as unused since 1999 commit 2391a162.
Amos Jeffries [Tue, 5 Nov 2024 16:46:41 +0000 (16:46 +0000)]
Fix systemd startup sequence to require active Local Filesystem (#1937)
Squid requires Local Filesystem to be active. While uncommon, it
may in some cases be incomplete or delayed. Ensure that the
dependency is explicitly listed to prevent failure from Squid
early initialization.
This change resolves Debian Bug 956581:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=956581
Nil request dereference in ACLExtUser and SourceDomainCheck ACLs (#1931)
ACLExtUser-based ACLs (i.e. ext_user and ext_user_regex) dereferenced a
nil request pointer when they were used in a context without a request
(e.g., when honoring on_unsupported_protocol).
SourceDomainCheck-based ACLs (i.e. srcdomain and srcdom_regex) have a
similar bug, although we do not know whether broken slow ACL code is
reachable without a request (e.g., on_unsupported_protocol tests cannot
reach that code until that directive starts supporting slow ACLs). This
change does not start to require request presence for these two ACLs to
avoid breaking any existing configurations that "work" without one.
Writing a nil c-string to an std::ostream object results in undefined
behavior. When Security::IoResult::errorDescription is nil, that UB
leads to crashes for some STL implementations. These changes avoid UB
while making higher-level reporting code simpler and safer.
This change alters affected level-1 ERROR test lines a little, including
removing duplicated connection information from clientNegotiateSSL()
message (cache_log_message id=62). That duplication existed because
Squid reports the same Connection info automatically via CodeContext.
New WithExtras() mechanism may be useful for other "low-level debugging
and level-0/1 details for admins ought to differ" cases as well. Today,
the only known debugging context for Security::IoResult is
Security::PeerConnector::suspendNegotiation(), but that is likely to
change as we upgrade TLS callers to library-independent wrappers beyond
the current Security::Accept() and Security::Connect() pair.
Ftp::Gateway may segfault in level-3 double-complete debugs() (#1923)
Ftp::Gateway::completeForwarding() must check data.conn pointer before
dereferencing it. Long-term, we should improve Comm::ConnectionPointer
printing to safely report Connection::id (where available). This minimal
fix just mimics existing Ftp::Relay::abortOnData() solution.
Use GitHub Actions for more OS-specific build tests (#1917)
Test staged commits using some of our docker images from Jenkins tests.
The added tests (see staged.yaml) were easier to adopt; we are also
working on enabling GitHub Actions for FreeBSD and some other images.
We do not run these added tests for PR commits because existing Ubuntu
tests (defined in pr.yaml) already expose the vast majority of build
problems, and we are worried that running a lot more tests for each PR
push event would consume too much GitHub resources and significantly
increase noise in PR checks summary, obscuring often-easier-to-handle
failures detected by Ubuntu tests.
Also postpone MacOS tests until PR staging. On GitHub, MacOS runners are
x10 more expensive than Linux runners. We use cheaper runners for fast
feedback while still checking MacOS build before each merged commit.
This change does not increase GitHub CI wait time for PR push tests
because those checks are dominated by the unchanged 35min CodeQL-tests
job. However, it reduces total CPU time used (from 2h 30m to 2h) because
we no longer perform 3 MacOS tests for PR push events.
This change doubles GitHub CI wait time for staged commit tests (from
35m to 1h) and drastically increases total CPU time used (from 2h to
17h), primarily due to 84 added docker-based linux-distros checks.
GitHub does not provide any riscv64 runners and free Linux/arm64
runners. Our initial attempts to virtualize Linux/arm64 tests on
MacOS/arm64 runners and to make virtualized FreeBSD and OpenBSD tests
work on Linux/x64 were not successful. Thus, we still rely on Jenkins
for Linux/riscv64, Linux/arm64, FreeBSD/x64, and OpenBSD/x64 tests.
uhliarik [Fri, 11 Oct 2024 03:31:19 +0000 (03:31 +0000)]
Bug 5449: Ignore SP and HTAB chars after chunk-size (#1914)
Prior to 2023 commit 951013d0, Squid accepted Transfer-Encoding chunks
with chunk-size followed by spaces or tabs (before CRLF). This HTTP
syntax violation was allowed to address Bug 4492 (fixed in 2017 commit 26f0a359). This change restores that fix functionality. FWIW, our
research shows that nginx and httpd also accept similar input.
Fix validation of Digest auth header parameters (#1906)
Insufficient validation of Digest authentication parameters resulted in
a DigestCalcHA1() call that dereferenced a nil pointer.
This bug was discovered and detailed by Joshua Rogers at
https://megamansec.github.io/Squid-Security-Audit/ where it was filed as
"strlen(NULL) Crash Using Digest Authentication".
Alex Rousskov [Sat, 5 Oct 2024 16:18:33 +0000 (16:18 +0000)]
Prohibit bad --enable-linux-netfilter combinations (#1893)
Since 2009 commit 51f4d36b, Squid declared that "all transparent build
options are independent, and may be used in any combination". That
declaration was accurate from a "successful build" point of view, but
Ip::Intercept::LookupNat() (and its precursors) started mishandling at
least two combinations of options as detailed below.
LookupNat() tries up to four lookup methods (until one succeeds):
The first method -- NetfilterInterception() -- fails to look up the true
destination address of an intercepted connection when, for example, the
client goes away just before the lookup. In those (relatively common in
busy environments) cases, the intended destination address cannot be
obtained via getsockname(), but LookupNat() proceeds calling other
methods, including the two methods that may rely on getsockname().
Methods 2 and 3 may rely on a previous getsockname() call to provide
true destination address, but getsockname() answers are not compatible
with what NetfilterInterception() must provide -- the destination
address returned by getsockname() is Squid's own http(s)_port address.
When Squid reaches problematic code now encapsulated in a dedicated
UseInterceptionAddressesLookedUpEarlier() function, Squid may end up
connecting to its own http(s)_port! Such connections may cause
forwarding loops and other problems. In SslBump deployments, these loops
form _before_ Forwarded-For protection can detect and break them.
These problems can be prevented if an admin does not enable incompatible
combinations of interception lookup methods. The relevant code is
correctly documented as "Trust the user configured properly", but that
statement essentially invalidates our "may be used in any combination"
design assertion and leads to runtime failures when user configured
improperly. Those runtime failures are difficult to triage because they
lack signs pointing to a build misconfiguration.
This change bans incompatible NetfilterInterception()+getsockname()
combinations at compile time: Squid ./configured with
--enable-linux-netfilter cannot use --enable-ipfw-transparent or
(--enable-pf-transparent --without-nat-devpf).
TODO: Ban incompatible combinations at ./configure time as well!
We have considered and rejected an alternative solution where all
./configure option combinations are still allowed, but LookupNat()
returns immediately on NetfilterInterception()/SO_ORIGINAL_DST failures.
That solution is inferior to build-time bans because an admin may think
that their Squid uses other configured lookup method(s) if/as needed,
but Squid would never reach them in --enable-linux-netfilter builds.
The only viable alternative is to specify lookup methods in squid.conf,
similar to the existing "tproxy" vs. "intercept" http(s)_port modes. In
that case, squid.conf will be validated for incompatible method
combinations (if combinations are supported at all), and LookupNat()
will only use configured method(s).
Do not mark successful FTP PUT entries with ENTRY_BAD_LENGTH (#1904)
Since 2021 commit ba3fe8d, we explicitly mark complete responses and
treat all other responses as truncated. That commit missed a case where
the FTP server responds with 226 or 250 code after receiving the upload.
The bug affects HTTP PUT requests using ftp URI scheme.
Incorrect truncation marking adds an unwanted WITH_CLIENT %err_detail to
ERR_FTP_PUT_CREATED transaction records in access.log:
201 PUT ftp://... ERR_FTP_PUT_CREATED/FTP_REPLY_CODE=226+WITH_CLIENT
Fixed code logs:
201 PUT ftp://... ERR_FTP_PUT_CREATED/FTP_REPLY_CODE=226
Squid builds without pkg-config, but results are likely to surprise
administrators because many optional features will not be
default-enabled despite properly installed libraries.
ESI feature has a number of bugs and security vulnerabilities.
It is also rarely used and a survey of active community members
has not revealed a need to keep maintaining this code.
Alex Rousskov [Fri, 20 Sep 2024 09:35:04 +0000 (09:35 +0000)]
Use ERR_ACCESS_DENIED for HTTP 403 (Forbidden) errors (#1899)
... when request authentication fails. Do not use
ERR_CACHE_ACCESS_DENIED for those "permanent" errors.
Default ERR_CACHE_ACCESS_DENIED is meant for cases where the user is
likely to eventually gain access (e.g., by supplying credentials). Its
default text says "not currently allowed... until you have authenticated
yourself". When the error page was added in 1998 commit cb69b4c7 it was
only used for HTTP 407 errors. The same logic was preserved when that
code was refactored in 1999 commit 1cfdbcf0, but exceptions started to
creep in, perhaps accidentally, since 2011 when HTTP 403 case was added
in commit 2f1431ea that introduced USE_AUTH macro. 2011 commit 21512911
added a similar "not possible to authenticate" SslBump case.
Other HTTP 403 (Forbidden) cases already use ERR_ACCESS_DENIED or a
similar "permanent" error (e.g., ERR_FORWARDING_DENIED or ERR_TOO_BIG).
It is still possible to customize the returned error page via deny_info.
The replaced assertion was wrong because a stale entry may be aborted
while we are revalidating it. The exact real-world conditions that
triggered this assertion are unknown, but many conditions lead to
aborted entries. The assertion can be triggered in lab tests using a hit
transaction that collapses on a miss transaction that runs into storage
problems (e.g., a memory cache that ran out of usable space).
Also adjusted cacheHit() to avoid similar problems. We have not
reproduced them, but no code maintains the asserted invariant on the
cacheHit() path either. Moreover, async-called cacheHit() initiates
processExpired() that leads to problematic sendClientOldEntry() call, so
seeing an aborted StoreEntry at cacheHit() time is probably a matter of
sufficient concurrency levels and asynchronous callback delays.
When establishing a TLS connection to an origin server _through_ a
cache_peer, Security::CreateClientSession() used cache_peer's
Security::PeerOptions instead of global ProxyOutgoingConfig (i.e.
tls_outgoing_options). Used cache_peer's PeerOptions are unrelated to
the (tunneled) TLS connection in question (and are currently empty
because Squid does not support TLS inside TLS -- the cache_peer accepts
plain HTTP connections).
This TLS context:options mismatch exists in both OpenSSL and GnuTLS
builds, but it currently does not affect OpenSSL builds where
CreateSession() gets TLS options from its Security::Context parameter
rather than its (unused) Security::PeerOptions parameter.
The mismatch exists on both PeekingPeerConnector/SslBump and
BlindPeerConnector code paths.
This minimal change pairs TLS context with its TLS options. Long-term,
the added FuturePeerContext shim (that stores references to independent
context and options objects) should be replaced with a PeerContext class
that encapsulates those two objects. We may also be able to avoid
re-computing GnuTLS context from options and to simplify code by
preventing PeerConnector construction in Squid builds that do not
support TLS. That refactoring should be done separately because it
triggers many changes unrelated to Bug 5293.
Also removed updateSessionOptions() from
PeekingPeerConnector::initialize() because Security::CreateSession(),
called by our parent initialize() method, already sets session options.
It is easier to remove that call/code than keep it up to date.
Security::BlindPeerConnector does not updateSessionOptions().
After a ReadNow() call, the buffer length must not exceed accumulation
limits (e.g., client_request_buffer_max_size). SBuf::reserve() alone
cannot reliably enforce those limits because it does not decrease SBuf
space; various SBuf manipulations may lead to excessive SBuf space. When
filled by ReadNow(), that space exceeds the limit.
This change uses documented CommIoCbParams::size trick to limit how much
Comm::ReadNow() may read, obeying SQUID_TCP_SO_RCVBUF (server-to-Squid)
and client_request_buffer_max_size (client-to-Squid) accumulation limit.
Alex Rousskov [Thu, 5 Sep 2024 17:46:20 +0000 (17:46 +0000)]
Bug 5417: An empty annotation value does not match (#1896)
Helpers may return annotations with empty values:
OK team_=""
A note ACL may be configured to match an annotation with an empty value:
configuration_includes_quoted_values on
acl emptyTeam note team_ ""
However, that emptyTeam ACL did not match the above helper annotation:
* AppendTokens() split an empty annotation value into an empty vector
instead of a vector with a single empty entry. That "never match an
empty value received from the helper" bug was probably introduced in
2017 commit 75d47340 when it replaced an "always try to match an empty
value, even when it was not received from the helper" bug in
ACLNoteStrategy::matchNotes().
* ACLStringData::match(SBuf v) never matched an empty value v. That bug
was probably introduced in 2015 commit 76ee67ac that mistook a nil
c-string pointer for an empty c-string.
Alex Rousskov [Sun, 1 Sep 2024 00:39:34 +0000 (00:39 +0000)]
Bug 5405: Large uploads fill request buffer and die (#1887)
maybeMakeSpaceAvailable: request buffer full
ReadNow: ... size 0, retval 0, errno 0
terminateAll: 1/1 after ERR_CLIENT_GONE/WITH_CLIENT
%Ss=TCP_MISS_ABORTED
This bug is triggered by a combination of the following two conditions:
* HTTP client upload fills Squid request buffer faster than it is
drained by an origin server, cache_peer, or REQMOD service. The buffer
accumulates 576 KB (default 512 KB client_request_buffer_max_size + 64
KB internal "pipe" buffer).
* The affected server or service consumes a few bytes after the critical
accumulation is reached. In other words, the bug cannot be triggered
if nothing is consumed after the first condition above is met.
Comm::ReadNow() must not be called with a full buffer: Related
FD_READ_METHOD() code cannot distinguish "received EOF" from "had no
buffer space" outcomes. Server::readSomeData() tried to prevent such
calls, but the corresponding check had two problems:
* The check had an unsigned integer underflow bug[^1] that made it
ineffective when inBuf length exceeded Config.maxRequestBufferSize.
That length could exceed the limit due to reconfiguration and when
inBuf space size first grew outside of maybeMakeSpaceAvailable()
protections (e.g., during an inBuf.c_str() call) and then got filled
with newly read data. That growth started happening after 2020 commit 1dfbca06 optimized SBuf::cow() to merge leading and trailing space.
Prior to that commit, Bug 5405 could probably only affect Squid
reconfigurations that lower client_request_buffer_max_size.
* The check was separated from the ReadNow() call it was meant to
protect. While ConnStateData was waiting for the socket to become
ready for reading, various asynchronous events could alter inBuf or
Config.maxRequestBufferSize.
This change fixes both problems.
This change also fixes Squid Bug 5214.
[^1]: That underflow bug was probably introduced in 2015 commit 4d1376d7
while trying to emulate the original "do not read less than two bytes"
ConnStateData::In::maybeMakeSpaceAvailable() condition. That condition
itself looks like a leftover from manual zero-terminated input buffer
days that ended with 2014 commit e7287625. It is now removed.
Alex Rousskov [Wed, 28 Aug 2024 21:01:15 +0000 (21:01 +0000)]
Reject config with unknown directives before committing to it (#1897)
Ideally, we want to reject configurations with unknown directives before
applying any configuration changes that correspond to known directives,
but current apply-as-you-parse architecture makes that impractical.
Pending smooth reconfiguration refactoring will make that possible, but
we can make a step towards that ideal future now.
Rejecting bad configurations before calling configDoConfigure() reduces
the set of configuration errors that Squid can detect in one execution
(because configDoConfigure() error-checking code is not reached), but
that small reduction is a lesser evil compared to running
configDoConfigure() with a clearly broken config, especially when we are
going to kill Squid anyway. While many legacy parse_foo() functions do
apply significant changes before configDoConfigure(), we cannot easily
prevent that (for now). We can easily prevent configDoConfigure().
Use git to extract default build-info (when enabled) (#1892)
Have configure option --enable-build-info[=yes]
refer to git instead of bazaar to extract
information about what is being built to include
in the output of `squid -v`
Alex Rousskov [Wed, 21 Aug 2024 18:51:29 +0000 (18:51 +0000)]
Fix Comm::TcpAcceptor::status() reporting of listening address (#1891)
Buggy "static" caching optimization led to TcpAcceptor X reporting
listening address of TcpAcceptor Y or an incorrect 0.0.0.0 address.
TcpAcceptor status is visible in mgr:jobs and level-5+ debugs() output.
Side effect: For a listening port configured without an explicit IP
address, the listening address part of status() is now reported using
"[::]" rather than "0.0.0.0" string.
Alex Rousskov [Wed, 21 Aug 2024 07:09:15 +0000 (07:09 +0000)]
Optimization: Do not zero-terminate Server::inBuf (#1889)
... when storing unencoded request body bytes in ConnStateData::bodyPipe
Calling SBuf::c_str() is unnecessary in this context because
BodyPipe::append() does not need/expect a zero-terminated buffer. The
call is relatively expensive when the previous Comm::ReadNow() was able
to fill the entire inBuf trailing space, which is a common occurrence
because Server::maybeMakeSpaceAvailable() uses CLIENT_REQ_BUF_SZ as
idealSpace size, and CLIENT_REQ_BUF_SZ is hard-coded to just 4096 bytes.
The call was added in 2014 commit e7287625 that converted inBuf to SBuf.
Alex Rousskov [Mon, 19 Aug 2024 17:20:06 +0000 (17:20 +0000)]
Fixed and redefined total peering time (%<tt) (#1828)
This change adjusts `%[http::]<tt` definition to include peer selection
stage (with any relevant DNS lookups). That stage runs in parallel with
connection establishment attempts, so excluding portions of that stage
made interpreting logged delays more difficult (and code more complex).
Also documented how cache refresh (TCP_REFRESH_MODIFIED) affects `%<tt`.
Fetching a new version of the stale object creates a second FwdState
object, effectively repeating peer selection activities and request
forwarding attempts. `%<tt` includes the time spent by those activities.
The following three bugs were fixed (at least):
* Squid logged dash `%<tt` values for pinned connections (e.g.,
connections bumped at SslBump step 2 or 3) because Squid did not call
HierarchyLogEntry::startPeerClock() for pinned connections.
* Logged `%<tt` values for connections spliced at SslBump step 2 or 3
included TCP connection establishment time and TLS Server Hello
fetching time (if any) but were missing the rest of Squid-peer
communication (i.e. when Squid was tunneling bytes to and from the
peer) because Squid stopped counting after calling switchToTunnel().
* Squid logged dash `%<tt` values for external_acl_type and most other
directives that support logformat codes because ALE::hier data member
is unset until access logging, when prepareLogWithRequestDetails()
copies HttpRequest::hier to ALE. This is a short-term fix. A proper
one deserves a dedicated change (that should also address similar
concerns for other HierarchyLogEntry-derived %codes).
Amos Jeffries [Mon, 12 Aug 2024 14:26:55 +0000 (14:26 +0000)]
Fix performance regressions with fastCheck() result copying (#1883)
Detected by Coverity. CID 1616162: Performance inefficiencies
(AUTO_CAUSES_COPY). Using the 'auto' keyword without an '&' causes the
copy of an object of type Acl::Answer.
This change fixes a few manually picked cases in sources that are not
normally compiled on our CI nodes. There are probably more missing
declarations, but we do not know how to find them automatically.
Also removed a few seemingly unused functions and declarations.
Protect ACLFilledChecklist heap allocations from leaking (#1870)
Non-blocking ACL checks follow this code pattern:
const auto ch = new ACLFilledChecklist(...);
fillWithInformation(ch); // may throw
ch->nonBlockingCheck(&throwingCallback); // may delete ch or throw!
// ch may be a dangling raw pointer here
The checklist object is leaked if an exception is thrown by
fillWithInformation() (and the code it calls) or by nonBlockingCheck()
(and the code it calls, including, in some cases, throwingCallback()).
Use std::unique_ptr to avoid such leaks.
Remove peerDigestFetchStop() and peerDigestFetchAbort() wrappers (#1869)
These wrappers were left in recent commit f036532 to reduce noise. Now
we always call finishAndDeleteFetch() directly (instead of sometimes
going through those wrappers). The new function name helps reduce the
number of use-after-free bugs in this code.
This flag became effectively unused in 2010 commit cfd66529 when
copyFDFlags() -- the only function checking the flag -- became unused.
That unused function was removed a bit later in commit 5ae21d99.
Supporting logging to a file complicates upgrading helper code to use
debugs() because DebugFile code calls commSetCloseOnExec(), and our
comm/libminimal does not currently provide a functioning implementation
for that API: The existing implementation is an unconditional assert.
To save development time while upgrading helpers, we are dropping this
feature. It can probably be emulated using shell redirection tricks.
Alex Rousskov [Mon, 5 Aug 2024 19:30:08 +0000 (19:30 +0000)]
Extend in-use ACLs lifetime across reconfiguration (#1826)
When Squid is reconfigured, all previously created Acl::Node objects
become unusable[^1], resulting in ACLChecklist::resumeNonBlockingCheck()
ACCESS_DUNNO outcomes. ACCESS_DUNNO leads to Squid applying the affected
directive "default" action rather than the configured one (e.g., an
http_access denies a transaction that should be allowed or miss_access
allows a transaction that should be denied). These unwanted and
surprising effects make hot reconfiguration unreliable. They are also
difficult to discover and triage.
An Acl::Node object should last as long as the object/scope that saved
that ACL for reuse. This change applies this (re)configuration principle
to ACLChecklist::accessList by converting that raw Acl::Tree pointer to
use reference counting. That change requires reference counting of
individual Acl::Node objects and related types. Several TODOs and XXXs
asked for reference counted ACLs, but more than 50 explicit raw ACL
pointers in Squid code make that transition difficult:
* Config.accessList.foo and similar items used raw Acl::Tree pointers to
objects destroyed during reconfiguration by aclDestroyAcls(). They now
point to refcounted Acl::Trees that ACLChecklist and other
configuration-using code can safely store across reconfigurations. The
outer Config.accessList.foo pointer is still raw because Squid
invasive RefCount smart pointers cannot be stored directly in various
configuration objects without significant out-of-scope refactoring.
* Acl::Node::next items formed an invasive list of raw ACL pointers
(rooted in Config.aclList). The list was used to FindByName() and also
to prepareForUse() explicitly named ACLs. Reconfiguration destroyed
listed nodes via aclDestroyAcls(). Acl::Node::next is now gone.
* RegisteredAcls std::set stored raw ACL pointers to destroy all ACL
objects (named and internally-generated) during reconfiguration, after
all the free_foo() calls, in aclDestroyAcls(). It is now gone.
* InnerNode::nodes std::vector stored raw pointers to
internally-generated Acl::Node objects. They are now refcounted.
The remaining raw Acl::Node pointers are not actually _stored_ across
async calls. They are used to pass ACLs around. Some of them should be
replaced with references (where it is known that the ACL must exist and
is not expected to be stored), but that is a different polishing issue.
For now, we just make sure that no raw Acl::Node pointers are stored
across async calls (because any stored raw ACL pointer may be
invalidated by reconfiguration).
Extending lifetime of in-use ACLs is necessary but not sufficient to
make Squid reconfiguration reliable. We need to preserve more in-use
resources...
[^1]: Prior to these changes, reconfiguration made Acl::Node objects
unusable because their C++ destructors were called. Cbdata locks may
have kept underlying object memory, but any linked resources freed by
destructors were gone, and the destructed objects could not be accessed.
gmake[3]: Entering directory .../libltdl
.../cfgaux/missing: line 85: aclocal-1.16: command not found
gmake[3]: *** [Makefile:561: .././../libltdl/aclocal.m4] Error 127
During bootstrap.sh run, libtoolize copies prepackaged configure and
Makefile.in files into our libltdl directory:
* libltdl/configure from libtool v2.4 has aclocal version set to 1.16;
* libltdl/Makefile.in from libtool v2.4 uses configure-set aclocal
version to build aclocal.m4
Thus, libltdl/Makefile (generated from libltdl/Makefile.in above) runs
aclocal-1.16 if "make" needs to build libltdl/aclocal.m4.
Normally, "make" does not need to build libltdl/aclocal.m4 because that
file was created by libtoolize. However, libtool v2.4 is packaged with
(generated by packaging folks) libltdl/Makefile.in that makes
libltdl/aclocal.m4 target dependent on files in libltld/../m4 directory.
Squid does not have that ./m4 directory, so "make" attempts to
re-generate libltdl/aclocal.m4. When it does, it uses aclocal-1.16.
Our bootstrap.sh generated new ./configure but preserved copied
libltdl/configure with its aclocal version set to 1.16. In other words,
our bootstrap.sh did not bootstrap libltdl sub-project. In build
environments without aclocal-1.16, Squid build failed.
Several solutions or workarounds have been tried or considered:
* Adjust bootstrap.sh to bootstrap libltdl (this change). 2008 attempt
to do that was reverted in commit bfd6b6a9 with "better to fix libtool
installation" rationale. Another potential argument against this
option is that packages should be bootstrapped by their distributors,
not "users". We are not distributing libtool, but this is a gray area
because we do distribute files that libtoolize creates. Finally,
libtool itself does not provide a bootstrapping script and does not
explicitly recommend bootstrapping in documentation.
* "Fix libtool installation". We failed to find a good way to do that on
MacOS (without building and installing newer libtool from sources).
* Place m4 files where libtool v2.4 expects to find them. That change
fixes MacOS builds that use automake v1.17, but breaks Gentoo builds
because Gentoo libtool installs a buggy libltdl/Makefile.in that must
be regenerated by automake before it can work. Fixing m4 files
location prevents that regeneration.
We picked the first option despite its drawbacks because the third
option did not work on Gentoo, and asking Squid developers to install
libtool from sources (i.e. the second option) felt like a greater evil.
This old problem was exposed by recently introduced CI MacOS tests that
started to fail when MacOS brew updated automake package from v1.16
without the corresponding libtoolize package changes.
Also work around what seems to be a libtool packaging bug affecting
MacOS/brew environments, including GitHub Actions runners we use for CI:
That workaround will be removed after libtool package is fixed.
Also removed a single-iteration "for dir" loop with several stale hacks
from bootstrap.sh: With only two directories to bootstrap and with a
directory-specific mkdir command, source comments, and progress
messages, it is best to unroll that loop.
Alex Rousskov [Tue, 16 Jul 2024 01:38:59 +0000 (01:38 +0000)]
Enable EDNS for DNS A queries and reverse IPv4 lookups (#1864)
This change brings Squid code in sync with existing dns_packet_max
directive documentation and allows admins to enable a useful performance
optimization for still very popular IPv4-related DNS queries.
An enabled dns_packet_max feature may now break Squid compatibility with
more buggy nameservers (that appeared to "work" before), but we should
not penalize many Squid deployments (or complicate configuration and
spend a lot of development time) to accommodate a few exceptional ones,
at least until such heroic efforts are proven to be necessary.
Alex Rousskov [Mon, 15 Jul 2024 21:59:54 +0000 (21:59 +0000)]
Do not drop queued TCP DNS queries when queuing a new one (#1863)
Since 2005 commit 032785bf, previously queued serialized DNS query
(waiting for a TCP connection to the nameserver or for a write on an
already established TCP connection) was erased every time a new query
was serialized and appended to the `vc->queue` MemBuf. Thus, at most one
TCP query was submitted when the nameserver became available.
Non-serialized versions of erased queries remained on the DNS lru_list
and would eventually be retried or timeout. This change restores
conditional MemBuf initialization that was replaced with an
unconditional MemBuf reset in that 2005 commit.
It would take more than 5 hours of 1000/s unique request load for
100-byte DNS queries to overflow a 2GB MemBuf in a properly functioning
Squid, but this change does not rely on MEM_BUF_MAX_SIZE always being
large enough. New code prevents MemBuf::grow() assertions and informs
the admin about dropped queries that may indicate a Squid bug or a
persistent problem with nameserver communication.
Alex Rousskov [Thu, 11 Jul 2024 20:04:06 +0000 (20:04 +0000)]
Be more careful when updating nameservers global (#1862)
Due to I/O delays and timeouts, DNS nsvc objects may be deleted after
the `nameservers` global pointing to them was cleared and then populated
with new ns objects. Thus, nsvc destructor should check that nsvc and
`nameservers` states are still in sync before manipulating the latter.
DNS code should use similar validation in a few other places, but they
are all about read-only debugging that requires a rather noisy cleanup.
Alex Rousskov [Wed, 10 Jul 2024 10:30:33 +0000 (10:30 +0000)]
Recover after failing to open a TCP connection to DNS server (#1861)
ERROR: Failed to connect to nameserver 127.0.0.1 using TCP.
After failing to establish a TCP connection to a DNS server, all DNS
queries that needed a TCP connection to that DNS server would timeout
because the nsvc object representing TCP connectivity got stuck in a
"queuing new queries but too busy to send any right now" state. Such
timeouts typically lead to HTTP 503 ERR_DNS_FAIL responses. This bug was
introduced when Comm closure handler registration was moved/delayed in
2010 commit cfd66529.
With this change, the affected nsvc object is destroyed, and Squid
attempts to open another TCP connection to the DNS server (when needed).
The original query is typically retried (subject to dns_timeout and
dns_retransmit_interval idiosyncrasies).
XXX: This fix increases the surface of reconfiguration and shutdown
problems documented in nsvc class destructor XXX.
This change fixes how cbdata is used for managing CachePeer::digest
lifetime. Prior to these changes, cbdata was (ab)used as a reference
counting mechanism: CachePeer::digest object could outlive CachePeer
(effectively its creator), necessitating complex "is it time to delete
this digest?" cleanup logic. Now, CachePeer is an exclusive digest owner
that no longer locks/unlocks its digest field; it just creates/deletes
the object. Digest fetching code no longer needs to cleanup the digest.
"CachePeer::digest is always valid" invariant simplifies digest fetching
code and, hopefully, reduces the probability of bugs similar to Bug 5329
fixed in minimal commit 4657405 (that promised this refactoring).
Removed effectively unused acl_checklist members (#1860)
Four classes had problematic acl_checklist data members:
* SnmpRequest does not use acl_checklist at all.
* ClientRequestContext and Adaptation::AccessCheck do not use
acl_checklist beyond checklist creation/configuration code. A local
variable works better for creation/configuration, and having a data
member complicates upcoming checklist creation code improvements.
* PeerSelector creates/configures a checklist and uses acl_checklist
during destruction, but the latter code is marked as a Squid BUG and
is itself buggy: Checklist objects are cbdata-protected. They must
have one owner at any time. Starting with a nonBlockingCheck() call,
that owner is the checklist object itself. That owner destructs the
checklist object (i.e. "this"). If that Squid BUG code could be
reached, Squid would delete the same object twice. There are no known
ways to trigger that bug.
Alex Rousskov [Sun, 7 Jul 2024 03:03:00 +0000 (03:03 +0000)]
Fix Tokenizer::int64() parsing of "0" when guessing base (#1842)
Known bug victims in current code were tcp_outgoing_mark,
mark_client_packet, clientside_mark, and mark_client_connection
directives as well as client_connection_mark and (deprecated)
clientside_mark ACLs if they were configured to match a zero mark using
"0" or "0/..." syntax: