Maintenance: No ptr copy in future CredentialsCache::insert()s (#1807)
Currently, the insert() method calls implicitly convert raw "this"
pointers to RefCount pointers. That self-registration code raises red
flags and may eventually be refactored. If it is refactored, insert()
calls are likely to start using RefCount pointers as parameters. This
change allows those future calls to avoid RefCount pointer copies. This
change does not affect current calls performance.
This change was triggered by Coverity CID 1554665: Unnecessary object
copies can affect performance (COPY_INSTEAD_OF_MOVE). However, the
insert() method is still calling RefCount copy assignment operator
rather than its (cheaper) move assignment operator. Safely removing that
overhead is only possible by reintroducing future parameter copying
overhead described above _and_ using an explicit std::move() call that
we are trying to avoid in general code. It is arguably not worth it.
Alex Rousskov [Fri, 3 May 2024 02:00:08 +0000 (02:00 +0000)]
Bug 5322: Do not leak HttpReply when checking http_reply_access (#1764)
... as well as response_delay_pool and send_hit directives.
auto check = clientAclChecklistCreate(...); // sets check->reply
check->reply = reply; // self-assignment does nothing
HTTPMSGLOCK(check->reply); // an unwanted/extra lock
When ACLFilledChecklist::reply is already set to X, resetting it to X
should not change HttpReply lock count, but some manual locking code did
not check that "already set" precondition and over-locked reply objects
set to ClientHttpRequest::al::reply by clientAclChecklistFill().
Current known leaks were probably introduced in 2021 commit e227da8 that
started supplying HttpReply to ACLChecklist in clientAclChecklistFill().
The added code locked the reply object correctly, but it was
incompatible with unconditional manual locks in three existing indirect
clientAclChecklistFill() callers (calling clientAclChecklistCreate()).
This change removes all known similar leaks and improves
ACLFilledChecklist API to prevent future similar leaks.
Alex Rousskov [Sat, 13 Apr 2024 08:15:00 +0000 (08:15 +0000)]
Bug 5352: Do not get stuck when RESPMOD is slower than read(2) (#1777)
... RESPMOD BodyPipe buffer becomes full ...
maybeMakeSpaceAvailable: will not read up to 0
The AsyncCall Client::noteDelayAwareReadChance constructed
... RESPMOD consumes some buffered virgin body data ...
entering BodyProducer::noteMoreBodySpaceAvailable
leaving BodyProducer::noteMoreBodySpaceAvailable
When RESPMOD does not empty its adaptation BodyPipe buffer fast enough,
readReply() may eventually fill that buffer and call delayRead(),
anticipating a noteDelayAwareReadChance() callback from Store or Server
delay pools. That callback never happens if Store and Server are not
getting any data -- they do not even start working until RESPMOD service
starts releasing adapted/echoed response back to Squid! Meanwhile, our
flags.do_next_read (cleared by readReply() caller) remains false.
When/if RESPMOD service eventually frees some BodyPipe buffer space,
triggering noteMoreBodySpaceAvailable() notification, nothing changes
because maybeReadVirginBody() quits when flags.do_next_read is false.
noteMoreBodySpaceAvailable() could not just make flags.do_next_read true
because that flag may be false for a variety of other/permanent reasons.
Instead, we replaced that one-size-fits-all flag with more specific
checks so that reading can resume if it is safe to resume it. This
change addresses a couple of flag-related XXXs.
The bug was introduced in 2023 commit 50c5af88. Prior that that change,
delayRead() was not called when RESPMOD BodyPipe buffer became full
because maybeMakeSpaceAvailable() returned false in that case, blocking
maybeReadVirginBody() from triggering readReply() via Comm::Read(). We
missed flags.do_next_read dependency and that Store-specific delayRead()
cannot be used to wait for adaptation buffer space to become available.
XXX: To reduce risks, this change duplicates a part of
calcBufferSpaceToReserve() logic. Removing that duplication requires
significant (and risky) refactoring of several related methods.
Alex Rousskov [Mon, 8 Apr 2024 22:06:02 +0000 (22:06 +0000)]
Upgrade Acl::Node::name to SBuf; remove AclMatchedName global (#1766)
AclMatchedName global has been localized into a regular Acl::Answer data
member (Acl::Answer maintains the result of ACLChecklist evaluations).
This long overdue change resolves an old TODO and XXXs, paving the way
for Acl::Node reference counting.
No significant functionality changes are expected, but it is possible
that some deny_info configurations will now be handled better in
reconfiguration corner cases (because Squid no longer forgets the name
of the last checked ACL when a slow ACL check crosses reconfiguration
barrier).
Most of these changes are performance-neutral or -positive because they
eliminate or reduce memory allocations and associated name copying (and
more reduction will become possible after upgrading squid.conf parsers
to use SBuf). This change adds SBuf object copies when Acl::Answer is
propagated to ACLCB callbacks, but those read-only copies are cheap.
Also renamed and polished aclGetDenyInfoPage() because we had to update
its parameter type (to supply the last evaluated ACL). All callers were
also supplying the same first argument (that is unlikely to change in
the foreseeable future); that argument is now gone. We did not fix the
redirect_allowed name and debugs(): Fixing that behavior deserves a
dedicated change.
Also polished legacy aclIsProxyAuth() profile and description because we
have to change the parameter type (to supply the last evaluated ACL).
Also removed 63-character aclname parameter limit for acl directives.
Alex Rousskov [Wed, 3 Apr 2024 09:25:55 +0000 (09:25 +0000)]
Fix const-correctness of ACLHTTPHeaderData::match() parameter (#1771)
ACLHTTPHeaderData::match() required a pointer to non-const HttpHeader
but does not (and should not) modify the supplied HttpHeader.
Also removed support for nil HttpHeader in that method. All callers
already require HttpHeader presence. If that changes, it is the _caller_
that should decide what HttpHeader absence means (match, mismatch,
exception/dunno, etc.); ACLHTTPHeaderData does not have enough
information to make the right choice and, hence, should not be asked to
choose.
Also polished related #includes to remove unnecessary ones.
Alex Rousskov [Tue, 2 Apr 2024 20:37:31 +0000 (20:37 +0000)]
Do not blame cache_peer for CONNECT errors (#1772)
ERROR: Connection to [such-and-such-cache_peer] failed
TCP_TUNNEL/503 CONNECT nxdomain.test:443 FIRSTUP_PARENT
Squid does not alert an admin about (and decrease health level of) a
cache_peer that responded with an error to a GET request. Just like GET
responses from a cache_peer, CONNECT responses may (and often do!)
reflect client or origin server failures. We should not penalize
cache_peers (and alert admins) until we can distinguish these frequent
client/origin failures from (relatively rare) cache_peer problems. This
change absolves cache_peers of CONNECT problems, restoring parity with
GETs and restoring v4 behavior changed (probably by accident) in v5.
Also removed Http::StatusCode parameter from failure notification
functions because it became essentially unused after the primary
Http::Tunneler changes. Tunneler was the only source of status code
information that (in some cases) used received HTTP response to compute
that status code. All other cases extracted that status code from
Squid-generated errors. Those errors were arguably never meant to supply
status code information for "this failure is not our fault" decision,
and they do not supply 4xx status codes driving that decision.
### Problem evolution
2019 commit f5e1794 effectively started blaming cache_peer for all
FwdState CONNECT errors. That functionality change was probably
accidental, likely influenced by the names of noteConnectFailure() and
peerConnectFailed() functions that abbreviated "Connection", making the
functions look as applicable to CONNECT failures. Prior to that commit,
the functions were never used for CONNECT errors. After it, FwdState
started calling peerConnectFailed() for all CONNECT failures.
In 2020 commit 25b0ce4, TunnelStateData started blaming cache_peers as
well (by moving that FwdState-only error handling code into Tunneler).
The same "accidental functionality change" speculations apply here.
In 2022 commit 022dbab, we made an exception for 4xx CONNECT errors as
folks deploying newer code started complaining about cache_peers getting
blamed for client-caused errors (e.g., HTTP 403 Forbidden replies). We
did not realize that the blaming code itself was an unwanted accident.
Now we are getting complaints about cache_peers getting blamed for 502
and 503 CONNECT errors caused by, for example, domain names without IPs:
As these CONNECT error responses are propagated from parent to child
caches, every child cache in the chain logs ERRORs and every cache_peer
in the chain gets its health counter decreased!
Class constructor and destructor sources should not manually duplicate
class names in debugs() statements. Many existing classes do the right
thing, but find-alive.pl still insists on a manually added class name,
forcing its users to add a colon: `find-alive.pl MemBlob:`.
We now accommodate both legacy/manual and modern/auto use cases.
When FwdState decides to re-forward a request, it forgets the original
response[^1] but does not create an error object. Since commit e2bbd3b,
FwdState::noteDestinationsEnd() correctly assumed that we only idly wait
for additional destinations after encountering a problem, but
incorrectly asserted that past problems imply error object existence.
Now Squid generates an HTTP 502 (Bad Gateway) response while setting
%err_code/%err_detail to ERR_CANNOT_FORWARD/REFORWARD_TO_NONE.
TunnelStateData::noteDestinationsEnd() code is similar, but it probably
does not suffer from the same bug because an error object is created
before every retryOrBail() call, and there is no re-forwarding code that
forgets an HTTP error response without creating an error. Those
invariants are not well tracked, and this change mimics FwdState changes
in TunnelStateData just in case and to keep the two methods in sync. In
those hypothetical cases, ERR_CANNOT_FORWARD/RETRY_TO_NONE is logged.
[^1]: Long-term we probably want to preserve that original response so
that we do not have to replace it with a generated error, but doing so
requires significant refactoring and is outside this minimal fix scope.
Alex Rousskov [Wed, 27 Mar 2024 18:41:41 +0000 (18:41 +0000)]
CI: Require source-maintenance.sh application before merging PRs (#1760)
By default, fixing PR-introduced code formatting problems should be that
PR responsibility. We did not enforced formatting because we did not
have CI automation to support that enforcement, and also because it was
tedious even for core developer to maintain the right astyle version.
Neither excuse remains valid these days.
In most development environments, it is not very difficult to install
Astyle v3.1 (and other Squid source maintenance tools). Our GitHub
Actions configuration file (.github/default.yaml) has a recipe. In
exceptional cases, including cases where a pull request was not
formatted because a different astyle version was installed, other
developers can (re)format PR sources (proactively or at author request).
Also fixed wrong error message when running source-maintenance.sh fails.
HTTP: Protect just-parsed responses from accidental destruction (#1752)
The lack of HttpReply locking meant that we could not (safely) pass the
freshly created reply object to ACLFilledChecklist because the checklist
destructor would unwittingly destroy the object (by unlocking it). There
is at least one applicable ACL check in handle1xx(), but we got lucky
because that method adds the reply object to ALE, increasing its
reference counter. This change stops relying on such "external" locks.
This change also protects from HttpReply memory leaks when an exception
is thrown between HttpReply creation and a setVirginReply() call.
TODO: Fix all cases where newly created HttpReply objects are remembered
using raw pointers (but may be passed to locking/unlocking code).
ankor2023 [Tue, 5 Mar 2024 11:13:19 +0000 (11:13 +0000)]
negotiate_kerberos_auth: Support Kerberos PAC-ResourceGroups (#1597)
Parse the ResourceGroupIds pac-data structure to have information
about the user's membership in AD Domain Local groups.
Previously, the helper obtained user groups information only from
GroupIds and ExtraSids pac-data structures (of the
KERB_VALIDATION_INFO structure).
The patch extends the functionality of the helper.
Now it additionally parse the ResourceGroupIds pac-data structure
where Domain Local AD-group rids are located.
It appends these groups to the the list generated by parsing
GroupIds and ExtraSids.
No changes in existing helper deployments are required.
The new parsing functions are similar to those already used for
parsing GroupIds and ExtraSids.
Alex Rousskov [Fri, 1 Mar 2024 22:20:20 +0000 (22:20 +0000)]
Bug 5069: Keep listening after getsockname() error (#1713)
ERROR: Stopped accepting connections:
error: getsockname() failed to locate local-IP on ...
In many cases, these failures are intermittent client-triggered errors
(e.g., client shut down the accepted socket); Squid will successfully
accept other connections and, hence, should keep listening for them.
Store is an essential service, used by a lot of Squid code. As was
already established in 2022 commit 23b7963, this service should be
available during shutdown. That commit correctly removed explicit Store
service termination, but missed the fact that the reference-counting
TheRoot pointer (that provides access to the Store Controller singleton)
gets _automatically_ destroyed during C++ cleanup. This change removes
TheRoot reference counting, making that Controller singleton immortal.
Squid asserted when exiting with active entries in the shared memory
cache because TheRoot destruction leads to Controller destruction,
Controller destructor cleans up cache index, and that cleanup code may
result in calls to Store::Root() that dereferences destroyed TheRoot.
These assertions were seen when Squid was shutdown cleanly (e.g., using
SIGTERM) and when a kid process was exiting due to a fatal() error.
Making Store::Controller singleton immortal means that the class
destructor is never called. Fortunately, the destructor did nothing
particularly useful; Store flushing is performed by Controller::sync()
which is explicitly called during early stages of clean shutdown. The
now-unused destructor code was removed: Implementing this destructor in
a meaningful way (while avoiding accessing a being-destructed global
Store!) requires heroic efforts (which would be wasted since the
destructor is never actually called).
Also made Store Controller singleton available on the first use,
complying with Store's "essential, widely used service" status and
removing the need for an explicit Store::Init(void) call.
Also removed unit tests that required support for dynamic replacement of
Store Controller singleton. The minuscule value of the removed tests did
not justify the costs of supporting a replaceable Store Controller.
Also removed Store::FreeMemory(). Its implementation contradicted Store
status as an essential service. The function was only used by unit
tests, and its removal addresses the corresponding commit 23b7963 TODO.
POSIX.1-2001 marks vfork(2) OBSOLETE.
POSIX.1-2008 removes the specification of vfork(2).
MacOS system headers declare vfork(2) as deprecated.
We only use vfork(2) in negotiate_wrapper, where it is not necessary.
On MacOS / Homebrew, libcppunit is not part of the system path.
pkg-config will report it; use it.
This change also ensures squid honours the promise made
in configure's help text to let administrators specify a
LIBCPPUNIT_LIBS environment variable to
override automatic detection
Alex Rousskov [Fri, 23 Feb 2024 04:26:38 +0000 (04:26 +0000)]
Fix marking of problematic cached IP addresses (#1691)
Since inception in 2017 commit fd9c47d, Dns::CachedIps::have() always
returned position zero after finding a matching IP address (at zero or
positive position). The bug affected two callers:
* markAsBad() always marked the first stored address (as bad);
* forgetMarking() always cleared the first stored address marking.
Buggy markings led to Squid sometimes not attempting to use a working
address (e.g., IPv4) while using a known problematic one (e.g., IPv6).
Maintenance: Remove Red Hat Linux workarounds predating RHEL (#1698)
The last Red Hat Linux release went EOL in 2004, replaced by Red Hat
Enterprise Linux and Fedora Linux. We no longer support Red Hat Linux
releases and expect that these hacks are no longer necessary in
supported environments.
squid-conf-tests: Ignore tests with mismatching autoconf macro (#1648)
The 'skip-unless-autoconf-defines' directive should be able to
distinguish autoconf macro values, such as '0' (not defined) from '1'
(defined) ones. For example, --disable-ipv6 configuration option
defines USE_IPV6 as '0'. This change allows IPv6 tests activation,
addressing a TODO.
Alex Rousskov [Sun, 18 Feb 2024 00:45:41 +0000 (00:45 +0000)]
Fix debugging for responses that Expire at check time (#1683)
Since 2000 commit 65fa5c6, our level-3 debugging mislead about Expires
being less than the check time when the two times were identical. The
actual checked conditions are correct: Roughly speaking, the response
with Expires value T is considered expired at that time T.
Also dropped extra (and inconsistent) trailing space on debugs() lines.
This space was added by the same 2000 commit, probably accidentally.
Alex Rousskov [Fri, 16 Feb 2024 13:02:54 +0000 (13:02 +0000)]
Maintenance: Removed unused bits of Format::FmtConfig code (#1681)
This code was probably accidentally copied from Log::LogConfig when
FmtConfig was created in 2011 commit 31971e6. It mentions logfile_format
directive that never existed. The code also duplicates a dangerous
Log::LogConfig snippet (see Bug 5344).
Alex Rousskov [Fri, 16 Feb 2024 04:03:40 +0000 (04:03 +0000)]
Bug 5344: mgr:config segfaults without logformat (#1680)
Since 2011 commit 38e16f9, Log::LogConfig::dumpFormats() dereferenced a
nil `logformats` pointer while reporting a non-existent logformat
configuration (e.g., squid.conf.default): `logformats->dump(e, name)`.
In most environments, that code "worked" because the corresponding
Format::Format::dump() method happens to do nothing if "this" is nil.
However, in some environments, Squid segfaulted.
The code loading a response from the shared memory cache incorrectly
assumed that the being-loaded response could not have been updated by an
HTTP 304 (Not Modified) reply and called adjustableBaseReply() that is
banned for updated responses. The goal of that call was to determine
whether the cached response header has been parsed. That determination
can be made without using a method that is banned for updated responses.
StoreClient and clientReplyContext had very similar checks that used the
right/safe approach because their current code did not need an immediate
access to an "adjustable" response. We have updated all external
psParsed checks to reduce chances that this bug resurfaces.
Alex Rousskov [Thu, 8 Feb 2024 22:03:44 +0000 (22:03 +0000)]
Fix max-stale in default refresh_pattern (#1664)
RefreshPattern constructor must set data fields to honor refresh_pattern
defaults promised in squid.conf.documented. For max-stale, that implies
making max_stale negative. A negative value allows refreshCheck() to use
a max_stale directive value (i.e. Config.maxStale that defaults to 1
week). The buggy constructor set max_stale to 0 instead and, hence,
refreshCheck() ignored max_stale directive when no refresh_pattern rules
were configured.
The fixed bug did not affect Squids configured using explicit
refresh_pattern rules because those rules are handled by
parse_refreshpattern() which sets max_stale to -1 by default. Our
squid.conf.default does have explicit refresh_pattern rules.