Jeff King [Tue, 16 Apr 2024 08:35:33 +0000 (04:35 -0400)]
upload-pack: disable lazy-fetching by default
The upload-pack command tries to avoid trusting the repository in which
it's run (e.g., by not running any hooks and not using any config that
contains arbitrary commands). But if the server side of a fetch or a
clone is a partial clone, then either upload-pack or its child
pack-objects may run a lazy "git fetch" under the hood. And it is very
easy to convince fetch to run arbitrary commands.
The "server" side can be a local repository owned by someone else, who
would be able to configure commands that are run during a clone with the
current user's permissions. This issue has been designated
CVE-2024-32004.
The fix in this commit's parent helps in this scenario, as well as in
related scenarios using SSH to clone, where the untrusted .git directory
is owned by a different user id. But if you received one as a zip file,
on a USB stick, etc, it may be owned by your user but still untrusted.
This has been designated CVE-2024-32465.
To mitigate the issue more completely, let's disable lazy fetching
entirely during `upload-pack`. While fetching from a partial repository
should be relatively rare, it is certainly not an unreasonable workflow.
And thus we need to provide an escape hatch.
This commit works by respecting a GIT_NO_LAZY_FETCH environment variable
(to skip the lazy-fetch), and setting it in upload-pack, but only when
the user has not already done so (which gives us the escape hatch).
The name of the variable is specifically chosen to match what has
already been added in 'master' via e6d5479e7a (git: extend
--no-lazy-fetch to work across subprocesses, 2024-02-27). Since we're
building this fix as a backport for older versions, we could cherry-pick
that patch and its earlier steps. However, we don't really need the
niceties (like a "--no-lazy-fetch" option) that it offers. By using the
same name, everything should just work when the two are eventually
merged, but here are a few notes:
- the blocking of the fetch in e6d5479e7a is incomplete! It sets
fetch_if_missing to 0 when we setup the repository variable, but
that isn't enough. pack-objects in particular will call
prefetch_to_pack() even if that variable is 0. This patch by
contrast checks the environment variable at the lowest level before
we call the lazy fetch, where we can be sure to catch all code
paths.
Possibly the setting of fetch_if_missing from e6d5479e7a can be
reverted, but it may be useful to have. For example, some code may
want to use that flag to change behavior before it gets to the point
of trying to start the fetch. At any rate, that's all outside the
scope of this patch.
- there's documentation for GIT_NO_LAZY_FETCH in e6d5479e7a. We can
live without that here, because for the most part the user shouldn't
need to set it themselves. The exception is if they do want to
override upload-pack's default, and that requires a separate
documentation section (which is added here)
- it would be nice to use the NO_LAZY_FETCH_ENVIRONMENT macro added by e6d5479e7a, but those definitions have moved from cache.h to
environment.h between 2.39.3 and master. I just used the raw string
literals, and we can replace them with the macro once this topic is
merged to master.
At least with respect to CVE-2024-32004, this does render this commit's
parent commit somewhat redundant. However, it is worth retaining that
commit as defense in depth, and because it may help other issues (e.g.,
symlink/hardlink TOCTOU races, where zip files are not really an
interesting attack vector).
The tests in t0411 still pass, but now we have _two_ mechanisms ensuring
that the evil command is not run. Let's beef up the existing ones to
check that they failed for the expected reason, that we refused to run
upload-pack at all with an alternate user id. And add two new ones for
the same-user case that both the restriction and its escape hatch.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
fetch/clone: detect dubious ownership of local repositories
When cloning from somebody else's repositories, it is possible that,
say, the `upload-pack` command is overridden in the repository that is
about to be cloned, which would then be run in the user's context who
started the clone.
To remind the user that this is a potentially unsafe operation, let's
extend the ownership checks we have already established for regular
gitdir discovery to extend also to local repositories that are about to
be cloned.
This protection extends also to file:// URLs.
The fixes in this commit address CVE-2024-32004.
Note: This commit does not touch the `fetch`/`clone` code directly, but
instead the function used implicitly by both: `enter_repo()`. This
function is also used by `git receive-pack` (i.e. pushes), by `git
upload-archive`, by `git daemon` and by `git http-backend`. In setups
that want to serve repositories owned by different users than the
account running the service, this will require `safe.*` settings to be
configured accordingly.
Also note: there are tiny time windows where a time-of-check-time-of-use
("TOCTOU") race is possible. The real solution to those would be to work
with `fstat()` and `openat()`. However, the latter function is not
available on Windows (and would have to be emulated with rather
expensive low-level `NtCreateFile()` calls), and the changes would be
quite extensive, for my taste too extensive for the little gain given
that embargoed releases need to pay extra attention to avoid introducing
inadvertent bugs.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Filip Hejsek [Sun, 28 Jan 2024 03:29:33 +0000 (04:29 +0100)]
t0411: add tests for cloning from partial repo
Cloning from a partial repository must not fetch missing objects into
the partial repository, because that can lead to arbitrary code
execution.
Add a couple of test cases, pretending to the `upload-pack` command (and
to that command only) that it is working on a repository owned by
someone else.
Helped-by: Jeff King <peff@peff.net> Signed-off-by: Filip Hejsek <filip.hejsek@gmail.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
We observed a series of clone failures arose in a specific set of
repositories after we fully enabled the MIDX bitmap feature within our
Codebase service. These failures were accompanied with error messages
such as:
Cloning into bare repository 'clone.git'...
remote: Enumerating objects: 8, done.
remote: Total 8 (delta 0), reused 0 (delta 0), pack-reused 8 (from 1)
Receiving objects: 100% (8/8), done.
fatal: did not receive expected object ...
fatal: fetch-pack: invalid index-pack output
Temporarily disabling the MIDX feature eliminated the reported issues.
After some investigation we found that all repositories experiencing
failures contain replace references, which seem to be improperly
acknowledged by the MIDX bitmap generation logic.
A more thorough explanation about the root cause from Taylor Blau says:
Indeed, the pack-bitmap-write machinery does not itself call
disable_replace_refs(). So when it generates a reachability bitmap, it
is doing so with the replace refs in mind. You can see that this is
indeed the cause of the problem by looking at the output of an
instrumented version of Git that indicates what bits are being set
during the bitmap generation phase.
With replace refs (incorrectly) enabled, we get:
[2, 4, 6, 8, 13, 3, 6, 7, 3, 4, 6, 8]
and doing the same after calling disable_replace_refs(), we instead get:
[2, 5, 6, 13, 3, 6, 7, 3, 4, 6, 8]
Single pack bitmaps are unaffected by this issue because we generate
them from within pack-objects, which does call disable_replace_refs().
This patch updates the MIDX logic to disable replace objects within the
multi-pack-index builtin, and a test showing a clone (which would fail
with MIDX bitmap) is added to demonstrate the bug.
Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Xing Xin <xingxin.xx@bytedance.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
builtin/receive-pack: convert to use git-maintenance(1)
In 850b6edefa (auto-gc: extract a reusable helper from "git fetch",
2020-05-06), we have introduced a helper function `run_auto_gc()` that
kicks off `git gc --auto`. The intent of this function was to pass down
the "--quiet" flag to git-gc(1) as required without duplicating this at
all callsites. In 7c3e9e8cfb (auto-gc: pass --quiet down from am,
commit, merge and rebase, 2020-05-06) we then converted callsites that
need to pass down this flag to use the new helper function. This has the
notable omission of git-receive-pack(1), which is the only remaining
user of `git gc --auto` that sets up the proccess manually. This is
probably because it unconditionally passes down the `--quiet` flag and
thus didn't benefit much from the new helper function.
In a95ce12430 (maintenance: replace run_auto_gc(), 2020-09-17) we then
replaced `run_auto_gc()` with `run_auto_maintenance()` which invokes
git-maintenance(1) instead of git-gc(1). This command is the modern
replacement for git-gc(1) and is both more thorough and also more
flexible because administrators can configure which tasks exactly to run
during maintenance.
But due to git-receive-pack(1) not using `run_auto_gc()` in the first
place it did not get converted to use git-maintenance(1) like we do
everywhere else now. Address this oversight and start to use the newly
introduced function `prepare_auto_maintenance()`. This will also make it
easier for us to adapt this code together with all the other callsites
that invoke auto-maintenance in the future.
This removes the last internal user of `git gc --auto`.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
run-command: introduce function to prepare auto-maintenance process
The `run_auto_maintenance()` function is responsible for spawning a new
`git maintenance run --auto` process. To do so, it sets up the `sturct
child_process` and then runs it by executing `run_command()` directly.
This is rather inflexible in case callers want to modify the child
process somewhat, e.g. to redirect stderr or stdout.
Introduce a new `prepare_auto_maintenance()` function to plug this gap.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:40 +0000 (00:02 +0000)]
credential: add method for querying capabilities
Right now, there's no specific way to determine whether a credential
helper or git credential itself supports a given set of capabilities.
It would be helpful to have such a way, so let's let credential helpers
and git credential take an argument, "capability", which has it list the
capabilities and a version number on standard output.
Specifically choose a format that is slightly different from regular
credential output and assume that no capabilities are supported if a
non-zero exit status occurs or the data deviates from the format. It is
common for users to write small shell scripts as the argument to
credential.helper, which will almost never be designed to emit
capabilities. We want callers to gracefully handle this case by
assuming that they are not capable of extended support because that is
almost certainly the case, and specifying the error behavior up front
does this and preserves backwards compatibility in a graceful way.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:39 +0000 (00:02 +0000)]
credential-cache: implement authtype capability
Now that we have full support in Git for the authtype capability, let's
add support to the cache credential helper.
When parsing data, we always set the initial capabilities because we're
the helper, and we need both the initial and helper capabilities to be
set in order to have the helper capabilities take effect.
When emitting data, always emit the supported capability and make sure
we emit items only if we have them and they're supported by the caller.
Since we may no longer have a username or password, be sure to emit
those conditionally as well so we don't segfault on a NULL pointer.
Similarly, when comparing credentials, consider both the password and
credential fields when we're matching passwords.
Adjust the partial credential detection code so that we can store
credentials missing a username or password as long as they have an
authtype and credential.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:38 +0000 (00:02 +0000)]
t: add credential tests for authtype
It's helpful to have some basic tests for credential helpers supporting
the authtype and credential fields. Let's add some tests for this case
so that we can make sure newly supported helpers work correctly.
Note that we explicitly check that credential helpers can produce
different sets of authtype and credential values based on the username.
While the username is not used in the HTTP protocol with authtype and
credential, it can still be specified in the URL and thus may be part of
the protocol. Additionally, because it is common for users to have
multiple accounts on one service (say, both personal and professional
accounts), it's very helpful to be able to store different credentials
for different accounts in the same helper, and that doesn't become less
useful if one is using, say, Bearer authentication instead of Basic.
Thus, credential helpers should be expected to support this
functionality as basic functionality, so verify here that they do so.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:37 +0000 (00:02 +0000)]
credential: add support for multistage credential rounds
Over HTTP, NTLM and Kerberos require two rounds of authentication on the
client side. It's possible that there are custom authentication schemes
that also implement this same approach. Since these are tricky schemes
to implement and the HTTP library in use may not always handle them
gracefully on all systems, it would be helpful to allow the credential
helper to implement them instead for increased portability and
robustness.
To allow this to happen, add a boolean flag, continue, that indicates
that instead of failing when we get a 401, we should retry another round
of authentication. However, this necessitates some changes in our
current credential code so that we can make this work.
Keep the state[] headers between iterations, but only use them to send
to the helper and only consider the new ones we read from the credential
helper to be valid on subsequent iterations. That avoids us passing
stale data when we finally approve or reject the credential. Similarly,
clear the multistage and wwwauth[] values appropriately so that we
don't pass stale data or think we're trying a multiround response when
we're not. Remove the credential values so that we can actually fill a
second time with new responses.
Limit the number of iterations of reauthentication we do to 3. This
means that if there's a problem, we'll terminate with an error message
instead of retrying indefinitely and not informing the user (and
possibly conducting a DoS on the server).
In our tests, handle creating multiple response output files from our
helper so we can verify that each of the messages sent is correct.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:36 +0000 (00:02 +0000)]
t5563: refactor for multi-stage authentication
Some HTTP authentication schemes, such as NTLM- and Kerberos-based
options, require more than one round trip to authenticate. Currently,
these can only be supported in libcurl, since Git does not have support
for this in the credential helper protocol.
However, in a future commit, we'll add support for this functionality
into the credential helper protocol and Git itself. Because we don't
really want to implement either NTLM or Kerberos, both of which are
complex protocols, we'll want to test this using a fake credential
authentication scheme. In order to do so, update t5563 and its backend
to allow us to accept multiple sets of credentials and respond with
different behavior in each case.
Since we can now provide any number of possible status codes, provide a
non-specific reason phrase so we don't have to generate a more specific
one based on the response. The reason phrase is mandatory according to
the status-line production in RFC 7230, but clients SHOULD ignore it,
and curl does (except to print it).
Each entry in the authorization and challenge fields contains an ID,
which indicates a corresponding credential and response. If the
response is a 200 status, then we continue to execute git-http-backend.
Otherwise, we print the corresponding status and response. If no ID is
matched, we use the default response with a status of 401.
Note that there is an implicit order to the parameters. The ID is
always first and the creds or response value is always last, and
therefore may contain spaces, equals signs, or other arbitrary data.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:35 +0000 (00:02 +0000)]
docs: set a limit on credential line length
We recently introduced a way for credential helpers to add arbitrary
state as part of the protocol. Set some limits on line length to avoid
helpers passing extremely large amounts of data. While Git doesn't have
a fixed parsing length, there are other tools which support this
protocol and it's kind to allow them to use a reasonable fixed-size
buffer for parsing. In addition, we would like to be moderate in our
memory usage and imposing reasonable limits is helpful for that purpose.
In the event a credential helper is incapable of storing its serialized
state in 64 KiB, it can feel free to serialize it on disk and store a
reference instead.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:33 +0000 (00:02 +0000)]
credential: add an argument to keep state
Until now, our credential code has mostly deal with usernames and
passwords and we've let libcurl deal with the variant of authentication
to be used. However, now that we have the credential value, the
credential helper can take control of the authentication, so the value
provided might be something that's generated, such as a Digest hash
value.
In such a case, it would be helpful for a credential helper that gets an
erase or store command to be able to keep track of an identifier for the
original secret that went into the computation. Furthermore, some types
of authentication, such as NTLM and Kerberos, actually need two round
trips to authenticate, which will require that the credential helper
keep some state.
In order to allow for these use cases and others, allow storing state in
a field called "state[]". This value is passed back to the credential
helper that created it, which avoids confusion caused by parsing values
from different helpers.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:32 +0000 (00:02 +0000)]
http: add support for authtype and credential
Now that we have the credential helper code set up to handle arbitrary
authentications schemes, let's add support for this in the HTTP code,
where we really want to use it. If we're using this new functionality,
don't set a username and password, and instead set a header wherever
we'd normally do so, including for proxy authentication.
Since we can now handle this case, ask the credential helper to enable
the appropriate capabilities.
Finally, if we're using the authtype value, set "Expect: 100-continue".
Any type of authentication that requires multiple rounds (such as NTLM
or Kerberos) requires a 100 Continue (if we're larger than
http.postBuffer) because otherwise we send the pack data before we're
authenticated, the push gets a 401 response, and we can't rewind the
stream. We don't know for certain what other custom schemes might
require this, the HTTP/1.1 standard has required handling this since
1999, the broken HTTP server for which we disabled this (Google's) is
now fixed and has been for some time, and libcurl has a 1-second
fallback in case the HTTP server is still broken. In addition, it is
not unreasonable to require compliance with a 25-year old standard to
use new Git features. For all of these reasons, do so here.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:31 +0000 (00:02 +0000)]
docs: indicate new credential protocol fields
Now that we have new fields (authtype and credential), let's document
them for users and credential helper implementers.
Indicate specifically what common values of authtype are and what values
are allowed. Note that, while common, digest and NTLM authentication
are insecure because they require unsalted, uniterated password hashes
to be stored.
Tell users that they can continue to use a username and password even if
the new capability is supported.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:30 +0000 (00:02 +0000)]
credential: add a field called "ephemeral"
Now that we have support for a wide variety of types of authentication,
it's important to indicate to other credential helpers whether they
should store credentials, since not every credential helper may
intuitively understand all possible values of the authtype field. Do so
with a boolean field called "ephemeral", to indicate whether the
credential is expected to be temporary.
For example, in HTTP Digest authentication, the Authorization header
value is based off a nonce. It isn't useful to store this value
for later use because reusing the credential long term will not result
in successful authentication due to the nonce necessarily differing.
An additional case is potentially short-lived credentials, which may
last only a few hours. It similarly wouldn't be helper for other
credential helpers to attempt to provide these much later.
We do still pass the value to "git credential store" or "git credential
erase", since it may be helpful to the original helper to know whether
the operation was successful.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:29 +0000 (00:02 +0000)]
credential: gate new fields on capability
We support the new credential and authtype fields, but we lack a way to
indicate to a credential helper that we'd like them to be used. Without
some sort of indication, the credential helper doesn't know if it should
try to provide us a username and password, or a pre-encoded credential.
For example, the helper might prefer a more restricted Bearer token if
pre-encoded credentials are possible, but might have to fall back to
more general username and password if not.
Let's provide a simple way to indicate whether Git (or, for that matter,
the helper) is capable of understanding the authtype and credential
fields. We send this capability when we generate a request, and the
other side may reply to indicate to us that it does, too.
For now, don't enable sending capabilities for the HTTP code. In a
future commit, we'll introduce appropriate handling for that code,
which requires more in-depth work.
The logic for determining whether a capability is supported may seem
complex, but it is not. At each stage, we emit the capability to the
following stage if all preceding stages have declared it. Thus, if the
caller to git credential fill didn't declare it, then we won't send it
to the helper, and if fill's caller did send but the helper doesn't
understand it, then we won't send it on in the response. If we're an
internal user, then we know about all capabilities and will request
them.
For "git credential approve" and "git credential reject", we set the
helper capability before calling the helper, since we assume that the
input we're getting from the external program comes from a previous call
to "git credential fill", and thus we'll invoke send a capability to the
helper if and only if we got one from the standard input, which is the
correct behavior.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:28 +0000 (00:02 +0000)]
credential: add a field for pre-encoded credentials
At the moment, our credential code wants to find a username and password
for access, which, for HTTP, it will pass to libcurl to encode and
process. However, many users want to use authentication schemes that
libcurl doesn't support, such as Bearer authentication. In these
schemes, the secret is not a username and password pair, but some sort
of token that meets the production for authentication data in the RFC.
In fact, in general, it's useful to allow our credential helper to have
knowledge about what specifically to put in the protocol header. Thus,
add a field, credential, which contains data that's preencoded to be
suitable for the protocol in question. If we have such data, we need
neither a username nor a password, so make that adjustment as well.
It is in theory possible to reuse the password field for this. However,
if we do so, we must know whether the credential helper supports our new
scheme before sending it data, which necessitates some sort of
capability inquiry, because otherwise an uninformed credential helper
would store our preencoded data as a password, which would fail the next
time we attempted to connect to the remote server. This design is
substantially simpler, and we can hint to the credential helper that we
support this approach with a simple new field instead of needing to
query it first.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:27 +0000 (00:02 +0000)]
http: use new headers for each object request
Currently we create one set of headers for all object requests and reuse
it. However, we'll need to adjust the headers for authentication
purposes in the future, so let's create a new set for each request so
that we can adjust them if the authentication changes.
Note that the cost of allocation here is tiny compared to the fact that
we're making a network call, not to mention probably a full TLS
connection, so this shouldn't have a significant impact on performance.
Moreover, nobody who cares about performance is using the dumb HTTP
protocol anyway, since it often makes huge numbers of requests compared
to the smart protocol.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:26 +0000 (00:02 +0000)]
remote-curl: reset headers on new request
When we retry a post_rpc request, we currently reuse the same headers as
before. In the future, we'd like to be able to modify them based on the
result we get back, so let's reset them on each retry so we can avoid
sending potentially duplicate headers if the values change.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Wed, 17 Apr 2024 00:02:25 +0000 (00:02 +0000)]
credential: add an authtype field
When Git makes an HTTP request, it can negotiate the type of
authentication to use with the server provided the authentication scheme
is one of a few well-known types (Basic, Digest, NTLM, or Negotiate).
However, some servers wish to use other types of authentication, such as
the Bearer type from OAuth2. Since libcurl doesn't natively support
this type, it isn't possible to use it, and the user is forced to
specify the Authorization header using the http.extraheader setting.
However, storing a plaintext token in the repository configuration is
not very secure, especially if a repository can be shared by multiple
parties. We already have support for many types of secure credential
storage by using credential helpers, so let's teach credential helpers
how to produce credentials for an arbitrary scheme.
If the credential helper specifies an authtype field, then it specifies
an authentication scheme (e.g., Bearer) and the password field specifies
the raw authentication token, with any encoding already specified. We
reuse the password field for this because some credential helpers store
the metadata without encryption even though the password is encrypted,
and we'd like to avoid insecure storage if an older version of the
credential helper gets ahold of the data.
The username is not used in this case, but it is still preserved for the
purpose of finding the right credential if the user has multiple
accounts.
If the authtype field is not specified, then the password behaves as
normal and it is passed along with the username to libcurl.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
builtin/clone: refuse local clones of unsafe repositories
When performing a local clone of a repository we end up either copying
or hardlinking the source repository into the target repository. This is
significantly more performant than if we were to use git-upload-pack(1)
and git-fetch-pack(1) to create the new repository and preserves both
disk space and compute time.
Unfortunately though, performing such a local clone of a repository that
is not owned by the current user is inherently unsafe:
- It is possible that source files get swapped out underneath us while
we are copying or hardlinking them. While we do perform some checks
here to assert that we hardlinked the expected file, they cannot
reliably thwart time-of-check-time-of-use (TOCTOU) style races. It
is thus possible for an adversary to make us copy or hardlink
unexpected files into the target directory.
Ideally, we would address this by starting to use openat(3P),
fstatat(3P) and friends. Due to platform compatibility with Windows
we cannot easily do that though. Furthermore, the scope of these
fixes would likely be quite broad and thus not fit for an embargoed
security release.
- Even if we handled TOCTOU-style races perfectly, hardlinking files
owned by a different user into the target repository is not a good
idea in general. It is possible for an adversary to rewrite those
files to contain whatever data they want even after the clone has
completed.
Address these issues by completely refusing local clones of a repository
that is not owned by the current user. This reuses our existing infra we
have in place via `ensure_valid_ownership()` and thus allows a user to
override the safety guard by adding the source repository path to the
"safe.directory" configuration.
This addresses CVE-2024-32020.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Introduce a new function `die_upon_dubious_ownership()` that uses
`ensure_valid_ownership()` to verify whether a repositroy is safe for
use, and causes Git to die in case it is not.
This function will be used in a subsequent commit.
Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
builtin/clone: abort when hardlinked source and target file differ
When performing local clones with hardlinks we refuse to copy source
files which are symlinks as a mitigation for CVE-2022-39253. This check
can be raced by an adversary though by changing the file to a symlink
after we have checked it.
Fix the issue by checking whether the hardlinked destination file
matches the source file and abort in case it doesn't.
This addresses CVE-2024-32021.
Reported-by: Apple Product Security <product-security@apple.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
builtin/clone: stop resolving symlinks when copying files
When a user performs a local clone without `--no-local`, then we end up
copying the source repository into the target repository directly. To
optimize this even further, we try to hardlink files into place instead
of copying data over, which helps both disk usage and speed.
There is an important edge case in this context though, namely when we
try to hardlink symlinks from the source repository into the target
repository. Depending on both platform and filesystem the resulting
behaviour here can be different:
- On macOS and NetBSD, calling link(3P) with a symlink target creates
a hardlink to the file pointed to by the symlink.
- On Linux, calling link(3P) instead creates a hardlink to the symlink
itself.
To unify this behaviour, 36596fd2df (clone: better handle symlinked
files at .git/objects/, 2019-07-10) introduced logic to resolve symlinks
before we try to link(3P) files. Consequently, the new behaviour was to
always create a hard link to the target of the symlink on all platforms.
Eventually though, we figured out that following symlinks like this can
cause havoc when performing a local clone of a malicious repository,
which resulted in CVE-2022-39253. This issue was fixed via 6f054f9fb3
(builtin/clone.c: disallow `--local` clones with symlinks, 2022-07-28),
by refusing symlinks in the source repository.
But even though we now shouldn't ever link symlinks anymore, the code
that resolves symlinks still exists. In the best case the code does not
end up doing anything because there are no symlinks anymore. In the
worst case though this can be abused by an adversary that rewrites the
source file after it has been checked not to be a symlink such that it
actually is a symlink when we call link(3P). Thus, it is still possible
to recreate CVE-2022-39253 due to this time-of-check-time-of-use bug.
Remove the call to `realpath()`. This doesn't yet address the actual
vulnerability, which will be handled in a subsequent commit.
Reported-by: Apple Product Security <product-security@apple.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
ci(linux32): add a note about Actions that must not be updated
The Docker container used by the `linux32` job comes without Node.js,
and therefore the `actions/checkout` and `actions/upload-artifact`
Actions cannot be upgraded to the latest versions (because they use
Node.js).
One time too many, I accidentally tried to update them, where
`actions/checkout` at least fails immediately, but the
`actions/upload-artifact` step is only used when any test fails, and
therefore the CI run usually passes even though that Action was updated
to a version that is incompatible with the Docker container in which
this job runs.
So let's add a big fat warning, mainly for my own benefit, to avoid
running into the very same issue over and over again.
Backported-from: 20e0ff8835 (ci(linux32): add a note about Actions that must not be updated, 2024-02-11) Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Junio C Hamano [Fri, 2 Feb 2024 20:39:35 +0000 (12:39 -0800)]
GitHub Actions: update to github-script@v7
We seem to be getting "Node.js 16 actions are deprecated." warnings
for jobs that use github-script@v6. Update to github-script@v7,
which is said to use Node.js 20.
Backported-from: c4ddbe043e (GitHub Actions: update to github-script@v7, 2024-02-02) Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
After activating automatic Dependabot updates in the
git-for-windows/git repository, Dependabot noticed a couple
of yet-unaddressed updates. They avoid "Node.js 16 Actions"
deprecation messages by bumping the following Actions'
versions:
- actions/upload-artifact from 3 to 4
- actions/download-artifact from 3 to 4
- actions/cache from 3 to 4
Backported-from: 820a340085 (ci: bump remaining outdated Actions versions, 2024-02-11) Helped-by: Matthias Aßhauer <mha1993@live.de> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Junio C Hamano [Fri, 2 Feb 2024 20:39:34 +0000 (12:39 -0800)]
GitHub Actions: update to checkout@v4
We seem to be getting "Node.js 16 actions are deprecated." warnings
for jobs that use checkout@v3. Except for the i686 containers job
that is kept at checkout@v1 [*], update to checkout@v4, which is
said to use Node.js 20.
[*] 6cf4d908 (ci(main): upgrade actions/checkout to v3, 2022-12-05)
refers to https://github.com/actions/runner/issues/2115 and
explains why container jobs are kept at checkout@v1. We may
want to check the current status of the issue and move it to the
same version as other jobs, but that is outside the scope of
this step.
Backported-from: e94dec0c1d (GitHub Actions: update to checkout@v4, 2024-02-02) Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Tests with LSan from time to time seem to emit harmless message
that makes our tests unnecessarily flakey; we work it around by
filtering the uninteresting output.
Merge branch 'backport/jk/libcurl-8.7-regression-workaround' into maint-2.39
Fix was added to work around a regression in libcURL 8.7.0 (which has
already been fixed in their tip of the tree).
* jk/libcurl-8.7-regression-workaround:
remote-curl: add Transfer-Encoding header only for older curl
INSTALL: bump libcurl version to 7.21.3
http: reset POSTFIELDSIZE when clearing curl handle
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
* jk/httpd-test-updates:
t/lib-httpd: increase ssl key size to 2048 bits
t/lib-httpd: drop SSLMutex config
t/lib-httpd: bump required apache version to 2.4
t/lib-httpd: bump required apache version to 2.2
This is a backport onto the `maint-2.39` branch, to improve the CI
health of that branch.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
* jk/http-test-fixes:
t5559: make SSL/TLS the default
t5559: fix test failures with LIB_HTTPD_SSL
t/lib-httpd: enable HTTP/2 "h2" protocol, not just h2c
t/lib-httpd: respect $HTTPD_PROTO in expect_askpass()
t5551: drop curl trace lines without headers
t5551: handle v2 protocol in cookie test
t5551: simplify expected cookie file
t5551: handle v2 protocol in upload-pack service test
t5551: handle v2 protocol when checking curl trace
t5551: stop forcing clone to run with v0 protocol
t5551: handle HTTP/2 when checking curl trace
t5551: lower-case headers in expected curl trace
t5551: drop redundant grep for Accept-Language
t5541: simplify and move "no empty path components" test
t5541: stop marking "used receive-pack service" test as v0 only
t5541: run "used receive-pack service" test earlier
This is a backport onto the `maint-2.39` branch, starting to take care
of making that branch's CI builds healthy again.
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Every once in a while, the `git-p4` tests flake for reasons outside of
our control. It typically fails with "Connection refused" e.g. here:
https://github.com/git/git/actions/runs/5969707156/job/16196057724
[...]
+ git p4 clone --dest=/home/runner/work/git/git/t/trash directory.t9807-git-p4-submit/git //depot
Initialized empty Git repository in /home/runner/work/git/git/t/trash directory.t9807-git-p4-submit/git/.git/
Perforce client error:
Connect to server failed; check $P4PORT.
TCP connect to localhost:9807 failed.
connect: 127.0.0.1:9807: Connection refused
failure accessing depot: could not run p4
Importing from //depot into /home/runner/work/git/git/t/trash directory.t9807-git-p4-submit/git
[...]
This happens in other jobs, too, but in the `linux-asan`/`linux-ubsan`
jobs it hurts the most because those jobs often take an _awfully_ long
time to run, therefore re-running a failed `linux-asan`/`linux-ubsan`
jobs is _very_ costly.
The purpose of the `linux-asan`/`linux-ubsan` jobs is to exercise the C
code of Git, anyway, and any part of Git's source code that the `git-p4`
tests run and that would benefit from the attention of ASAN/UBSAN are
run better in other tests anyway, as debugging C code run via Python
scripts can get a bit hairy.
In fact, it is not even just `git-p4` that is the problem (even if it
flakes often enough to be problematic in the CI builds), but really the
part about Python scripts. So let's just skip any Python parts of the
tests from being run in that job.
For good measure, also skip the Subversion tests because debugging C
code run via Perl scripts is as much fun as debugging C code run via
Python scripts. And it will reduce the time this very expensive job
takes, which is a big benefit.
Backported to `maint-2.39` as another step to get that branch's CI
builds back to a healthy state.
Backported-from: 6ba913629f (ci(linux-asan-ubsan): let's save some time, 2023-08-29) Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
PtraceRegistersStatus have_registers =
suspended_threads.GetRegistersAndSP(i, ®isters, &sp);
if (have_registers != REGISTERS_AVAILABLE) {
Report("Unable to get registers from thread %llu.\n", os_id);
// If unable to get SP, consider the entire stack to be reachable unless
// GetRegistersAndSP failed with ESRCH.
if (have_registers == REGISTERS_UNAVAILABLE_FATAL)
continue;
sp = stack_begin;
}
The program itself still runs fine and LSan doesn't cause us to abort.
But test-lib.sh looks for any non-empty LSan logs and marks the test as
a failure anyway, under the assumption that we simply missed the failing
exit code somehow.
I don't think I've ever seen this happen in the CI job, but running
locally using clang-14 on an 8-core machine, I can't seem to make it
through a full run of the test suite without having at least one
failure. And it's a different one every time (though they do seem to
often be related to packing tests, which makes sense, since that is one
of our biggest users of threaded code).
We can hack around this by only counting LSan log files that contain a
line that doesn't match our known-uninteresting pattern.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
In April, GitHub announced that the `macos-13` pool is available:
https://github.blog/changelog/2023-04-24-github-actions-macos-13-is-now-available/.
It is only a matter of time until the `macos-12` pool is going away,
therefore we should switch now, without pressure of a looming deadline.
Since the `macos-13` runners no longer include Python2, we also drop
specifically testing with Python2 and switch uniformly to Python3, see
https://github.com/actions/runner-images/blob/HEAD/images/macos/macos-13-Readme.md
for details about the software available on the `macos-13` pool's
runners.
Also, on macOS 13, Homebrew seems to install a `gcc@9` package that no
longer comes with a regular `unistd.h` (there seems only to be a
`ssp/unistd.h`), and hence builds would fail with:
In file included from base85.c:1:
git-compat-util.h:223:10: fatal error: unistd.h: No such file or directory
223 | #include <unistd.h>
| ^~~~~~~~~~
compilation terminated.
The reason why we install GCC v9.x explicitly is historical, and back in
the days it was because it was the _newest_ version available via
Homebrew: 176441bfb58 (ci: build Git with GCC 9 in the 'osx-gcc' build
job, 2019-11-27).
To reinstate the spirit of that commit _and_ to fix that build failure,
let's switch to the now-newest GCC version: v13.x.
Backported-from: 682a868f67 (ci: upgrade to using macos-13, 2023-11-03) Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Junio C Hamano [Tue, 16 Apr 2024 21:50:30 +0000 (14:50 -0700)]
Merge branch 'ba/osxkeychain-updates'
Update osxkeychain backend with features required for the recent
credential subsystem.
* ba/osxkeychain-updates:
osxkeychain: store new attributes
osxkeychain: erase matching passwords only
osxkeychain: erase all matching credentials
osxkeychain: replace deprecated SecKeychain API
Junio C Hamano [Tue, 16 Apr 2024 21:50:30 +0000 (14:50 -0700)]
Merge branch 'jt/reftable-geometric-compaction'
The strategy to compact multiple tables of reftables after many
operations accumulate many entries has been improved to avoid
accumulating too many tables uncollected.
* jt/reftable-geometric-compaction:
reftable/stack: use geometric table compaction
reftable/stack: add env to disable autocompaction
reftable/stack: expose option to disable auto-compaction
Adjust to an upcoming changes to GNU make that breaks our Makefiles.
* tb/make-indent-conditional-with-non-spaces:
Makefile(s): do not enforce "all indents must be done with tab"
Makefile(s): avoid recipe prefix in conditional statements
vreportf(), which is usede by error() and friends, has been taught
to give the error message printf-format string when its vsnprintf()
call fails, instead of showing nothing useful to identify the
nature of the error.
Junio C Hamano [Tue, 16 Apr 2024 21:50:27 +0000 (14:50 -0700)]
Merge branch 'jc/local-extern-shell-rules'
Document and apply workaround for a buggy version of dash that
mishandles "local var=val" construct.
* jc/local-extern-shell-rules:
t1016: local VAR="VAL" fix
t0610: local VAR="VAL" fix
t: teach lint that RHS of 'local VAR=VAL' needs to be quoted
t: local VAR="VAL" (quote ${magic-reference})
t: local VAR="VAL" (quote command substitution)
t: local VAR="VAL" (quote positional parameters)
CodingGuidelines: quote assigned value in 'local var=$val'
CodingGuidelines: describe "export VAR=VAL" rule
rerere: fix crashes due to unmatched opening conflict markers
When rerere handles a conflict with an unmatched opening conflict marker
in a file with other conflicts, it will fail create a preimage and also
fail allocate the status member of struct rerere_dir. Currently the
status member is allocated after the error handling. This will lead to a
SEGFAULT when the status member is accessed during cleanup of the failed
parse.
Additionally, in subsequent executions of rerere, after removing the
MERGE_RR.lock manually, rerere crashes for a similar reason. MERGE_RR
points to a conflict id that has no preimage, therefore the status
member is not allocated and a SEGFAULT happens when trying to check if a
preimage exists.
Solve this by making sure the status field is allocated correctly and add
tests to prevent the bug from reoccurring.
This does not fix the root cause, failing to parse stray conflict
markers, but I don't think we can do much better than recognizing it,
printing an error, and moving on gracefully.
Signed-off-by: Marcel Röthke <marcel@roethke.info> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Fix was added to work around a regression in libcURL 8.7.0 (which has
already been fixed in their tip of the tree).
* jk/libcurl-8.7-regression-workaround:
remote-curl: add Transfer-Encoding header only for older curl
INSTALL: bump libcurl version to 7.21.3
http: reset POSTFIELDSIZE when clearing curl handle
Junio C Hamano [Mon, 15 Apr 2024 21:11:43 +0000 (14:11 -0700)]
Merge branch 'gt/add-u-commit-i-pathspec-check'
"git add -u <pathspec>" and "git commit [-i] <pathspec>" did not
diagnose a pathspec element that did not match any files in certain
situations, unlike "git add <pathspec>" did.
* gt/add-u-commit-i-pathspec-check:
builtin/add: error out when passing untracked path with -u
builtin/commit: error out when passing untracked path with -i
revision: optionally record matches with pathspec elements
Junio C Hamano [Mon, 15 Apr 2024 21:11:43 +0000 (14:11 -0700)]
Merge branch 'ds/fetch-config-parse-microfix'
A config parser callback function fell through instead of returning
after recognising and processing a variable, wasting cycles, which
has been corrected.
* ds/fetch-config-parse-microfix:
fetch: return when parsing submodule.recurse
René Scharfe [Sun, 14 Apr 2024 16:47:52 +0000 (18:47 +0200)]
imap-send: increase command size limit
nfvasprintf() has a 8KB limit, but it's not relevant, as its result is
combined with other strings and added to a 1KB buffer by its caller.
That 1KB limit is not mentioned in RFC 9051, which specifies IMAP.
While 1KB is plenty for user names, passwords and mailbox names,
there's no point in limiting our commands like that. Call xstrvfmt()
instead of open-coding it and use strbuf to format the command to
send, as we need its length. Fail hard if it exceeds INT_MAX, because
socket_write() can't take more than that.
Suggested-by: Jeff King <peff@peff.net> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Peter Krefting [Sat, 13 Apr 2024 20:14:48 +0000 (21:14 +0100)]
bisect: report the found commit with "show"
When "git bisect" finds the first bad commit and shows it to the user,
it calls "git diff-tree" to do so, whose output is meant to be stable
and deliberately ignores end-user customizations.
As the output is supposed to be consumed by humans, replace this with
a call to "git show". This command honors configuration options (such
as "log.date" and "log.mailmap") and other UI improvements (renames
are detected).
Pass some hard-coded options to "git show" to make the output similar
to the one we are replacing, such as showing a patch summary only.
Reported-by: Michael Osipov <michael.osipov@innomotics.com> Signed-off-By: Peter Krefting <peter@softwolves.pp.se> Signed-off-by: Junio C Hamano <gitster@pobox.com>
René Scharfe [Sun, 14 Apr 2024 16:47:54 +0000 (18:47 +0200)]
git-compat-util: fix NO_OPENSSL on current macOS
b195aa00c1 (git-compat-util: suppress unavoidable Apple-specific
deprecation warnings, 2014-12-16) started to define
__AVAILABILITY_MACROS_USES_AVAILABILITY in git-compat-util.h. On
current versions it is already defined (e.g. on macOS 14.4.1). Undefine
it before redefining it to avoid a compilation error.
Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 0fea6b73f1 (Merge branch 'tb/multi-pack-verbatim-reuse', 2024-01-12)
we have introduced multi-pack verbatim reuse of objects. This series has
introduced a new BTMP chunk, which encodes information about bitmapped
objects in the multi-pack index. Starting with dab60934e3 (pack-bitmap:
pass `bitmapped_pack` struct to pack-reuse functions, 2023-12-14) we use
this information to figure out objects which we can reuse from each of
the packfiles.
One thing that we glossed over though is backwards compatibility with
repositories that do not yet have BTMP chunks in their multi-pack index.
In that case, `nth_bitmapped_pack()` would return an error, which causes
us to emit a warning followed by another error message. These warnings
are visible to users that fetch from a repository:
While the fetch succeeds the user is left wondering what they did wrong.
Furthermore, as visible both from the warning and from the reuse stats,
pack-reuse is completely disabled in such repositories.
What is quite interesting is that this issue can even be triggered in
case `pack.allowPackReuse=single` is set, which is the default value.
One could have expected that in this case we fall back to the old logic,
which is to use the preferred packfile without consulting BTMP chunks at
all. But either we fail with the above error in case they are missing,
or we use the first pack in the multi-pack-index. The former case
disables pack-reuse altogether, whereas the latter case may result in
reusing objects from a suboptimal packfile.
Fix this issue by partially reverting the logic back to what we had
before this patch series landed. Namely, in the case where we have no
BTMP chunks or when `pack.allowPackReuse=single` are set, we use the
preferred pack instead of consulting the BTMP chunks.
Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
reftable/block: avoid copying block iterators on seek
When seeking a reftable record in a block we need to position the
iterator _before_ the sought-after record so that the next call to
`block_iter_next()` would yield that record. To achieve this, the loop
that performs the linear seeks to restore the previous position once it
has found the record.
This is done by advancing two `block_iter`s: one to check whether the
next record is our sought-after record, and one that we update after
every iteration. This of course involves quite a lot of copying and also
leads to needless memory allocations.
Refactor the code to get rid of the `next` iterator and the copying this
involves. Instead, we can restore the previous offset such that the call
to `next` will return the correct record.
Next to being simpler conceptually this also leads to a nice speedup.
The following benchmark parser 10k refs out of 100k existing refs via
`git-rev-list --no-walk`:
Benchmark 1: rev-list: print many refs (HEAD~)
Time (mean ± σ): 170.2 ms ± 1.7 ms [User: 86.1 ms, System: 83.6 ms]
Range (min … max): 166.4 ms … 180.3 ms 500 runs
Benchmark 2: rev-list: print many refs (HEAD~)
Time (mean ± σ): 161.6 ms ± 1.6 ms [User: 78.1 ms, System: 83.0 ms]
Range (min … max): 158.4 ms … 172.3 ms 500 runs
Summary
rev-list: print many refs (HEAD) ran
1.05 ± 0.01 times faster than rev-list: print many refs (HEAD~)
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
reftable/block: reuse `zstream` state on inflation
When calling `inflateInit()` and `inflate()`, the zlib library will
allocate several data structures for the underlying `zstream` to keep
track of various information. Thus, when inflating repeatedly, it is
possible to optimize memory allocation patterns by reusing the `zstream`
and then calling `inflateReset()` on it to prepare it for the next chunk
of data to inflate.
This is exactly what the reftable code is doing: when iterating through
reflogs we need to potentially inflate many log blocks, but we discard
the `zstream` every single time. Instead, as we reuse the `block_reader`
for each of the blocks anyway, we can initialize the `zstream` once and
then reuse it for subsequent inflations.
Refactor the code to do so, which leads to a significant reduction in
the number of allocations. The following measurements were done when
iterating through 1 million reflog entries. Before:
HEAP SUMMARY:
in use at exit: 13,473 bytes in 122 blocks
total heap usage: 23,028 allocs, 22,906 frees, 162,813,552 bytes allocated
After:
HEAP SUMMARY:
in use at exit: 13,473 bytes in 122 blocks
total heap usage: 302 allocs, 180 frees, 88,352 bytes allocated
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The reftable format stores log blocks in a compressed format. Thus,
whenever we want to read such a block we first need to decompress it.
This is done by calling the convenience function `uncompress2()` of the
zlib library, which is a simple wrapper that manages the lifecycle of
the `zstream` structure for us.
While nice for one-off inflation of data, when iterating through reflogs
we will likely end up inflating many such log blocks. This requires us
to reallocate the state of the `zstream` every single time, which adds
up over time. It would thus be great to reuse the `zstream` instead of
discarding it after every inflation.
Open-code the call to `uncompress2()` such that we can start reusing the
`zstream` in the subsequent commit. Note that our open-coded variant is
different from `uncompress2()` in two ways:
- We do not loop around `inflate()` until we have processed all input.
As our input is limited by the maximum block size, which is 16MB, we
should not hit limits of `inflate()`.
- We use `Z_FINISH` instead of `Z_NO_FLUSH`. Quoting the `inflate()`
documentation: "inflate() should normally be called until it returns
Z_STREAM_END or an error. However if all decompression is to be
performed in a single step (a single call of inflate), the parameter
flush should be set to Z_FINISH."
Furthermore, "Z_FINISH also informs inflate to not maintain a
sliding window if the stream completes, which reduces inflate's
memory footprint."
Other than that this commit is expected to be functionally equivalent
and does not yet reuse the `zstream`.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The reftable backend stores reflog entries in a compressed format and
thus needs to uncompress blocks before one can read records from it.
For each reflog block we thus have to allocate an array that we can
decompress the block contents into. This block is being discarded
whenever the table iterator moves to the next block. Consequently, we
reallocate a new array on every block, which is quite wasteful.
Refactor the code to reuse the uncompressed block data when moving the
block reader to a new block. This significantly reduces the number of
allocations when iterating through many compressed blocks. The following
measurements are done with `git reflog list` when listing 100k reflogs.
Before:
HEAP SUMMARY:
in use at exit: 13,473 bytes in 122 blocks
total heap usage: 45,755 allocs, 45,633 frees, 254,779,456 bytes allocated
After:
HEAP SUMMARY:
in use at exit: 13,473 bytes in 122 blocks
total heap usage: 23,028 allocs, 22,906 frees, 162,813,547 bytes allocated
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The table iterator has to iterate towards the next block once it has
yielded all records of the current block. This is done by creating a new
table iterator, initializing it to the next block, releasing the old
iterator and then copying over the data.
Refactor the code to instead advance the table iterator in place. This
is simpler and unlocks some optimizations in subsequent patches. Also,
it allows us to avoid some allocations.
The following measurements show a single matching ref out of 1 million
refs. Before this change:
HEAP SUMMARY:
in use at exit: 13,603 bytes in 125 blocks
total heap usage: 7,235 allocs, 7,110 frees, 301,481 bytes allocated
After:
HEAP SUMMARY:
in use at exit: 13,603 bytes in 125 blocks
total heap usage: 315 allocs, 190 frees, 107,027 bytes allocated
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
reftable/block: move ownership of block reader into `struct table_iter`
The table iterator allows the caller to iterate through all records in a
reftable table. To do so it iterates through all blocks of the desired
type one by one, where for each block it creates a new block iterator
and yields all its entries.
One of the things that is somewhat confusing in this context is who owns
the block reader that is being used to read the blocks and pass them to
the block iterator. Intuitively, as the table iterator is responsible
for iterating through the blocks, one would assume that this iterator is
also responsible for managing the lifecycle of the reader. And while it
somewhat is, the block reader is ultimately stored inside of the block
iterator.
Refactor the code such that the block reader is instead fully managed by
the table iterator. Instead of passing the reader to the block iterator,
we now only end up passing the block data to it. Despite clearing up the
lifecycle of the reader, it will also allow for better reuse of the
reader in subsequent patches.
The following benchmark prints a single matching ref out of 1 million
refs. Before:
HEAP SUMMARY:
in use at exit: 13,603 bytes in 125 blocks
total heap usage: 6,607 allocs, 6,482 frees, 509,635 bytes allocated
After:
HEAP SUMMARY:
in use at exit: 13,603 bytes in 125 blocks
total heap usage: 7,235 allocs, 7,110 frees, 301,481 bytes allocated
Note that while there are more allocation and free calls now, the
overall number of bytes allocated is significantly lower. The number of
allocations will be reduced significantly by the next patch though.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Introduce a new function `block_reader_release()` that releases
resources acquired by the block reader. This function will be extended
in a subsequent commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Function definitions and declaration of `struct block_reader` and
`struct block_iter` are somewhat mixed up, making it hard to see which
functions belong together. Rearrange them.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
reftable/block: merge `block_iter_seek()` and `block_reader_seek()`
The function `block_iter_seek()` is merely a simple wrapper around
`block_reader_seek()`. Merge those two functions into a new function
`block_iter_seek_key()` that more clearly says what it is actually
doing.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `block_reader_start()` does not really apply to the block
reader, but to the block iterator. It's name is thus somewhat confusing.
Rename it to `block_iter_seek_start()` to clarify.
We will rename `block_reader_seek()` in similar spirit in the next
commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When advice.waitingForEditor configuration is not set to false, we show
a hint telling that we are waiting for user's editor to close the file
when we launch an editor and wait for it to return control back to us.
We give the message on an incomplete line, expecting that we can go back
to the beginning of the line and clear the message when the editor returns.
However, it is possible that the editor exits with an error status, in
which case we show an error message and then return to our caller. In
such a case, the error message is given where the terminal cursor
happens to be, which is most likely after the "we are waiting for your
editor" message on the same line.
Clear the line before showing the error.
While we're here, make the error message follow our CodingGuideLines.
Signed-off-by: Rubén Justo <rjusto@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
NUL cannot appear in paths. Even disregarding filesystem path
limitations, the tree object format delimits with NUL, so such a path
cannot be encoded by Git.
When a quoted path is unquoted, it could possibly contain NUL from
"\000". Forbid it so it isn't truncated.
fast-import still has other issues with NUL, but those will be addressed
later.
Signed-off-by: Thalia Archibald <thalia@archibald.dev> Signed-off-by: Junio C Hamano <gitster@pobox.com>
fast-import: improve documentation for path quoting
It describes what characters cannot be in an unquoted path, but not
their semantics. Reframe it as a definition of unquoted paths. From the
perspective of the parser, whether it starts with `"` is what defines
whether it will parse it as quoted or unquoted.
The restrictions on characters in unquoted paths (with starting-", LF,
and spaces) are explained in the quoted paragraph. Move it to the
unquoted paragraph and reword.
The restriction that the source paths of filecopy and filerename cannot
contain SP is only stated in their respective sections. Restate it in
the <path> section.
Signed-off-by: Thalia Archibald <thalia@archibald.dev> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The strbuf in `note_change_n` is to copy the remainder of `p` before
potentially invalidating it when reading the next line. However, `p` is
not used after that point. It has been unused since the function was
created in a8dd2e7d2b (fast-import: Add support for importing commit
notes, 2009-10-09) and looks to be a fossil from adapting
`file_change_m`. Remove it.
Signed-off-by: Thalia Archibald <thalia@archibald.dev> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ever since filerename was added in f39a946a1f (Support wholesale
directory renames in fast-import, 2007-07-09) and filecopy in b6f3481bb4
(Teach fast-import to recursively copy files/directories, 2007-07-15),
both have produced an error when the destination path is empty. Later,
when support for targeting the root directory with an empty string was
added in 2794ad5244 (fast-import: Allow filemodify to set the root,
2010-10-10), this had the effect of allowing the quoted empty string
(`""`), but forbidding its unquoted variant (``). This seems to have
been intended as simple data validation for parsing two paths, rather
than a syntax restriction, because it was not extended to the other
operations.
All other occurrences of paths (in filemodify, filedelete, the source of
filecopy and filerename, and ls) allow both.
For most of this feature's lifetime, the documentation has not
prescribed the use of quoted empty strings. In e5959106d6
(Documentation/fast-import: put explanation of M 040000 <dataref> "" in
context, 2011-01-15), its documentation was changed from “`<path>` may
also be an empty string (`""`) to specify the root of the tree” to “The
root of the tree can be represented by an empty string as `<path>`”.
Thus, we should assume that some front-ends have depended on this
behavior.
Remove this restriction for the destination paths of filecopy and
filerename and change tests targeting the root to test `""` and ``.
Signed-off-by: Thalia Archibald <thalia@archibald.dev> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Previously, one case would not write the path to the strbuf: when the
path is unquoted and at the end of the string. It was essentially
copy-on-write. However, with the logic simplification of the previous
commit, this case was eliminated and the strbuf is always populated.
Directly use the strbufs now instead of an alias.
Since this already changes all the lines that use the strbufs, rename
them from `uq` to be more descriptive. That they are unquoted is not
their most important property, so name them after what they carry.
Additionally, `file_change_m` no longer needs to copy the path before
reading inline data.
Signed-off-by: Thalia Archibald <thalia@archibald.dev> Signed-off-by: Junio C Hamano <gitster@pobox.com>
and fast-import.c parses them in five different ways:
1. For filemodify and filedelete:
Try to unquote <path>. If it unquotes without errors, use the
unquoted version; otherwise, treat it as literal bytes to the end of
the line (including any number of SP).
2. For filecopy (source) and filerename (source):
Try to unquote <path>. If it unquotes without errors, use the
unquoted version; otherwise, treat it as literal bytes up to, but not
including, the next SP.
3. For filecopy (dest) and filerename (dest):
Like 1., but an unquoted empty string is forbidden.
4. For ls:
If <path> starts with `"`, unquote it and report parse errors;
otherwise, treat it as literal bytes to the end of the line
(including any number of SP).
5. For ls-commit:
Unquote <path> and report parse errors.
(It must start with `"` to disambiguate from ls.)
In the first three, any errors from trying to unquote a string are
suppressed, so a quoted string that contains invalid escapes would be
interpreted as literal bytes. For example, `"\xff"` would fail to
unquote (because hex escapes are not supported), and it would instead be
interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is
certainly not intended. Some front-ends erroneously use their language's
standard quoting routine instead of matching Git's, which could silently
introduce escapes that would be incorrectly parsed due to this and lead
to data corruption.
The documentation states “To use a source path that contains SP the path
must be quoted.”, so it is expected that some implementations depend on
spaces being allowed in paths in the final position. Thus we have two
documented ways to parse paths, so simplify the implementation to that.
Now we have:
1. `parse_path_eol` for filemodify, filedelete, filecopy (dest),
filerename (dest), ls, and ls-commit:
If <path> starts with `"`, unquote it and report parse errors;
otherwise, treat it as literal bytes to the end of the line
(including any number of SP).
2. `parse_path_space` for filecopy (source) and filerename (source):
If <path> starts with `"`, unquote it and report parse errors;
otherwise, treat it as literal bytes up to, but not including, the
next SP. It must be followed by SP.
There remain two special cases: The dest <path> in filecopy and rename
cannot be an unquoted empty string (this will be addressed subsequently)
and <path> in ls-commit must be quoted to disambiguate it from ls.
Signed-off-by: Thalia Archibald <thalia@archibald.dev> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Fri, 12 Apr 2024 18:31:39 +0000 (11:31 -0700)]
Merge branch 'tb/midx-write'
Code clean-up by splitting code responsible for writing midx files
into its own file.
* tb/midx-write:
midx-write.c: use `--stdin-packs` when repacking
midx-write.c: check count of packs to repack after grouping
midx-write.c: factor out common want_included_pack() routine
midx-write: move writing-related functions from midx.c
Junio C Hamano [Fri, 12 Apr 2024 18:31:39 +0000 (11:31 -0700)]
Merge branch 'rs/t-prio-queue-cleanup'
t-prio-queue test has been cleaned up by using C99 compound
literals; this is meant to also serve as a weather-balloon to smoke
out folks with compilers who have trouble compiling code that uses
the feature.
* rs/t-prio-queue-cleanup:
t-prio-queue: simplify using compound literals
Junio C Hamano [Fri, 12 Apr 2024 18:31:39 +0000 (11:31 -0700)]
Merge branch 'ps/reftable-binsearch-updates'
Reftable code clean-up and some bugfixes.
* ps/reftable-binsearch-updates:
reftable/block: avoid decoding keys when searching restart points
reftable/record: extract function to decode key lengths
reftable/block: fix error handling when searching restart points
reftable/block: refactor binary search over restart points
reftable/refname: refactor binary search over refnames
reftable/basics: improve `binsearch()` test
reftable/basics: fix return type of `binsearch()` to be `size_t`
"git checkout/switch --detach foo", after switching to the detached
HEAD state, gave the tracking information for the 'foo' branch,
which was pointless.
Tested-by: M Hickford <mirth.hickford@gmail.com>
cf. <CAGJzqsmE9FDEBn=u3ge4LA3ha4fDbm4OWiuUbMaztwjELBd7ug@mail.gmail.com>
* jc/checkout-detach-wo-tracking-report:
checkout: omit "tracking" information on a detached HEAD
merge-tree: fix argument type of the `--merge-base` option
In 5f43cf5b2e4 (merge-tree: accept 3 trees as arguments, 2024-01-28), I
taught `git merge-tree` to perform three-way merges on trees. This
commit even changed the manual page to state that the `--merge-base`
option takes a tree-ish rather than requiring a commit.
But I forgot to adjust the in-program help text. This patch fixes that.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit corrects a typographical error found in both
date-formats.txt and git-fast-import.txt documentation, where the term
`email format` was mistakenly used instead of `date format`.
Signed-off-by: Xing Xin <xingxin.xx@bytedance.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
t0612: add tests to exercise Git/JGit reftable compatibility
While the reftable format is a recent introduction in Git, JGit already
knows to read and write reftables since 2017. Given the complexity of
the format there is a very real risk of incompatibilities between those
two implementations, which is something that we really want to avoid.
Add some basic tests that verify that reftables written by Git and JGit
can be read by the respective other implementation. For now this test
suite is rather small, only covering basic functionality. But it serves
as a good starting point and can be extended over time.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>