Derrick Stolee [Mon, 3 Feb 2025 17:11:03 +0000 (17:11 +0000)]
backfill: add builtin boilerplate
In anticipation of implementing 'git backfill', populate the necessary files
with the boilerplate of a new builtin. Mark the builtin as experimental at
this time, allowing breaking changes in the near future, if necessary.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Tue, 4 Feb 2025 00:12:33 +0000 (16:12 -0800)]
Merge branch 'master' into ds/backfill
* master: (446 commits)
The seventh batch
The sixth batch
The fifth batch
The fourth batch
refs/reftable: fix uninitialized memory access of `max_index`
remote: announce removal of "branches/" and "remotes/"
The third batch
hash.h: drop unsafe_ function variants
csum-file: introduce hashfile_checkpoint_init()
t/helper/test-hash.c: use unsafe_hash_algo()
csum-file.c: use unsafe_hash_algo()
hash.h: introduce `unsafe_hash_algo()`
csum-file.c: extract algop from hashfile_checksum_valid()
csum-file: store the hash algorithm as a struct field
t/helper/test-tool: implement sha1-unsafe helper
trace2: prevent segfault on config collection with valueless true
refs: fix creation of reflog entries for symrefs
ci: wire up Visual Studio build with Meson
ci: raise error when Meson generates warnings
meson: fix compilation with Visual Studio
...
Jiang Xin [Mon, 3 Feb 2025 06:29:38 +0000 (07:29 +0100)]
send-pack: gracefully close the connection for atomic push
Patrick reported an issue that the exit code of git-receive-pack(1) is
ignored during atomic push with "--porcelain" flag, and added new test
cases in t5543.
This issue originated from commit 7dcbeaa0df (send-pack: fix
inconsistent porcelain output, 2020-04-17). At that time, I chose to
ignore the exit code of "finish_connect()" without investigating the
root cause of the abnormal termination of git-receive-pack. That was an
incorrect solution.
The root cause is that an atomic push operation terminates early without
sending a flush packet to git-receive-pack. As a result,
git-receive-pack continues waiting for commands without exiting. By
sending a flush packet at the appropriate location in "send_pack()", we
ensure that the git-receive-pack process closes properly, avoiding an
erroneous exit code for git-push. At the same time, revert the changes
to the "transport.c" file made in commit 7dcbeaa0df.
Reported-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Jiang Xin <zhiyou.jx@alibaba-inc.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add new test cases in t5543 to avoid ignoring the exit code of
git-receive-pack(1) during atomic push with "--porcelain" flag.
We'd typically notice this case because the refs would have their error
message set. But there is an edge case when pushing refs succeeds, but
git-receive-pack(1) exits with a non-zero exit code at a later point in
time due to another error. An atomic git-push(1) would ignore that error
code, and consequently it would return successfully and not print any
error message at all.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jiang Xin [Mon, 3 Feb 2025 06:29:36 +0000 (07:29 +0100)]
send-pack: new return code "ERROR_SEND_PACK_BAD_REF_STATUS"
The "push_refs" function in the transport_vtable is the handler for
git-push operation. All the "push_refs" functions for different
transports (protocols) should have the same behavior, but the behavior
of "git_transport_push()" function for builtin_smart_vtable in
"transport.c" (which calls "send_pack()" in "send-pack.c") differs from
the handler of the HTTP protocol.
The "push_refs()" function for the HTTP protocol which calls the
"push_refs_with_push()" function in "transport-helper.c" will return 0
even when a bad REF_STATUS (such as REF_STATUS_REJECT_NONFASTFORWARD)
was found. But "send_pack()" for Git smart protocol will return -1 for
a bad REF_STATUS.
We cannot ignore bad REF_STATUS directly in the "send_pack()" function,
because the function is also used in "builtin/send-pack.c". So we add a
new non-zero error code "SEND_PACK_ERROR_REF_STATUS" for "send_pack()".
Ignore the specific error code in the "git_transport_push()" function to
have the same behavior as "push_refs()" for HTTP protocol. Note that
even though we ignore the error here, we'll ultimately still end up
detecting that a subset of refs was not pushed in `transport_push()`
because we eventually call `push_had_errors()` on the remote refs.
Signed-off-by: Jiang Xin <zhiyou.jx@alibaba-inc.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Add two more test cases exercising git-push(1) with `--procelain`, one
exercising a non-atomic and one exercising an atomic push.
Based-on-patch-by: Jiang Xin <zhiyou.jx@alibaba-inc.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jiang Xin [Mon, 3 Feb 2025 06:29:32 +0000 (07:29 +0100)]
t5548: refactor to reuse setup_upstream() function
Refactor the function setup_upstream_and_workbench(), extracting
create_upstream_template() and setup_upstream() from it. The former is
used to create the upstream repository template, while the latter is
used to rebuild the upstream repository and will be reused in subsequent
commits.
To ensure that setup_upstream() works properly in both local and HTTP
protocols, the HTTP settings have been moved to the setup_upstream() and
setup_upstream_and_workbench() functions.
Signed-off-by: Jiang Xin <zhiyou.jx@alibaba-inc.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ayush Chandekar [Sun, 2 Feb 2025 12:09:26 +0000 (17:39 +0530)]
t6423: fix suppression of Git’s exit code in tests
Some test in t6423 supress Git's exit code, which can cause test
failures go unnoticed. Specifically using git <subcommand> |
<other-command> masks potential failures of the Git command.
This commit ensures that Git's exit status is correctly propogated by:
- Avoiding pipes that suppress exit codes.
Signed-off-by: Ayush Chandekar <ayu.chandekar@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
David Aguilar [Sat, 1 Feb 2025 21:33:18 +0000 (13:33 -0800)]
help: show the suggested command when help.autocorrect is false
Make the handling of false boolean values for help.autocorrect
consistent with the handling of value 0 by showing the suggested
commands but not running them.
Suggested-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: David Aguilar <davvid@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Mon, 3 Feb 2025 18:23:34 +0000 (10:23 -0800)]
Merge branch 'ps/build-meson-fixes'
More build fixes and enhancements on meson based build procedure.
* ps/build-meson-fixes:
ci: wire up Visual Studio build with Meson
ci: raise error when Meson generates warnings
meson: fix compilation with Visual Studio
meson: make the CSPRNG backend configurable
meson: wire up fuzzers
meson: wire up generation of distribution archive
meson: wire up development environments
meson: fix dependencies for generated headers
meson: populate project version via GIT-VERSION-GEN
GIT-VERSION-GEN: allow running without input and output files
GIT-VERSION-GEN: simplify computing the dirty marker
Junio C Hamano [Mon, 3 Feb 2025 18:23:33 +0000 (10:23 -0800)]
Merge branch 'ps/3.0-remote-deprecation'
Following the procedure we established to introduce breaking
changes for Git 3.0, allow an early opt-in for removing support of
$GIT_DIR/branches/ and $GIT_DIR/remotes/ directories to configure
remotes.
* ps/3.0-remote-deprecation:
remote: announce removal of "branches/" and "remotes/"
builtin/pack-redundant: remove subcommand with breaking changes
ci: repurpose "linux-gcc" job for deprecations
ci: merge linux-gcc-default into linux-gcc
Makefile: wire up build option for deprecated features
Junio C Hamano [Mon, 3 Feb 2025 18:23:33 +0000 (10:23 -0800)]
Merge branch 'jk/combine-diff-cleanup'
Code clean-up for code paths around combined diff.
* jk/combine-diff-cleanup:
tree-diff: make list tail-passing more explicit
tree-diff: simplify emit_path() list management
tree-diff: use the name "tail" to refer to list tail
tree-diff: drop list-tail argument to diff_tree_paths()
combine-diff: drop public declaration of combine_diff_path_size()
tree-diff: inline path_appendnew()
tree-diff: pass whole path string to path_appendnew()
tree-diff: drop path_appendnew() alloc optimization
run_diff_files(): de-mystify the size of combine_diff_path struct
diff: add a comment about combine_diff_path.parent.path
combine-diff: use pointer for parent paths
tree-diff: clear parent array in path_appendnew()
combine-diff: add combine_diff_path_new()
run_diff_files(): delay allocation of combine_diff_path
Junio C Hamano [Mon, 3 Feb 2025 18:23:32 +0000 (10:23 -0800)]
Merge branch 'tb/unsafe-hash-cleanup'
The API around choosing to use unsafe variant of SHA-1
implementation has been updated in an attempt to make it harder to
abuse.
* tb/unsafe-hash-cleanup:
hash.h: drop unsafe_ function variants
csum-file: introduce hashfile_checkpoint_init()
t/helper/test-hash.c: use unsafe_hash_algo()
csum-file.c: use unsafe_hash_algo()
hash.h: introduce `unsafe_hash_algo()`
csum-file.c: extract algop from hashfile_checksum_valid()
csum-file: store the hash algorithm as a struct field
t/helper/test-tool: implement sha1-unsafe helper
Jeff King [Fri, 31 Jan 2025 23:30:15 +0000 (18:30 -0500)]
ci: set CI_JOB_IMAGE for coverity job
The main GitHub Actions workflow switched away from the "$distro"
variable in b133d3071a (github: simplify computation of the job's
distro, 2025-01-10). Since the Coverity job also depends on our
ci/install-dependencies.sh script, it needs to likewise set CI_JOB_IMAGE
to find the correct dependencies (without this patch, we don't install
curl and the build fails).
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Mon, 3 Feb 2025 17:24:25 +0000 (09:24 -0800)]
Merge branch 'ps/ci-misc-updates' into jk/ci-coverity-update
* ps/ci-misc-updates:
ci: remove stale code for Azure Pipelines
ci: use latest Ubuntu release
ci: stop special-casing for Ubuntu 16.04
gitlab-ci: add linux32 job testing against i386
gitlab-ci: remove the "linux-old" job
github: simplify computation of the job's distro
github: convert all Linux jobs to be containerized
github: adapt containerized jobs to be rootless
t7422: fix flaky test caused by buffered stdout
t0060: fix EBUSY in MinGW when setting up runtime prefix
Seyi Kuforiji [Fri, 31 Jan 2025 22:14:20 +0000 (23:14 +0100)]
t/unit-tests: convert strcmp-offset test to use clar test framework
Adapt strcmp-offset test script to clar framework by using clar
assertions where necessary. Introduce `test_strcmp_offset__empty()` to
verify `check_strcmp_offset()` behavior when both input strings are
empty. This ensures the function correctly handles edge cases and
returns expected values.
Mentored-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Seyi Kuforiji <kuforiji98@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Seyi Kuforiji [Fri, 31 Jan 2025 22:14:18 +0000 (23:14 +0100)]
t/unit-tests: adapt example decorate test to use clar test framework
Introduce `test_example_decorate__initialize()` to explicitly set up
object IDs and retrieve corresponding objects before tests run. This
ensures a consistent and predictable test state without relying on data
from previous tests.
Add `test_example_decorate__cleanup()` to clear decorations after each
test, preventing interference between tests and ensuring each runs in
isolation.
Adapt example decorate test script to clar framework by using clar
assertions where necessary. Previously, tests relied on data written by
earlier tests, leading to unintended dependencies between them. This
explicitly initializes the necessary state within
`test_example_decorate__readd`, ensuring it does not depend on prior
test executions.
Mentored-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Seyi Kuforiji <kuforiji98@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Justin Tobler [Fri, 31 Jan 2025 17:39:38 +0000 (11:39 -0600)]
ci: fix base commit fallback for check-whitespace and check-style
The check-whitespace and check-style CI scripts require a base commit.
In GitLab CI, the base commit can be provided by several different
predefined CI variables depending on the type of pipeline being
performed.
In 30c4f7e350 (check-whitespace: detect if no base_commit is provided,
2024-07-23), the GitLab check-whitespace CI job was modified to support
CI_MERGE_REQUEST_DIFF_BASE_SHA as a fallback base commit if
CI_MERGE_REQUEST_TARGET_BRANCH_SHA was not provided. The same fallback
strategy was also implemented for the GitLab check-style CI job in bce7e52d4e (ci: run style check on GitHub and GitLab, 2024-07-23).
The base commit fallback is implemented using shell parameter expansion
where, if the first variable is unset, the second variable is used as
fallback. In GitLab CI, these variables can be set but null. This has
the unintended effect of selecting an empty first variable which results
in CI jobs providing an invalid base commit and failing.
Fix the issue by defaulting to the fallback variable if the first is
unset or null.
Signed-off-by: Justin Tobler <jltobler@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
global: adapt callers to use generic hash context helpers
Adapt callers to use generic hash context helpers instead of using the
hash algorithm to update them. This makes the callsites easier to reason
about and removes the possibility that the wrong hash algorithm is used
to update the hash context's state. And as a nice side effect this also
gets rid of a bunch of users of `the_hash_algo`.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
hash: provide generic wrappers to update hash contexts
The hash context is supposed to be updated via the `git_hash_algo`
structure, which contains a list of function pointers to update, clone
or finalize a hashing context. This requires the callers to track which
algorithm was used to initialize the context and continue to use the
exact same algorithm. If they fail to do that correctly, it can happen
that we start to access context state of one hash algorithm with
functions of a different hash algorithm. The result would typically be a
segfault, as could be seen e.g. in the patches part of 98422943f0 (Merge
branch 'ps/weak-sha1-for-tail-sum-fix', 2025-01-01).
The situation was significantly improved starting with 04292c3796
(hash.h: drop unsafe_ function variants, 2025-01-23) and its parent
commits. These refactorings ensure that it is not possible to mix up
safe and unsafe variants of the same hash algorithm anymore. But in
theory, it is still possible to mix up different hash algorithms with
each other, even though this is a lot less likely to happen.
But still, we can do better: instead of asking the caller to remember
the hash algorithm used to initialize a context, we can instead make the
context itself remember which algorithm it has been initialized with. If
we do so, callers can use a set of generic helpers to update the context
and don't need to be aware of the hash algorithm at all anymore.
Adapt the context initialization functions to store the hash algorithm
in the hashing context and introduce these generic helpers. Callers will
be adapted in the subsequent commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
We generally avoid using `typedef` in the Git codebase. One exception
though is the `git_hash_ctx`, likely because it used to be a union
rather than a struct until the preceding commit refactored it. But now
that it is a normal `struct` there isn't really a need for a typedef
anymore.
Drop the typedef and adapt all callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `git_hash_context` is a union containing the different hash-specific
states for SHA1, its unsafe variant as well as SHA256. We know that only
one of these states will ever be in use at the same time because hash
contexts cannot be used for multiple different hashes at the same point
in time.
We're about to extend the structure though to keep track of the hash
algorithm used to initialize the context, which is impossible to do
while the context is a union. Refactor it to instead be a structure that
contains the union of context states.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Fri, 31 Jan 2025 18:05:46 +0000 (10:05 -0800)]
Merge branch 'tb/unsafe-hash-cleanup' into ps/hash-cleanup
* tb/unsafe-hash-cleanup:
hash.h: drop unsafe_ function variants
csum-file: introduce hashfile_checkpoint_init()
t/helper/test-hash.c: use unsafe_hash_algo()
csum-file.c: use unsafe_hash_algo()
hash.h: introduce `unsafe_hash_algo()`
csum-file.c: extract algop from hashfile_checksum_valid()
csum-file: store the hash algorithm as a struct field
t/helper/test-tool: implement sha1-unsafe helper
Junio C Hamano [Thu, 30 Jan 2025 22:33:55 +0000 (14:33 -0800)]
Merge branch 'ps/build-meson-fixes' into ps/build-meson-fixes-0130
* ps/build-meson-fixes:
ci: wire up Visual Studio build with Meson
ci: raise error when Meson generates warnings
meson: fix compilation with Visual Studio
meson: make the CSPRNG backend configurable
meson: wire up fuzzers
meson: wire up generation of distribution archive
meson: wire up development environments
meson: fix dependencies for generated headers
meson: populate project version via GIT-VERSION-GEN
GIT-VERSION-GEN: allow running without input and output files
GIT-VERSION-GEN: simplify computing the dirty marker
setup: fix reinit of repos with incompatible GIT_DEFAULT_HASH
The exact same issue as described in the preceding commit also exists
for GIT_DEFAULT_HASH. Thus, reinitializing a repository that e.g. uses
SHA1 with `GIT_DEFAULT_HASH=sha256 git init` will cause the object
format of that repository to change to SHA256. This is of course bogus
as any existing objects and refs will not be converted, thus causing
repository corruption:
$ git init repo
Initialized empty Git repository in /tmp/repo/.git/
$ cd repo/
$ git commit --allow-empty -m message
[main (root-commit) 35a7344] message
$ GIT_DEFAULT_HASH=sha256 git init
Reinitialized existing Git repository in /tmp/repo/.git/
$ git show
fatal: your current branch appears to be broken
Fix the issue by ignoring the environment variable in case the repo has
already been initialized with an object hash.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: fix reinit of repos with incompatible GIT_DEFAULT_REF_FORMAT
The GIT_DEFAULT_REF_FORMAT environment variable can be set to influence
the default ref format that new repostiories shall be initialized with.
While this is the expected behaviour when creating a new repository, it
is not when reinitializing a repository: we should retain the ref format
currently used by it in that case.
This doesn't work correctly right now:
$ git init --ref-format=files repo
Initialized empty Git repository in /tmp/repo/.git/
$ GIT_DEFAULT_REF_FORMAT=reftable git init repo
fatal: could not open '/tmp/repo/.git/refs/heads' for writing: Is a directory
Instead of retaining the current ref format, the reinitialization tries
to reinitialize the repository with the different format. This action
fails when git-init(1) tries to write the ".git/refs/heads" stub, which
in the context of the reftable backend is always written as a file so
that we can detect clients which inadvertently try to access the repo
with the wrong ref format. Seems like the protection mechanism works for
this case, as well.
Fix the issue by ignoring the environment variable in case the repo has
already been initialized with a ref storage format.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Phillip Wood [Thu, 30 Jan 2025 11:08:30 +0000 (11:08 +0000)]
apply: detect overflow when parsing hunk header
"git apply" uses strtoul() to parse the numbers in the hunk header but
silently ignores overflows. As LONG_MAX is a legitimate return value for
strtoul() we need to set errno to zero before the call to strtoul() and
check that it is still zero afterwards. The error message we display is
not particularly helpful as it does not say what was wrong. However, it
seems pretty unlikely that users are going to trigger this error in
practice and we can always improve it later if needed.
Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
We don't free the result of `remote_default_branch()`, leading to a
memory leak. This leak is exposed by t9211, but only when run with Meson
with the `-Db_sanitize=leak` option:
Direct leak of 5 byte(s) in 1 object(s) allocated from:
#0 0x5555555cfb93 in malloc (scalar+0x7bb93)
#1 0x5555556b05c2 in do_xmalloc ../wrapper.c:55:8
#2 0x5555556b06c4 in do_xmallocz ../wrapper.c:89:8
#3 0x5555556b0656 in xmallocz ../wrapper.c:97:9
#4 0x5555556b0728 in xmemdupz ../wrapper.c:113:16
#5 0x5555556b07a7 in xstrndup ../wrapper.c:119:9
#6 0x5555555d3a4b in remote_default_branch ../scalar.c:338:14
#7 0x5555555d20e6 in cmd_clone ../scalar.c:493:28
#8 0x5555555d196b in cmd_main ../scalar.c:992:14
#9 0x5555557c4059 in main ../common-main.c:64:11
#10 0x7ffff7a2a1fb in __libc_start_call_main (/nix/store/h7zcxabfxa7v5xdna45y2hplj31ncf8a-glibc-2.40-36/lib/libc.so.6+0x2a1fb) (BuildId: 0a855678aa0cb573cecbb2bcc73ab8239ec472d0)
#11 0x7ffff7a2a2b8 in __libc_start_main@GLIBC_2.2.5 (/nix/store/h7zcxabfxa7v5xdna45y2hplj31ncf8a-glibc-2.40-36/lib/libc.so.6+0x2a2b8) (BuildId: 0a855678aa0cb573cecbb2bcc73ab8239ec472d0)
#12 0x555555592054 in _start (scalar+0x3e054)
DEDUP_TOKEN: __interceptor_malloc--do_xmalloc--do_xmallocz--xmallocz--xmemdupz--xstrndup--remote_default_branch--cmd_clone--cmd_main--main--__libc_start_call_main--__libc_start_main@GLIBC_2.2.5--_start
SUMMARY: LeakSanitizer: 5 byte(s) leaked in 1 allocation(s).
As the `branch` variable may contain a string constant obtained from
parsing command line arguments we cannot free the leaking variable
directly. Instead, introduce a new `branch_to_free` variable that only
ever gets assigned the allocated string and free that one to plug the
leak.
It is unclear why the leak isn't flagged when running the test via our
Makefile.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When trying to create a Unix socket in a path that exceeds the maximum
socket name length we try to first change the directory into the parent
folder before creating the socket to reduce the length of the name. When
this fails we error out of `unix_sockaddr_init()` with an error code,
which indicates to the caller that the context has not been initialized.
Consequently, they don't release that context.
This leads to a memory leak: when we have already populated the context
with the original directory that we need to chdir(3p) back into, but
then the chdir(3p) into the socket's parent directory fails, then we
won't release the original directory's path. The leak is exposed by
t0301, but only when running tests in a directory hierarchy whose path
is long enough to make the socket name length exceed the maximum socket
name length:
Direct leak of 129 byte(s) in 1 object(s) allocated from:
#0 0x5555555e85c6 in realloc.part.0 lsan_interceptors.cpp.o
#1 0x55555590e3d6 in xrealloc ../wrapper.c:140:8
#2 0x5555558c8fc6 in strbuf_grow ../strbuf.c:114:2
#3 0x5555558cacab in strbuf_getcwd ../strbuf.c:605:3
#4 0x555555923ff6 in unix_sockaddr_init ../unix-socket.c:65:7
#5 0x555555923e42 in unix_stream_connect ../unix-socket.c:84:6
#6 0x55555562a984 in send_request ../builtin/credential-cache.c:46:11
#7 0x55555562a89e in do_cache ../builtin/credential-cache.c:108:6
#8 0x55555562a655 in cmd_credential_cache ../builtin/credential-cache.c:178:3
#9 0x555555700547 in run_builtin ../git.c:480:11
#10 0x5555556ff0e0 in handle_builtin ../git.c:740:9
#11 0x5555556ffee8 in run_argv ../git.c:807:4
#12 0x5555556fee6b in cmd_main ../git.c:947:19
#13 0x55555593f689 in main ../common-main.c:64:11
#14 0x7ffff7a2a1fb in __libc_start_call_main (/nix/store/h7zcxabfxa7v5xdna45y2hplj31ncf8a-glibc-2.40-36/lib/libc.so.6+0x2a1fb) (BuildId: 0a855678aa0cb573cecbb2bcc73ab8239ec472d0)
#15 0x7ffff7a2a2b8 in __libc_start_main@GLIBC_2.2.5 (/nix/store/h7zcxabfxa7v5xdna45y2hplj31ncf8a-glibc-2.40-36/lib/libc.so.6+0x2a2b8) (BuildId: 0a855678aa0cb573cecbb2bcc73ab8239ec472d0)
#16 0x5555555ad1d4 in _start (git+0x591d4)
DEDUP_TOKEN: ___interceptor_realloc.part.0--xrealloc--strbuf_grow--strbuf_getcwd--unix_sockaddr_init--unix_stream_connect--send_request--do_cache--cmd_credential_cache--run_builtin--handle_builtin--run_argv--cmd_main--main--__libc_start_call_main--__libc_start_main@GLIBC_2.2.5--_start
SUMMARY: LeakSanitizer: 129 byte(s) leaked in 1 allocation(s).
Fix this leak.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Calvin Wan [Wed, 29 Jan 2025 21:50:44 +0000 (13:50 -0800)]
libgit: add higher-level libgit crate
The C functions exported by libgit-sys do not provide an idiomatic Rust
interface. To make it easier to use these functions via Rust, add a
higher-level "libgit" crate, that wraps the lower-level configset API
with an interface that is more Rust-y.
This combination of $X and $X-sys crates is a common pattern for FFI in
Rust, as documented in "The Cargo Book" [1].
Josh Steadmon [Wed, 29 Jan 2025 21:50:43 +0000 (13:50 -0800)]
libgit-sys: also export some config_set functions
In preparation for implementing a higher-level Rust API for accessing
Git configs, export some of the upstream configset API via libgitpub and
libgit-sys. Since this will be exercised as part of the higher-level API
in the next commit, no tests have been added for libgit-sys.
While we're at it, add git_configset_alloc() and git_configset_free()
functions in libgitpub so that callers can manage config_set structs on
the heap. This also allows non-C external consumers to treat config_sets
as opaque structs.
Co-authored-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Josh Steadmon <steadmon@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Wed, 29 Jan 2025 22:05:09 +0000 (14:05 -0800)]
Merge branch 'ja/doc-commit-markup-updates'
Doc updates.
* ja/doc-commit-markup-updates:
doc: migrate git-commit manpage secondary files to new format
doc: convert git commit config to new format
doc: make more direct explanations in git commit options
doc: the mode param of -u of git commit is optional
doc: apply new documentation guidelines to git commit
Junio C Hamano [Wed, 29 Jan 2025 22:05:08 +0000 (14:05 -0800)]
Merge branch 'ds/path-walk-1'
Introduce a new API to visit objects in batches based on a common
path, or by type.
* ds/path-walk-1:
path-walk: drop redundant parse_tree() call
path-walk: reorder object visits
path-walk: mark trees and blobs as UNINTERESTING
path-walk: visit tags and cached objects
path-walk: allow consumer to specify object types
t6601: add helper for testing path-walk API
test-lib-functions: add test_cmp_sorted
path-walk: introduce an object walk by path
Josh Steadmon [Tue, 28 Jan 2025 22:01:38 +0000 (14:01 -0800)]
libgit-sys: introduce Rust wrapper for libgit.a
Introduce libgit-sys, a Rust wrapper crate that allows Rust code to call
functions in libgit.a. This initial patch defines build rules and an
interface that exposes user agent string getter functions as a proof of
concept. This library can be tested with `cargo test`. In later commits,
a higher-level library containing a more Rust-friendly interface will be
added at `contrib/libgit-rs`.
Symbols in libgit can collide with symbols from other libraries such as
libgit2. We avoid this by first exposing library symbols in
public_symbol_export.[ch]. These symbols are prepended with "libgit_" to
avoid collisions and set to visible using a visibility pragma. In
build.rs, Rust builds contrib/libgit-rs/libgit-sys/libgitpub.a, which also
contains libgit.a and other dependent libraries, with
-fvisibility=hidden to hide all symbols within those libraries that
haven't been exposed with a visibility pragma.
Co-authored-by: Kyle Lippincott <spectral@google.com> Co-authored-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Kyle Lippincott <spectral@google.com> Signed-off-by: Josh Steadmon <steadmon@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Josh Steadmon [Tue, 28 Jan 2025 22:01:37 +0000 (14:01 -0800)]
common-main: split init and exit code into new files
Currently, object files in libgit.a reference common_exit(), which is
contained in common-main.o. However, common-main.o also includes main(),
which references cmd_main() in git.o, which in turn depends on all the
builtin/*.o objects.
We would like to allow external users to link libgit.a without needing
to include so many extra objects. Enable this by splitting common_exit()
and check_bug_if_BUG() into a new file common-exit.c, and add
common-exit.o to LIB_OBJS so that these are included in libgit.a.
This split has previously been proposed ([1], [2]) to support fuzz tests
and unit tests by avoiding conflicting definitions for main(). However,
both of those issues were resolved by other methods of avoiding symbol
conflicts. Now we are trying to make libgit.a more self-contained, so
hopefully we can revisit this approach.
Additionally, move the initialization code out of main() into a new
init_git() function in its own file. Include this in libgit.a as well,
so that external users can share our setup code without calling our
main().
We don't yet have any test coverage for the new zlib-ng backend as part
of our CI. Add it by installing zlib-ng in Alpine Linux, which causes
Meson to pick it up automatically.
Note that we are somewhat limited with regards to where we run that job:
Debian-based distributions don't have zlib-ng in their repositories,
Fedora has it but doesn't run tests, and Alma Linux doesn't have the
package either. Alpine Linux does have it available and is running our
test suite, which is why it was picked.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Switch over the "linux-musl" job to use Meson instead of Makefiles. This
is done due to multiple reasons:
- It simplifies our CI infrastructure a bit as we don't have to
manually specify a couple of build options anymore.
- It verifies that Meson detects and sets those build options
automatically.
- It makes it easier for us to wire up a new CI job using zlib-ng as
backend.
One platform compatibility that Meson cannot easily detect automatically
is the `GIT_TEST_UTF8_LOCALE` variable used in tests. Wire up a build
option for it, which we set via a new "MESONFLAGS" environment variable.
Note that we also drop the CC variable, which is set to "gcc". We
already default to GCC when CC is unset in "ci/lib.sh", so this is not
needed.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The zlib-ng library is a hard fork of the old and venerable zlib
library. It describes itself as zlib replacement with optimizations for
"next generation" systems. As such, it contains several implementations
of central algorithms using for example SSE2, AVX2 and other vectorized
CPU intrinsics that supposedly speed up in- and deflating data.
And indeed, compiling Git against zlib-ng leads to a significant speedup
when reading objects. The following benchmark uses git-cat-file(1) with
`--batch --batch-all-objects` in the Git repository:
Benchmark 1: zlib
Time (mean ± σ): 52.085 s ± 0.141 s [User: 51.500 s, System: 0.456 s]
Range (min … max): 52.004 s … 52.335 s 5 runs
Benchmark 2: zlib-ng
Time (mean ± σ): 40.324 s ± 0.134 s [User: 39.731 s, System: 0.490 s]
Range (min … max): 40.135 s … 40.484 s 5 runs
Summary
zlib-ng ran
1.29 ± 0.01 times faster than zlib
So we're looking at a ~25% speedup compared to zlib. This is of course
an extreme example, as it makes us read through all objects in the
repository. But regardless, it should be possible to see some sort of
speedup in most commands that end up accessing the object database.
The zlib-ng library provides a compatibility layer that makes it a
proper drop-in replacement for zlib: nothing needs to change in the
build system to support it. Unfortunately though, this mode isn't easy
to use on most systems because distributions do not allow you to install
zlib-ng in that way, as that would mean that the zlib library would be
globally replaced. Instead, many distributions provide a package that
installs zlib-ng without the compatibility layer. This version does
provide effectively the same APIs like zlib does, but all of the symbols
are prefixed with `zng_` to avoid symbol collisions.
Implement a new build option that allows us to link against zlib-ng
directly. If set, we redefine zlib symbols so that we use the `zng_`
prefixed versions thereof provided by that library. Like this, it
becomes possible to install both zlib and zlib-ng (without the compat
layer) and then pick whichever library one wants to link against for
Git.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
git-zlib: cast away potential constness of `next_in` pointer
The `struct git_zstream::next_in` variable points to the input data and
is used in combination with `struct z_stream::next_in`. While that
latter field is not marked as a constant in zlib, it is marked as such
in zlib-ng. This causes a couple of compiler errors when we try to
assign these fields to one another due to mismatching constness.
Fix the issue by casting away the potential constness of `next_in`.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
compat/zlib: provide stubs for `deflateSetHeader()`
The function `deflateSetHeader()` has been introduced with zlib v1.2.2.1,
so we don't use it when linking against an older version of it. Refactor
the code to instead provide a central stub via "compat/zlib.h" so that
we can adapt it based on whether or not we use zlib-ng in a subsequent
commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
compat/zlib: provide `deflateBound()` shim centrally
The `deflateBound()` function has only been introduced with zlib 1.2.0.
When linking against a zlib version older than that we thus provide our
own compatibility shim. Move this shim into "compat/zlib.h" so that we
can adapt it based on whether or not we use zlib-ng in a subsequent
commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
git-compat-util: move include of "compat/zlib.h" into "git-zlib.h"
We include "compat/zlib.h" in "git-compat-util.h", which is
unnecessarily broad given that we only have a small handful of files
that use the zlib library. Move the header into "git-zlib.h" instead and
adapt users of zlib to include that header.
One exception is the reftable library, as we don't want to use the
Git-specific wrapper of zlib there, so we include "compat/zlib.h"
instead. Furthermore, we move the include into "reftable/system.h" so
that users of the library other than Git can wire up zlib themselves.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Introduce a new "compat/zlib-compat.h" header that we include instead of
including <zlib.h> directly. This will allow us to wire up zlib-ng as an
alternative backend for zlib compression in a subsequent commit.
Note that we cannot just call the file "compat/zlib.h", as that may
otherwise cause us to include that file instead of <zlib.h>.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Before including <zlib.h> we explicitly define `z_const` to an empty
value. This has the effect that the `z_const` macro in "zconf.h" itself
will remain empty instead of being defined as `const`, which effectively
adapts a couple of APIs so that their parameters are not marked as being
constants.
It is dubious though whether this is something we actually want: not
marking a parameter as a constant doesn't make it any less constant than
it was. The define was added via 07564773c2 (compat: auto-detect if zlib
has uncompress2(), 2022-01-24), where it was seemingly carried over from
our internal compatibility shim for `uncompress2()` that was removed in
the preceding commit. The commit message doesn't mention why we carry
over the define and make it public, either, and I cannot think of any
reason for why we would want to have it.
Drop the define.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our compat library has an implementation of zlib's `uncompress2()`
function that gets used when linking against an old version of zlib
that doesn't yet have it. The last user of `uncompress2()` got removed
in 15a60b747e (reftable/block: open-code call to `uncompress2()`,
2024-04-08), so the compatibility code is not required anymore. Drop it.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Tue, 28 Jan 2025 21:02:24 +0000 (13:02 -0800)]
Merge branch 'ps/reftable-sign-compare'
The reftable/ library code has been made -Wsign-compare clean.
* ps/reftable-sign-compare:
reftable: address trivial -Wsign-compare warnings
reftable/blocksource: adjust `read_block()` to return `ssize_t`
reftable/blocksource: adjust type of the block length
reftable/block: adjust type of the restart length
reftable/block: adapt header and footer size to return a `size_t`
reftable/basics: adjust `hash_size()` to return `uint32_t`
reftable/basics: adjust `common_prefix_size()` to return `size_t`
reftable/record: handle overflows when decoding varints
reftable/record: drop unused `print` function pointer
meson: stop disabling -Wsign-compare
Junio C Hamano [Tue, 28 Jan 2025 21:02:22 +0000 (13:02 -0800)]
Merge branch 'sk/unit-tests'
Move a few more unit tests to the clar test framework.
* sk/unit-tests:
t/unit-tests: convert reftable tree test to use clar test framework
t/unit-tests: adapt priority queue test to use clar test framework
t/unit-tests: convert mem-pool test to use clar test framework
t/unit-tests: handle dashes in test suite filenames
Junio C Hamano [Tue, 28 Jan 2025 21:02:22 +0000 (13:02 -0800)]
Merge branch 'jc/show-usage-help'
The help text from "git $cmd -h" appear on the standard output for
some $cmd and the standard error for others. The built-in commands
have been fixed to show them on the standard output consistently.
* jc/show-usage-help:
builtin: send usage() help text to standard output
oddballs: send usage() help text to standard output
builtins: send usage_with_options() help text to standard output
usage: add show_usage_if_asked()
parse-options: add show_usage_with_options_if_asked()
t0012: optionally check that "-h" output goes to stdout
Derrick Stolee [Mon, 27 Jan 2025 19:02:34 +0000 (19:02 +0000)]
pack-objects: prevent name hash version change
When the --name-hash-version option is used in 'git pack-objects', it
can change from the initial assignment to when it is used based on
interactions with other arguments. Specifically, when writing or reading
bitmaps, we must force version 1 for now. This could change in the
future when the bitmap format can store a name hash version value,
indicating which was used during the writing of the packfile.
Protect the 'git pack-objects' process from getting confused by failing
with a BUG() statement if the value of the name hash version changes
between calls to pack_name_hash_fn().
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Mon, 27 Jan 2025 19:02:33 +0000 (19:02 +0000)]
test-tool: add helper for name-hash values
Add a new test-tool helper, name-hash, to output the value of the
name-hash algorithms for the input list of strings, one per line.
Since the name-hash values can be stored in the .bitmap files, it is
important that these hash functions do not change across Git versions.
Add a simple test to t5310-pack-bitmaps.sh to provide some testing of
the current values. Due to how these functions are implemented, it would
be difficult to change them without disturbing these values. The paths
used for this test are carefully selected to demonstrate some of the
behavior differences of the two current name hash versions, including
which conditions will cause them to collide.
Create a performance test that uses test_size to demonstrate how
collisions occur for these hash algorithms. This test helps inform
someone as to the behavior of the name-hash algorithms for their repo
based on the paths at HEAD.
My copy of the Git repository shows modest statistics around the
collisions of the default name-hash algorithm:
Test this tree
--------------------------------------------------
5314.1: paths at head 4.5K
5314.2: distinct hash value: v1 4.1K
5314.3: maximum multiplicity: v1 13
5314.4: distinct hash value: v2 4.2K
5314.5: maximum multiplicity: v2 9
Here, the maximum collision multiplicity is 13, but around 10% of paths
have a collision with another path.
In a more interesting example, the microsoft/fluentui [1] repo had these
statistics at time of committing:
Test this tree
--------------------------------------------------
5314.1: paths at head 19.5K
5314.2: distinct hash value: v1 8.2K
5314.3: maximum multiplicity: v1 279
5314.4: distinct hash value: v2 17.8K
5314.5: maximum multiplicity: v2 44
[1] https://github.com/microsoft/fluentui
That demonstrates that of the nearly twenty thousand path names, they
are assigned around eight thousand distinct values. 279 paths are
assigned to a single value, leading the packing algorithm to sort
objects from those paths together, by size.
With the v2 name hash function, the maximum multiplicity lowers to 44,
leaving some room for further improvement.
In a more extreme example, an internal monorepo had a much worse
collision rate:
Test this tree
--------------------------------------------------
5314.1: paths at head 227.3K
5314.2: distinct hash value: v1 72.3K
5314.3: maximum multiplicity: v1 14.4K
5314.4: distinct hash value: v2 166.5K
5314.5: maximum multiplicity: v2 138
Here, we can see that the v2 name hash function provides somem
improvements, but there are still a number of collisions that could lead
to repacking problems at this scale.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Mon, 27 Jan 2025 19:02:32 +0000 (19:02 +0000)]
p5313: add size comparison test
As custom options are added to 'git pack-objects' and 'git repack' to
adjust how compression is done, use this new performance test script to
demonstrate their effectiveness in performance and size.
The recently-added --name-hash-version option allows for testing
different name hash functions. Version 2 intends to preserve some of the
locality of version 1 while more often breaking collisions due to long
filenames.
Distinguishing objects by more of the path is critical when there are
many name hash collisions and several versions of the same path in the
full history, giving a significant boost to the full repack case. The
locality of the hash function is critical to compressing something like
a shallow clone or a thin pack representing a push of a single commit.
This can be seen by running pt5313 on the open source fluentui
repository [1]. Most commits will have this kind of output for the thin
and big pack cases, though certain commits (such as [2]) will have
problematic thin pack size for other reasons.
Checked out at the parent of [2], I see the following statistics:
Test HEAD
---------------------------------------------------------------
5313.2: thin pack with version 1 0.37(0.44+0.02)
5313.3: thin pack size with version 1 1.2M
5313.4: big pack with version 1 2.04(7.77+0.23)
5313.5: big pack size with version 1 20.4M
5313.6: shallow fetch pack with version 1 1.41(2.94+0.11)
5313.7: shallow pack size with version 1 34.4M
5313.8: repack with version 1 95.70(676.41+2.87)
5313.9: repack size with version 1 439.3M
5313.10: thin pack with version 2 0.12(0.12+0.06)
5313.11: thin pack size with version 2 22.0K
5313.12: big pack with version 2 2.80(5.43+0.34)
5313.13: big pack size with version 2 25.9M
5313.14: shallow fetch pack with version 2 1.77(2.80+0.19)
5313.15: shallow pack size with version 2 33.7M
5313.16: repack with version 2 33.68(139.52+2.58)
5313.17: repack size with version 2 160.5M
To make comparisons easier, I will reformat this output into a different
table style:
| Test | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|---------|---------|---------|---------|
| Thin Pack | 0.37 s | 0.12 s | 1.2 M | 22.0 K |
| Big Pack | 2.04 s | 2.80 s | 20.4 M | 25.9 M |
| Shallow Pack | 1.41 s | 1.77 s | 34.4 M | 33.7 M |
| Repack | 95.70 s | 33.68 s | 439.3 M | 160.5 M |
The v2 hash function successfully differentiates the CHANGELOG.md files
from each other, which leads to significant improvements in the thin
pack (simulating a push of this commit) and the full repack. There is
some bloat in the "big pack" scenario and essentially the same results
for the shallow pack.
In the case of the Git repository, these numbers show some of the issues
with this approach:
| Test | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|---------|---------|---------|---------|
| Thin Pack | 0.02 s | 0.02 s | 1.1 K | 1.1 K |
| Big Pack | 1.69 s | 1.95 s | 13.5 M | 14.5 M |
| Shallow Pack | 1.26 s | 1.29 s | 12.0 M | 12.2 M |
| Repack | 29.51 s | 29.01 s | 237.7 M | 238.2 M |
Here, the attempts to remove conflicts in the v2 function seem to cause
slight bloat to these sizes. This shows that the Git repository benefits
a lot from cross-path delta pairs.
The results are similar with the nodejs/node repo:
| Test | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|---------|---------|---------|---------|
| Thin Pack | 0.02 s | 0.02 s | 1.6 K | 1.6 K |
| Big Pack | 4.61 s | 3.26 s | 56.0 M | 52.8 M |
| Shallow Pack | 7.82 s | 7.51 s | 104.6 M | 107.0 M |
| Repack | 88.90 s | 73.75 s | 740.1 M | 764.5 M |
Here, the v2 name-hash causes some size bloat more often than it reduces
the size, but it also universally improves performance time, which is an
interesting reversal. This must mean that it is helping to short-circuit
some delta computations even if it is not finding the most efficient
ones. The performance improvement cannot be explained only due to the
I/O cost of writing the resulting packfile.
The Linux kernel repository was the initial target of the default name
hash value, and its naming conventions are practically build to take the
most advantage of the default name hash values:
| Test | V1 Time | V2 Time | V1 Size | V2 Size |
|--------------|----------|----------|---------|---------|
| Thin Pack | 0.17 s | 0.07 s | 4.6 K | 4.6 K |
| Big Pack | 17.88 s | 12.35 s | 201.1 M | 159.1 M |
| Shallow Pack | 11.05 s | 22.94 s | 269.2 M | 273.8 M |
| Repack | 727.39 s | 566.95 s | 2.5 G | 2.5 G |
Here, the thin and big packs gain some performance boosts in time, with
a modest gain in the size of the big pack. The shallow pack, however, is
more expensive to compute, likely because similarly-named files across
different directories are farther apart in the name hash ordering in v2.
The repack also gains benefits in computation time but no meaningful
change to the full size.
Finally, an internal Javascript repo of moderate size shows significant
gains when repacking with --name-hash-version=2 due to it having many name
hash collisions. However, it's worth noting that only the full repack
case has significant differences from the v1 name hash:
| Test | V1 Time | V2 Time | V1 Size | V2 Size |
|-----------|-----------|----------|---------|---------|
| Thin Pack | 8.28 s | 7.28 s | 16.8 K | 16.8 K |
| Big Pack | 12.81 s | 11.66 s | 29.1 M | 29.1 M |
| Shallow | 4.86 s | 4.06 s | 42.5 M | 44.1 M |
| Repack | 3126.50 s | 496.33 s | 6.2 G | 855.6 M |
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Mon, 27 Jan 2025 19:02:31 +0000 (19:02 +0000)]
pack-objects: add GIT_TEST_NAME_HASH_VERSION
Add a new environment variable to opt-in to different values of the
--name-hash-version=<n> option in 'git pack-objects'. This allows for
extra testing of the feature without repeating all of the test
scenarios. Unlike many GIT_TEST_* variables, we are choosing to not add
this to the linux-TEST-vars CI build as that test run is already
overloaded. The behavior exposed by this test variable is of low risk
and should be sufficient to allow manual testing when an issue arises.
But this option isn't free. There are a few tests that change behavior
with the variable enabled.
First, there are a few tests that are very sensitive to certain delta
bases being picked. These are both involving the generation of thin
bundles and then counting their objects via 'git index-pack --fix-thin'
which pulls the delta base into the new packfile. For these tests,
disable the option as a decent long-term option.
Second, there are some tests that compare the exact output of a 'git
pack-objects' process when using bitmaps. The warning that ignores the
--name-hash-version=2 and forces version 1 causes these tests to fail.
Disable the environment variable to get around this issue.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Mon, 27 Jan 2025 19:02:30 +0000 (19:02 +0000)]
repack: add --name-hash-version option
The new '--name-hash-version' option for 'git repack' is a simple
pass-through to the underlying 'git pack-objects' subcommand. However,
this subcommand may have other options and a temporary filename as part
of the subcommand execution that may not be predictable or could change
over time.
The existing test_subcommand method requires an exact list of arguments
for the subcommand. This is too rigid for our needs here, so create a
new method, test_subcommand_flex. Use it to check that the
--name-hash-version option is passing through.
Since we are modifying the 'git repack' command, let's bring its usage
in line with the Documentation's synopsis. This removes it from the
allow list in t0450 so it will remain in sync in the future.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Mon, 27 Jan 2025 19:02:29 +0000 (19:02 +0000)]
pack-objects: add --name-hash-version option
The previous change introduced a new pack_name_hash_v2() function that
intends to satisfy much of the hash locality features of the existing
pack_name_hash() function while also distinguishing paths with similar
final components of their paths.
This change adds a new --name-hash-version option for 'git pack-objects'
to allow users to select their preferred function version. This use of
an integer version allows for future expansion and a direct way to later
store a name hash version in the .bitmap format.
For now, let's consider how effective this mechanism is when repacking a
repository with different name hash versions. Specifically, we will
execute 'git pack-objects' the same way a 'git repack -adf' process
would, except we include --name-hash-version=<n> for testing.
On the Git repository, we do not expect much difference. All path names
are short. This is backed by our results:
This example demonstrates how there is some natural overhead coming from
the cloned copy because the server is hosting many forks and has not
optimized for exactly this set of reachable objects. But the full repack
has similar characteristics for both versions.
Let's consider some repositories that are hitting too many collisions
with version 1. First, let's explore the kinds of paths that are
commonly causing these collisions:
* "/CHANGELOG.json" is 15 characters, and is created by the beachball
[1] tool. Only the final character of the parent directory can
differentiate different versions of this file, but also only the two
most-significant digits. If that character is a letter, then this is
always a collision. Similar issues occur with the similar
"/CHANGELOG.md" path, though there is more opportunity for
differences In the parent directory.
* Localization files frequently have common filenames but
differentiates via parent directories. In C#, the name
"/strings.resx.lcl" is used for these localization files and they
will all collide in name-hash.
[1] https://github.com/microsoft/beachball
I've come across many other examples where some internal tool uses a
common name across multiple directories and is causing Git to repack
poorly due to name-hash collisions.
One open-source example is the fluentui [2] repo, which uses beachball
to generate CHANGELOG.json and CHANGELOG.md files, and these files have
very poor delta characteristics when comparing against versions across
parent directories.
In this example, we see significant gains in the compressed packfile
size as well as the time taken to compute the packfile.
Using a collection of repositories that use the beachball tool, I was
able to make similar comparisions with dramatic results. While the
fluentui repo is public, the others are private so cannot be shared for
reproduction. The results are so significant that I find it important to
share here:
Future changes could include making --name-hash-version implied by a config
value or even implied by default during a full repack.
It is important to point out that the name hash value is stored in the
.bitmap file format, so we must force --name-hash-version=1 when bitmaps
are being read or written. Later, the bitmap format could be updated to
be aware of the name hash version so deltas can be quickly computed
across the bitmapped/not-bitmapped boundary. To promote the safety of
this parameter, the validate_name_hash_version() method will die() if
the given name-hash version is incorrect and will disable newer versions
if not yet compatible with other features, such as --write-bitmap-index.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jonathan Tan [Mon, 27 Jan 2025 19:02:28 +0000 (19:02 +0000)]
pack-objects: create new name-hash function version
As we will explore in later changes, the default name-hash function used
in 'git pack-objects' has a tendency to cause collisions and cause poor
delta selection. This change creates an alternative that avoids some
collisions while preserving some amount of hash locality.
The pack_name_hash() method has not been materially changed since it was
introduced in ce0bd64 (pack-objects: improve path grouping
heuristics., 2006-06-05). The intention here is to group objects by path
name, but also attempt to group similar file types together by making
the most-significant digits of the hash be focused on the final
characters.
Here's the crux of the implementation:
/*
* This effectively just creates a sortable number from the
* last sixteen non-whitespace characters. Last characters
* count "most", so things that end in ".c" sort together.
*/
while ((c = *name++) != 0) {
if (isspace(c))
continue;
hash = (hash >> 2) + (c << 24);
}
As the comment mentions, this only cares about the last sixteen
non-whitespace characters. This cause some filenames to collide more than
others. This collision is somewhat by design in order to promote hash
locality for files that have similar types (.c, .h, .json) or could be the
same file across a directory rename (a/foo.txt to b/foo.txt). This leads to
decent cross-path deltas in cases like shallow clones or packing a
repository with very few historical versions of files that share common data
with other similarly-named files.
However, when the name-hash instead leads to a large number of name-hash
collisions for otherwise unrelated files, this can lead to confusing the
delta calculation to prefer cross-path deltas over previous versions of the
same file.
The new pack_name_hash_v2() function attempts to fix this issue by
taking more of the directory path into account through its hash
function. Its naming implies that we will later wire up details for
choosing a name-hash function by version.
The first change is to be more careful about paths using non-ASCII
characters. With these characters in mind, reverse the bits in the byte
as the least-significant bits have the highest entropy and we want to
maximize their influence. This is done with some bit manipulation that
swaps the two halves, then the quarters within those halves, and then
the bits within those quarters.
The second change is to perform hash composition operations at every
level of the path. This is done by storing a 'base' hash value that
contains the hash of the parent directory. When reaching a directory
boundary, we XOR the current level's name-hash value with a downshift of
the previous level's hash. This perturbation intends to create low-bit
distinctions for paths with the same final 16 bytes but distinct parent
directory structures.
The collision rate and effectiveness of this hash function will be
explored in later changes as the function is integrated with 'git
pack-objects' and 'git repack'.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Karthik Nayak [Mon, 27 Jan 2025 09:44:08 +0000 (10:44 +0100)]
refs/reftable: fix uninitialized memory access of `max_index`
When migrating reflogs between reference backends, maintaining the
original order of the reflog entries is crucial. To achieve this, an
`index` field is stored within the `ref_update` struct that encodes the
relative order of reflog entries. This field is used by the reftable
backend as update index for the respective reflog entries to maintain
that ordering.
These update indices must be respected when writing table headers, which
encode the minimum and maximum update index of contained records in the
header and footer. This logic was added in commit bc67b4ab5f (reftable:
write correct max_update_index to header, 2025-01-15), which started to
use `reftable_writer_set_limits()` to propagate the mininum and maximum
update index of all records contained in a ref transaction.
However, we only set the maximum update index for the first transaction
argument, even though there can be multiple such arguments. This is the
case when we write to multiple stacks in a single transaction, e.g. when
updating references in two different worktrees at once. Consequently,
the update index for all but the first argument remain uninitialized,
which may cause undefined behaviour.
Fix this by moving the assignment of the maximum update index in
`reftable_be_transaction_finish()` inside the loop, which ensures that
all elements of the array are correctly initialized.
Furthermore, initialize the `max_index` field to 0 when queueing a new
transaction argument. This is not strictly necessary, as all elements of
`write_transaction_table_arg.max_index` are now assigned correctly.
However, this initialization is added for consistency and to safeguard
against potential future changes that might inadvertently introduce
uninitialized memory access.
Reported-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Bence Ferdinandy [Sun, 26 Jan 2025 22:02:11 +0000 (23:02 +0100)]
fetch set_head: fix non-mirror remotes in bare repositories
In b1b713f722 (fetch set_head: handle mirrored bare repositories,
2024-11-22) it was implicitly assumed that all remotes will be mirrors
in a bare repository, thus fetching a non-mirrored remote could lead to
HEAD pointing to a non-existent reference. Make sure we only overwrite
HEAD if we are in a bare repository and fetching from a mirror.
Otherwise, proceed as normally, and create
refs/remotes/<nonmirrorremote>/HEAD instead.
Reported-by: Christian Hesse <list@eworm.de> Signed-off-by: Bence Ferdinandy <bence@ferdinandy.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Bence Ferdinandy [Sun, 26 Jan 2025 22:02:10 +0000 (23:02 +0100)]
fetch set_head: refactor to use remote directly
As a preparatory step to use even more properties from the remote
struct, refactor set_head to take the entire struct as a parameter,
instead of the necessary bits. This also allows consolidating the use of
gtransport->remote in set_head, making the access of the remote's
properties consistent in the function.
Signed-off-by: Bence Ferdinandy <bence@ferdinandy.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Already when introduced in c7a8a16239 (Add bundle transport,
2007-09-10), the `bundle` transport had a bug where it would open a file
descriptor to the bundle file and then close it _twice_: First, the file
descriptor (`data->fd`) is passed to `unbundle()`, which would use it as
the `stdin` of the `index-pack` process, which as a consequence would
close it via `start_command()`. However, `data->fd` would still hold the
numerical value of the file descriptor, and `close_bundle()` would see
that and happily close it again.
This seems not to have caused too many problems in almost two decades,
but I encountered a situation today where it _does_ cause problems: In
i686 variants of Git for Windows, it seems that file descriptors are
reused quickly after they have been closed.
In the particular scenario I faced, `git fetch <bundle> <ref>` gets the
same file descriptor value when opening the bundle file and importing
its embedded packfile (which implicitly closes the file descriptor) and
then when opening a pack file in `fetch_and_consume_refs()` while
looking up an object's header.
Later on, after the bundle has been imported (and the `close_bundle()`
function erroneously closes the file descriptor that has _already_ been
closed when using it as `stdin` for `git index-pack`), the same file
descriptor value has now been reused via `use_pack()`. Now, when either
the recursive fetch (which defaults to "on", unfortunately) or a
commit-graph update needs to `mmap()` the packfile, it fails due to a
now-invalid file descriptor that _should_ point to the pack file but
doesn't anymore.
To fix that, let's invalidate `data->fd` after calling `unbundle()`.
That way, `close_bundle()` does not close a file descriptor that may
have been reused for something different. While at it, document that
`unbundle()` closes the file descriptor, and ensure that it also does
that when failing to verify the bundle.
Luckily, this bug does not affect the bundle URI feature, it only
affects the `git fetch <bundle>` code path.
Note that this patch does not _completely_ clarifies who is responsible
to close that file descriptor, as `run_command()` may fail _without_
closing `cmd->in`. Addressing this issue thoroughly, however, would
require a rather thorough re-design of the `start_command()` and
`finish_command()` functionality to make it a lot less murky who is
responsible for what file descriptors.
At least this here patch is relatively easy to reason about, and
addresses a hard failure (`fatal: mmap: could not determine filesize`)
at the expense of leaking a file descriptor under very rare
circumstances in which `git fetch` would error out anyway.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
ZheNing Hu [Fri, 24 Jan 2025 07:49:14 +0000 (07:49 +0000)]
gc: add `--expire-to` option
This commit extends the functionality of `git gc`
by adding a new option, `--expire-to=<dir>`. Previously,
this feature was implemented in 91badeba32 (builtin/repack.c:
implement `--expire-to` for storing pruned objects, 2022-10-24),
which allowing users to specify a directory where unreachable
and expired cruft packs are stored during garbage collection.
However, users had to run `git repack --cruft --expire-to=<dir>`
followed by `git prune` to achieve similar results within `git gc`.
By introducing `--expire-to=<dir>` directly into `git gc`,
we simplify the process for users who wish to manage their
repository's cleanup more efficiently. This change involves
passing the `--expire-to=<dir>` parameter through to `git repack`,
making it easier for users to set up a backup location for cruft
packs that will be pruned.
Due to the original `git gc --prune=now` deleting all unreachable
objects by passing the `-a` parameter to git repack. With the
addition of the `--cruft` and `--expire-to` options, it is necessary
to modify this default behavior: instead of deleting these
unreachable objects, they should be merged into a cruft pack and
collected in a specified directory. Therefore, we do not pass `-a`
to the repack command but instead pass `--cruft`, `--expire-to`,
and `--cruft-expiration=now` to repack.
Signed-off-by: ZheNing Hu <adlternative@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Julian Prein [Thu, 9 Jan 2025 13:25:42 +0000 (13:25 +0000)]
config.txt: add trailer.* variables
The trailer.* configuration variables are currently only described in
git-interpret-trailers(1) but affect git-commit and git-tag as well.
Move that section into its own config/trailer.txt file and also include
it in git-config(1).
Signed-off-by: Julian Prein <julian@druckdev.xyz> Acked-by: Eric Sesterhenn <eric.sesterhenn@x41-dsec.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
remote: announce removal of "branches/" and "remotes/"
Back when Git was in its infancy, remotes were configured via separate
files in "branches/" (back in 2005). This mechanism was replaced later
that year with the "remotes/" directory. Both mechanisms have eventually
been replaced by config-based remotes, and it is very unlikely that
anybody still uses these directories to configure their remotes.
Both of these directories have been marked as deprecated, one in 2005
and the other one in 2011. Follow through with the deprecation and
finally announce the removal of these features in Git 3.0.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
[jc: with a small tweak to the help message] Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "instaweb" bound only to local IP address without "--local" and
to all addresses with "--local", which was the other way around, when
using Python's http.server class, which has been corrected.
* ak/instaweb-python-port-binding-fix:
instaweb: fix ip binding for the python http.server
Extended SHA-1 expression parser did not work well when a branch
with an unusual name (e.g. "foo{bar") is involved.
* en/object-name-with-funny-refname-fix:
object-name: be more strict in parsing describe-like output
object-name: fix resolution of object names containing curly braces
Junio C Hamano [Thu, 23 Jan 2025 20:00:40 +0000 (12:00 -0800)]
Merge branch 'ds/path-walk-1' into ds/backfill
* ds/path-walk-1:
path-walk: drop redundant parse_tree() call
path-walk: reorder object visits
path-walk: mark trees and blobs as UNINTERESTING
path-walk: visit tags and cached objects
path-walk: allow consumer to specify object types
t6601: add helper for testing path-walk API
test-lib-functions: add test_cmp_sorted
path-walk: introduce an object walk by path
Taylor Blau [Thu, 23 Jan 2025 17:34:39 +0000 (12:34 -0500)]
csum-file: introduce hashfile_checkpoint_init()
In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1
backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with
unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to
initialize a hashfile_checkpoint with the same hash function
implementation as is used by the hashfile it is used to checkpoint.
While both 106140a99f and 9218c0bfe1 work around the immediate crash,
changing the hash function implementation within the hashfile API to,
for example, the non-unsafe variant would re-introduce the crash. This
is a result of the tight coupling between initializing hashfiles and
hashfile_checkpoints.
Introduce and use a new function which ensures that both parts of a
hashfile and hashfile_checkpoint pair use the same hash function
implementation to avoid such crashes.
A few things worth noting:
- In the change to builtin/fast-import.c::stream_blob(), we can see
that by removing the explicit reference to
'the_hash_algo->unsafe_init_fn()', we are hardened against the
hashfile API changing away from the_hash_algo (or its unsafe
variant) in the future.
- The bulk-checkin code no longer needs to explicitly zero-initialize
the hashfile_checkpoint, since it is now done as a result of calling
'hashfile_checkpoint_init()'.
- Also in the bulk-checkin code, we add an additional call to
prepare_to_stream() outside of the main loop in order to initialize
'state->f' so we know which hash function implementation to use when
calling 'hashfile_checkpoint_init()'.
This is OK, since subsequent 'prepare_to_stream()' calls are noops.
However, we only need to call 'prepare_to_stream()' when we have the
HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling
'prepare_to_stream()' does not assign 'state->f', so we have nothing
to initialize.
- Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are
appropriately guarded.
Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>