The following commit will introduce a case where we write a MIDX bitmap
over packs that do not themselves have on-disk *.rev files.
This case is supported within Git, and we will simply fall back to
generating the revindex in memory. But we don't ever release that
memory, causing a leak that is exposed by a test introduced in the
following commit.
(As far as I could find, we never free()'d memory allocated as a
byproduct of creating an in-memory revindex, likely because that code
predates the leak-checking niceties we have in the test suite now.)
Rectify this by calling `FREE_AND_NULL()` on the `p->revindex` field
when calling `close_pack_revindex()`.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:58:13 +0000 (11:58 -0400)]
builtin/repack.c: convert `--write-midx` to an `OPT_CALLBACK`
Change the --write-midx (-m) flag from an OPT_BOOL to an OPT_CALLBACK
that accepts an optional mode argument. Introduce an enum with
REPACK_WRITE_MIDX_NONE and REPACK_WRITE_MIDX_DEFAULT to distinguish
between the two states, and update all existing boolean checks
accordingly.
For now, passing no argument (or just `-m`) selects the default mode,
preserving existing behavior. A subsequent commit will add a new mode
for writing incremental MIDXs.
Extract repack_write_midx() as a dispatcher that selects the
appropriate MIDX-writing implementation based on the mode.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:58:10 +0000 (11:58 -0400)]
repack-geometry: prepare for incremental MIDX repacking
Teach `pack_geometry_init()` to optionally restrict the set of
repacking candidates to only packs in the tip MIDX layer when a
`midx_layer_threshold` is configured. If the tip layer has fewer packs
than the threshold, those packs are excluded entirely; otherwise only
packs in that layer participate in the geometric repack.
Also track whether any tip-layer packs were included in the rollup
(`midx_tip_rewritten`), which a subsequent commit will use to decide
how to update the MIDX chain after repacking.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `write_midx_included_packs()` manages the lifecycle of
writing packs to stdin when running `git multi-pack-index write` as a
child process.
Extract a standalone `repack_fill_midx_stdin_packs()` helper, which
handles `--stdin-packs` argument setup, starting the command, writing
pack names to its standard input, and finishing the command.
This simplifies `write_midx_included_packs()` and prepares for a
subsequent commit where the same helper is called with `cmd->out = -1`
to capture the MIDX's checksum from the command's standard output,
which is needed when writing MIDX layers with `--no-write-chain-file`.
No functional changes are included in this patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:58:03 +0000 (11:58 -0400)]
repack-midx: factor out `repack_prepare_midx_command()`
The `write_midx_included_packs()` function assembles and executes a
`git multi-pack-index write` command, constructing the argument list
inline.
Future commits will introduce additional callers that need to construct
similar `git multi-pack-index` commands (for both `write` and `compact`
subcommands), so extract the common portions of the command setup into a
reusable `repack_prepare_midx_command()` helper.
The extracted helper sets `git_cmd`, pushes `multi-pack-index` and a
subcommand, and handles `--progress`/`--no-progress` and `--bitmap`
flags. The remaining arguments that are specific to the `write`
subcommand (such as `--stdin-packs`) are left to the caller.
No functional changes are included in this patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:58:00 +0000 (11:58 -0400)]
midx: expose `midx_layer_contains_pack()`
Rename the function `midx_contains_pack_1()` to instead be called
`midx_layer_contains_pack()` and make it accessible. Unlike
`midx_contains_pack()` (which recurses through the entire chain), this
function checks only a single MIDX layer.
This will be used by a subsequent commit to determine whether a given
pack belongs to the tip MIDX layer specifically, rather than to any
layer in the chain.
No functional changes are present in this commit.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:57:57 +0000 (11:57 -0400)]
repack: track the ODB source via existing_packs
Store the ODB source in the `existing_packs` struct and use that in
place of the raw `repo->objects->sources` access within `cmd_repack()`.
The source used is still assigned from the first source in the list, so
there are no functional changes in this commit. The changes instead
serve two purposes (one immediate, one not):
- The incremental MIDX-based repacking machinery will need to know what
source is being used to read the existing MIDX/chain (should one
exist).
- In the future, if "git repack" is taught how to operate on other
object sources, this field will serve as the authoritative value for
that source.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:57:54 +0000 (11:57 -0400)]
midx: support custom `--base` for incremental MIDX writes
Both `compact` and `write --incremental` fix the base of the resulting
MIDX layer: `compact` always places the compacted result on top of
"from's" immediate parent in the chain, and `write --incremental` always
appends a new layer to the existing tip. In both cases the base is not
configurable.
Future callers need additional flexibility. For instance, the incremental
MIDX-based repacking code may wish to write a layer based on some
intermediate ancestor rather than the current tip, or produce a root
layer when replacing the bottommost entries in the chain.
Introduce a new `--base` option for both subcommands to specify the
checksum of the MIDX layer to use as the base. The given checksum must
refer to a valid layer in the MIDX chain that is an ancestor of the
topmost layer being written or compacted.
The special value "none" is accepted to produce a root layer with no
parent. This will be needed when the incremental repacking machinery
determines that the bottommost layers of the chain should be replaced.
If no `--base` is given, behavior is unchanged: `compact` uses "from's"
immediate parent in the chain, and `write` appends to the existing tip.
For the `write` subcommand, `--base` requires `--no-write-chain-file`. A plain
`write --incremental` appends a new layer to the live chain tip with no
mechanism to atomically replace it; overriding the base would produce a
layer that does not extend the tip, breaking chain invariants. With
`--no-write-chain-file` the chain is left unmodified and the caller is
responsible for assembling a valid chain.
For `compact`, no such restriction applies. The compaction operation
atomically replaces the compacted range in the chain file, so writing
the result on top of any valid ancestor preserves chain invariants.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:57:51 +0000 (11:57 -0400)]
midx: introduce `--no-write-chain-file` for incremental MIDX writes
When writing an incremental MIDX layer, the MIDX machinery writes the
new layer into the multi-pack-index.d directory and then updates the
multi-pack-index-chain file to include the freshly written layer.
Future callers however may not wish to immediately update the MIDX chain
itself, preferring instead to write out new layer(s) themselves before
atomically updating the chain. Concretely, the new incremental
MIDX-based repacking strategy will want to do exactly this (that is,
assemble the new MIDX chain itself before writing a new chain file and
atomically linking it into place).
Introduce a `--no-write-chain-file` flag that:
* writes the new MIDX layer into the multi-pack-index.d directory
* prints its checksum
* does not update the multi-pack-index-chain file.
The MIDX chain file (and thus, the lock protecting it) remain untouched,
allowing callers to assemble the chain themselves. This flag requires
`--incremental`, since the notion of a separate layer only makes sense
for incremental MIDXs.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:57:48 +0000 (11:57 -0400)]
midx: use `strvec` for `keep_hashes`
The `keep_hashes` array in `write_midx_internal()` accumulates the
checksums of MIDX files that should be retained when pruning stale
entries from the MIDX chain. For similar reasons as in a previous
commit, rewrite this using a strvec, requiring us to pass one fewer
parameter.
Unlike the aforementioned previous commit, use a `strvec` instead of a
`string_list`, which provides a more ergonomic interface to adjust the
values at a particular index. The ordering is important here, as this
value is used to determine the contents of the resulting
`multi-pack-index-chain` file when writing with "--incremental".
Since the previous commit already builds the array in forward order, the
conversion is straightforward: replace indexed assignments with
`strvec_push()`, drop the pre-counting and `CALLOC_ARRAY()`, and
simplify cleanup via `strvec_clear()`.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:57:45 +0000 (11:57 -0400)]
midx: build `keep_hashes` array in order
Instead of filling the keep_hashes array using reverse indexing (e.g.,
`keep_hashes[count - i - 1]`) while traversing linked lists forward,
collect linked list nodes into a temporary `layers` array and then
iterate it backwards to fill `keep_hashes` sequentially.
This makes the filling logic easier to follow, since each segment of the
array is filled with a simple forward-marching index. Moreover, this
change prepares us for a subsequent commit that will switch to using a
`strvec`.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:57:42 +0000 (11:57 -0400)]
midx: use `strset` for retained MIDX files
Both `clear_midx_files_ext()` and `clear_incremental_midx_files_ext()`
build a list of filenames to keep while pruning stale MIDX files. Today
they hand-roll an array instead of using a `strset`, thus requiring us
to pass an additional length parameter, and makes lookups linear.
Replace the bare array with a `strset` which can be passed around as a
single parameter. Though it improves lookup performance, the difference
is likely immeasurable given how small the keep_hashes array typically
is.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 19 May 2026 15:57:39 +0000 (11:57 -0400)]
midx-write: handle noop writes when converting incremental chains
When updating a MIDX, we optimize out writes that will result in an
identical MIDX as the one we already have on disk. See b3bab9d2729
(midx-write: extract function to test whether MIDX needs updating,
2025-12-10) for more details on exactly which writes are optimized out.
If `midx_needs_update()` can't rule out any of the obvious cases (e.g.,
the checksum is invalid, we're requesting a different version, or
performing compaction which always requires an update), then we compare
the packs we're writing to the packs we already know about. If there are
an equal number of packs being written as there are in any existing
MIDX layer(s), then we compare the packs by their name.
This comparison fails when we have an incremental MIDX chain with
at least two layers, since we do not recursively peel through earlier
layers, instead treating the `->pack_names` array of the tip MIDX layer
as containing all `m->num_packs + m->num_packs_in_base` packs.
Adjust this to instead look through the MIDX layers one by one when
comparing pack names. While we're at it, fix a typo above in the same
function.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Wed, 20 May 2026 01:30:57 +0000 (10:30 +0900)]
Merge branch 'ps/history-fixup'
"git history" learned "fixup" command.
* ps/history-fixup:
builtin/history: introduce "fixup" subcommand
builtin/history: generalize function to commit trees
replay: allow callers to control what happens with empty commits
Some tests assume that bare repository accesses are by default
allowed; rewrite some of them to avoid the assumption, rewrite
others to explicitly set safe.bareRepository to allow them.
* js/adjust-tests-to-explicitly-access-bare-repo:
safe.bareRepository: default to "explicit" with WITH_BREAKING_CHANGES
status tests: filter `.gitconfig` from status output
ls-files tests: filter `.gitconfig` from `--others` output
t5601: restore `.gitconfig` after includeIf test
t1305: use `--git-dir=.` for bare repo in include cycle test
t1300: remove global config settings injected by test-lib.sh
t7900: do not let `$HOME/.gitconfig` interfere with XDG tests
test-lib: allow bare repository access when breaking changes are enabled
Junio C Hamano [Wed, 20 May 2026 01:30:56 +0000 (10:30 +0900)]
Merge branch 'en/diffstat-utf8-truncation-fix'
The computation to shorten the filenames shown in diffstat measured
width of individual UTF-8 characters to add up, but forgot to take
into account error cases (e.g., an invalid UTF-8 sequence, or a
control character).
* en/diffstat-utf8-truncation-fix:
diff: fix out-of-bounds reads and NULL deref in diffstat UTF-8 truncation
Junio C Hamano [Wed, 20 May 2026 01:30:56 +0000 (10:30 +0900)]
Merge branch 'js/mingw-no-nedmalloc'
Stop using unmaintained custom allocator in Windows build which was
the last user of the code.
* js/mingw-no-nedmalloc:
mingw: remove the vendored compat/nedmalloc/ subtree
mingw: drop the build-system plumbing for nedmalloc
mingw: stop using nedmalloc
Update code paths that assumed "unsigned long" was long enough for
"size_t".
* js/objects-larger-than-4gb-on-windows:
ci: run expensive tests on push builds to integration branches
t5608: mark >4GB tests as EXPENSIVE
test-tool synthesize: add precomputed SHA-256 pack for 4 GiB + 1
test-tool synthesize: precompute pack for 4 GiB + 1
test-tool synthesize: use the unsafe hash for speed
t5608: add regression test for >4GB object clone
test-tool: add a helper to synthesize large packfiles
delta, packfile: use size_t for delta header sizes
odb, packfile: use size_t for streaming object sizes
git-zlib: handle data streams larger than 4GB
index-pack, unpack-objects: use size_t for object size
Stop using `the_repository` in `init_db()` and instead accept
the repository as a parameter. The injection of `the_repository` is thus
bumped one level higher, where callers now pass it in explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `create_reference_database()`
Stop using `the_repository` in `create_reference_database()` and instead
accept the repository as a parameter. The injection of `the_repository`
is thus bumped one level higher, where callers now pass it in
explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `initialize_repository_version()`
Stop using `the_repository` in `initialize_repository_version()` and
instead accept the repository as a parameter. The injection of
`the_repository` is thus bumped one level higher, where callers now pass
it in explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `check_repository_format()`
Stop using `the_repository` in `check_repository_format()` and instead
accept the repository as a parameter. The injection of `the_repository`
is thus bumped one level higher, where callers now pass it in
explicitly.
Furthermore, the function is never used outside "setup.c". Drop its
declaration in "setup.h" and make it static. Note that this requires us
to reorder the function.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `upgrade_repository_format()`
Stop using `the_repository` in `upgrade_repository_format()` and instead
accept the repository as a parameter. The injection of `the_repository`
is thus bumped one level higher, where callers now pass it in
explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `setup_git_directory()`
Stop using `the_repository` in `setup_git_directory()` and instead
accept the repository as a parameter. The injection of `the_repository`
is thus bumped one level higher, where callers now pass it in
explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `setup_git_directory_gently()`
Stop using `the_repository` in `setup_git_directory_gently()` and
instead accept the repository as a parameter. The injection of
`the_repository` is thus bumped one level higher, where callers now pass
it in explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `setup_git_env()`
Stop using `the_repository` in `setup_git_env()` and instead accept the
repository as a parameter. The injection of `the_repository` is thus
bumped one level higher, where callers now pass it in explicitly.
Furthermore, the function is never used outside of "setup.c". Drop the
declaration in "environment.h" and make it static.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `set_git_work_tree()`
Stop using `the_repository` in `set_git_work_tree()` and instead accept
the repository as a parameter. The injection of `the_repository` is thus
bumped one level higher, where callers now pass it in explicitly.
Similar as with the preceding commit, we track whether the worktree has
been initialized already via a global variable so that we can die in
case the repository is re-initialized with a different worktree path.
Store this info in the `struct repository` instead so that we correctly
handle this per repository.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `setup_work_tree()`
Stop using `the_repository` in `setup_work_tree()` and instead accept
the repository as a parameter. The injection of `the_repository` is thus
bumped one level higher, where callers now pass it in explicitly.
Note that the function tracks two bits of information via global
variables. This of course doesn't make much sense anymore now that we
can set up worktrees for arbitrary repositories:
- We track whether the worktree has already been initialized and, if
so, we skip the call to `chdir_notify()` and setenv(3p). It does not
make much sense to store this info in the repository, as we _would_
want to update the environment when switching between worktrees back
and forth.
So instead of storing this info in the repository, we drop this
state entirely and live with the fact that we may execute the logic
twice. It should ultimately be idempotent though and thus not be
much of a problem.
- We track whether the worktree configuration is bogus. If so, and if
later on some caller tries to setup the worktree, then we'll die
instead. This is indeed information that we can move into the
repository itself.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `enter_repo()`
Stop using `the_repository` in `enter_repo()` and instead accept the
repository as a parameter. The injection of `the_repository` is thus
bumped one level higher, where callers now pass it in explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `verify_non_filename()`
Stop using `the_repository` in `verify_non_filename()` and instead
accept the repository as a parameter. The injection of `the_repository`
is thus bumped one level higher, where callers now pass it in
explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `verify_filename()`
Stop using `the_repository` in `verify_filename()` and instead accept
the repository as a parameter. The injection of `the_repository` is thus
bumped one level higher, where callers now pass it in explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `path_inside_repo()`
Stop using `the_repository` in `path_inside_repo()` and instead accept
the repository as a parameter. The injection of `the_repository` is thus
bumped one level higher, where callers now pass it in explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `prefix_path()`
Stop using `the_repository` in `prefix_path()` and instead accept the
repository as a parameter. The injection of `the_repository` is thus
bumped one level higher, where callers now pass it in explicitly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `is_inside_work_tree()`
Similar as with the preceding commit, `is_inside_work_tree()` determines
whether the current working directory is located inside the worktree of
`the_repository`. Perform the same refactoring by dropping the caching
mechanism and injecting the repository that shall be checked.
Note that, same as in the preceding commit, we're also resolving the
worktree path via `realpath()`. In theory this step is not necessary as
we always set the worktree path via `repo_set_worktree()`, and that
function already resolves the path for us. But resolving the path a
second time is unlikely to matter performance-wise, and it feels fragile
to rely on the repository's worktree path being absolute. We thus
perform the same extra step even though it's ultimately not required.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: stop using `the_repository` in `is_inside_git_dir()`
The function `is_inside_git_dir()` verifies whether or not the current
working directory is located inside the gitdir of `the_repository`. This
is done by taking the gitdir path and verifying that it's a prefix of
the current working directory.
This information is cached so that we don't have to re-do this change
multiple times. Furthermore, we proactively set the value in multiple
locations so that we don't even have to perform the check when we have
discovered the repository.
While we could simply move the caching variable into the repository, the
current layout doesn't really feel sensible in the first place:
- It can easily lead to false positives or negatives if at any point
in time we may switch the current working directory.
- We don't call the function in a hot loop, and neither is it overly
expensive to compute.
Drop the caching infrastructure and instead compute the property ad-hoc
via an injected repository.
Note that there is one small gotcha: we often end up with relative
gitdir paths, and if so `is_inside_dir()` might fail. This wasn't an
issue before because of how we proactively set the cached value during
repository discovery. Now that we stop doing that it becomes a problem
though, which we work around by resolving the gitdir via `realpath()`.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: replace use of `the_repository` in static functions
Replace the use of `the_repository` in "setup.c" for all static
functions. For now, we simply add `the_repository` to invocations of
these functions. This will be addressed in subsequent commits, where
we'll move up `the_repository` one more layer to callers of "setup.c".
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 19 May 2026 05:22:19 +0000 (01:22 -0400)]
connect: use "service" enum for "name" argument
The git_connect() function takes a "name" argument which is a bit
confusing. It is _not_ the program to run on the remote repo, which is
specified by the "prog" argument. It should instead be one of a few
well-known strings specifying the type of operation (e.g.,
"git-upload-pack"). But to add to the confusion, unless otherwise
configured, those well-known strings will also be the same as the
programs we run, making it easy to mistake which variable is which.
This confusion comes from eaa0fd6584 (git_connect(): fix corner cases in
downgrading v2 to v0, 2023-03-17), though in its defense, the term
"name" and the use of a string are found in other connect code, going
all the way back to b236752a87 (Support remote archive from all smart
transports, 2009-12-09).
But let's see if we can clean things up a bit. The term "name" is overly
vague. We use "service" in other places, including in the smart-http
protocol, so let's use it here, too.
Using a string invites the notion that it can be anything, not one of a
defined set. Let's instead introduce an enum, which has the added bonus
that the compiler can catch typos for us, rather than quietly choosing
the wrong service from an unexpected strcmp() result.
We do still have to turn our enum into those well-known strings to pass
along in the remote-helper protocol (e.g., for a stateless-connect
directive). But now we do so explicitly and in a way that I think is
much more obvious to follow.
This is a pure cleanup; there should be no behavior change.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 19 May 2026 01:20:59 +0000 (21:20 -0400)]
quote: simplify internals of dequoting
Our sq_dequote_to_argv_internal() helper was wrapped by the to_argv()
and to_strvec() forms. Now that we have only the latter, we can stop
wrapping it and drop the argv-only bits.
Note that in theory sq_dequote_to_strvec() could take a const input
string, which would be friendlier to its callers. We couldn't do that
with the to_argv() form because it reused the input string to hold the
output elements. But since we're built on sq_dequote_step(), which
munges the input, we'd have to rework the parser. Since no callers care
about it currently, we'll leave that for another day.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 19 May 2026 01:19:34 +0000 (21:19 -0400)]
quote: drop sq_dequote_to_argv()
The last caller went away in f9dbb64fad (config: parse more robust
format in GIT_CONFIG_PARAMETERS, 2021-01-12), when we switched to using
sq_dequote_step().
The "to_argv()" form is not a great interface. If you care about raw
speed, then sq_dequote_step() lets you work incrementally without extra
allocations. If you care about simplicity, then sq_dequote_to_strvec()
puts the result in an encapsulated data structure. With sq_dequote_to_argv(),
you have a data dependency on the original string but still have to
remember to manually free the argv array itself (but not its elements).
So it's sort of a worst-of-both-worlds middle ground. Let's get rid of
it.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 19 May 2026 01:19:01 +0000 (21:19 -0400)]
quote.h: bump strvec forward declaration to the top
We usually put forward declarations at the top of header files, rather
than next to the functions that need them. In theory placing it next to
the function has some explanatory value, but it's also just as likely to
become stale if other uses are added.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git rebase --update-refs", when used with an rebase.instructionFormat
with "%d" (describe) in it, tried to update local branch HEAD by
mistake, which has been corrected.
Junio C Hamano [Tue, 19 May 2026 00:57:44 +0000 (09:57 +0900)]
Merge branch 'kh/name-rev-custom-format'
A new builtin "git format-rev" is introduced for pretty formatting
one revision expression per line or commit object names found in
running text.
* kh/name-rev-custom-format:
format-rev: introduce builtin for on-demand pretty formatting
name-rev: make dedicated --annotate-stdin --name-only test
name-rev: factor code for sharing with a new command
name-rev: run clang-format before factoring code
name-rev: wrap both blocks in braces
Junio C Hamano [Tue, 19 May 2026 00:57:43 +0000 (09:57 +0900)]
Merge branch 'en/xdiff-cleanup-3'
Preparation of the xdiff/ codebase to work with Rust.
* en/xdiff-cleanup-3:
xdiff/xdl_cleanup_records: make execution of action easier to follow
xdiff/xdl_cleanup_records: make setting action easier to follow
xdiff/xdl_cleanup_records: make limits more clear
xdiff/xdl_cleanup_records: use unambiguous types
xdiff: use unambiguous types in xdl_bogo_sqrt()
xdiff/xdl_cleanup_records: delete local recs pointer
Junio C Hamano [Tue, 19 May 2026 00:57:43 +0000 (09:57 +0900)]
Merge branch 'mc/http-emptyauth-negotiate-fix'
The 'http.emptyAuth=auto' configuration now correctly attempts
Negotiate authentication before falling back to manual credentials.
This allows seamless Kerberos ticket-based authentication without
requiring users to explicitly set 'http.emptyAuth=true'.
* mc/http-emptyauth-negotiate-fix:
doc: clarify http.emptyAuth values
t5563: add tests for http.emptyAuth with Negotiate
http: attempt Negotiate auth in http.emptyAuth=auto mode
http: extract http_reauth_prepare() from retry paths
René Scharfe [Mon, 18 May 2026 20:25:02 +0000 (22:25 +0200)]
use __builtin_add_overflow() in st_add() with Clang
Clang and GCC optimize away comparisons of overflow checks by checking
the carry flag on x64. GCC does the same on ARM64, but Clang currently
(version 22.1) doesn't.
It does this optimization for overflow checks that use its builtin
function __builtin_add_overflow(), though. Provide a non-generic
lookalike for size_t that does the same checks as before as a fallback
and use the original with Clang. Use it on all platforms for simplicity.
On an Apple M1 I get a nice speedup for a command that builds lots of
strings using a strbuf, which exercises the st_add3() in strbuf_grow()
for every line of output:
Benchmark 1: ./git_main cat-file --batch-all-objects --batch-check='%(objectname)'
Time (mean ± σ): 120.4 ms ± 0.2 ms [User: 113.8 ms, System: 6.0 ms]
Range (min … max): 120.1 ms … 121.1 ms 24 runs
Benchmark 2: ./git cat-file --batch-all-objects --batch-check='%(objectname)'
Time (mean ± σ): 115.5 ms ± 0.1 ms [User: 108.6 ms, System: 5.8 ms]
Range (min … max): 115.2 ms … 115.8 ms 25 runs
Summary
./git cat-file --batch-all-objects --batch-check='%(objectname)' ran
1.04 ± 0.00 times faster than ./git_main cat-file --batch-all-objects --batch-check='%(objectname)'
Suggested-by: Jeff King <peff@peff.net> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
René Scharfe [Mon, 18 May 2026 20:25:01 +0000 (22:25 +0200)]
strbuf: use st_add3() in strbuf_grow()
Simplify the code by calling st_add3() to do overflow checks instead of
open-coding it. This changes the error message to include the offending
summands, which can be helpful when tracking down the cause.
Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Karthik Nayak [Sun, 17 May 2026 17:32:05 +0000 (19:32 +0200)]
refs/files: skip lock files during consistency checks
Consistency checks in the files reference backend involve two steps:
1. Iterate over all entries within the 'refs/' directory and call
`files_fsck_ref()` on each.
2. Iterate over all root refs via `for_each_root_ref()` and call
`files_fsck_ref()` on each.
`files_fsck_ref()` then runs all fsck checks defined in
`fsck_refs_fn[]`. Step 2 goes through the refs API and only sees valid
refs, but step 1 iterates the directory directly and may also encounter
intermediate '*.lock' files.
Currently, `files_fsck_refs_name()`, one of the functions in
`fsck_refs_fn[]`, filters out lock files itself. The other function,
`files_fsck_refs_content()`, has no such check and would parse the lock
file. Any new function added to `fsck_refs_fn[]` would have the same
problem.
Move the filter up into `files_fsck_refs_dir()`, where the directory
iteration happens. Since step 2 cannot produce lock files, this is the
only site where the filter is needed, and individual checks no longer
have to re-implement it.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Sun, 17 May 2026 13:58:30 +0000 (22:58 +0900)]
Merge branch 'hn/git-checkout-m-with-stash'
"git checkout -m another-branch" was invented to deal with local
changes to paths that are different between the current and the new
branch, but it gave only one chance to resolve conflicts. The command
was taught to create a stash to save the local changes.
* hn/git-checkout-m-with-stash:
checkout -m: autostash when switching branches
checkout: rollback lock on early returns in merge_working_tree
sequencer: teach autostash apply to take optional conflict marker labels
sequencer: allow create_autostash to run silently
stash: add --label-ours, --label-theirs, --label-base for apply
Junio C Hamano [Sun, 17 May 2026 13:58:30 +0000 (22:58 +0900)]
Merge branch 'ss/t7004-unhide-git-failures'
Test clean-up.
* ss/t7004-unhide-git-failures:
t7004: avoid subshells to capture git exit codes
t7004: dynamically grab expected state in tests
t7004: drop hardcoded tag count for state verification
Junio C Hamano [Sun, 17 May 2026 13:58:29 +0000 (22:58 +0900)]
Merge branch 'en/backfill-fixes-and-edges'
The 'git backfill' command now rejects revision-limiting options that
are incompatible with its operation, uses standard documentation for
revision ranges, and includes blobs from boundary commits by default
to improve performance of subsequent operations.
* en/backfill-fixes-and-edges:
backfill: default to grabbing edge blobs too
backfill: document acceptance of revision-range in more standard manner
backfill: reject rev-list arguments that do not make sense
Pushkar Singh [Sat, 16 May 2026 18:33:48 +0000 (18:33 +0000)]
stash: add coverage for show --include-untracked
Add a test for 'git stash show --include-untracked' to
cover the case where untracked files saved in the stash
are included in the output.
While stash creation and restoration of untracked files
are already tested, there is currently no explicit test
covering the output behavior of 'stash show
--include-untracked'.
Signed-off-by: Pushkar Singh <pushkarkumarsingh1970@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
commit-reach: use object flags for tips_reachable_from_bases()
tips_reachable_from_bases() walks the commit graph from a set of base
commits to find which tip commits are reachable. The inner loop does
a linear scan over the tips array to check whether each visited commit
is a tip, making the overall cost O(C * T) where C is commits walked
and T is the number of tips.
Use the RESULT object flag to mark tip commits, replacing the linear
scan with a single flag test per visited commit. This reduces the
per-commit tip check from O(T) to O(1) and the overall cost from
O(C * T) to O(C + T).
When multiple refs point to the same commit, the shared object gets
the flag once, so all duplicates are handled automatically. The
early-termination advancement loop checks the flag on the sorted
commits array directly, which naturally handles duplicates since the
flag is on the shared commit object.
This also removes the index field from struct commit_and_index, since
the indirection through the original tips array is no longer needed.
This function is called by `git for-each-ref --merged` and
`git branch/tag --contains/--no-contains` via reach_filter() in
ref-filter.c.
Benchmark on a merge-heavy monorepo (2.3M commits, 10,000 refs):
Command Before After Speedup
for-each-ref --merged HEAD 6.57s 1.59s 4.1x
for-each-ref --no-merged HEAD 6.67s 1.66s 4.0x
branch --merged HEAD 0.68s 0.61s 10%
branch --no-merged HEAD 0.65s 0.61s 8%
tag --merged HEAD 0.12s 0.12s -
On linux.git with 10,000 synthetic branches at the root commit (worst
case for the DFS walk):
Command Before After Speedup
for-each-ref --merged HEAD 1.35s 0.35s 3.9x
for-each-ref --no-merged HEAD 1.82s 0.31s 5.9x
The large speedup for for-each-ref is because it checks all 10,000
refs as tips, making the O(T) inner loop expensive. The branch
subcommand only checks local branches (fewer tips), so the improvement
is smaller.
Signed-off-by: Kristofer Karlsson <krka@spotify.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Philippe Blain [Fri, 15 May 2026 15:48:11 +0000 (15:48 +0000)]
diff-format.adoc: mode and hash are 0* for unmerged paths from index only
In the "Raw output format" section, we mention that the 'mode' and
'sha1' for "src" and "dst" are 0* if "(creation|deletion) or unmerged".
For unmerged entries, 'mode' and 'sha1' are in fact 0* only when we are
looking at the index, i.e. on the left side for 'git diff-files' and on
the right side for 'git diff-index --cached'. Be more precise by
mentioning this, and while at it uniformize the wording of the "work
tree out of sync with the index" case.
Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Philippe Blain [Fri, 15 May 2026 15:48:10 +0000 (15:48 +0000)]
diff-format.adoc: 'git diff-files' prints two lines for unmerged files
Since 10637b84d9 (diff-files: -1/-2/-3 to diff against unmerged stage.,
2005-11-29), for unmerged entries 'git diff-files' print both an
"unmerged" line ('U'), as well as an "in-place edit" line ('M')
comparing stage 2 (by default) with the working tree. The "Raw output
format" documentation however mentions that all commands print a single
line per changed file. Adjust diff-format.adoc to also mention this
special case, for completeness.
Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Philippe Blain [Fri, 15 May 2026 15:48:09 +0000 (15:48 +0000)]
diff-format.adoc: remove mention of diff-tree specific output
In the "Raw output format" section, we start by mentioning that 'git
diff-tree' prints the hashes of what is being compared. This is only
true in --stdin mode, and is already mentioned in the description of
'--stdin' in git-diff-tree.adoc. Remove this sentence such that we only
focus on the common output between diff-tree, diff-index, diff-files and
These rules exist to ensure Make won't fail with the following error if
one of the .adoc files is renamed or removed:
make: *** No rule to make target 'Documentation/config.adoc', needed by 'config-list.h'.
With the no-op targets defined in `config-list.h.d`, Make knows there's
no work to be done to generate these files, so it doesn't error out if
it doesn't exist.
For the Makefile build system this works great. And since ebeea3c471 (build: regenerate config-list.h when Documentation changes,
2026-02-24) this script is also called from the Meson build system.
Nevertheless, on AlmaLinux 8 the following build failure is seen:
This version of this distro uses Ninja 1.8.2 and it seems to have some
issues with the format of the `config-list.h.d` file.
Ninja versions before 1.10.0 do not reset the depfile parser state on
newlines. This causes issues when the depfile has one dependency per
line, like we have in `config-list.h.d`:
The parser only recognizes the first "config-list.h:" as a target. On
subsequent lines it is still in dependency-parsing mode, so the repeated
output name is recorded as an input. This causes the error mentioned
above.
The bug in Ninja is fixed in 1.10, with commit
ninja-build/ninja@1daa7470ab7e (depfile_parser: remove restriction on
multiple outputs, 2019-11-20).
To be compatible with older versions of Ninja, collapse the dependencies
for `config-list.h` into a single line like:
This works around the bug in older versions of Ninja, and is fully
compatible Make and with more recent versions of Ninja. And while the
no-op targets are not needed for Ninja, they also don't do any harm.
Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Toon Claes <toon@iotcl.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ethan Dickson [Fri, 15 May 2026 06:39:54 +0000 (06:39 +0000)]
connected: close err_fd in promisor fast-path
connected.h documents that err_fd is closed before check_connected()
returns. It is, on three of four exit paths. The promisor-pack fast
path added in 50033772d (connected: verify promisor-ness of partial
clone, 2020-01-30) returns 0 without closing it.
receive-pack uses err_fd as the write end of an async sideband
muxer's pipe, and the muxer thread waits for EOF. The same omission
has caused deadlocks there twice before: 49ecfa13f (receive-pack:
close sideband fd on early pack errors, 2013-04-19) and 6cdad1f13
(receive-pack: fix deadlock when we cannot create tmpdir,
2017-03-07).
Signed-off-by: Ethan Dickson <ethanndickson@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
René Scharfe [Fri, 15 May 2026 07:33:53 +0000 (09:33 +0200)]
trailer: change strbuf in-place in unfold_value()
Avoid an allocation by doing s/\n\s*/ /g (replacing NL and any following
whitespace with a SP) right in the strbuf instead of copying the result
to a temporary one and swapping them in the end. We can safely do that
because the replacement is never longer than the original string.
Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Sat, 16 May 2026 02:23:10 +0000 (22:23 -0400)]
commit: handle large commit messages in utf8 verification
Running t4205 under UBSan with the EXPENSIVE prereq enabled triggers an
error when we try to create a commit message that is over 2GB:
commit.c:1574:6: runtime error: signed integer overflow:
-2147483648 - 1 cannot be represented in type 'int'
The problem is that find_invalid_utf8() is not prepared to handle
large buffers, as it uses an "int" to represent buffer sizes and
offsets.
We can fix this with a few changes:
1. We'll take in "len" as a size_t (which is what the caller has
anyway, since it's working with a strbuf).
2. We need to return a size_t to give the offset to the invalid utf8,
but we also need a sentinel value for "no invalid value"
(previously "-1"). Let's split these to return a bool for "found
invalid utf8" and then pass back the offset as an out-parameter.
We'll switch the function name to match the new semantics.
3. The caller in verify_utf8() uses a "long" to store buffer
positions, which is a bit funny. This goes back to 08a94a145c
(commit/commit-tree: correct latin1 to utf-8, 2012-06-28) and is
perhaps trying to match our use of "unsigned long" for object sizes
(though we don't care about it ever becoming negative here). This
should be a size_t, too, as some platforms (like Windows) still use
a 32-bit long on machines with 64-bit pointers.
4. The "bytes" field within find_invalid_utf() does not have range
problems. It is the number of bytes the utf8 sequence claims to
have, so is limited by how many bits can be set in a single 8-bit
byte. However, if we leave it as an "int" then the compiler will
complain about the sign mismatch when comparing it to "len". So
let's make it unsigned, too.
All of this is a little silly, of course, because 2GB text commit
messages are clearly nonsense. So we might consider rejecting them
outright, but it is easy enough to make these helper functions more
robust in the meantime.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Sat, 16 May 2026 02:16:22 +0000 (22:16 -0400)]
apply: plug leak on "patch too large" error
In apply_patch(), we return immediately if read_patch_file() returns an
error. Traditionally this was OK, since an error from strbuf_read()
would restore the strbuf to its unallocated state.
But since f1c0e3946e (apply: reject patches larger than ~1 GiB,
2022-10-25), we may also return an error if we successfully read the
patch but it is too large. In this case we leak the strbuf contents when
apply_patch() returns.
You can see it in action by running t4141 under LSan with the EXPENSIVE
prereq enabled.
We can fix this in one of two places:
1. In read_patch_file(), we could release the buffer before returning
the error, behaving more like a raw strbuf_read() call.
2. In apply_patch(), we can release the strbuf ourselves before
returning.
I picked the latter, since it future proofs us against read_patch_file()
getting new error modes. We also have a cleanup label in that function
already, so now our error handling at this spot matches the rest of the
function (and all of the variables are initialized such that the rest of
the cleanup is correctly a noop at this point).
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
t/unit-tests: add tests for the in-memory object source
While the in-memory object source is a full-fledged source, our code
base only exercises parts of its functionality because we only use it in
git-blame(1). Implement unit tests to verify that the yet-unused
functionality of the backend works as expected.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The in-memory source stores its objects in a simple array that we grow as
needed. This has a couple of downsides:
- The object lookup is O(n). This doesn't matter in practice because
we only store a small number of objects.
- We don't have an easy way to iterate over all objects in
lexicographic order.
- We don't have an easy way to compute unique object ID prefixes.
Refactor the code to use an oidtree instead. This is the same data
structure used by our loose object source, and thus it means we get a
bunch of functionality for free.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The oidtree data structure is currently only used to store object IDs,
without any associated data. So consequently, it can only really be used
to track which object IDs exist, and we can use the tree structure to
efficiently operate on OID prefixes.
But there are valid use cases where we want to both:
- Store object IDs in a sorted order.
- Associated arbitrary data with them.
Refactor the oidtree interface so that it allows us to store arbitrary
payloads within the respective nodes. This will be used in the next
commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
cbtree: allow using arbitrary wrapper structures for nodes
The cbtree subsystem allows the user to store arbitrary data in a
prefix-free set of strings. This is used by us to store object IDs in a
way that we can easily iterate through them in lexicograph order, and so
that we can easily perform lookups with shortened object IDs.
In its current form, it is not easily possible to store arbitrary data
with the tree nodes. There are a couple of approaches such a caller
could try to use, but none of them really work:
- One may embed the `struct cb_node` in a custom structure. This does
not work though as `struct cb_node` contains a flex array, and
embedding such a struct in another struct is forbidden.
- One may use a `union` over `struct cb_node` and ones own data type,
which _is_ allowed even if the struct contains a flex array. This
does not work though, as the compiler may align members of the
struct so that the node key would not immediately start where the
flex array starts.
- One may allocate `struct cb_node` such that it has room for both its
key and the custom data. This has the downside though that if the
custom data is itself a pointer to allocated memory, then the leak
checker will not consider the pointer to be alive anymore.
Refactor the cbtree to drop the flex array and instead take in an
explicit offset for where to find the key, which allows the caller to
embed `struct cb_node` is a wrapper struct.
Note that this change has the downside that we now have a bit of padding
in our structure, which grows the size from 60 to 64 bytes on a 64 bit
system. On the other hand though, it allows us to get rid of the memory
copies that we previously had to do to ensure proper alignment. This
seems like a reasonable tradeoff.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
odb: fix unnecessary call to `find_cached_object()`
The function `odb_pretend_object()` writes an object into the in-memory
object database source. The effect of this is that the object will now
become readable, but it won't ever be persisted to disk.
Before storing the object, we first verify whether the object already
exists. This is done by calling `odb_has_object()` to check all sources,
followed by `find_cached_object()` to check whether we have already
stored the object in our in-memory source.
This is unnecessary though, as `odb_has_object()` already checks the
in-memory source transitively via:
Implement the `free()` callback function for the "in-memory" source.
Note that this requires us to define `struct cached_object_entry` in
"odb/source-inmemory.h", as it is accessed in both "odb.c" and
"odb/source-inmemory.c" now. This will be fixed in subsequent commits
though.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Next to our typical object database sources, each object database also
has an implicit source of "cached" objects. These cached objects only
exist in memory and some use cases:
- They contain evergreen objects that we expect to always exist, like
for example the empty tree.
- They can be used to store temporary objects that we don't want to
persist to disk, which is used by git-blame(1) to create a fake
worktree commit.
Overall, their use is somewhat restricted though. For example, we don't
provide the ability to use it as a temporary object database source that
allows the user to write objects, but discard them after Git exists. So
while these cached objects behave almost like a source, they aren't used
as one.
This is about to change over the following commits, where we will turn
cached objects into a new "in-memory" source. This will allow us to use
it exactly the same as any other source by providing the same common
interface as the "files" source.
For now, the in-memory source only hosts the cached objects and doesn't
provide any logic yet. This will change with subsequent commits, where
we move respective functionality into the source.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Justin Tobler [Thu, 14 May 2026 18:37:40 +0000 (13:37 -0500)]
odb/transaction: make `write_object_stream()` pluggable
How an ODB transaction handles writing objects is expected to vary
between implementations. Introduce a new `write_object_stream()`
callback in `struct odb_transaction` to make this function pluggable.
Rename `index_blob_packfile_transaction()` to
`odb_transaction_files_write_object_stream()` and wire it up for use
with `struct odb_transaction_files` accordingly.
Signed-off-by: Justin Tobler <jltobler@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Justin Tobler [Thu, 14 May 2026 18:37:39 +0000 (13:37 -0500)]
object-file: generalize packfile writes to use odb_write_stream
The `index_blob_packfile_transaction()` function streams blob data
directly from an fd. This makes it difficult to reuse as part of a
generic transactional object writing interface.
Refactor the packfile write path to operate on a `struct
odb_write_stream`, allowing callers to supply data from arbitrary
sources.
Signed-off-by: Justin Tobler <jltobler@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Justin Tobler [Thu, 14 May 2026 18:37:38 +0000 (13:37 -0500)]
object-file: avoid fd seekback by checking object size upfront
In certain scenarios, Git handles writing blobs that exceed
"core.bigFileThreshold" differently by streaming the object directly
into a packfile. When there is an active ODB transaction, these blobs
are streamed to the same packfile instead of using a separate packfile
for each. If "pack.packSizeLimit" is configured and streaming another
object causes the packfile to exceed the configured limit, the packfile
is truncated back to the previous object and the object write is
restarted in a new packfile.
This works fine, but requires the fd being read from to save a
checkpoint so it becomes possible to rewind the input source via seeking
back to a known offset at the beginning. In a subsequent commit, blob
streaming is converted to use `struct odb_write_stream` as a more
generic input source instead of an fd which doesn't provide a mechanism
for rewinding.
For this use case though, rewinding the fd is not strictly necessary
because the inflated size of the object is known and can be used to
approximate whether writing the object would cause the packfile to
exceed the configured limit prior to writing anything. These blobs
written to the packfile are never deltified thus the size difference
between what is written versus the inflated size is due to zlib
compression. While this does prevent packfiles from being filled to the
potential maximum is some cases, it should be good enough and still
prevents the packfile from exceeding any configured limit.
Use the inflated blob size to determine whether writing an object to a
packfile will exceed the configured "pack.packSizeLimit".
Signed-off-by: Justin Tobler <jltobler@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Justin Tobler [Thu, 14 May 2026 18:37:37 +0000 (13:37 -0500)]
object-file: remove flags from transaction packfile writes
The `index_blob_packfile_transaction()` function handles streaming a
blob from an fd to compute its object ID and conditionally writes the
object directly to a packfile if the INDEX_WRITE_OBJECT flag is set. A
subsequent commit will make these packfile object writes part of the
transaction interface. Consequently, having the object write be
conditional on this flag is a bit awkward.
In preparation for this change, introduce a dedicated
`hash_blob_stream()` helper that only computes the OID from a `struct
odb_write_stream`. This is invoked by `index_fd()` instead when the
INDEX_WRITE_OBJECT is not set. The object write performed via
`index_blob_packfile_transaction()` is made unconditional accordingly.
Signed-off-by: Justin Tobler <jltobler@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `read()` callback used by `struct odb_write_stream` currently
returns a pointer to an internal buffer along with the number of bytes
read. This makes buffer ownership unclear and provides no way to report
errors.
Update the interface to instead require the caller to provide a buffer,
and have the callback return the number of bytes written to it or a
negative value on error. While at it, also move the `struct
odb_write_stream` definition to "odb/streaming.h". Call sites are
updated accordingly.
Signed-off-by: Justin Tobler <jltobler@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>