streaming: make the `odb_read_stream` definition public
Subsequent commits will move the backend-specific logic of setting up an
object read stream into the specific subsystems. As the backends are now
the ones that are responsible for allocating the stream they'll need to
have the stream definition available to them.
Make the stream definition public to prepare for this.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Subsequent commits will move the backend-specific logic of object
streaming into their respective subsystems. These subsystems have gotten
rid of `the_repository` already, but we still use it in two locations in
the streaming subsystem.
Prepare for the move by fixing those two cases. Converting the logic in
`open_istream_pack_non_delta()` is trivial as we already got the object
database as input.
But for `stream_blob_to_fd()` we have to add a new parameter to make it
accessible. So, as we already have to adjust all callers anyway, rename
the function to `odb_stream_blob_to_fd()` to indicate it's part of the
object subsystem.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
streaming: rely on object sources to create object stream
When creating an object stream we first look up the object info and, if
it's present, we call into the respective backend that contains the
object to create a new stream for it.
This has the consequence that, for loose object source, we basically
iterate through the object sources twice: we first discover that the
file exists as a loose object in the first place by iterating through
all sources. And, once we have discovered it, we again walk through all
sources to try and map the object. The same issue will eventually also
surface once the packfile store becomes per-object-source.
Furthermore, it feels rather pointless to first look up the object only
to then try and read it.
Refactor the logic to be centered around sources instead. Instead of
first reading the object, we immediately ask the source to create the
object stream for us. If the object exists we get stream, otherwise
we'll try the next source.
Like this we only have to iterate through sources once. But even more
importantly, this change also helps us to make the whole logic
pluggable. The object read stream subsystem does not need to be aware of
the different source backends anymore, but eventually it'll only have to
call the source's callback function.
Note that at the current point in time we aren't fully there yet:
- The packfile store still sits on the object database level and is
thus agnostic of the sources.
- We still have to call into both the packfile store and the loose
object source.
But both of these issues will soon be addressed.
This refactoring results in a slight change to semantics: previously, it
was `odb_read_object_info_extended()` that picked the source for us, and
it would have favored packed (non-deltified) objects over loose objects.
And while we still favor packed over loose objects for a single source
with the new logic, we'll now favor a loose object from an earlier
source over a packed object from a later source.
Ultimately this shouldn't matter though: the stream doesn't indicate to
the caller which source it is from and whether it was created from a
packed or loose object, so such details are opaque to the caller. And
other than that we should be able to assume that two objects with the
same object ID should refer to the same content, so the streamed data
would be the same, too.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
packfile: introduce function to read object info from a store
Extract the logic to read object info for a packed object from
`do_oid_object_into_extended()` into a standalone function that operates
on the packfile store. This function will be used in a subsequent
commit.
Note that this change allows us to make `find_pack_entry()` an internal
implementation detail. As a consequence though we have to move around
`packfile_store_freshen_object()` so that it is defined after that
function.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
streaming: create structure for filtered object streams
As explained in a preceding commit, we want to get rid of the union of
stream-type specific data in `struct odb_read_stream`. Create a new
structure for filtered object streams to move towards this design.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
streaming: create structure for packed object streams
As explained in a preceding commit, we want to get rid of the union of
stream-type specific data in `struct odb_read_stream`. Create a new
structure for packed object streams to move towards this design.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
streaming: create structure for loose object streams
As explained in a preceding commit, we want to get rid of the union of
stream-type specific data in `struct odb_read_stream`. Create a new
structure for loose object streams to move towards this design.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
streaming: create structure for in-core object streams
As explained in a preceding commit, we want to get rid of the union of
stream-type specific data in `struct odb_read_stream`. Create a new
structure for in-core object streams to move towards this design.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
streaming: allocate stream inside the backend-specific logic
When creating a new stream we first allocate it and then call into
backend-specific logic to populate the stream. This design requires that
the stream itself contains a `union` with backend-specific members that
then ultimately get populated by the backend-specific logic.
This works, but it's awkward in the context of pluggable object
databases. Each backend will need its own member in that union, and as
the structure itself is completely opaque (it's only defined in
"streaming.c") it also has the consequence that we must have the logic
that is specific to backends in "streaming.c".
Ideally though, the infrastructure would be reversed: we have a generic
`struct odb_read_stream` and some helper functions in "streaming.c",
whereas the backend-specific logic sits in the backend's subsystem
itself.
This can be realized by using a design that is similar to how we handle
reference databases: instead of having a union of members, we instead
have backend-specific structures with a `struct odb_read_stream base`
as its first member. The backends would thus hand out the pointer to the
base, but internally they know to cast back to the backend-specific
type.
This means though that we need to allocate different structures
depending on the backend. To prepare for this, move allocation of the
structure into the backend-specific functions that open a new stream.
Subsequent commits will then create those new backend-specific structs.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
streaming: explicitly pass packfile info when streaming a packed object
When streaming a packed object we first populate the stream with
information about the pack that contains the object before calling
`open_istream_pack_non_delta()`. This is done because we have already
looked up both the pack and the object's offset, so it would be a waste
of time to look up this information again.
But the way this is done makes for a somewhat awkward calling interface,
as the caller now needs to be aware of how exactly the function itself
behaves.
Refactor the code so that we instead explicitly pass the packfile info
into `open_istream_pack_non_delta()`. This makes the calling convention
explicit, but more importantly this allows us to refactor the function
so that it becomes its responsibility to allocate the stream itself in a
subsequent patch.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
streaming: propagate final object type via the stream
When opening the read stream for a specific object the caller is also
expected to pass in a pointer to the object type. This type is passed
down via multiple levels and will eventually be populated with the type
of the looked-up object.
The way we propagate down the pointer though is somewhat non-obvious.
While `istream_source()` still expects the pointer and looks it up via
`odb_read_object_info_extended()`, we also pass it down even further
into the format-specific callbacks that perform another lookup. This is
quite confusing overall.
Refactor the code so that the responsibility to populate the object type
rests solely with the format-specific callbacks. This will allow us to
drop the call to `odb_read_object_info_extended()` in `istream_source()`
entirely in a subsequent patch.
Furthermore, instead of propagating the type via an in-pointer, we now
propagate the type via a new field in the object stream. It already has
a `size` field, so it's only natural to have a second field that
contains the object type.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When creating a read stream we first populate the structure with the
open callback function and then subsequently call the function. This
layout is somewhat weird though:
- The structure needs to be allocated and partially populated with the
open function before we can properly initialize it.
- We only ever call the `open()` callback function right after having
populated the `struct odb_read_stream::open` member, and it's never
called thereafter again. So it is somewhat pointless to store the
callback in the first place.
Especially the first point creates a problem for us. In subsequent
commits we'll want to fully move construction of the read source into
the respective object sources. E.g., the loose object source will be the
one that is responsible for creating the structure. But this creates a
problem: if we first need to create the structure so that we can call
the source-specific callback we cannot fully handle creation of the
structure in the source itself.
We could of course work around that and have the loose object source
create the structure and populate its `open()` callback, only. But
this doesn't really buy us anything due to the second bullet point
above.
Instead, drop the callback entirely and refactor `istream_source()` so
that we open the streams immediately. This unblocks a subsequent step,
where we'll also start to allocate the structure in the source-specific
logic.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
streaming: rename `git_istream` into `odb_read_stream`
In the following patches we are about to make the `git_istream` more
generic so that it becomes fully controlled by the specific object
source that wants to create it. As part of these refactorings we'll
fully move the structure into the object database subsystem.
Prepare for this change by renaming the structure from `git_istream`
to `odb_read_stream`. This mirrors the `odb_write_stream` structure that
we already have.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Mon, 17 Nov 2025 15:00:12 +0000 (07:00 -0800)]
Merge branch 'jc/ci-use-arm64-p4-on-macos'
We replaced deprecated macos-13 with macos-14 image in GitHub
Actions CI, but we forgot that the image is for arm64. We have
been seeing a lot of test failures ever since. Switch to arm64
binary for Perforce tests.
* jc/ci-use-arm64-p4-on-macos:
Use Perforce arm64 binary on macOS CI jobs
Junio C Hamano [Sun, 16 Nov 2025 23:10:28 +0000 (15:10 -0800)]
Use Perforce arm64 binary on macOS CI jobs
The previous step replaced deprecated macos-13 image with macos-14
image on GitHub Actions CI. While x86-64 binaries can work there,
because macos-14 images are arm64 based (we could replace it with
macos-14-large that is x86-64), it makes more sense to use arm64
binary there. Without this change, we have been getting unusually
higher rate of failures from random macOS CI jobs railing to run
t98xx series of tests.
Helped-by: Koji Nakamaru <koji.nakamaru@gree.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
A recently added configuration variable and command line option
syntax ":(optional)" for values that are of filename type
inconsistently behaved on an empty file (configuration took it
happily, while the command line option pretended as if it did not
exist), which has been corrected.
* dk/parseopt-optional-filename-fixes:
parseopt: remove unreachable code
parseopt: restore const qualifier to parsed filename
config: use boolean type for a simple flag
parseopt: use boolean type for a simple flag
doc: clarify command equivalence comment
parseopt: fix :(optional) at command line to only ignore missing files
Junio C Hamano [Thu, 6 Nov 2025 23:17:01 +0000 (15:17 -0800)]
Merge branch 'cc/fast-import-export-i18n-cleanup'
Messages from fast-import/export are now marked for i18n.
* cc/fast-import-export-i18n-cleanup:
gpg-interface: mark a string for translation
fast-import: mark strings for translation
fast-export: mark strings for translation
gpg-interface: use left shift to define GPG_VERIFY_*
gpg-interface: simplify ssh fingerprint parsing
Bumps `actions/upload-artifact` from 4 to 5.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v5)
Bumps `actions/download-artifact` from 5 to 6.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v5...v6)
Originally-authored-by: dependabot[bot] <support@github.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
D. Ben Knoble [Tue, 4 Nov 2025 13:58:29 +0000 (08:58 -0500)]
meson: make GIT_HTML_PATH configurable
Makefile-based builds can configure Git's internal HTML_PATH by defining
htmldir, which is useful for packagers that put documentation in
different locations. Gentoo, for example, uses version-suffixed
directories like ${prefix}/share/doc/git-2.51 and puts the HTML
documentation in an 'html' subdirectory of the same.
Propagate the same configuration knob to Meson-based builds so that
"git --html-path" on such systems can be configured to output the
correct directory.
Signed-off-by: D. Ben Knoble <ben.knoble+github@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
D. Ben Knoble [Tue, 4 Nov 2025 18:14:57 +0000 (13:14 -0500)]
perl: also mark git-contacts executable
When installing git-contacts with Meson via -Dcontrib=contacts, the default
Perl generation fails to mark it executable. As a result, "git contacts"
reports "'contacts' is not a git command."
Unlike generate-script.sh, we aren't testing the basename here; so, glob
the script name in the case arm to match wherever the input comes from.
Signed-off-by: D. Ben Knoble <ben.knoble+github@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Thomas Uhle [Wed, 5 Nov 2025 19:55:19 +0000 (20:55 +0100)]
wincred: align Makefile with other Makefiles in contrib
* Replace $(LOADLIBES) because it is deprecated since long and it is
used nowhere else in the git project.
* Use $(gitexecdir) instead of $(libexecdir) because config.mak defines
$(libexecdir) as $(prefix)/libexec, not as $(prefix)/libexec/git-core.
* Similar to other Makefiles, let install target rule create
$(gitexecdir) to make sure the directory exists before copying the
executable and also let it respect $(DESTDIR).
* Shuffle the lines for the default settings to align them with the
other Makefiles in contrib/credential.
* Define .PHONY for all special targets (all, install, clean).
Signed-off-by: Thomas Uhle <thomas.uhle@mailbox.tu-dresden.de> Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Tue, 4 Nov 2025 17:34:20 +0000 (09:34 -0800)]
parseopt: remove unreachable code
At this point in the code after running skip_prefix() on the
variable and receiving the result in the same variable, the contents
of the variable can never be NULL. The function either (1) updates
the variable to point at a later part of the string it originally
pointed at, or (2) leaves it intact if the string does not have the
prefix. (1) will never make the variable NULL, and (2) cannot be
the source of NULL, because the variable cannot be NULL before
calling skip_prefix(), which would die immediately by dereferencing
the NULL pointer in that case.
Helped-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
D. Ben Knoble [Sun, 2 Nov 2025 16:17:48 +0000 (11:17 -0500)]
parseopt: restore const qualifier to parsed filename
This was unintentionally dropped in ccfcaf399f (parseopt: values of
pathname type can be prefixed with :(optional), 2025-09-28). Notably,
continue dropping the const qualifier when free'ing value; see 4049b9cfc0 (fix const issues with some functions, 2007-10-16) or 83838d5c1b (cast variable in call to free() in builtin/diff.c and
submodule.c, 2011-11-06) for more details on why.
Suggested-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: D. Ben Knoble <ben.knoble+github@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
D. Ben Knoble [Sun, 2 Nov 2025 16:17:47 +0000 (11:17 -0500)]
config: use boolean type for a simple flag
Suggested-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: D. Ben Knoble <ben.knoble+github@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
D. Ben Knoble [Sun, 2 Nov 2025 16:17:46 +0000 (11:17 -0500)]
parseopt: use boolean type for a simple flag
Suggested-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: D. Ben Knoble <ben.knoble+github@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
D. Ben Knoble [Sun, 2 Nov 2025 16:17:45 +0000 (11:17 -0500)]
doc: clarify command equivalence comment
Documentation of command parsing for :(optional) includes a terse
comment; expand it to be clearer to readers.
Suggested-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: D. Ben Knoble <ben.knoble+github@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
D. Ben Knoble [Sun, 2 Nov 2025 16:17:44 +0000 (11:17 -0500)]
parseopt: fix :(optional) at command line to only ignore missing files
Unlike the configuration option magic, the parseopt code also ignores
empty files: compare implementations from ccfcaf399f (parseopt: values
of pathname type can be prefixed with :(optional), 2025-09-28) and 749d6d166d (config: values of pathname type can be prefixed with
:(optional), 2025-09-28).
Unify the 2 by not ignoring empty files, which is less surprising and
the intended semantics from the first patch for config.
Suggested-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: D. Ben Knoble <ben.knoble+github@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Tue, 4 Nov 2025 15:48:07 +0000 (07:48 -0800)]
Merge branch 'qj/doc-my1stcontrib-email-verify'
The "MyFirstContribution" tutorial tells the reader how to send out
their patches; the section gained a hint to verify the message
reached the mailing list.
* qj/doc-my1stcontrib-email-verify:
MyFirstContribution: add note on confirming patches
Junio C Hamano [Tue, 4 Nov 2025 15:48:07 +0000 (07:48 -0800)]
Merge branch 'tz/test-prepare-gnupghome'
Tests did not set up GNUPGHOME correctly, which is fixed but some
flaky tests are exposed in t1016, which needs to be addressed
before this topic can move forward.
* tz/test-prepare-gnupghome:
t/lib-gpg: call prepare_gnupghome() in GPG2 prereq
t/lib-gpg: add prepare_gnupghome() to create GNUPGHOME dir
object-file: refactor writing objects via a stream
We have two different ways to write an object into the database:
- We either provide the full buffer and write the object all at once.
- Or we provide an input stream that has a `read()` function so that
we can chunk the object.
The latter is especially used for large objects, where it may be too
expensive to hold the complete object in memory all at once.
While we already have `odb_write_object()` at the ODB-layer, we don't
have an equivalent for streaming an object. Introduce a new function
`odb_write_object_stream()` to address this gap so that callers don't
have to be aware of the inner workings of how to stream an object to
disk with a specific object source.
Rename `stream_loose_object()` to `odb_source_loose_write_stream()` to
clarify its scope. This matches our modern best practices around how to
name functions.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rename `write_object_file()` to `odb_source_loose_write_object()` so
that it becomes clear that this is tied to a specific loose object
source. This matches our modern naming schema for functions.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When writing an object that already exists in our object database we
skip the write and instead only update mtimes of the object, either in
its packed or loose object format. This logic is wholly contained in
"object-file.c", but that file is really only concerned with loose
objects. So it does not really make sense that it also contains the
logic to freshen a packed object.
Introduce a new `odb_freshen_object()` function that sits on the object
database level and two functions `packfile_store_freshen_object()` and
`odb_source_loose_freshen_object()`. Like this, the format-specific
functions can be part of their respective subsystems, while the backend
agnostic function to freshen an object sits at the object database
layer.
Note that this change also moves the logic that iterates through object
sources from the object source layer into the object database layer.
This change is intentional: object sources should ideally only have to
worry about themselves, and coordination of different sources should be
handled on the object database level.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rename `has_loose_object()` to `odb_source_loose_has_object()` so that
it becomes clear that this is tied to a specific loose object source.
This matches our modern naming schema for functions.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
object-file: read objects via the loose object source
When reading an object via `loose_object_info()` or `map_loose_object()`
we hand in the whole repository. We then iterate through each of the
object sources to figure out whether that source has the object in
question.
This logic is reversing responsibility though: a specific backend should
only care about one specific source, where the object sources themselves
are then managed by the object database.
Refactor the code accordingly by passing an object source to both of
these functions instead. The different sources are then handled by
either `do_oid_object_info_extended()`, which sits on the object
database level, and by `open_istream_loose()`. The latter function
arguably is still at the wrong level, but this will be cleaned up at a
later point in time.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
object-file: move loose object map into loose source
The loose object map is used to map from the repository's canonical
object hash to the compatibility hash. As the name indicates, this map
is only used for loose objects, and as such it is tied to a specific
loose object source.
Same as with preceding commits, move this map into the loose object
source accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
object-file: hide internals when we need to reprepare loose sources
There are two different situations where we have to clear the cache of
loose objects:
- When freeing the loose object source itself to avoid memory leaks.
- When repreparing the loose object source so that any potentially-
stale data is getting evicted from the cache.
The former is already handled by `odb_source_loose_free()`. But the
latter case is still done manually by in `odb_reprepare()`, so we are
leaking internals into that code.
Introduce a new `odb_source_loose_reprepare()` function as an equivalent
to `packfile_store_prepare()` to hide these implementation details.
Furthermore, while at it, rename the function `odb_clear_loose_cache()`
to `odb_source_loose_clear()`.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
object-file: move loose object cache into loose source
Our loose objects use a cache that (optionally) stores all objects for
each of the opened sharding directories. This cache is located in the
`struct odb_source`, but now that we have `struct odb_source_loose` it
makes sense to move it into the latter structure so that all state that
relates to loose objects is entirely self-contained.
Do so. While at it, rename corresponding functions to have a prefix that
relates to `struct odb_source_loose`.
Note that despite this prefix, the functions still accept a `struct
odb_source` as input. This is done intentionally: once we introduce
pluggable object databases, we will continue to accept this struct but
then do a cast inside these functions to `struct odb_source_loose`. This
design is similar to how we do it for our ref backends.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Currently, all state that relates to loose objects is held directly by
the `struct odb_source`. Introduce a new `struct odb_source_loose` to
hold the state instead so that it is entirely self-contained.
This structure will eventually morph into the backend for accessing
loose objects. As such, this is part of the refactorings to introduce
pluggable object databases.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `fetch_if_missing` global variable is declared in "object-file.h"
but defined in "odb.c". The variable relates to the whole object
database instead of only loose objects, so move the declaration into
"odb.h" accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The functions `free_object_directory()` and `free_object_directories()`
are responsible for freeing a single object source or all object sources
connected to an object database, respectively. The associated structure
has been renamed from `struct object_directory` to `struct odb_source`
in a1e2581a1e (object-store: rename `object_directory` to `odb_source`,
2025-07-01) though, so the names are somewhat stale nowadays.
Rename them to mention the new struct name instead. Furthermore, while
at it, adapt them to our modern naming schema where we first have the
subject followed by a verb.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
odb: fix subtle logic to check whether an alternate is usable
When adding an alternate to the object database we first check whether
or not the path is usable. A path is usable if:
- It actually exists.
- We don't have it in our object sources yet.
While the former check is trivial enough, the latter part is somewhat
subtle and prone for bugs. This is because the function doesn't only
check whether or not the given path is usable. But if it _is_ usable, we
also store that path in the map of object sources immediately.
The tricky part here is that the path that gets stored in the map is
_not_ copied. Instead, we rely on the fact that subsequent code uses
`strbuf_detach()` to store the exact same allocated memory in the
created object source. Consequently, the memory is owned by the source
but _also_ stored in the map. This subtlety is easy to miss, so if one
decides to refactor this code one can easily end up breaking this
mechanism.
Make the relationship more explicit by not storing the path as part of
`alt_odb_usable()`. Instead, store the path after we have created the
source so that we can use the source's path pointer directly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Toon Claes [Thu, 23 Oct 2025 07:50:14 +0000 (09:50 +0200)]
last-modified: implement faster algorithm
The current implementation of git-last-modified(1) works by doing a
revision walk, and inspecting the diff at each level of that walk to
annotate entries remaining in the hashmap of paths. In other words, if
the diff at some level touches a path which has not yet been associated
with a commit, then that commit becomes associated with the path.
While a perfectly reasonable implementation, it can perform poorly in
either one of two scenarios:
1. There are many entries of interest, in which case there is simply
a lot of work to do.
2. Or, there are (even a few) entries which have not been updated in a
long time, and so we must walk through a lot of history in order to
find a commit that touches that path.
This patch rewrites the last-modified implementation that addresses the
second point. The idea behind the algorithm is to propagate a set of
'active' paths (a path is 'active' if it does not yet belong to a
commit) up to parents and do a truncated revision walk.
The walk is truncated because it does not produce a revision for every
change in the original pathspec, but rather only for active paths.
More specifically, consider a priority queue of commits sorted by
generation number. First, enqueue the set of boundary commits with all
paths in the original spec marked as interesting.
Then, while the queue is not empty, do the following:
1. Pop an element, say, 'c', off of the queue, making sure that 'c'
isn't reachable by anything in the '--not' set.
2. For each parent 'p' (with index 'parent_i') of 'c', do the
following:
a. Compute the diff between 'c' and 'p'.
b. Pass any active paths that are TREESAME from 'c' to 'p'.
c. If 'p' has any active paths, push it onto the queue.
3. Any path that remains active on 'c' is associated to that commit.
This ends up being equivalent to doing something like 'git log -1 --
$path' for each path simultaneously. But, it allows us to go much faster
than the original implementation by limiting the number of diffs we
compute, since we can avoid parts of history that would have been
considered by the revision walk in the original implementation, but are
known to be uninteresting to us because we have already marked all paths
in that area to be inactive.
To avoid computing many first-parent diffs, add another trick on top of
this and check if all paths active in 'c' are DEFINITELY NOT in c's
Bloom filter. Since the commit-graph only stores first-parent diffs in
the Bloom filters, we can only apply this trick to first-parent diffs.
Comparing the performance of this new algorithm shows about a 2.5x
improvement on git.git:
Benchmark 1: master no bloom
Time (mean ± σ): 2.868 s ± 0.023 s [User: 2.811 s, System: 0.051 s]
Range (min … max): 2.847 s … 2.926 s 10 runs
Benchmark 2: master with bloom
Time (mean ± σ): 949.9 ms ± 15.2 ms [User: 907.6 ms, System: 39.5 ms]
Range (min … max): 933.3 ms … 971.2 ms 10 runs
Benchmark 3: HEAD no bloom
Time (mean ± σ): 782.0 ms ± 6.3 ms [User: 740.7 ms, System: 39.2 ms]
Range (min … max): 776.4 ms … 798.2 ms 10 runs
Benchmark 4: HEAD with bloom
Time (mean ± σ): 307.1 ms ± 1.7 ms [User: 276.4 ms, System: 29.9 ms]
Range (min … max): 303.7 ms … 309.5 ms 10 runs
Summary
HEAD with bloom ran
2.55 ± 0.02 times faster than HEAD no bloom
3.09 ± 0.05 times faster than master with bloom
9.34 ± 0.09 times faster than master no bloom
In short, the existing implementation is comparably fast *with* Bloom
filters as the new implementation is *without* Bloom filters. So, most
repositories should get a dramatic speed-up by just deploying this (even
without computing Bloom filters), and all repositories should get faster
still when computing Bloom filters.
When comparing a more extreme example of
`git last-modified -- COPYING t`, the difference is even 5 times better:
Benchmark 1: master
Time (mean ± σ): 4.372 s ± 0.057 s [User: 4.286 s, System: 0.062 s]
Range (min … max): 4.308 s … 4.509 s 10 runs
Benchmark 2: HEAD
Time (mean ± σ): 826.3 ms ± 22.3 ms [User: 784.1 ms, System: 39.2 ms]
Range (min … max): 810.6 ms … 881.2 ms 10 runs
Summary
HEAD ran
5.29 ± 0.16 times faster than master
As an added benefit, results are more consistent now. For example
implementation in 'master' gives:
With the changes in this patch the results of git-last-modified(1)
always match those of `git log --max-count=1`.
One thing to note though, the results might be outputted in a different
order than before. This is not considerd to be an issue because nowhere
is documented the order is guaranteed.
Based-on-patches-by: Derrick Stolee <stolee@gmail.com> Based-on-patches-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Toon Claes <toon@iotcl.com> Acked-by: Taylor Blau <me@ttaylorr.com>
[jc: tweaked use of xcalloc() to unbreak coccicheck] Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Mon, 3 Nov 2025 14:49:55 +0000 (06:49 -0800)]
Merge branch 'jk/diff-patch-dry-run-cleanup'
Finishing touches to fixes to the recent regression in "git diff -w
--quiet" and anything that needs to internally generate patch to
see if it turns empty.
* jk/diff-patch-dry-run-cleanup:
diff: simplify run_external_diff() quiet logic
diff: drop dry-run redirection to /dev/null
diff: replace diff_options.dry_run flag with NULL file
diff: drop save/restore of color_moved in dry-run mode
diff: send external diff output to diff_options.file
Junio C Hamano [Mon, 3 Nov 2025 14:49:55 +0000 (06:49 -0800)]
Merge branch 'ps/maintenance-geometric'
"git maintenance" command learns the "geometric" strategy where it
avoids doing maintenance tasks that rebuilds everything from
scratch.
* ps/maintenance-geometric:
t7900: fix a flaky test due to git-repack always regenerating MIDX
builtin/maintenance: introduce "geometric" strategy
builtin/maintenance: make "gc" strategy accessible
builtin/maintenance: extend "maintenance.strategy" to manual maintenance
builtin/maintenance: run maintenance tasks depending on type
builtin/maintenance: improve readability of strategies
builtin/maintenance: don't silently ignore invalid strategy
builtin/maintenance: make the geometric factor configurable
builtin/maintenance: introduce "geometric-repack" task
builtin/gc: make `too_many_loose_objects()` reusable without GC config
builtin/gc: remove global `repack` variable
Junio C Hamano [Mon, 3 Nov 2025 14:49:54 +0000 (06:49 -0800)]
Merge branch 'rs/add-patch-quit'
The 'q'(uit) command in "git add -p" has been improved to quit
without doing any meaningless work before leaving, and giving EOF
(typically control-D) to the prompt is made to behave the same way.
* rs/add-patch-quit:
add-patch: quit on EOF
add-patch: quit without skipping undecided hunks
Junio C Hamano [Thu, 30 Oct 2025 15:00:20 +0000 (08:00 -0700)]
Merge branch 'kf/log-shortlog-completion-fix'
"git shortlog" knows "--committer" and "--author" options, which
the command line completion (in contrib/) did not handle well,
which has been corrected.
* kf/log-shortlog-completion-fix:
completion: complete some 'git log' options