hash: convert `oidcmp()` and `oideq()` to compare whole hash
With the preceding commit, the hash array of object IDs is now fully
zero-padded even when the hash algorithm's output is smaller than the
array length. With that, we can now adapt both `oidcmp()` and `oideq()`
to unconditionally memcmp(3P) the whole array instead of depending on
the hash size.
While it may feel inefficient to compare unused bytes for e.g. SHA-1, in
practice the compiler should now be able to produce code that is better
optimized both because we have no branch anymore, but also because the
size to compare is now known at compile time. Goldbolt spits out the
following assembly on an x86_64 platform with GCC 14.1 for the old and
new implementations of `oidcmp()`:
Interestingly, the compiler decides to split the comparisons into two so
that it first compares the lower half of the object ID for equality and
then the upper half. If the first check shows a difference, then we
wouldn't even end up comparing the second half.
In both cases, the new generated code is significantly shorter and has
way less branches. While I didn't benchmark the change, I'd be surprised
if the new code was slower.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `oidcmp()` and `oideq()` functions only compare the prefix length as
specified by the given hash algorithm. This mandates that the object IDs
have a valid hash algorithm set, or otherwise we wouldn't be able to
figure out that prefix. As we do not have a hash algorithm in many
cases, for example when handling null object IDs, this assumption cannot
always be fulfilled. We thus have a fallback in place that instead uses
`the_repository` to derive the hash function. This implicit dependency
is hidden away from callers and can be quite surprising, especially in
contexts where there may be no repository.
In theory, we can adapt those functions to always memcmp(3P) the whole
length of their hash arrays. But there exist a couple of sites where we
populate `struct object_id`s such that only the prefix of its hash that
is actually used by the hash algorithm is populated. The remaining bytes
are left uninitialized. The fact that those bytes are uninitialized also
leads to warnings under Valgrind in some places where we copy those
bytes.
Refactor callsites where we populate object IDs to always initialize all
bytes. This also allows us to get rid of `oidcpy_with_padding()`, for
one because the input is now fully initialized, and because `oidcpy()`
will now always copy the whole hash array.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
hash: require hash algorithm in `oidread()` and `oidclr()`
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash
function that shall be used. Require callers to pass in the hash
algorithm to get rid of this implicit dependency.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()`
Many of our hash functions have two variants, one receiving a `struct
git_hash_algo` and one that derives it via `the_repository`. Adapt all
of those functions to always require the hash algorithm as input and
drop the variants that do not accept one.
As those functions are now independent of `the_repository`, we can move
them from "hash.h" to "hash-ll.h".
Note that both in this and subsequent commits in this series we always
just pass `the_repository->hash_algo` as input even if it is obvious
that there is a repository in the context that we should be using the
hash from instead. This is done to be on the safe side and not introduce
any regressions. All callsites should eventually be amended to use a
repo passed via parameters, but this is outside the scope of this patch
series.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
The functions `is_empty_{blob,tree}_sha1()` are mostly unused, except
for a single callsite in "read-cache.c". Most callsites have long since
been converted to use the equivalents that accept a `struct object_id`
instead of a string.
Adapt the remaining callsite and drop those functions.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Thu, 13 Jun 2024 16:39:08 +0000 (09:39 -0700)]
Merge branch 'ps/ref-storage-migration' into ps/use-the-repository
* ps/ref-storage-migration:
builtin/refs: new command to migrate ref storage formats
refs: implement logic to migrate between ref storage formats
refs: implement removal of ref storages
worktree: don't store main worktree twice
reftable: inline `merged_table_release()`
refs/files: fix NULL pointer deref when releasing ref store
refs/files: extract function to iterate through root refs
refs/files: refactor `add_pseudoref_and_head_entries()`
refs: allow to skip creation of reflog entries
refs: pass storage format to `ref_store_init()` explicitly
refs: convert ref storage format to an enum
setup: unset ref storage when reinitializing repository version
Junio C Hamano [Wed, 12 Jun 2024 20:37:14 +0000 (13:37 -0700)]
Merge branch 'cp/reftable-unit-test'
Basic unit tests for reftable have been reimplemented under the
unit test framework.
* cp/reftable-unit-test:
t: improve the test-case for parse_names()
t: add test for put_be16()
t: move tests from reftable/record_test.c to the new unit test
t: move tests from reftable/stack_test.c to the new unit test
t: move reftable/basics_test.c to the unit testing framework
t/: migrate helper/test-oidtree.c to unit-tests/t-oidtree.c
helper/test-oidtree.c along with t0069-oidtree.sh test the oidtree.h
library, which is a wrapper around crit-bit tree. Migrate them to
the unit testing framework for better debugging and runtime
performance. Along with the migration, add an extra check for
oidtree_each() test, which showcases how multiple expected matches can
be given to check_each() helper.
To achieve this, introduce a new library called 'lib-oid.h'
exclusively for the unit tests to use. It currently mainly includes
utility to generate object_id from an arbitrary hex string
(i.e. '12a' -> '12a0000000000000000000000000000000000000'). This also
handles the hash algo selection based on GIT_TEST_DEFAULT_HASH.
This library will also be helpful when we port other unit tests such
as oid-array, oidset etc.
Helped-by: Junio C Hamano <gitster@pobox.com> Mentored-by: Christian Couder <chriscool@tuxfamily.org> Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com> Signed-off-by: Ghanshyam Thakkar <shyamthakkar001@gmail.com>
[jc: small fixlets squashed in] Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Mon, 10 Jun 2024 17:30:39 +0000 (10:30 -0700)]
Merge branch 'jk/leakfixes'
Memory leaks in "git mv" has been plugged.
* jk/leakfixes:
mv: replace src_dir with a strvec
mv: factor out empty src_dir removal
mv: move src_dir cleanup to end of cmd_mv()
t-strvec: mark variable-arg helper with LAST_ARG_MUST_BE_NULL
t-strvec: use va_end() to match va_start()
Junio C Hamano [Fri, 7 Jun 2024 17:57:23 +0000 (10:57 -0700)]
Merge branch 'tb/midx-write-cleanup'
Code clean-up around writing the .midx files.
* tb/midx-write-cleanup:
pack-bitmap.c: reimplement `midx_bitmap_filename()` with helper
midx: replace `get_midx_rev_filename()` with a generic helper
midx-write.c: support reading an existing MIDX with `packs_to_include`
midx-write.c: extract `fill_packs_from_midx()`
midx-write.c: extract `should_include_pack()`
midx-write.c: pass `start_pack` to `compute_sorted_entries()`
midx-write.c: reduce argument count for `get_sorted_entries()`
midx-write.c: tolerate `--preferred-pack` without bitmaps
Junio C Hamano [Thu, 6 Jun 2024 19:49:23 +0000 (12:49 -0700)]
Merge branch 'ps/leakfixes'
Leakfixes.
* ps/leakfixes:
builtin/mv: fix leaks for submodule gitfile paths
builtin/mv: refactor to use `struct strvec`
builtin/mv duplicate string list memory
builtin/mv: refactor `add_slash()` to always return allocated strings
strvec: add functions to replace and remove strings
submodule: fix leaking memory for submodule entries
commit-reach: fix memory leak in `ahead_behind()`
builtin/credential: clear credential before exit
config: plug various memory leaks
config: clarify memory ownership in `git_config_string()`
builtin/log: stop using globals for format config
builtin/log: stop using globals for log config
convert: refactor code to clarify ownership of check_roundtrip_encoding
diff: refactor code to clarify memory ownership of prefixes
config: clarify memory ownership in `git_config_pathname()`
http: refactor code to clarify memory ownership
checkout: clarify memory ownership in `unique_tracking_name()`
strbuf: fix leak when `appendwholeline()` fails with EOF
transport-helper: fix leaking helper name
builtin/refs: new command to migrate ref storage formats
Introduce a new command that allows the user to migrate a repository
between ref storage formats. This new command is implemented as part of
a new git-refs(1) executable. This is due to two reasons:
- There is no good place to put the migration logic in existing
commands. git-maintenance(1) felt unwieldy, and git-pack-refs(1) is
not the correct place to put it, either.
- I had it in my mind to create a new low-level command for accessing
refs for quite a while already. git-refs(1) is that command and can
over time grow more functionality relating to refs. This should help
discoverability by consolidating low-level access to refs into a
single executable.
As mentioned in the preceding commit that introduces the ref storage
format migration logic, the new `git refs migrate` command still has a
bunch of restrictions. These restrictions are documented accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
refs: implement logic to migrate between ref storage formats
With the introduction of the new "reftable" backend, users may want to
migrate repositories between the backends without having to recreate the
whole repository. Add the logic to do so.
The implementation is generic and works with arbitrary ref storage
formats so that a backend does not need to implement any migration
logic. It does have a few limitations though:
- We do not migrate repositories with worktrees, because worktrees
have separate ref storages. It makes the overall affair more complex
if we have to migrate multiple storages at once.
- We do not migrate reflogs, because we have no interfaces to write
many reflog entries.
- We do not lock the repository for concurrent access, and thus
concurrent writes may end up with weird in-between states. There is
no way to fully lock the "files" backend for writes due to its
format, and thus we punt on this topic altogether and defer to the
user to avoid those from happening.
In other words, this version is a minimum viable product for migrating a
repository's ref storage format. It works alright for bare repos, which
often have neither worktrees nor reflogs. But it will not work for many
other repositories without some preparations. These limitations are not
set into stone though, and ideally we will eventually address them over
time.
The logic is not yet used by anything, and thus there are no tests for
it. Those will be added in the next commit.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
We're about to introduce logic to migrate ref storages. One part of the
migration will be to delete the files that are part of the old ref
storage format. We don't yet have a way to delete such data generically
across ref backends though.
Implement a new `delete` callback and expose it via a new
`ref_storage_delete()` function.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In `get_worktree_ref_store()` we either return the repository's main ref
store, or we look up the ref store via the map of worktree ref stores.
Which of these worktrees gets picked depends on the `is_current` bit of
the worktree, which indicates whether the worktree is the one that
corresponds to `the_repository`.
The bit is getting set in `get_worktrees()`, but only after we have
computed the list of all worktrees. This is too late though, because at
that time we have already called `get_worktree_ref_store()` on each of
the worktrees via `add_head_info()`. The consequence is that the current
worktree will not have been marked accordingly, which means that we did
not use the main ref store, but instead created a new ref store. We thus
have two separate ref stores now that map to the same ref database.
Fix this by setting `is_current` before we call `add_head_info()`.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `merged_table_release()` releases a merged table, whereas
`reftable_merged_table_free()` releases a merged table and then also
free's its pointer. But all callsites of `merged_table_release()` are in
fact followed by `reftable_merged_table_free()`, which is redundant.
Inline `merged_table_release()` into `reftable_merged_table_free()` to
get rid of this redundance.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
refs/files: fix NULL pointer deref when releasing ref store
The `free_ref_cache()` function is not `NULL` safe and will thus
segfault when being passed such a pointer. This can easily happen when
trying to release a partially initialized "files" ref store. Fix this.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
refs/files: extract function to iterate through root refs
Extract a new function that can be used to iterate through all root refs
known to the "files" backend. This will be used in the next commit,
where we start to teach ref backends to remove themselves.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `add_pseudoref_and_head_entries()` function accepts both the ref
store as well as a directory name as input. This is unnecessary though
as the ref store already uniquely identifies the root directory of the
ref store anyway.
Furthermore, the function is misnamed now that we have clarified the
meaning of pseudorefs as it doesn't add pseudorefs, but root refs.
Rename it accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The ref backends do not have any way to disable the creation of reflog
entries. This will be required for upcoming ref format migration logic
so that we do not create any entries that didn't exist in the original
ref database.
Provide a new `REF_SKIP_CREATE_REFLOG` flag that allows the caller to
disable reflog entry creation.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
refs: pass storage format to `ref_store_init()` explicitly
We're about to introduce logic to migrate refs from one storage format
to another one. This will require us to initialize a ref store with a
different format than the one used by the passed-in repository.
Prepare for this by accepting the desired ref storage format as
parameter.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The ref storage format is tracked as a simple unsigned integer, which
makes it harder than necessary to discover what that integer actually is
or where its values are defined.
Convert the ref storage format to instead be an enum.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
setup: unset ref storage when reinitializing repository version
When reinitializing a repository's version we may end up unsetting the
hash algorithm when it matches the default hash algorithm. If we didn't
do that then the previously configured value might remain intact.
While the same issue exists for the ref storage extension, we don't do
this here. This has been fine for most of the part because it is not
supported to re-initialize a repository with a different ref storage
format anyway. We're about to introduce a new command to migrate ref
storages though, so this is about to become an issue there.
Prepare for this and unset the ref storage format when reinitializing a
repository with the "files" format.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 4 Jun 2024 10:13:40 +0000 (06:13 -0400)]
sparse-checkout: free duplicate hashmap entries
In insert_recursive_pattern(), we create a new pattern_entry to insert
into the parent_hashmap. If we find that the same entry already exists
in the hashmap, we skip adding the new one. But we forget to free the new
one, creating a leak.
We can fix it by cleaning up the discarded entry. It would probably be
possible to avoid creating it in the first place, but it's non-trivial.
We'd have to define a "keydata" struct that lets us compare the existing
entries to the broken-out fields. It's probably not worth the
complexity, so we'll punt on that for now.
There is one subtlety here: our insertion is happening in a loop, with
each iteration looking at the pattern we just inserted (hence the
"recursive" in the name). So if we skip insertion, what do we look at?
The obvious answer is that we should remember the existing duplicate we
found and use that. But I _think_ in that case, we probably already have
all of the recursive bits already (from when the original entry was
added). And so just breaking out of the loop would be correct. But I'm
not 100% sure on that; after all, the original leaky code could have
done the same break, but it didn't.
So I went with the "obvious answer" above, which has no chance of
changing the behavior aside from fixing the leak.
With this patch, t1091 can now be marked leak-free.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 4 Jun 2024 10:13:35 +0000 (06:13 -0400)]
sparse-checkout: free pattern list in sparse_checkout_list()
In sparse_checkout_list(), we create a pattern_list that needs to
eventually be cleared. We remember to do so in the regular code path,
but the cone-mode path does an early return, and forgets to clean up.
We could fix the leak by adding a new call to clear_pattern_list(). But
we can simplify even further by just skipping the early return, pushing
the other code path (which consists now of only one line!) into an else
block. That also matches the same cone/non-cone if/else used in some
other functions.
This fixes 15 leaks found in t1091.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 4 Jun 2024 10:13:32 +0000 (06:13 -0400)]
sparse-checkout: free sparse_filename after use
We allocate a heap buffer via get_sparse_checkout_filename(). Most calls
remember to free it, but sparse_checkout_init() forgets to, causing a
leak. Ironically, it remembers to do so in the error return paths, but
not in the path that makes it all the way to the function end!
Fixing this clears up 6 leaks from t1091.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In update_working_directory(), we take in a pattern_list, attach it to
the repository index by assigning it to index->sparse_checkout_patterns,
and then call unpack_trees. Afterwards, we remove it by setting
index->sparse_checkout_patterns back to NULL.
But there are two possible leaks here:
1. If the index already had a populated sparse_checkout_patterns,
we've obliterated it. We can fix this by saving and restoring it,
rather than always setting it back to NULL.
2. We may call the function with a NULL pattern_list, expecting it to
use the on-disk sparse file. In that case, the index routines will
lazy-load the sparse patterns automatically. But now at the end of
the function when we restore the patterns, we'll leak those
lazy-loaded ones!
We can fix this by freeing the pattern list before overwriting its
pointer whenever it does not match what was passed in (in practice
this should only happen when the passed-in list is NULL, but this
is erring on the defensive side).
Together these remove 48 indirect leaks found in t1091.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 4 Jun 2024 10:13:27 +0000 (06:13 -0400)]
sparse-checkout: always free "line" strbuf after reading input
In add_patterns_from_input(), we may read lines from a file with a loop
like this:
while (!strbuf_getline(&line, file)) {
...
strbuf_to_cone_pattern(&line, pl);
}
/* we don't strbuf_release(&line) here! */
This generally is OK because strbuf_to_cone_pattern() consumes the
buffer via strbuf_detach(). But we can leak in a few cases:
1. We don't always consume the buffer! If the line ends up empty after
trimming, we leave strbuf_to_cone_pattern() without detaching. In
most cases this is OK, because a subsequent getline() call will use
the same buffer. But if you had an empty line at the end of file,
for example, it would leak.
2. Even if strbuf_to_cone_pattern() always consumed the buffer,
there's a subtle issue with strbuf_getline(). As we saw in 94e2aa555e (strbuf: fix leak when `appendwholeline()` fails with
EOF, 2024-05-27), it's possible for it to return EOF with an
allocated buffer (e.g., if the underlying getdelim() call saw an
error). So we should always strbuf_release() after finishing a read
loop like this.
Note that even the code to read patterns from argv has the same problem.
Because that also uses strbuf_to_cone_pattern(), we stuff each argv
entry into a strbuf. It uses the same "line" strbuf as the getline code,
but we should position the strbuf_release() to cover both code paths.
This fixes at least 9 leaks found in t1091.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 4 Jun 2024 10:13:25 +0000 (06:13 -0400)]
sparse-checkout: reuse --stdin buffer when reading patterns
When we read patterns from --stdin, we loop on strbuf_getline(), and
detach each line we read to pass into add_pattern(). This used to be
necessary because add_pattern() required that the pattern strings remain
valid while the pattern_list was in use. But it also created a leak,
since we didn't record the detached buffers anywhere else.
Now that add_pattern() has been modified to make its own copy of the
strings, we can stop detaching and fix the leak. This fixes 4 leaks
detected in t1091.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 4 Jun 2024 10:13:22 +0000 (06:13 -0400)]
dir.c: always copy input to add_pattern()
The add_pattern() function has a subtle and undocumented gotcha: the
pattern string you pass in must remain valid as long as the pattern_list
is in use (and nor do we take ownership of it). This is easy to get
wrong, causing either subtle bugs (because you free or reuse the string
buffer) or leaks (because you copy the string, but don't track ownership
separately).
All of this "pattern" code was originally the "exclude" mechanism. So
this _usually_ works OK because you add entries in one of two ways:
1. From the command-line (e.g., "--exclude"), in which case we're
pointing to an argv entry which remains valid for the lifetime of
the program.
2. From a file (e.g., ".gitignore"), in which case we read the whole
file into a buffer, attach it to the pattern_list's "filebuf"
entry, then parse the buffer in-place (adding NULs). The strings
point into the filebuf, which is cleaned up when the whole
pattern_list goes away.
But other code, like sparse-checkout, reads individual lines from stdin
and passes them one by one to add_pattern(), leaking each. We could fix
this by refactoring it to take in the whole buffer at once, like (2)
above, and stuff it in "filebuf". But given how subtle the interface is,
let's just fix it to always copy the string.
That seems at first like we'd be wasting extra memory, but we can
mitigate that:
a. The path_pattern struct already uses a FLEXPTR, since we sometimes
make a copy (when we see "foo/", we strip off the trailing slash,
requiring a modifiable copy of the string).
Since we'll now always embed the string inside the struct, we can
switch to the regular FLEX_ARRAY pattern, saving us 8 bytes of
pointer. So patterns with a trailing slash and ones under 8 bytes
actually get smaller.
b. Now that we don't need the original string to hang around, we can
get rid of the "filebuf" mechanism entirely, and just free the file
contents after parsing. Since files are the sources we'd expect to
have the largest pattern sets, we should mostly break even on
stuffing the same data into the individual structs.
This patch just adjusts the add_pattern() interface; it doesn't fix any
leaky callers yet.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Wed, 5 Jun 2024 08:03:08 +0000 (04:03 -0400)]
dir.c: reduce max pattern file size to 100MB
In a2bc523e1e (dir.c: skip .gitignore, etc larger than INT_MAX,
2024-05-31) we put capped the size of some files whose parsing code and
data structures used ints. Setting the limit to INT_MAX was a natural
spot, since we know the parsing code would misbehave above that.
But it also leaves the possibility of overflow errors when we multiply
that limit to allocate memory. For instance, a file consisting only of
"a\na\n..." could have INT_MAX/2 entries. Allocating an array of
pointers for each would need INT_MAX*4 bytes on a 64-bit system, enough
to overflow a 32-bit int.
So let's give ourselves a bit more safety margin by giving a much
smaller limit. The size 100MB is somewhat arbitrary, but is based on the
similar value for attribute files added by 3c50032ff5 (attr: ignore
overly large gitattributes files, 2022-12-01).
There's no particular reason these have to be the same, but the idea is
that they are in the ballpark of "so huge that nobody would care, but
small enough to avoid malicious overflow". So lacking a better guess, it
makes sense to use the same value. The implementation here doesn't share
the same constant, but we could change that later (or even give it a
runtime config knob, though nobody has complained yet about the
attribute limit).
And likewise, let's add a few tests that exercise the limits, based on
the attr ones. In this case, though, we never read .gitignore from the
index; the blob code is exercised only for sparse filters. So we'll
trigger it that way.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Tue, 4 Jun 2024 18:46:10 +0000 (11:46 -0700)]
imap-send: minimum leakfix
EVen with the minimum "no-op" invocation t1517 makes, "git imap-send"
leaks an empty strbuf it used to read a 0-byte string into.
There are a few other topics cooking in 'next' that plugs many
other leaks in this program, so let's minimally fix this one, barely
enough to make CI pass, leaving the rest for the other topic.
Helped-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In add_pattern_to_hashsets(), we remove entries from the
recursive_hashmap when adding similar ones to the parent_hashmap. I
won't pretend to understand all of what's going on here, but there's an
obvious leak: whatever we removed from recursive_hashmap is not
referenced anywhere else, and is never free()d.
We can easily fix this by asking the hashmap to return a pointer to the
old entry. This makes t7002 now completely leak-free.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 4 Jun 2024 10:13:17 +0000 (06:13 -0400)]
sparse-checkout: clear patterns when init() sees existing sparse file
In sparse_checkout_init(), we first try to load patterns from an
existing file. If we found any, we return immediately, but end up
leaking the patterns we parsed. Fixing this reduces the number of leaks
in t7002 from 9 down to 5.
Note that there are two other exits from the function, but they don't
need the same treatment:
- if we can't resolve HEAD, we write out a hard-coded sparse file and
return. But we know the pattern list is empty there, since we didn't
find any in the on-disk file and we haven't yet added any of our
own.
- otherwise, we do populate the list and then tail-call into
write_patterns_and_update(). But that function frees the
pattern_list itself, so we don't need to.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 4 Jun 2024 10:13:14 +0000 (06:13 -0400)]
dir.c: free strings in sparse cone pattern hashmaps
The pattern_list structs used for cone-mode sparse lookups use a few
extra hashmaps. These store pattern_entry structs, each of which has its
own heap-allocated pattern string. When we clean up the hashmaps, we
free the individual pattern_entry structs, but forget to clean up the
embedded strings, causing memory leaks.
We can fix this by iterating over the hashmaps to free the extra
strings. This reduces the numbers of leaks in t7002 from 22 to 9.
One alternative here would be to make the string a FLEX_ARRAY member of
the pattern_entry. Then there's no extra free() required, and as a bonus
it would be a little more efficient. However, some of the refactoring
gets awkward, as we are often assigning strings allocated by helper
functions. So let's just fix the leak for now, and we can explore bigger
refactoring separately.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 4 Jun 2024 10:13:10 +0000 (06:13 -0400)]
sparse-checkout: pass string literals directly to add_pattern()
The add_pattern() function takes a pattern string, but neither makes a
copy of it nor takes ownership of the memory. So it is the caller's
responsibility to make sure the string hangs around as long as the
pattern_list which references it.
There are a few cases in sparse-checkout where we use string literal
patterns by stuffing them into a strbuf, detaching the buffer, and then
passing the result into add_pattern(). This creates a leak when the
pattern_list is eventually cleared, since we don't retain a copy of the
detached buffer to free.
But we can observe that the whole strbuf dance is unnecessary. The point
was presumably[1] to satisfy the lifetime requirement of the string. But
string literals have static duration; we can count on them lasting for
the whole program.
So we can fix the leak by just passing them directly. And as a bonus,
that simplifies the code. The leaks can be seen in t7002, which drops
from 25 leaks to 22 with this patch. It also makes t3602 and t1090
leak-free.
In the long run, we will also want to clean up this (undocumented!)
memory lifetime requirement of add_pattern(). But that can come in a
later patch; passing the string literals directly will be the right
thing either way.
[1] The code in question comes from 416adc8711 (sparse-checkout: update
working directory in-process for 'init', 2019-11-21) and 99dfa6f970
(sparse-checkout: use in-process update for disable subcommand,
2019-11-21), but I didn't see anything in their commit messages or
on the list explaining the strbufs.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When "git push" notices that the commit at the tip of the ref on
the other side it is about to overwrite does not exist locally, it
used to first try fetching it if the local repository is a partial
clone. The command has been taught not to do so and immediately
fail instead.
* th/push-local-ff-check-without-lazy-fetch:
push: don't fetch commit object when checking existence
Prior changes have made the documentation around the internals of the
alias command execution clearer, but I have still found this detailed
view of the aliased command being run helpful for debugging purposes.
A test case is added to ensure the full command output is present in
the execution flow.
Signed-off-by: Ian Wienand <iwienand@redhat.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ian Wienand [Mon, 27 May 2024 00:30:48 +0000 (10:30 +1000)]
Documentation: alias: add notes on shell expansion
When writing inline shell for shell-expansion aliases (i.e. prefixed
with "!"), there are some caveats around argument parsing to be aware
of. This series of notes attempts to explain what is happening more
clearly.
Signed-off-by: Ian Wienand <iwienand@redhat.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Fri, 31 May 2024 12:00:34 +0000 (08:00 -0400)]
dir.c: skip .gitignore, etc larger than INT_MAX
We use add_patterns() to read .gitignore, .git/info/exclude, etc, as
well as other pattern-like files like sparse-checkout. The parser for
these uses an "int" as an index, meaning that files over 2GB will
generally cause signed integer overflow and out-of-bounds access.
This is unlikely to happen in any real files, but we do read .gitignore
files from the tree. A malicious tree could cause an out-of-bounds read
and segfault (we also write NULs over newlines, so in theory it could be
an out-of-bounds write, too, but as we go char-by-char, the first thing
that happens is trying to read a negative 2GB offset).
We could fix the most obvious issue by replacing one "int" with a
"size_t". But there are tons of "int" sprinkled throughout this code for
things like pattern lengths, number of patterns, and so on. Since nobody
would actually want a 2GB .gitignore file, an easy defensive measure is
to just refuse to parse them.
The "int" in question is in add_patterns_from_buffer(), so we could
catch it there. But by putting the checks in its two callers, we can
produce more useful error messages.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Fri, 31 May 2024 20:51:15 +0000 (13:51 -0700)]
Post 2.45.2 updates
Merge down a handful of topics to adjust tests and CI to make them
work better, without changing Git itself, and a bit of developer
docs update:
* Tests that try to corrupt in-repository files in chunked format did
not work well on macOS due to its broken "mv", which has been
worked around.
* Unbreak CI jobs so that we do not attempt to use Python 2 that has
been removed from the platform.
* Git 2.43 started using the tree of HEAD as the source of attributes
in a bare repository, which has severe performance implications.
For now, revert the change, without ripping out a more explicit
support for the attr.tree configuration variable.
* Windows CI running in GitHub Actions started complaining about the
order of arguments given to calloc(); the imported regex code uses
the wrong order almost consistently, which has been corrected.
Junio C Hamano [Fri, 31 May 2024 22:28:21 +0000 (15:28 -0700)]
Merge branch 'jk/ci-macos-gcc13-fix' into maint-2.45
CI fix.
* jk/ci-macos-gcc13-fix:
ci: stop installing "gcc-13" for osx-gcc
ci: avoid bare "gcc" for osx-gcc job
ci: drop mention of BREW_INSTALL_PACKAGES variable
Junio C Hamano [Fri, 31 May 2024 22:28:19 +0000 (15:28 -0700)]
Merge branch 'jc/compat-regex-calloc-fix' into maint-2.45
Windows CI running in GitHub Actions started complaining about the
order of arguments given to calloc(); the imported regex code uses
the wrong order almost consistently, which has been corrected.
* jc/compat-regex-calloc-fix:
compat/regex: fix argument order to calloc(3)
Junio C Hamano [Fri, 31 May 2024 22:28:19 +0000 (15:28 -0700)]
Merge branch 'jc/no-default-attr-tree-in-bare' into maint-2.45
Git 2.43 started using the tree of HEAD as the source of attributes
in a bare repository, which has severe performance implications.
For now, revert the change, without ripping out a more explicit
support for the attr.tree configuration variable.
* jc/no-default-attr-tree-in-bare:
stop using HEAD for attributes in bare repository by default
Junio C Hamano [Fri, 31 May 2024 15:55:34 +0000 (08:55 -0700)]
Merge branch 'jk/leakfixes' into jk/sparse-leakfix
* jk/leakfixes:
mv: replace src_dir with a strvec
mv: factor out empty src_dir removal
mv: move src_dir cleanup to end of cmd_mv()
t-strvec: mark variable-arg helper with LAST_ARG_MUST_BE_NULL
t-strvec: use va_end() to match va_start()
Junio C Hamano [Fri, 31 May 2024 08:43:27 +0000 (01:43 -0700)]
t1517: more coverage for commands that work without repository
While most of the commands in Git suite are designed to do useful
things in Git repositories, some commands are also usable outside
any repository. Building on top of an earlier work abece6e9 (t1517:
test commands that are designed to be run outside repository,
2024-05-20) that adds tests for such commands, let's give coverage
to some more commands.
This patch covers commands whose code has hits for
$ git grep setup_git_directory_gently
and passes a pointer to nongit_ok variable it uses to allow it to
run outside a Git repository, but mostly they are tested only to see
that they start up (as opposed to dying with "not in a git
repository" complaint). We may want to update them to actually do
something useful later, but this would at least help us catch
regressions by mistake.
Junio C Hamano [Fri, 31 May 2024 00:17:21 +0000 (17:17 -0700)]
Merge branch 'jc/fix-2.45.1-and-friends-for-maint' into maint-2.45
* jc/fix-2.45.1-and-friends-for-maint:
Revert "fsck: warn about symlink pointing inside a gitdir"
Revert "Add a helper function to compare file contents"
clone: drop the protections where hooks aren't run
tests: verify that `clone -c core.hooksPath=/dev/null` works again
Revert "core.hooksPath: add some protection while cloning"
init: use the correct path of the templates directory again
hook: plug a new memory leak
ci: stop installing "gcc-13" for osx-gcc
ci: avoid bare "gcc" for osx-gcc job
ci: drop mention of BREW_INSTALL_PACKAGES variable
send-email: avoid creating more than one Term::ReadLine object
send-email: drop FakeTerm hack
Junio C Hamano [Fri, 31 May 2024 00:11:02 +0000 (17:11 -0700)]
Merge branch 'fixes/2.45.1/2.44' into maint-2.44
* fixes/2.45.1/2.44:
Revert "fsck: warn about symlink pointing inside a gitdir"
Revert "Add a helper function to compare file contents"
clone: drop the protections where hooks aren't run
tests: verify that `clone -c core.hooksPath=/dev/null` works again
Revert "core.hooksPath: add some protection while cloning"
init: use the correct path of the templates directory again
hook: plug a new memory leak
ci: stop installing "gcc-13" for osx-gcc
ci: avoid bare "gcc" for osx-gcc job
ci: drop mention of BREW_INSTALL_PACKAGES variable
send-email: avoid creating more than one Term::ReadLine object
send-email: drop FakeTerm hack
Junio C Hamano [Fri, 31 May 2024 00:04:37 +0000 (17:04 -0700)]
Merge branch 'fixes/2.45.1/2.43' into maint-2.43
* fixes/2.45.1/2.43:
Revert "fsck: warn about symlink pointing inside a gitdir"
Revert "Add a helper function to compare file contents"
clone: drop the protections where hooks aren't run
tests: verify that `clone -c core.hooksPath=/dev/null` works again
Revert "core.hooksPath: add some protection while cloning"
init: use the correct path of the templates directory again
hook: plug a new memory leak
ci: stop installing "gcc-13" for osx-gcc
ci: avoid bare "gcc" for osx-gcc job
ci: drop mention of BREW_INSTALL_PACKAGES variable
send-email: avoid creating more than one Term::ReadLine object
send-email: drop FakeTerm hack
Junio C Hamano [Fri, 31 May 2024 00:00:57 +0000 (17:00 -0700)]
Merge branch 'fixes/2.45.1/2.42' into maint-2.42
* fixes/2.45.1/2.42:
Revert "fsck: warn about symlink pointing inside a gitdir"
Revert "Add a helper function to compare file contents"
clone: drop the protections where hooks aren't run
tests: verify that `clone -c core.hooksPath=/dev/null` works again
Revert "core.hooksPath: add some protection while cloning"
init: use the correct path of the templates directory again
hook: plug a new memory leak
ci: stop installing "gcc-13" for osx-gcc
ci: avoid bare "gcc" for osx-gcc job
ci: drop mention of BREW_INSTALL_PACKAGES variable
send-email: avoid creating more than one Term::ReadLine object
send-email: drop FakeTerm hack
Junio C Hamano [Thu, 30 May 2024 23:58:12 +0000 (16:58 -0700)]
Merge branch 'fixes/2.45.1/2.41' into maint-2.41
* fixes/2.45.1/2.41:
Revert "fsck: warn about symlink pointing inside a gitdir"
Revert "Add a helper function to compare file contents"
clone: drop the protections where hooks aren't run
tests: verify that `clone -c core.hooksPath=/dev/null` works again
Revert "core.hooksPath: add some protection while cloning"
init: use the correct path of the templates directory again
hook: plug a new memory leak
ci: stop installing "gcc-13" for osx-gcc
ci: avoid bare "gcc" for osx-gcc job
ci: drop mention of BREW_INSTALL_PACKAGES variable
send-email: avoid creating more than one Term::ReadLine object
send-email: drop FakeTerm hack
Junio C Hamano [Thu, 30 May 2024 23:54:42 +0000 (16:54 -0700)]
Merge branch 'fixes/2.45.1/2.40' into maint-2.40
* fixes/2.45.1/2.40:
Revert "fsck: warn about symlink pointing inside a gitdir"
Revert "Add a helper function to compare file contents"
clone: drop the protections where hooks aren't run
tests: verify that `clone -c core.hooksPath=/dev/null` works again
Revert "core.hooksPath: add some protection while cloning"
init: use the correct path of the templates directory again
hook: plug a new memory leak
ci: stop installing "gcc-13" for osx-gcc
ci: avoid bare "gcc" for osx-gcc job
ci: drop mention of BREW_INSTALL_PACKAGES variable
send-email: avoid creating more than one Term::ReadLine object
send-email: drop FakeTerm hack
Junio C Hamano [Thu, 30 May 2024 23:38:58 +0000 (16:38 -0700)]
Merge branch 'jc/fix-2.45.1-and-friends-for-2.39' into maint-2.39
* jc/fix-2.45.1-and-friends-for-2.39:
Revert "fsck: warn about symlink pointing inside a gitdir"
Revert "Add a helper function to compare file contents"
clone: drop the protections where hooks aren't run
tests: verify that `clone -c core.hooksPath=/dev/null` works again
Revert "core.hooksPath: add some protection while cloning"
init: use the correct path of the templates directory again
hook: plug a new memory leak
ci: stop installing "gcc-13" for osx-gcc
ci: avoid bare "gcc" for osx-gcc job
ci: drop mention of BREW_INSTALL_PACKAGES variable
send-email: avoid creating more than one Term::ReadLine object
send-email: drop FakeTerm hack
Adjust jc/fix-2.45.1-and-friends-for-2.39 for more recent
maintenance track.
* jc/fix-2.45.1-and-friends-for-maint:
Revert "fsck: warn about symlink pointing inside a gitdir"
Revert "Add a helper function to compare file contents"
clone: drop the protections where hooks aren't run
tests: verify that `clone -c core.hooksPath=/dev/null` works again
Revert "core.hooksPath: add some protection while cloning"
init: use the correct path of the templates directory again
hook: plug a new memory leak
ci: stop installing "gcc-13" for osx-gcc
ci: avoid bare "gcc" for osx-gcc job
ci: drop mention of BREW_INSTALL_PACKAGES variable
send-email: avoid creating more than one Term::ReadLine object
send-email: drop FakeTerm hack
Junio C Hamano [Thu, 30 May 2024 21:15:15 +0000 (14:15 -0700)]
Merge branch 'es/chainlint-ncores-fix'
The chainlint script (invoked during "make test") did nothing when
it failed to detect the number of available CPUs. It now falls
back to 1 CPU to avoid the problem.
* es/chainlint-ncores-fix:
chainlint.pl: latch CPU count directly reported by /proc/cpuinfo
chainlint.pl: fix incorrect CPU count on Linux SPARC
chainlint.pl: make CPU count computation more robust
The base topic started to make it an error for a command to leave
the hash algorithm unspecified, which revealed a few commands that
were not ready for the change. Give users a knob to revert back to
the "default is sha-1" behaviour as an escape hatch, and start
fixing these breakages.
* jc/undecided-is-not-necessarily-sha1-fix:
apply: fix uninitialized hash function
builtin/hash-object: fix uninitialized hash function
builtin/patch-id: fix uninitialized hash function
t1517: test commands that are designed to be run outside repository
setup: add an escape hatch for "no more default hash algorithm" change
Junio C Hamano [Thu, 30 May 2024 21:15:12 +0000 (14:15 -0700)]
Merge branch 'ps/reftable-reusable-iterator'
Code clean-up to make the reftable iterator closer to be reusable.
* ps/reftable-reusable-iterator:
reftable/merged: adapt interface to allow reuse of iterators
reftable/stack: provide convenience functions to create iterators
reftable/reader: adapt interface to allow reuse of iterators
reftable/generic: adapt interface to allow reuse of iterators
reftable/generic: move seeking of records into the iterator
reftable/merged: simplify indices for subiterators
reftable/merged: split up initialization and seeking of records
reftable/reader: set up the reader when initializing table iterator
reftable/reader: inline `reader_seek_internal()`
reftable/reader: separate concerns of table iter and reftable reader
reftable/reader: unify indexed and linear seeking
reftable/reader: avoid copying index iterator
reftable/block: use `size_t` to track restart point index
Junio C Hamano [Thu, 30 May 2024 21:15:11 +0000 (14:15 -0700)]
Merge branch 'ps/reftable-write-options'
The knobs to tweak how reftable files are written have been made
available as configuration variables.
* ps/reftable-write-options:
refs/reftable: allow configuring geometric factor
reftable: make the compaction factor configurable
refs/reftable: allow disabling writing the object index
refs/reftable: allow configuring restart interval
reftable: use `uint16_t` to track restart interval
refs/reftable: allow configuring block size
reftable/dump: support dumping a table's block structure
reftable/writer: improve error when passed an invalid block size
reftable/writer: drop static variable used to initialize strbuf
reftable: pass opts as constant pointer
reftable: consistently refer to `reftable_write_options` as `opts`
Before discovering the repository details, We used to assume SHA-1
as the "default" hash function, which has been corrected. Hopefully
this will smoke out codepaths that rely on such an unwarranted
assumptions.
* ps/undecided-is-not-necessarily-sha1:
repository: stop setting SHA1 as the default object hash
oss-fuzz/commit-graph: set up hash algorithm
builtin/shortlog: don't set up revisions without repo
builtin/diff: explicitly set hash algo when there is no repo
builtin/bundle: abort "verify" early when there is no repository
builtin/blame: don't access potentially unitialized `the_hash_algo`
builtin/rev-parse: allow shortening to more than 40 hex characters
remote-curl: fix parsing of detached SHA256 heads
attr: fix BUG() when parsing attrs outside of repo
attr: don't recompute default attribute source
parse-options-cb: only abbreviate hashes when hash algo is known
path: move `validate_headref()` to its only user
path: harden validation of HEAD with non-standard hashes
Taylor Blau [Wed, 29 May 2024 22:55:42 +0000 (18:55 -0400)]
midx: replace `get_midx_rev_filename()` with a generic helper
Commit f894081deae (pack-revindex: read multi-pack reverse indexes,
2021-03-30) introduced the `get_midx_rev_filename()` helper (later
modified by commit 60980aed786 (midx.c: write MIDX filenames to
strbuf, 2021-10-26)).
This function returns the location of the classic ".rev" files we used
to write for MIDXs (prior to 95e8383bac1 (midx.c: make changing the
preferred pack safe, 2022-01-25)), which is always of the form:
$GIT_DIR/objects/pack/multi-pack-index-$HASH.rev
Replace this function with a generic helper that populates a strbuf with
the above form, replacing the ".rev" extension with a caller-provided
argument.
This will allow us to remove a similarly-defined function in the
pack-bitmap code (used to determine the location of a MIDX .bitmap file)
by reimplementing it in terms of `get_midx_filename_ext()`.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Wed, 29 May 2024 22:55:39 +0000 (18:55 -0400)]
midx-write.c: support reading an existing MIDX with `packs_to_include`
Avoid unconditionally copying all packs from an existing MIDX into a new
MIDX by checking that packs added via `fill_packs_from_midx()` don't
appear in the `to_include` set, if one was provided.
Do so by calling `should_include_pack()` from both `add_pack_to_midx()`
and `fill_packs_from_midx()`.
In order to make this work, teach `should_include_pack()` a new
"exclude_from_midx" parameter, which allows skipping the first check.
This is done so that the caller in `fill_packs_from_midx()` doesn't
reject all of the packs it provided since they appear in an existing
MIDX by definition.
The sum total of this change is that we are now able to read and
reference objects in an existing MIDX even when given a non-NULL
`packs_to_include`. This is a prerequisite step for incremental MIDXs,
which need to load any existing MIDX (if one is present) in order to
determine whether or not an object already appears in an earlier portion
of the MIDX to avoid duplicating it across multiple portions.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>