In `process_ranges_merge_commit()` we try to figure out which of the
parents can be blamed for the given line changes. When we figure out
that none of the files in the line-log have changed we assign the
complete blame to that commit and rewrite the parents of the current
commit to only use that single parent.
This is done via `commit_list_append()`, which is misleadingly _not_
appending to the list of parents. Instead, we overwrite the parents with
the blamed parent. This makes us lose track of the old pointers,
creating a memory leak.
Fix this issue by freeing the parents before we overwrite them.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
bisect: fix leaking commit list items in `check_merge_base()`
While we free the result commit list at the end of `check_merge_base()`,
we forget to free any items that we have already iterated over. Fix this
by using a separate variable to iterate through them.
This leak is exposed by t6030, but plugging it does not make the whole
test suite pass.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are multiple leaks in `bisect_next_all()`. For one we don't free
the `tried` commit list. Second, one of the branches uses a direct
return instead of jumping to the cleanup code.
Fix these by freeing the commit list and converting the return to a
goto.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When reading bisect refs we read the reference mapping to the "bad" term
into the global `current_bad_oid` variable. This is an allocated string,
but because it is global we never have to free it. This changes though
when `register_ref()` is being called multiple times, at which point
we'll overwrite the previous pointer and thus make it unreachable.
Fix this issue by freeing the previous value. This leak is exposed by
t6030, but plugging it does not make the whole test suite pass.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
bisect: fix leaking string in `handle_bad_merge_base()`
When handling a bad merge base we print an error, which includes the set
of good revisions joined by spaces. This string is allocated, but never
freed.
Fix this memory leak. Note that the local `bad_hex` varible also looks
like a string that we should free. But in fact, `oid_to_hex()` returns
an address to a static variable even though it is declared to return a
non-constant string. The function signature is thus quite misleading and
really should be fixed, but doing so is outside of the scope of this
patch series.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
bisect: fix leaking good/bad terms when reading multipe times
Even though `read_bisect_terms()` is declared as assigning string
constants, it in fact assigns allocated strings to the `read_bad` and
`read_good` out parameters. The only callers of this function assign the
result to global variables and thus don't have to free them in order to
be leak-free. But that changes when executing the function multiple
times because we'd then overwrite the previous value and thus make it
unreachable.
Fix the function signature and free the previous values. This leak is
exposed by t0630, but plugging it does not make the whole test suite
pass.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
builtin/blame: fix leaking blame entries with `--incremental`
When passing `--incremental` to git-blame(1) we exit early by jumping to
the `cleanup` label. But some of the cleanups we perform are handled
between the `goto` and its label, and thus we leak the data.
Move the cleanups after the `cleanup` label. While at it, move the logic
to free the scoreboard's `final_buf` into `cleanup_scoreboard()` and
drop its `const` declaration.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
shejialuo [Wed, 20 Nov 2024 11:52:18 +0000 (19:52 +0800)]
ref: add symlink ref content check for files backend
Besides the textual symref, we also allow symbolic links as the symref.
So, we should also provide the consistency check as what we have done
for textual symref. And also we consider deprecating writing the
symbolic links. We first need to access whether symbolic links still
be used. So, add a new fsck message "symlinkRef(INFO)" to tell the
user be aware of this information.
We have already introduced "files_fsck_symref_target". We should reuse
this function to handle the symrefs which use legacy symbolic links. We
should not check the trailing garbage for symbolic refs. Add a new
parameter "symbolic_link" to disable some checks which should only be
executed for textual symrefs.
And we need to also generate the "referent" parameter for reusing
"files_fsck_symref_target" by the following steps:
1. Use "strbuf_add_real_path" to resolve the symlink and get the
absolute path "ref_content" which the symlink ref points to.
2. Generate the absolute path "abs_gitdir" of "gitdir" and combine
"ref_content" and "abs_gitdir" to extract the relative path
"relative_referent_path".
3. If "ref_content" is outside of "gitdir", we just set "referent" with
"ref_content". Instead, we set "referent" with
"relative_referent_path".
Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
shejialuo [Wed, 20 Nov 2024 11:52:09 +0000 (19:52 +0800)]
ref: check whether the target of the symref is a ref
Ideally, we want to the users use "git symbolic-ref" to create symrefs
instead of writing raw contents into the filesystem. However, "git
symbolic-ref" is strict with the refname but not strict with the
referent. For example, we can make the "referent" located at the
"$(gitdir)/logs/aaa" and manually write the content into this where we
can still successfully parse this symref by using "git rev-parse".
We may need to add some restrictions for "referent" parameter when using
"git symbolic-ref" to create symrefs because ideally all the
nonpseudo-refs should be located under the "refs" directory and we may
tighten this in the future.
In order to tell the user we may tighten the above situation, create
a new fsck message "symrefTargetIsNotARef" to notify the user that this
may become an error in the future.
Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
shejialuo [Wed, 20 Nov 2024 11:52:00 +0000 (19:52 +0800)]
ref: add basic symref content check for files backend
We have code that checks regular ref contents, but we do not yet check
the contents of symbolic refs. By using "parse_loose_ref_content" for
symbolic refs, we will get the information of the "referent".
We do not need to check the "referent" by opening the file. This is
because if "referent" exists in the file system, we will eventually
check its correctness by inspecting every file in the "refs" directory.
If the "referent" does not exist in the filesystem, this is OK as it is
seen as the dangling symref.
So we just need to check the "referent" string content. A regular ref
could be accepted as a textual symref if it begins with "ref:", followed
by zero or more whitespaces, followed by the full refname, followed only
by whitespace characters. However, we always write a single SP after
"ref:" and a single LF after the refname. It may seem that we should
report a fsck error message when the "referent" does not apply above
rules and we should not be so aggressive because third-party
reimplementations of Git may have taken advantage of the looser syntax.
Put it more specific, we accept the following contents:
When introducing the regular ref content checks, we created two fsck
infos "refMissingNewline" and "trailingRefContent" which exactly
represents above situations. So we will reuse these two fsck messages to
write checks to info the user about these situations.
But we do not allow any other trailing garbage. The followings are bad
symref contents which will be reported as fsck error by "git-fsck(1)".
And we introduce a new "badReferentName(ERROR)" fsck message to report
above errors by using "is_root_ref" and "check_refname_format" to check
the "referent". Since both "is_root_ref" and "check_refname_format"
don't work with whitespaces, we use the trimmed version of "referent"
with these functions.
In order to add checks, we will do the following things:
1. Record the untrimmed length "orig_len" and untrimmed last byte
"orig_last_byte".
2. Use "strbuf_rtrim" to trim the whitespaces or newlines to make sure
"is_root_ref" and "check_refname_format" won't be failed by them.
3. Use "orig_len" and "orig_last_byte" to check whether the "referent"
misses '\n' at the end or it has trailing whitespaces or newlines.
Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
shejialuo [Wed, 20 Nov 2024 11:51:49 +0000 (19:51 +0800)]
ref: add more strict checks for regular refs
We have already used "parse_loose_ref_contents" function to check
whether the ref content is valid in files backend. However, by
using "parse_loose_ref_contents", we allow the ref's content to end with
garbage or without a newline.
Even though we never create such loose refs ourselves, we have accepted
such loose refs. So, it is entirely possible that some third-party tools
may rely on such loose refs being valid. We should not report an error
fsck message at current. We should notify the users about such
"curiously formatted" loose refs so that adequate care is taken before
we decide to tighten the rules in the future.
And it's not suitable either to report a warn fsck message to the user.
We don't yet want the "--strict" flag that controls this bit to end up
generating errors for such weirdly-formatted reference contents, as we
first want to assess whether this retroactive tightening will cause
issues for any tools out there. It may cause compatibility issues which
may break the repository. So, we add the following two fsck infos to
represent the situation where the ref content ends without newline or
has trailing garbages:
1. refMissingNewline(INFO): A loose ref that does not end with
newline(LF).
2. trailingRefContent(INFO): A loose ref has trailing content.
It might appear that we can't provide the user with any warnings by
using FSCK_INFO. However, in "fsck.c::fsck_vreport", we will convert
FSCK_INFO to FSCK_WARN and we can still warn the user about these
situations when using "git refs verify" without introducing
compatibility issues.
Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
shejialuo [Wed, 20 Nov 2024 11:51:42 +0000 (19:51 +0800)]
ref: port git-fsck(1) regular refs check for files backend
"git-fsck(1)" implicitly checks the ref content by passing the
callback "fsck_handle_ref" to the "refs.c::refs_for_each_rawref".
Then, it will check whether the ref content (eventually "oid")
is valid. If not, it will report the following error to the user.
And it will also report above errors when there are dangling symrefs
in the repository wrongly. This does not align with the behavior of
the "git symbolic-ref" command which allows users to create dangling
symrefs.
As we have already introduced the "git refs verify" command, we'd better
check the ref content explicitly in the "git refs verify" command thus
later we could remove these checks in "git-fsck(1)" and launch a
subprocess to call "git refs verify" in "git-fsck(1)" to make the
"git-fsck(1)" more clean.
Following what "git-fsck(1)" does, add a similar check to "git refs
verify". Then add a new fsck error message "badRefContent(ERROR)" to
represent that a ref has an invalid content.
Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
shejialuo [Wed, 20 Nov 2024 11:51:32 +0000 (19:51 +0800)]
ref: support multiple worktrees check for refs
We have already set up the infrastructure to check the consistency for
refs, but we do not support multiple worktrees. However, "git-fsck(1)"
will check the refs of worktrees. As we decide to get feature parity
with "git-fsck(1)", we need to set up support for multiple worktrees.
Because each worktree has its own specific refs, instead of just showing
the users "refs/worktree/foo", we need to display the full name such as
"worktrees/<id>/refs/worktree/foo". So we should know the id of the
worktree to get the full name. Add a new parameter "struct worktree *"
for "refs-internal.h::fsck_fn". Then change the related functions to
follow this new interface.
The "packed-refs" only exists in the main worktree, so we should only
check "packed-refs" in the main worktree. Use "is_main_worktree" method
to skip checking "packed-refs" in "packed_fsck" function.
Then, enhance the "files-backend.c::files_fsck_refs_dir" function to add
"worktree/<id>/" prefix when we are not in the main worktree.
Last, add a new test to check the refname when there are multiple
worktrees to exercise the code.
Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
shejialuo [Wed, 20 Nov 2024 11:51:24 +0000 (19:51 +0800)]
ref: initialize ref name outside of check functions
We passes "refs_check_dir" to the "files_fsck_refs_name" function which
allows it to create the checked ref name later. However, when we
introduce a new check function, we have to allocate redundant memory and
re-calculate the ref name. It's bad for us to allocate redundant memory
and duplicate logic. Instead, we should allocate and calculate it only
once and pass the ref name to the check functions.
In order not to do repeat calculation, rename "refs_check_dir" to
"refname". And in "files_fsck_refs_dir", create a new strbuf "refname",
thus whenever we handle a new ref, calculate the name and call the check
functions one by one.
Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
shejialuo [Wed, 20 Nov 2024 11:51:16 +0000 (19:51 +0800)]
ref: check the full refname instead of basename
In "files-backend.c::files_fsck_refs_name", we validate the refname
format by using "check_refname_format" to check the basename of the
iterator with "REFNAME_ALLOW_ONELEVEL" flag.
However, this is a bad implementation. Although we doesn't allow a
single "@" in ".git" directory, we do allow "refs/heads/@". So, we will
report an error wrongly when there is a "refs/heads/@" ref by using one
level refname "@".
Because we just check one level refname, we either cannot check the
other parts of the full refname. And we will ignore the following
errors:
In order to fix the above problem, enhance "files_fsck_refs_name" to use
the full name for "check_refname_format". Then, replace the tests which
are related to "@" and add tests to exercise the above situations using
for loop to avoid repetition.
Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
shejialuo [Wed, 20 Nov 2024 11:51:08 +0000 (19:51 +0800)]
ref: initialize "fsck_ref_report" with zero
In "fsck.c::fsck_refs_error_function", we need to tell whether "oid" and
"referent" is NULL. So, we need to always initialize these parameters to
NULL instead of letting them point to anywhere when creating a new
"fsck_ref_report" structure.
The original code explicitly initializes the "path" member in the
"struct fsck_ref_report" to NULL (which implicitly 0-initializes other
members in the struct). It is more customary to use "{ 0 }" to express
that we are 0-initializing everything. In order to align with the
codebase, initialize "fsck_ref_report" with zero.
Mentored-by: Patrick Steinhardt <ps@pks.im> Mentored-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: shejialuo <shejialuo@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
reftable/block: optimize allocations by using scratch buffer
The block writer needs to compute the key for every record that one adds
to the writer. The buffer for this key is stored on the stack and thus
reallocated on every call to `block_writer_add()`, which is inefficient.
Refactor the code so that we store the buffer in the `block_writer`
struct itself so that we can reuse it. This reduces the number of
allocations when writing many refs, e.g. when migrating one million refs
from the "files" backend to the "reftable backend. Before this change:
HEAP SUMMARY:
in use at exit: 80,048 bytes in 49 blocks
total heap usage: 3,025,864 allocs, 3,025,815 frees, 372,746,291 bytes allocated
After this change:
HEAP SUMMARY:
in use at exit: 80,048 bytes in 49 blocks
total heap usage: 2,013,250 allocs, 2,013,201 frees, 347,543,583 bytes allocated
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Adapt the name of the `block_writer::buf` variable to instead be called
`block`. This aligns it with the existing `block_len` variable, which
tracks the length of this buffer, and is generally a bit more tied to
the actual context where this variable gets used.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
reftable/writer: optimize allocations by using a scratch buffer
Both `writer_add_record()` and `reftable_writer_add_ref()` get executed
for every single ref record we're adding to the reftable writer. And as
both functions use a local buffer to write data, the allocations we have
to do here add up during larger transactions.
Refactor the code to use a scratch buffer part of the `reftable_writer`
itself such that we can reuse it. This signifcantly reduces the number
of allocations during large transactions, e.g. when migrating refs from
the "files" backend to the "reftable" backend. Before this change:
HEAP SUMMARY:
in use at exit: 80,048 bytes in 49 blocks
total heap usage: 5,032,171 allocs, 5,032,122 frees, 418,792,092 bytes allocated
After this change:
HEAP SUMMARY:
in use at exit: 80,048 bytes in 49 blocks
total heap usage: 3,025,864 allocs, 3,025,815 frees, 372,746,291 bytes allocated
This also translate into a small speedup:
Benchmark 1: migrate files:reftable (refcount = 1000000, revision = HEAD~)
Time (mean ± σ): 827.2 ms ± 16.5 ms [User: 689.4 ms, System: 124.9 ms]
Range (min … max): 809.0 ms … 924.7 ms 50 runs
Benchmark 2: migrate files:reftable (refcount = 1000000, revision = HEAD)
Time (mean ± σ): 813.6 ms ± 11.6 ms [User: 679.0 ms, System: 123.4 ms]
Range (min … max): 786.7 ms … 833.5 ms 50 runs
Summary
migrate files:reftable (refcount = 1000000, revision = HEAD) ran
1.02 ± 0.02 times faster than migrate files:reftable (refcount = 1000000, revision = HEAD~)
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
refs: don't normalize log messages with `REF_SKIP_CREATE_REFLOG`
When the `REF_SKIP_CREATE_REFLOG` flag is set we skip the creation of
the reflog entry, but we still normalize the reflog message when we
queue the update. This is a waste of resources as the normalized message
will never get used in the first place.
Fix this issue by skipping the normalization in case the flag is set.
This leads to a surprisingly large speedup when migrating from the
"files" to the "reftable" backend:
Benchmark 1: migrate files:reftable (refcount = 1000000, revision = HEAD~)
Time (mean ± σ): 878.5 ms ± 14.9 ms [User: 726.5 ms, System: 139.2 ms]
Range (min … max): 858.4 ms … 941.3 ms 50 runs
Benchmark 2: migrate files:reftable (refcount = 1000000, revision = HEAD)
Time (mean ± σ): 831.1 ms ± 10.5 ms [User: 694.1 ms, System: 126.3 ms]
Range (min … max): 812.4 ms … 851.4 ms 50 runs
Summary
migrate files:reftable (refcount = 1000000, revision = HEAD) ran
1.06 ± 0.02 times faster than migrate files:reftable (refcount = 1000000, revision = HEAD~)
And an ever larger speedup when migrating the other way round:
Benchmark 1: migrate reftable:files (refcount = 1000000, revision = HEAD~)
Time (mean ± σ): 923.6 ms ± 11.6 ms [User: 705.5 ms, System: 208.1 ms]
Range (min … max): 905.3 ms … 946.5 ms 50 runs
Benchmark 2: migrate reftable:files (refcount = 1000000, revision = HEAD)
Time (mean ± σ): 818.5 ms ± 9.0 ms [User: 627.6 ms, System: 180.6 ms]
Range (min … max): 802.2 ms … 842.9 ms 50 runs
Summary
migrate reftable:files (refcount = 1000000, revision = HEAD) ran
1.13 ± 0.02 times faster than migrate reftable:files (refcount = 1000000, revision = HEAD~)
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
refs: skip collision checks in initial transactions
Reference transactions use `refs_verify_refname_available()` to check
for colliding references. This check consists of two parts:
- Checks for whether multiple ref updates in the same transaction
conflict with each other.
- Checks for whether existing refs conflict with any refs part of the
transaction.
While we generally cannot avoid the first check, the second check is
superfluous in cases where the transaction is an initial one in an
otherwise empty ref store. The check results in multiple ref reads as
well as the creation of a ref iterator for every ref we're checking,
which adds up quite fast when performing the check for many refs.
Introduce a new flag that allows us to skip this check and wire it up in
such that the backends pass it when running an initial transaction. This
leads to significant speedups when migrating ref storage backends. From
"files" to "reftable":
Benchmark 1: migrate files:reftable (refcount = 100000, revision = HEAD~)
Time (mean ± σ): 472.4 ms ± 6.7 ms [User: 175.9 ms, System: 285.2 ms]
Range (min … max): 463.5 ms … 483.2 ms 10 runs
Benchmark 2: migrate files:reftable (refcount = 100000, revision = HEAD)
Time (mean ± σ): 86.1 ms ± 1.9 ms [User: 67.9 ms, System: 16.0 ms]
Range (min … max): 82.9 ms … 90.9 ms 29 runs
Summary
migrate files:reftable (refcount = 100000, revision = HEAD) ran
5.48 ± 0.15 times faster than migrate files:reftable (refcount = 100000, revision = HEAD~)
And from "reftable" to "files":
Benchmark 1: migrate reftable:files (refcount = 100000, revision = HEAD~)
Time (mean ± σ): 452.7 ms ± 3.4 ms [User: 209.9 ms, System: 235.4 ms]
Range (min … max): 445.9 ms … 457.5 ms 10 runs
Benchmark 2: migrate reftable:files (refcount = 100000, revision = HEAD)
Time (mean ± σ): 95.2 ms ± 2.2 ms [User: 73.6 ms, System: 20.6 ms]
Range (min … max): 91.7 ms … 100.8 ms 28 runs
Summary
migrate reftable:files (refcount = 100000, revision = HEAD) ran
4.76 ± 0.11 times faster than migrate reftable:files (refcount = 100000, revision = HEAD~)
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
refs: use "initial" transaction semantics to migrate refs
Until now, we couldn't use "initial" transaction semantics to migrate
refs because the "files" backend only supported writing regular refs via
the initial transaction because it simply mapped the transaction to a
"packed-refs" transaction. But with the preceding commit, the "files"
backend has learned to also write symbolic and root refs in the initial
transaction by creating a second transaction for all refs that need to
be written as loose refs.
Adapt the code to migrate refs to commit the transaction as an initial
transaction. This results in a signiticant speedup when migrating many
refs:
Benchmark 1: migrate reftable:files (refcount = 100000, revision = HEAD~)
Time (mean ± σ): 3.247 s ± 0.034 s [User: 0.485 s, System: 2.722 s]
Range (min … max): 3.216 s … 3.309 s 10 runs
Benchmark 2: migrate reftable:files (refcount = 100000, revision = HEAD)
Time (mean ± σ): 453.6 ms ± 1.9 ms [User: 214.6 ms, System: 230.5 ms]
Range (min … max): 451.5 ms … 456.4 ms 10 runs
Summary
migrate reftable:files (refcount = 100000, revision = HEAD) ran
7.16 ± 0.08 times faster than migrate reftable:files (refcount = 100000, revision = HEAD~)
As the reftable backend doesn't (yet) special-case initial transactions
there is no comparable speedup for that backend.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
refs/files: support symbolic and root refs in initial transaction
The "files" backend has implemented special logic when committing
the first transactions in an otherwise empty ref store: instead of
writing all refs as separate loose files, it instead knows to write them
all into a "packed-refs" file directly. This is significantly more
efficient than having to write each of the refs as separate "loose" ref.
The only user of this optimization is git-clone(1), which only uses this
mechanism to write regular refs. Consequently, the implementation does
not know how to handle both symbolic and root refs. While fine in the
context of git-clone(1), this keeps us from using the mechanism in more
cases.
Adapt the logic to also support symbolic and root refs by using a second
transaction that we use for all of the refs that need to be written as
loose refs.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are two different ways to commit a transaction:
- `ref_transaction_commit()` can be used to commit a regular
transaction and is what almost every caller wants.
- `initial_ref_transaction_commit()` can be used when it is known that
the ref store that the transaction is committed for is empty and
when there are no concurrent processes. This is used when cloning a
new repository.
Implementing this via two separate functions has a couple of downsides.
First, every reference backend needs to implement a separate callback
even in the case where they don't special-case the initial transaction.
Second, backends are basically forced to reimplement the whole logic for
how to commit the transaction like the "files" backend does, even though
backends may wish to only tweak certain behaviour of a "normal" commit.
Third, it is awkward that callers must never prepare the transaction as
this is somewhat different than how a transaction typically works.
Refactor the code such that we instead mark initial transactions via a
separate flag when starting the transaction. This addresses all of the
mentioned painpoints, where the most important part is that it will
allow backends to have way more leeway in how exactly they want to
handle the initial transaction.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
refs/files: move logic to commit initial transaction
Move the logic to commit initial transactions such that we can start to
call it in `files_transaction_finish()` in a subsequent commit without
requiring a separate function declaration.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
refs: allow passing flags when setting up a transaction
Allow passing flags when setting up a transaction such that the
behaviour of the transaction itself can be altered. This functionality
will be used in a subsequent patch.
Adapt callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Wed, 20 Nov 2024 05:47:16 +0000 (14:47 +0900)]
Merge branch 'jt/repack-local-promisor'
"git gc" discards any objects that are outside promisor packs that
are referred to by an object in a promisor pack, and we do not
refetch them from the promisor at runtime, resulting an unusable
repository. Work it around by including these objects in the
referring promisor pack at the receiving end of the fetch.
* jt/repack-local-promisor:
index-pack: repack local links into promisor packs
t5300: move --window clamp test next to unclamped
t0410: use from-scratch server
t0410: make test description clearer
Junio C Hamano [Wed, 20 Nov 2024 05:42:58 +0000 (14:42 +0900)]
Merge branch 'ps/cache-tree-w-broken-index-entry' into maint-2.47
Fail gracefully instead of crashing when attempting to write the
contents of a corrupt in-core index as a tree object.
* ps/cache-tree-w-broken-index-entry:
unpack-trees: detect mismatching number of cache-tree/index entries
cache-tree: detect mismatching number of index entries
cache-tree: refactor verification to return error codes
Junio C Hamano [Wed, 20 Nov 2024 05:42:57 +0000 (14:42 +0900)]
Merge branch 'jk/fsmonitor-event-listener-race-fix' into maint-2.47
On macOS, fsmonitor can fall into a race condition that results in
a client waiting forever to be notified for an event that have
already happened. This problem has been corrected.
* jk/fsmonitor-event-listener-race-fix:
fsmonitor: initialize fs event listener before accepting clients
simple-ipc: split async server initialization and running
Jonathan Tan [Tue, 19 Nov 2024 20:10:15 +0000 (12:10 -0800)]
index-pack: teach --promisor to forbid pack name
Currently,
- Running "index-pack --promisor" outside a repo segfaults.
- It may be confusing to a user that running "index-pack --promisor"
within a repo may make changes to the repo's object DB, especially
since the packs indexed by the index-pack invocation may not even be
related to the repo.
As discussed in [1] and [2], teaching --promisor to forbid a packfile
name solves both these problems. This combination of arguments requires
a repo (since we are writing the resulting .pack and .idx to it) and it
is clear that the files are related to the repo.
Currently, Git uses "index-pack --promisor" only when fetching into
a repo, so it could be argued that we should teach "index-pack" a
new argument (say, "--fetching-mode") instead of tying --promisor to
a generic argument like the packfile name. However, this --promisor
feature could conceivably be used whenever we have a packfile that is
known to come from the promisor remote (whether obtained through Git's
fetch protocol or through other means) so not using a new argument seems
reasonable - one could envision a user-made script obtaining a packfile
and then running "index-pack --promisor --stdin", for example. In fact,
it might be possible to relax the restriction further (say, by also
allowing --promisor when indexing a packfile that is in the object DB),
but relaxing the restriction is backwards-compatible so we can revisit
that later.
One thing to watch out for is the possibility of a future Git feature
that indexes a pack in the context of a repo, but does not necessarily
write the resulting pack to it (and does not necessarily desire to
make any changes to the object DB). One such feature would be fetch
quarantine, which might need the repo context in order to detect
hash collisions, but would also need to ensure that the object DB
is undisturbed in case the fetch fails for whatever reason, even if
the reason occurs only after the indexing is complete. It may not be
obvious to the implementer of such a feature that "index-pack" could
sometimes write packs other than the indexed pack to the object DB,
but there are already other ways that "fetch" could write to the object
DB (in particular, packfile URIs and bundle URIs), so hopefully the
implementation of this future feature would already include a test that
the object DB be undisturbed.
This change requires the change to t5300 by 1f52cdfacb (index-pack:
document and test the --promisor option, 2022-03-09) to be undone.
(--promisor is already tested indirectly, so we don't need the explicit
test here any more.)
builtin/gc: provide hint when maintenance hits a stale schedule lock
When running scheduled maintenance via `git maintenance start`, we
acquire a lockfile to ensure that no other scheduled maintenance task is
running in the repository concurrently. If so, we do provide an error to
the user hinting that another process seems to be running in this repo.
There are two important cases why such a lockfile may exist:
- An actual git-maintenance(1) process is still running in this
repository.
- An earlier process may have crashed or was interrupted part way
through and has left a stale lockfile behind.
In c95547a394 (builtin/gc: fix crash when running `git maintenance
start`, 2024-10-10), we have fixed an issue where git-maintenance(1)
would crash with the "start" subcommand, and the underlying bug causes
the second scenario to trigger quite often now.
Most users don't know how to get out of that situation again though.
Ideally, we'd be removing the stale lock for our users automatically.
But in the context of repository maintenance this is rather risky, as it
can easily run for hours or even days. So finding a clear point where we
know that the old process has exited is basically impossible.
We have the same issue in other subsystems, e.g. when locking refs. Our
lockfile interfaces thus provide the `unable_to_lock_message()` function
for exactly this purpose: it provides a nice hint to the user that
explains what is going on and how to get out of that situation again by
manually removing the file.
Adapt git-maintenance(1) to print a similar hint. While we could use the
above function, we can provide a bit more context as we know exactly
what kind of process would create the lockfile.
Reported-by: Miguel Rincon Barahona <mrincon@gitlab.com> Reported-by: Kev Kloss <kkloss@gitlab.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jean-Noël Avila [Mon, 18 Nov 2024 22:05:49 +0000 (22:05 +0000)]
doc: git-diff: apply new documentation guidelines
The documentation for git-diff has been updated to follow the new
documentation guidelines. The following changes have been applied to
the series of patches:
- switching the synopsis to a synopsis block which will automatically
format placeholders in italics and keywords in monospace
- use _<placeholder>_ instead of <placeholder> in the description
- use `backticks for keywords and more complex option
descriptions`. The new rendering engine will apply synopsis rules to
these spans.
- prevent git-diff from self-referencing itself via gitlink macro when
the generated link would point to the same page.
Signed-off-by: Jean-Noël Avila <jn.avila@free.fr> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Tue, 19 Nov 2024 03:24:33 +0000 (12:24 +0900)]
Merge branch 'ps/reftable-detach' into ps/reftable-iterator-reuse
* ps/reftable-detach:
reftable/system: provide thin wrapper for lockfile subsystem
reftable/stack: drop only use of `get_locked_file_path()`
reftable/system: provide thin wrapper for tempfile subsystem
reftable/stack: stop using `fsync_component()` directly
reftable/system: stop depending on "hash.h"
reftable: explicitly handle hash format IDs
reftable/system: move "dir.h" to its only user
reftable/system: provide thin wrapper for lockfile subsystem
We use the lockfile subsystem to write lockfiles for "tables.list". As
with the tempfile subsystem, the lockfile subsystem also hooks into our
infrastructure to prune stale locks via atexit(3p) or signal handlers.
Furthermore, the lockfile subsystem also handles locking timeouts, which
do add quite a bit of logic. Having to reimplement that in the context
of Git wouldn't make a whole lot of sense, and it is quite likely that
downstream users of the reftable library may have a better idea for how
exactly to implement timeouts.
So again, provide a thin wrapper for the lockfile subsystem instead such
that the compatibility shim is fully self-contained.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
reftable/stack: drop only use of `get_locked_file_path()`
We've got a single callsite where we call `get_locked_file_path()`. As
we're about to convert our usage of the lockfile subsystem to instead be
used via a compatibility shim we'd have to implement more logic for this
single callsite. While that would be okay if Git was the only supposed
user of the reftable library, it's a bit more awkward when considering
that we have to reimplement this functionality for every user of the
library eventually.
Refactor the code such that we don't call `get_locked_file_path()`
anymore.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
reftable/system: provide thin wrapper for tempfile subsystem
We use the tempfile subsystem to write temporary tables, but given that
we're in the process of converting the reftable library to become
standalone we cannot use this subsystem directly anymore. While we could
in theory convert the code to use mkstemp(3p) instead, we'd lose access
to our infrastructure that automatically prunes tempfiles via atexit(3p)
or signal handlers.
Provide a thin wrapper for the tempfile subsystem instead. Like this,
the compatibility shim is fully self-contained in "reftable/system.c".
Downstream users of the reftable library would have to implement their
own tempfile shims by replacing "system.c" with a custom version.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
reftable/stack: stop using `fsync_component()` directly
We're executing `fsync_component()` directly in the reftable library so
that we can fsync data to disk depending on "core.fsync". But as we're
in the process of converting the reftable library to become standalone
we cannot use that function in the library anymore.
Refactor the code such that users of the library can inject a custom
fsync function via the write options. This allows us to get rid of the
dependency on "write-or-die.h".
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
We include "hash.h" in "reftable/system.h" such that we can use hash
format IDs as well as the raw size of SHA1 and SHA256. As we are in the
process of converting the reftable library to become standalone we of
course cannot rely on those constants anymore.
Introduce a new `enum reftable_hash` to replace internal uses of the
hash format IDs and new constants that replace internal uses of the hash
size. Adapt the reftable backend to set up the correct hash function.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The hash format IDs are used for two different things across the
reftable codebase:
- They are used as a 32 bit unsigned integer when reading and writing
the header in order to identify the hash function.
- They are used internally to identify which hash function is in use.
When one only considers the second usecase one might think that one can
easily change the representation of those hash IDs. But because those
IDs end up in the reftable header and footer on disk it is important
that those never change.
Create separate constants `REFTABLE_FORMAT_ID_*` and use them in
contexts where we read or write reftable headers. This serves multiple
purposes:
- It allows us to more easily discern cases where we actually use
those constants for the on-disk format.
- It detangles us from the same constants that are defined in
libgit.a, which is another required step to convert the reftable
library to become standalone.
- It makes the next step easier where we stop using `GIT_*_FORMAT_ID`
constants in favor of a custom enum.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
We still include "dir.h" in "reftable/system.h" even though it is not
used by anything but by a single unit test. Move it over into that unit
test so that we don't accidentally use any functionality provided by it
in the reftable codebase.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
and CMD_TO_TWEAK_THE_STREAM undoes the modifications that the
replacement did, then at the end you'd get a replace ref that points to
itself. For example:
Avoid such problems by deleting replace refs that will simply end up
pointing to themselves at the end of our writing. Unless users specify
--quiet, warn them when we delete such a replace ref.
Two notes about this patch:
* We are not ignoring the problematic update of the replace ref
(turning it into a no-op), we are replacing the update with a delete.
The logic here is that if the repository had a value for the replace
ref before fast-import was run, and the replace ref was explicitly
named in the fast-import stream, we don't want the replace ref to be
left with a pre-fast-import value.
* While loops with more than one element (e.g. refs/replace/A points
to B, and refs/replace/B points to A) are possible, they seem much
less plausible. It is pretty easy to create a sequence of
git-filter-repo commands that will trigger a self-referencing replace
ref, but I do not know how to trigger a scenario with a cycle length
greater than 1.
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Mon, 18 Nov 2024 09:55:22 +0000 (04:55 -0500)]
object-file: inline empty tree and blob literals
We define macros with the bytes of the empty trees and blobs for sha1
and sha256. But since e1ccd7e2b1 (sha1_file: only expose empty object
constants through git_hash_algo, 2018-05-02), those are used only for
initializing the git_hash_algo entries. Any other code using the macros
directly would be suspicious, since a hash_algo pointer is the level of
indirection we use to make everything work with both sha1 and sha256.
So let's future proof against code doing the wrong thing by dropping the
macros entirely and just initializing the structs directly.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Mon, 18 Nov 2024 09:55:19 +0000 (04:55 -0500)]
object-file: treat cached_object values as const
The cached-object API maps oids to in-memory entries. Once inserted,
these entries should be immutable. Let's return them from the
find_cached_object() call with a const tag to make this clear.
Suggested-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Mon, 18 Nov 2024 09:55:15 +0000 (04:55 -0500)]
object-file: drop oid field from find_cached_object() return value
The pretend_object_file() function adds to an array mapping oids to
object contents, which are later retrieved with find_cached_object().
We naturally need to store the oid for each entry, since it's the lookup
key.
But find_cached_object() also returns a hard-coded empty_tree object.
There we don't care about its oid field and instead compare against
the_hash_algo->empty_tree. The oid field is left as all-zeroes.
This all works, but it means that the cached_object struct we return
from find_cached_object() may or may not have a valid oid field, depend
whether it is the hard-coded tree or came from pretend_object_file().
Nobody looks at the field, so there's no bug. But let's future-proof it
by returning only the object contents themselves, not the oid. We'll
continue to call this "struct cached_object", and the array entry
mapping the key to those contents will be a "cached_object_entry".
This would also let us swap out the array for a better data structure
(like a hashmap) if we chose, but there's not much point. The only code
that adds an entry is git-blame, which adds at most a single entry per
process.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Mon, 18 Nov 2024 09:55:11 +0000 (04:55 -0500)]
object-file: move empty_tree struct into find_cached_object()
The fake empty_tree struct is a static global, but the only code that
looks at it is find_cached_object(). The struct itself is a little odd,
with an invalid "oid" field that is handled specially by that function.
Since it's really just an implementation detail, let's move it to a
static within the function. That future-proofs against other code trying
to use it and seeing the weird oid value.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Mon, 18 Nov 2024 09:55:07 +0000 (04:55 -0500)]
object-file: drop confusing oid initializer of empty_tree struct
We treat the empty tree specially, providing an in-memory "cached" copy,
which allows you to diff against it even if the object doesn't exist in
the repository. This is implemented as part of the larger cached_object
subsystem, but we use a stand-alone empty_tree struct.
We initialize the oid of that struct using EMPTY_TREE_SHA1_BIN_LITERAL.
At first glance, that seems like a bug; how could this ever work for
sha256 repositories?
The answer is that we never look at the oid field! The oid field is used
to look up entries added by pretend_object_file() to the cached_objects
array. But for our stand-alone entry, we look for it independently using
the_hash_algo->empty_tree, which will point to the correct algo struct
for the repository.
This happened in 62ba93eaa9 (sha1_file: convert cached object code to
struct object_id, 2018-05-02), which even mentions that this field is
never used. Let's reduce confusion for anybody reading this code by
replacing the sha1 initializer with a comment. The resulting field will
be all-zeroes, so any violation of our assumption that the oid field is
not used will break equally for sha1 and sha256.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
and then use it to initialize the hash field of an object_id struct.
That hash field is exactly 32 bytes long (the size we need for sha256).
But the string literal above is actually 33 bytes long due to the NUL
terminator. This is legal in C, and the NUL is ignored.
Side note on legality: in general excess initializer elements are
forbidden, and gcc will warn on both of these:
I couldn't find specific language in the standard allowing
initialization from a string literal where _just_ the NUL is ignored,
but C99 section 6.7.8 (Initialization), paragraph 32 shows this exact
case as "example 8".
However, the upcoming gcc 15 will start warning for this case (when
compiled with -Wextra via DEVELOPER=1):
CC object-file.o
object-file.c:52:9: warning: initializer-string for array of ‘unsigned char’ is too long [-Wunterminated-string-initialization]
52 | "\x6e\xf1\x9b\x41\x22\x5c\x53\x69\xf1\xc1" \
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
object-file.c:79:17: note: in expansion of macro ‘EMPTY_TREE_SHA256_BIN_LITERAL’
which is understandable. Even though this is not a bug for us, since we
do not care about the NUL terminator (and are just using the literal as
a convenient format), it would be easy to accidentally create an array
that was mistakenly unterminated.
We can avoid this warning by switching the initializer to an actual
array of unsigned values. That arguably demonstrates our intent more
clearly anyway.
Reported-by: Sam James <sam@gentoo.org> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Makefile: let clar header targets depend on their scripts
The targets that generate clar headers depend on their source files, but
not on the script that is actually generating the output. Fix the issue
by adding the missing dependencies.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
cmake: use verbatim arguments when invoking clar commands
Pass the VERBATIM option to `add_custom_command()`. Like this, all
arguments to the commands will be escaped properly for the build tool so
that the invoked command receives each argument unchanged.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 30bf9f0aaa (cmake: set up proper dependencies for generated clar
headers, 2024-10-21), we have deduplicated the logic to generate our
clar headers by reusing the same scripts that our Makefile does. Despite
the deduplication, this refactoring also made us rebuild the headers in
case the source files change, which didn't happen previously.
The commit also introduced an issue though: we execute the scripts
directly, so when the host does not have "/bin/sh" available they will
fail. This is for example the case on Windows when importing the CMake
project into Microsoft Visual Studio.
Address the issue by invoking the scripts with `SH_EXE`, which contains
the discovered path of the shell interpreter.
While at it, wrap the overly long lines in the CMake build instructions.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
t/unit-tests: convert "clar-generate.awk" into a shell script
Convert "clar-generate.awk" into a shell script that invokes awk(1).
This allows us to avoid the shell redirect in the build system, which
may otherwise be a problem with build systems on platforms that use a
different shell.
While at it, wrap the overly long lines in the CMake build instructions.
Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
It might be naïve to think that those who need this education would end
up here in the first place. But I think it’s good to mention this
high-level concept here on a command which provides a backup strategy.
Signed-off-by: Kristoffer Haugsbakk <code@khaugsbakk.name> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Documentation/git-bundle.txt: mention full backup example
Provide an example about how to make a “full backup” with caveats about
what that means in this case.
This is a requested use-case.[1] But the doc is a bit unassuming
about it:
If you want to match `git clone --mirror`, which would include your
refs such as `refs/remotes/*`, use `--all`.
The user cannot be expected to formulate “I want a full backup” as “I
want to match `git clone --mirror`” for a bundle file or something.
Let’s drop this mention of `--all` later in the doc and frontload it.
Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Kristoffer Haugsbakk <code@khaugsbakk.name> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Sun, 17 Nov 2024 01:31:49 +0000 (01:31 +0000)]
reflog: rename unreachable
In C23, "unreachable" is a macro that invokes undefined behavior if it
is invoked. To make sure that our code compiles on a variety of C
versions, rename unreachable to "is_unreachable".
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
brian m. carlson [Sun, 17 Nov 2024 01:31:48 +0000 (01:31 +0000)]
index-pack: rename struct thread_local
"thread_local" is a keyword in C23. To make sure that our code compiles
on a wide variety of C versions, rename struct thread_local to "struct
thread_local_data" to avoid a conflict.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Reported-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Kristoffer Haugsbakk <code@khaugsbakk.name> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Fri, 15 Nov 2024 17:27:40 +0000 (02:27 +0900)]
Clean up RelNotes for 2.48
There somehow ended up too many bogus "merge X later to maint"
comments for topics that cannot be merged ever down to 'maint'
because they were forked from more recent integration branches
in the draft release notes. Remove them, as they are inviting
for mistakes later.
brian m. carlson [Fri, 15 Nov 2024 00:54:04 +0000 (00:54 +0000)]
Allow cloning from repositories owned by another user
Historically, Git has allowed users to clone from an untrusted
repository, and we have documented that this is safe to do so:
`upload-pack` tries to avoid any dangerous configuration options or
hooks from the repository it's serving, making it safe to clone an
untrusted directory and run commands on the resulting clone.
However, this was broken by f4aa8c8bb1 ("fetch/clone: detect dubious
ownership of local repositories", 2024-04-10) in an attempt to make
things more secure. That change resulted in a variety of problems when
cloning locally and over SSH, but it did not change the stated security
boundary. Because the security boundary has not changed, it is safe to
adjust part of the code that patch introduced.
To do that and restore the previous functionality, adjust enter_repo to
take two flags instead of one.
The two bits are
- ENTER_REPO_STRICT: callers that require exact paths (as opposed
to allowing known suffixes like ".git", ".git/.git" to be
omitted) can set this bit. Corresponds to the "strict" parameter
that the flags word replaces.
- ENTER_REPO_ANY_OWNER_OK: callers that are willing to run without
ownership check can set this bit.
The former is --strict-paths option of "git daemon". The latter is
set only by upload-pack, which honors the claimed security boundary.
Note that local clones across ownership boundaries require --no-local so
that upload-pack is used. Document this fact in the manual page and
provide an example.
This patch was based on one written by Junio C Hamano.
Taylor Blau [Thu, 14 Nov 2024 13:42:12 +0000 (08:42 -0500)]
pack-objects: only perform verbatim reuse on the preferred pack
When reusing objects from source pack(s), write_reused_pack_verbatim()
is responsible for reusing objects whole eword_t's at a time. It works
by taking the longest continuous run of objects from the beginning of
each source pack that the caller wants, and reuses the entirety of that
section from each pack.
This is based on the assumption that we don't have any gaps within the
region. This assumption relieves us from having to patch any
OFS_DELTAs, since we know that there aren't any gaps between any delta
and its base in that region.
To illustrate why this assumption is necessary, suppose we have some
pack P, which has objects X, Y, and Z. If the MIDX's copy of Y was
selected from a pack other than P, then the bit corresponding to object
Y will appear earlier in the bitmap than the bits corresponding to X and
Z.
If pack-objects already has or will use the copy of Y from the pack it
was selected from in the MIDX, then it is an error to reuse all objects
between X and Z in the source pack. Doing so will cause us to reuse Y
from a different pack than the one which represents Y in the MIDX,
causing us to either:
- include the object twice, assuming that the caller wants Y in the
pack, or
- include the object once, resulting in us packing more objects than
necessary.
This regression comes from ca0fd69e37 (pack-objects: prepare
`write_reused_pack_verbatim()` for multi-pack reuse, 2023-12-14), which
incorrectly assumed that there would be no gaps in reusable regions of
non-preferred packs.
Instead, we can only safely perform the whole-word reuse optimization on
the preferred pack, where we know with certainty that no gaps exist in
that region of the bitmap. We can still reuse objects from non-preferred
packs, but we have to inspect them individually in write_reused_pack()
to ensure that any gaps that may exist are accounted for.
This allows us to simplify the implementation of
write_reused_pack_verbatim() back to almost its pre-multi-pack reuse
form, since we can now assume that the beginning of the pack appears at
the beginning of the bitmap, meaning that we don't have to account for
any bits up to the first word boundary (like we had to special case in ca0fd69e37).
The only significant changes from the pre-ca0fd69e37 implementation are:
- that we can no longer inspect words up to the end of
reuse_packfile_bitmap->word_alloc, since we only want to look at
words whose bits all correspond to objects in the given packfile, and
- that we return early when given a reuse_packfile which is not
preferred, making the call a noop.
In the future, it might be possible to restore this optimization if we
could guarantee that some reuse packs don't contain any gaps by
construction (similar to the "disjoint packs" idea in very early
versions of multi-pack reuse).
Helped-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The former is responsible for copying the bytes for a single object out
of an existing source pack. The latter does the same but for a region of
objects aligned at eword_t boundaries.
Demonstrate a bug whereby write_reused_pack_verbatim() can be tricked
into writing out objects from some source pack, even when those objects
were selected from a different source pack in the MIDX bitmap.
When the caller wants at least one of the objects in that region,
pack-objects will write the same object twice as a result of this bug.
In the other case where the caller doesn't want any of the objects in
the region of interest, we will write out objects that weren't
requested.
Demonstrate this bug by creating two packs, where the preferred one of
those packs contains a single object which also appears in the main
(non-preferred) pack. A separate bug[^1] prevents us from triggering the
main bug when the duplicated object is the last one in the main pack,
but any earlier object will suffice.
We could fix that separate bug, but the following commit will simplify
write_reused_pack_verbatim() and only call it on the preferred pack, so
doing so would have little point.
[^1]: Because write_reused_pack_verbatim() only reuses bits in the range
/* We're recording one chunk, not one object. */
record_reused_object(pack_start_off,
pack_start_off - (hashfile_total(out) - pack_start));
, or in other words excluding the object beginning at position 'pos -
reuse_packfile->bitmap_pos' in the source pack. But since
reuse_packfile->bitmap_pos is '1' in the non-preferred pack
(accounting for the single-object pack which is preferred), we don't
actually copy the bytes from the last object.
Helped-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Karthik Nayak [Thu, 14 Nov 2024 09:58:35 +0000 (10:58 +0100)]
refs: don't invoke reference-transaction hook for reflogs
The reference-transaction hook is invoked whenever there is a reference
update being performed. For each state of the transaction, we iterate
over the updates present and pass this information to the hook.
The `ref_update` structure is used to hold these updates within a
`transaction`. We use the same structure for holding reflog updates too.
Which means that the reference transaction hook is also obtaining
information about a reflog update. This is a bug, since:
- The hook is designed to work with reference updates and reflogs
updates are different.
- The hook doesn't have the required information to distinguish
reference updates from reflog updates.
This is particularly evident when the default branch (pointed by HEAD)
is updated, we see that the hook also receives information about HEAD
being changed. In reality, we only add a reflog update for HEAD, while
HEAD's values remains the same.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Thu, 14 Nov 2024 01:39:12 +0000 (20:39 -0500)]
test-lib: move malloc-debug setup after $PATH setup
Originally, the conditional definition of the setup/teardown functions
for malloc checking could be run at any time, because they depended only
on command-line options and the system getconf function.
But since 02d900361c (test-lib: check malloc debug LD_PRELOAD before
using, 2024-11-11), we probe the system by running "git version". Since
this code runs before we've set $PATH to point to the version of Git we
intend to test, we actually run the system version of git.
This mostly works, since what we really care about is whether the
LD_PRELOAD works, and it should work the same with any program. But
there are some corner cases:
1. You might not have a system git at all, in which case the preload
will appear to fail, even though it could work with the actual
built version of git.
2. Your system git could be linked in a different way. For example, if
it was built statically, then it will ignore LD_PRELOAD entirely,
and we might assume that the preload works, even though it might
not when used with a dynamic build.
We could give a more complete path to the version of Git we intend to
test, but features like GIT_TEST_INSTALLED make that not entirely
trivial. So instead, let's just bump the setup until after we've set up
the $PATH. There's no need for us to do it early, as long as it is done
before the first test runs.
Reported-by: Toon Claes <toon@iotcl.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Tue, 12 Nov 2024 23:35:34 +0000 (08:35 +0900)]
Merge branch 'ps/mingw-rename'
The MinGW compatibility layer has been taught to support POSIX
semantics for atomic renames when other process(es) have a file
opened at the destination path.
* ps/mingw-rename:
compat/mingw: support POSIX semantics for atomic renames
compat/mingw: allow deletion of most opened files
compat/mingw: share file handles created via `CreateFileW()`
Simon Marchi [Tue, 12 Nov 2024 16:22:57 +0000 (11:22 -0500)]
builtin/difftool: intialize some hashmap variables
When running a dir-diff command that produces no diff, variables
`wt_modified` and `tmp_modified` are used while uninitialized, causing:
$ /home/smarchi/src/git/git-difftool --dir-diff master
free(): invalid pointer
[1] 334004 IOT instruction (core dumped) /home/smarchi/src/git/git-difftool --dir-diff master
$ valgrind --track-origins=yes /home/smarchi/src/git/git-difftool --dir-diff master
...
Invalid free() / delete / delete[] / realloc()
at 0x48478EF: free (vg_replace_malloc.c:989)
by 0x422CAC: hashmap_clear_ (hashmap.c:208)
by 0x283830: run_dir_diff (difftool.c:667)
by 0x284103: cmd_difftool (difftool.c:801)
by 0x238E0F: run_builtin (git.c:484)
by 0x2392B9: handle_builtin (git.c:750)
by 0x2399BC: cmd_main (git.c:921)
by 0x356FEF: main (common-main.c:64)
Address 0x1ffefff180 is on thread 1's stack
in frame #2, created by run_dir_diff (difftool.c:358)
...
If taking any `goto finish` path before these variables are initialized,
`hashmap_clear_and_free()` operates on uninitialized data, sometimes
causing a crash.
This regression was introduced in 7f795a1715 (builtin/difftool: plug
several trivial memory leaks, 2024-09-26).
Fix it by initializing those variables with the `HASHMAP_INIT` macro.
Add a test comparing the main branch to itself, resulting in no diff.
Signed-off-by: Simon Marchi <simon.marchi@efficios.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 12 Nov 2024 08:39:37 +0000 (03:39 -0500)]
refspec: store raw refspecs inside refspec_item
The refspec struct keeps two matched arrays: one for the refspec_item
structs and one for the original raw refspec strings. The main reason
for this is that there are other users of refspec_item that do not care
about the raw strings. But it does make managing the refspec struct
awkward, as we must keep the two arrays in sync. This has led to bugs in
the past (both leaks and double-frees).
Let's just store a copy of the raw refspec string directly in each
refspec_item struct. This simplifies the handling at a small cost:
1. Direct callers of refspec_item_init() will now get an extra copy of
the refspec string, even if they don't need it. This should be
negligible, as the struct is already allocating two strings for the
parsed src/dst values (and we tend to only do it sparingly anyway
for things like the TAG_REFSPEC literal).
2. Users of refspec_appendf() will now generate a temporary string,
copy it, and then free the result (versus handing off ownership of
the temporary string). We could get around this by having a "nodup"
variant of refspec_item_init(), but it doesn't seem worth the extra
complexity for something that is not remotely a hot code path.
Code which accesses refspec->raw now needs to look at refspec->item.raw.
Other callers which just use refspec_item directly can remain the same.
We'll free the allocated string in refspec_item_clear(), which they
should be calling anyway to free src/dst.
One subtle note: refspec_item_init() can return an error, in which case
we'll still have set its "raw" field. But that is also true of the "src"
and "dst" fields, so any caller which does not _clear() the failed item
is already potentially leaking. In practice most code just calls die()
on an error anyway, but you can see the exception in valid_fetch_refspec(),
which does correctly call _clear() even on error.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 12 Nov 2024 08:36:10 +0000 (03:36 -0500)]
refspec: drop separate raw_nr count
A refspec struct contains zero or more refspec_item structs, along with
matching "raw" strings. The items and raw strings are kept in separate
arrays, but those arrays will always have the same length (because we
write them only via refspec_append_nodup(), which grows both). This can
lead to bugs when manipulating the array, since the arrays and lengths
must be modified in lockstep. For example, the bug fixed in the previous
commit, which forgot to decrement raw_nr.
So let's get rid of "raw_nr" and have only "nr", making this kind of bug
impossible (and also making it clear that the two are always matched,
something that existing code already assumed but was not guaranteed by
the interface).
Even though we'd expect "alloc" and "raw_alloc" to likewise move in
lockstep, we still need to keep separate counts there if we want to
continue to use ALLOC_GROW() for both.
Conceptually this would all be simpler if refspec_item just held onto
its own raw string, and we had a single array. But there are callers
which use refspec_item outside of "struct refspec" (and so don't hold on
to a matching "raw" string at all), which we'd possibly need to adjust.
So let's not worry about refactoring that for now, and just get rid of
the redundant count variable. That is the first step on the road to
combining them anyway.
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Tue, 12 Nov 2024 08:34:00 +0000 (03:34 -0500)]
fetch: adjust refspec->raw_nr when filtering prefetch refspecs
In filter_prefetch_refspecs(), we may remove one or more refspecs if
they point into refs/tags/. When we do, we remove the item from the
refspec->items array, shifting subsequent items down, and then decrement
the refspec->nr count.
We also remove the item from the refspec->raw array, but fail to
decrement refspec->raw_nr. This leaves us with a count that is too high,
and anybody looking at the "raw" array will erroneously see either:
1. The removed entry, if there were no subsequent items to shift down.
2. A duplicate of the final entry, as everything is shifted down but
there was nothing to overwrite the final item.
The obvious culprit to run into this is calling refspec_clear(), which
will try to free the removed entry (case 1) or double-free the final
entry (case 2). But even though the bug has existed since the function
was added in 2e03115d0c (fetch: add --prefetch option, 2021-04-16), we
did not trigger it in the test suite. The --prefetch option is normally
only used with configured refspecs, and we never bother to call
refspec_clear() on those (they are stored as part of a struct remote,
which is held in a global variable).
Ironically you couldn't trigger case 1, because the code accidentally
leaked the string in the raw array, and the two bugs (the leak and the
double-free) cancelled out. But when we fixed the leak in ea4780307c
(fetch: free "raw" string when shrinking refspec, 2024-09-24), it became
possible to trigger that, too, with a single item:
git fetch --prefetch . refs/tags/foo
We can fix both cases by just correctly decrementing "raw_nr" when we
shrink the array. Even though we don't expect people to use --prefetch
with command-line refspecs, we'll add a test to make sure it behaves
well (like the test just before it, we're just confirming that the
filtered prefetch succeeds at all).
Reported-by: Eric Mills <ermills@epic.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jonathan Tan [Fri, 1 Nov 2024 20:11:48 +0000 (13:11 -0700)]
index-pack: repack local links into promisor packs
Teach index-pack to, when processing the objects in a pack with
--promisor specified on the CLI, repack local objects (and the local
objects that they refer to, recursively) referenced by these objects
into promisor packs.
This prevents the situation in which, when fetching from a promisor
remote, we end up with promisor objects (newly fetched) referring
to non-promisor objects (locally created prior to the fetch). This
situation may arise if the client had previously pushed objects to the
remote, for example. One issue that arises in this situation is that,
if the non-promisor objects become inaccessible except through promisor
objects (for example, if the branch pointing to them has moved to
point to the promisor object that refers to them), then GC will garbage
collect them. There are other ways to solve this, but the simplest
seems to be to enforce the invariant that we don't have promisor objects
referring to non-promisor objects.
This repacking is done from index-pack to minimize the performance
impact. During a fetch, the only time most objects are fully inflated
in memory is when their object ID is computed, so we also scan the
objects (to see which objects they refer to) during this time.
Also to minimize the performance impact, an object is calculated to be
local if it's a loose object or present in a non-promisor pack. (If it's
also in a promisor pack or referred to by an object in a promisor pack,
it is technically already a promisor object. But a misidentification
of a promisor object as a non-promisor object is relatively benign
here - we will thus repack that promisor object into a promisor pack,
duplicating it in the object store, but there is no correctness issue,
just an issue of inefficiency.)
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jeff King [Mon, 11 Nov 2024 07:01:34 +0000 (02:01 -0500)]
test-lib: check malloc debug LD_PRELOAD before using
This fixes test failures across the suite on glibc platforms that don't
have libc_malloc_debug.so.0.
We added support for glibc's malloc checking routines long ago in a731fa916e (Add MALLOC_CHECK_ and MALLOC_PERTURB_ libc env to the test
suite for detecting heap corruption, 2012-09-14). Back then we didn't
need to do any checks to see if the platform supported it. We were just
setting some environment variables which would either enable it or not.
That changed in 131b94a10a (test-lib.sh: Use GLIBC_TUNABLES instead of
MALLOC_CHECK_ on glibc >= 2.34, 2022-03-04). Now that glibc split this
out into libc_malloc_debug.so, we have to add it to LD_PRELOAD. We only
do that when we detect glibc, but it's possible to have glibc but not
the malloc debug library. In that case LD_PRELOAD will complain to
stderr, and tests which check for an empty stderr will fail.
You can work around this by setting TEST_NO_MALLOC_CHECK, which disables
the feature entirely. But it's not obvious to know you need to do that.
Instead, since this malloc checking is best-effort anyway, let's just
automatically disable it when the LD_PRELOAD appears not to work. We can
check it by running something simple that should work (and produce
nothing on stderr) like "git version".
Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Abhijeet Sonar [Sat, 9 Nov 2024 09:27:39 +0000 (14:57 +0530)]
t5300: add test for 'show-index --object-format'
In 88a09a557c (builtin/show-index: provide options to determine hash
algo), the flag --object-format was added to show-index builtin as a way
to provide a hash algorithm explicitly. However, we do not have tests in
place for that functionality. Add them.
Signed-off-by: Abhijeet Sonar <abhijeet.nkt@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Abhijeet Sonar [Sat, 9 Nov 2024 09:27:38 +0000 (14:57 +0530)]
show-index: fix uninitialized hash function
In c8aed5e8da (repository: stop setting SHA1 as the default object
hash), we got rid of the default hash algorithm for the_repository.
Due to this change, it is now the responsibility of the callers to set
their own default when this is not present.
As stated in the docs, show-index should use SHA1 as the default hash
algorithm when run outside a repository. Make sure this promise is
met by falling back to SHA1 when the_hash_algo is not present (i.e.
when the command is run outside a repository). Also add a test that
verifies this behavior.
Signed-off-by: Abhijeet Sonar <abhijeet.nkt@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Fri, 8 Nov 2024 03:56:27 +0000 (12:56 +0900)]
Merge branch 'ps/upgrade-clar'
Buildfix and upgrade of Clar to a newer version.
* ps/upgrade-clar:
cmake: set up proper dependencies for generated clar headers
cmake: fix compilation of clar-based unit tests
Makefile: extract script to generate clar declarations
Makefile: adjust sed command for generating "clar-decls.h"
t/unit-tests: update clar to 206accb