We don't run style checks on our CI, even though we have a
'.clang-format' setup in the repository. Let's add one, the job will
validate only against the new commits added and will only run on merge
requests. Since we're introducing it for the first time, let's allow
this job to fail, so we can validate if this is useful and eventually
enforce it.
For GitHub, we allow the job to pass by adding 'continue-on-error: true'
to the workflow. This means the job would show as passed, even if the
style check failed. To know the status of the job, users have to
manually check the logs.
For GitLab, we allow the job to pass by adding 'allow_failure: true', to
the job. Unlike GitHub, here the job will show as failed with a yellow
warning symbol, but the pipeline would still show as passed.
Also for GitLab, we use the 'CI_MERGE_REQUEST_TARGET_BRANCH_SHA'
variable by default to obtain the base SHA of the merged pipeline (which
is only available for merged pipelines [1]). Otherwise we use the
'CI_MERGE_REQUEST_DIFF_BASE_SHA' variable.
There are some spacing rules that we follow in the project and it makes
sense to formalize them:
* Ensure there is no space inserted after the logical not '!' operator.
* Ensure there is no space before the case statement's colon.
* Ensure there is no space before the first bracket '[' of an array.
* Ensure there is no space in empty blocks.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The spacing around colons is currently not standardized and as such we
have the following practices in our code base:
- Spacing around the colon `int bf : 1`: 146 instances
- No spacing around the colon `int bf:1`: 148 instances
- Spacing before the colon `int bf :1`: 6 instances
- Spacing after the colon `int bf: 1`: 12 instances
Let's formalize this by picking the most followed pattern and add the
corresponding style to '.clang-format'.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
clang-format: indent preprocessor directives after hash
We do not have a rule around the indentation of preprocessor directives.
This was also discussed on the list [1], noting how there is often
inconsistency in the styling. While there was discussion, there was no
conclusion around what is the preferred style here. One style being
indenting after the hash:
#if FOO
# if BAR
# include <foo>
# endif
#endif
The other being before the hash:
#if FOO
#if BAR
#include <foo>
#endif
#endif
Let's pick the former and add 'IndentPPDirectives: AfterHash' value to
our '.clang-format'. There is no clear reason to pick one over the
other, but it would definitely be nicer to be consistent.
Junio C Hamano [Mon, 8 Jul 2024 21:53:09 +0000 (14:53 -0700)]
Merge branch 'tb/path-filter-fix'
The Bloom filter used for path limited history traversal was broken
on systems whose "char" is unsigned; update the implementation and
bump the format version to 2.
* tb/path-filter-fix:
bloom: introduce `deinit_bloom_filters()`
commit-graph: reuse existing Bloom filters where possible
object.h: fix mis-aligned flag bits table
commit-graph: new Bloom filter version that fixes murmur3
commit-graph: unconditionally load Bloom filters
bloom: prepare to discard incompatible Bloom filters
bloom: annotate filters with hash version
repo-settings: introduce commitgraph.changedPathsVersion
t4216: test changed path filters with high bit paths
t/helper/test-read-graph: implement `bloom-filters` mode
bloom.h: make `load_bloom_filter_from_graph()` public
t/helper/test-read-graph.c: extract `dump_graph_info()`
gitformat-commit-graph: describe version 2 of BDAT
commit-graph: ensure Bloom filters are read with consistent settings
revision.c: consult Bloom filters for root commits
t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
Junio C Hamano [Mon, 8 Jul 2024 21:53:08 +0000 (14:53 -0700)]
Merge branch 'rj/pager-die-upon-exec-failure'
When GIT_PAGER failed to spawn, depending on the code path taken,
we failed immediately (correct) or just spew the payload to the
standard output (incorrect). The code now always fail immediately
when GIT_PAGER fails.
* rj/pager-die-upon-exec-failure:
pager: die when paging to non-existing command
"git archive --add-virtual-file=<path>:<contents>" never paid
attention to the --prefix=<prefix> option but the documentation
said it would. The documentation has been corrected.
* jc/archive-prefix-with-add-virtual-file:
archive: document that --add-virtual-file takes full path
Junio C Hamano [Mon, 8 Jul 2024 05:50:37 +0000 (22:50 -0700)]
Merge https://github.com/j6t/git-gui
* https://github.com/j6t/git-gui:
git-gui: fix inability to quit after closing another instance
git-gui: sv.po: Update Swedish translation (576t0f0u)
git-gui: note the new maintainer
Makefile(s): do not enforce "all indents must be done with tab"
Makefile(s): avoid recipe prefix in conditional statements
doc: switch links to https
doc: update links to current pages
git-gui: po: fix typo in French "aperçu"
Johannes Sixt [Sun, 7 Jul 2024 12:00:23 +0000 (14:00 +0200)]
Merge branch 'os/catch-rename'
The problem can be reproduced on Linux with this sequence:
1. Run git gui from a terminal.
2. Edit the commit message and wait for at least 2 seconds.
3. Terminate the instance from the terminal, for example with Ctrl-C,
to simulate crash. This leaves the file .git/GITGUI_BCK behind.
4. Start two instances of git gui &.
At this point the first instance can be closed (it renames
.git/GITGUI_BCK to .git/GITGUI_MSG), but the seconds brings an error
message about the absent file and cannot be closed thereafter and must
be killed from the command line.
The renaming that happens by the first instance is the correct action
and need not be repeated by the second instance. It is the correct
action to ignore the failed renaming.
On the other hand, the second instance could just edit the commit
message again, wait 2 seconds to write GITGUI_BCK, and then can be
closed without failing. At this point, since the user has edited the
message, it is again correct to preserve the edited version in
GITGUI_MSG.
* os/catch-rename:
git-gui: fix inability to quit after closing another instance
Junio C Hamano [Tue, 2 Jul 2024 16:59:01 +0000 (09:59 -0700)]
Merge branch 'jk/remote-wo-url'
Memory ownership rules for the in-core representation of
remote.*.url configuration values have been straightened out, which
resulted in a few leak fixes and code clarification.
* jk/remote-wo-url:
remote: drop checks for zero-url case
remote: always require at least one url in a remote
t5801: test remote.*.vcs config
t5801: make remote-testgit GIT_DIR setup more robust
remote: allow resetting url list
config: document remote.*.url/pushurl interaction
remote: simplify url/pushurl selection
remote: use strvecs to store remote url/pushurl
remote: transfer ownership of memory in add_url(), etc
remote: refactor alias_url() memory ownership
archive: fix check for missing url
Junio C Hamano [Tue, 2 Jul 2024 16:59:01 +0000 (09:59 -0700)]
Merge branch 'jc/fuzz-sans-curl'
CI job to build minimum fuzzers learned to pass NO_CURL=NoThanks to
the build procedure, as its build environment does not offer, or
the rest of the build needs, anything cURL.
Junio C Hamano [Tue, 2 Jul 2024 16:59:00 +0000 (09:59 -0700)]
Merge branch 'rb/build-options-w-lib-versions'
"git version --build-options" reports the version information of
OpenSSL and other libraries (if used) in the build.
* rb/build-options-w-lib-versions:
version: teach --build-options to reports zlib version information
version: teach --build-options to reports libcurl version information
version: --build-options reports OpenSSL version information
Junio C Hamano [Tue, 2 Jul 2024 16:59:00 +0000 (09:59 -0700)]
Merge branch 'ps/use-the-repository'
A CPP macro USE_THE_REPOSITORY_VARIABLE is introduced to help
transition the codebase to rely less on the availability of the
singleton the_repository instance.
* ps/use-the-repository:
hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE`
t/helper: remove dependency on `the_repository` in "proc-receive"
t/helper: fix segfault in "oid-array" command without repository
t/helper: use correct object hash in partial-clone helper
compat/fsmonitor: fix socket path in networked SHA256 repos
replace-object: use hash algorithm from passed-in repository
protocol-caps: use hash algorithm from passed-in repository
oidset: pass hash algorithm when parsing file
http-fetch: don't crash when parsing packfile without a repo
hash-ll: merge with "hash.h"
refs: avoid include cycle with "repository.h"
global: introduce `USE_THE_REPOSITORY_VARIABLE` macro
hash: require hash algorithm in `empty_tree_oid_hex()`
hash: require hash algorithm in `is_empty_{blob,tree}_oid()`
hash: make `is_null_oid()` independent of `the_repository`
hash: convert `oidcmp()` and `oideq()` to compare whole hash
global: ensure that object IDs are always padded
hash: require hash algorithm in `oidread()` and `oidclr()`
hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()`
hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
Junio C Hamano [Tue, 2 Jul 2024 16:27:57 +0000 (09:27 -0700)]
Merge branch 'jc/no-default-attr-tree-in-bare' into maint-2.45
Earlier we stopped using the tree of HEAD as the default source of
attributes in a bare repository, but failed to document it. This
has been corrected.
* jc/no-default-attr-tree-in-bare:
attr.tree: HEAD:.gitattributes is no longer the default in a bare repo
Junio C Hamano [Tue, 2 Jul 2024 16:27:56 +0000 (09:27 -0700)]
Merge branch 'tb/precompose-getcwd' into maint-2.45
We forgot to normalize the result of getcwd() to NFC on macOS where
all other paths are normalized, which has been corrected. This still
does not address the case where core.precomposeUnicode configuration
is not defined globally.
* tb/precompose-getcwd:
macOS: ls-files path fails if path of workdir is NFD
Junio C Hamano [Tue, 2 Jul 2024 16:27:56 +0000 (09:27 -0700)]
Merge branch 'pw/rebase-i-error-message' into maint-2.45
When the user adds to "git rebase -i" instruction to "pick" a merge
commit, the error experience is not pleasant. Such an error is now
caught earlier in the process that parses the todo list.
* pw/rebase-i-error-message:
rebase -i: improve error message when picking merge
rebase -i: pass struct replay_opts to parse_insn_line()
Junio C Hamano [Tue, 2 Jul 2024 16:27:55 +0000 (09:27 -0700)]
Merge branch 'ds/format-patch-rfc-and-k' into maint-2.45
The "-k" and "--rfc" options of "format-patch" will now error out
when used together, as one tells us not to add anything to the
title of the commit, and the other one tells us to add "RFC" in
addition to "PATCH".
* ds/format-patch-rfc-and-k:
format-patch: ensure that --rfc and -k are mutually exclusive
Orgad Shaneh [Tue, 7 Feb 2023 07:43:17 +0000 (09:43 +0200)]
git-gui: fix inability to quit after closing another instance
If you open 2 git gui instances in the same directory, then close one
of them and try to close the other, an error message pops up, saying:
'error renaming ".git/GITGUI_BCK": no such file or directory', and it
is no longer possible to close the window ever.
Fix by catching this error, and proceeding even if the file no longer
exists.
Junio C Hamano [Fri, 28 Jun 2024 22:53:18 +0000 (15:53 -0700)]
Merge branch 'jc/varargs-attributes' into maint-2.45
Varargs functions that are unannotated as printf-like or execl-like
have been annotated as such.
* jc/varargs-attributes:
__attribute__: add a few missing format attributes
__attribute__: mark some functions with LAST_ARG_MUST_BE_NULL
__attribute__: remove redundant attribute declaration for git_die_config()
__attribute__: trace2_region_enter_printf() is like "printf"
Junio C Hamano [Fri, 28 Jun 2024 22:53:15 +0000 (15:53 -0700)]
Merge branch 'es/chainlint-ncores-fix' into maint-2.45
The chainlint script (invoked during "make test") did nothing when
it failed to detect the number of available CPUs. It now falls
back to 1 CPU to avoid the problem.
* es/chainlint-ncores-fix:
chainlint.pl: latch CPU count directly reported by /proc/cpuinfo
chainlint.pl: fix incorrect CPU count on Linux SPARC
chainlint.pl: make CPU count computation more robust
Junio C Hamano [Fri, 28 Jun 2024 22:53:08 +0000 (15:53 -0700)]
Merge branch 'bc/zsh-compatibility' into maint-2.45
zsh can pretend to be a normal shell pretty well except for some
glitches that we tickle in some of our scripts. Work them around
so that "vimdiff" and our test suite works well enough with it.
* bc/zsh-compatibility:
vimdiff: make script and tests work with zsh
t4046: avoid continue in &&-chain for zsh
Junio C Hamano [Fri, 28 Jun 2024 22:53:08 +0000 (15:53 -0700)]
Merge branch 'js/for-each-repo-keep-going' into maint-2.45
A scheduled "git maintenance" job is expected to work on all
repositories it knows about, but it stopped at the first one that
errored out. Now it keeps going.
* js/for-each-repo-keep-going:
maintenance: running maintenance should not stop on errors
for-each-repo: optionally keep going on an error
Junio C Hamano [Fri, 28 Jun 2024 22:53:06 +0000 (15:53 -0700)]
Merge branch 'pw/rebase-m-signoff-fix' into maint-2.45
"git rebase --signoff" used to forget that it needs to add a
sign-off to the resulting commit when told to continue after a
conflict stops its operation.
* pw/rebase-m-signoff-fix:
rebase -m: fix --signoff with conflicts
sequencer: store commit message in private context
sequencer: move current fixups to private context
sequencer: start removing private fields from public API
sequencer: always free "struct replay_opts"
Derrick Stolee [Fri, 28 Jun 2024 12:43:25 +0000 (12:43 +0000)]
sparse-index: improve lstat caching of sparse paths
The clear_skip_worktree_from_present_files() method was first introduced
in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
present in worktree, 2022-01-14) to allow better interaction with the
working directory in the presence of paths outside of the
sparse-checkout. The initial implementation would lstat() every single
SKIP_WORKTREE path to see if it existed; if it ran across a sparse
directory that existed (when a sparse index was in use), then it would
expand the index and then check every SKIP_WORKTREE path.
Since these lstat() calls were very expensive, this was improved in d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
caching, 2022-01-14) by caching directories that do not exist so it
could avoid lstat()ing any files under such directories. However, there
are some inefficiencies in that caching mechanism.
The caching mechanism stored only the parent directory as not existing,
even if a higher parent directory also does not exist. This means that
wasted lstat() calls would occur when the paths passed to path_found()
change immediate parent directories but within the same parent directory
that does not exist.
To create an example repository that demonstrates this problem, it helps
to have a directory outside of the sparse-checkout that contains many
deep paths. In particular, the first paths (in lexicographic order)
underneath the sparse directory should have deep directory structures,
maximizing the difference between the old caching algorithm that looks
to a single parent and the new caching algorithm that looks to the
top-most missing directory.
The performance test script p2000-sparse-operations.sh takes the sample
repository and copies its HEAD to several copies nested in directories
of the form f<i>/f<j>/f<k> where i, j, and k are numbers from 1 to 4.
The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/"
will trigger the behavior and also lead to some interesting cases for
the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do
not.
This is difficult to notice when running performance tests using the Git
repository (or a blow-up of the Git repository, as in
p2000-sparse-operations.sh) because Git has a very shallow directory
structure.
This change reorganizes the caching algorithm to focus on storing the
highest level leading directory that does not exist; specifically this
means that that directory's parent _does_ exist. By doing a little extra
work on a path passed to path_found(), we can short-circuit all of the
paths passed to path_found() afterwards that match a prefix with that
non-existing directory. When in a repository where the first sparse file
is likely to have a much deeper path than the first non-existing
directory, this can realize significant gains.
The details of this algorithm require careful attention, so the new
implementation of path_found() has detailed comments, including the use
of a new max_common_dir_prefix() method that may be of independent
interest.
It's worth noting that this is not universally positive, since we are
doing extra lstat() calls to establish the exact path to cache. In the
blow-up of the Git repository, we can see that the lstat count
_increases_ from 28 to 31. However, these numbers were already
artificially low.
Contributor Elijah Newren created a publicly-available test repository
that demonstrates the difference in these caching algorithms in the most
extreme way. To test, follow these steps:
git clone --sparse https://github.com/newren/gvfs-like-git-bomb
cd gvfs-like-git-bomb
./runme.sh # NOTE: check scripts before running!
At this point, assuming you do not have index.sparse=true set globally,
the index has one million paths with the SKIP_WORKTREE bit and they will
all be sent to path_found() in the sparse loop. You can measure this by
running 'git status' with GIT_TRACE2_PERF=1:
Sparse files in the index: 1,000,000
sparse_lstat_count (before): 200,000
sparse_lstat_count (after): 2
And here are the performance numbers:
Benchmark 1: old
Time (mean ± σ): 397.5 ms ± 4.1 ms
Range (min … max): 391.2 ms … 404.8 ms 10 runs
Benchmark 2: new
Time (mean ± σ): 252.7 ms ± 3.1 ms
Range (min … max): 249.4 ms … 259.5 ms 11 runs
Summary
'new' ran
1.57 ± 0.02 times faster than 'old'
By modifying this example further, we can demonstrate a more realistic
example and include the sparse index expansion. Continue by creating
this directory, confusing both caching algorithms somewhat:
mkdir -p bomb/d/e/f/a/a
Then re-run the 'git status' tests to see these statistics:
Sparse files in the index: 1,000,000
sparse_lstat_count (before): 724,010
sparse_lstat_count (after): 106
Benchmark 1: old
Time (mean ± σ): 753.0 ms ± 3.5 ms
Range (min … max): 749.7 ms … 760.9 ms 10 runs
Benchmark 2: new
Time (mean ± σ): 201.4 ms ± 3.2 ms
Range (min … max): 196.0 ms … 207.9 ms 14 runs
Summary
'new' ran
3.74 ± 0.06 times faster than 'old'
Note that if this repository had a sparse index enabled, the additional
cost of expanding the sparse index affects the total time of these
commands by over four seconds, significantly diminishing the benefit of
the caching algorithm. Having existing paths outside of the
sparse-checkout is a known performance issue for the sparse index and is
a known trade-off for the performance benefits given when no such paths
exist.
Using an internal monorepo with over two million paths at HEAD and a
typical sparse-checkout cone such that the sparse index contains
~190,000 entries (including over two thousand sparse trees), I was able
to measure these lstat counts when one sparse directory actually exists
on disk:
Derrick Stolee [Fri, 28 Jun 2024 12:43:24 +0000 (12:43 +0000)]
sparse-index: count lstat() calls
The clear_skip_worktree.. methods already report some statistics about
how many cache entries are checked against path_found() due to having
the skip-worktree bit set. However, due to path_found() performing some
caching, this isn't the only information that would be helpful to
report.
Add a new lstat_count member to the path_found_data struct to count the
number of times path_found() calls lstat(). This will be helpful to help
explain performance problems in this method as well as to demonstrate
future changes to the caching algorithm in a more concrete way than
end-to-end timings.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 28 Jun 2024 12:43:23 +0000 (12:43 +0000)]
sparse-index: use strbuf in path_found()
The path_found() method previously reused strings from the cache entries
the calling methods were using. This prevents string manipulation in
place and causes some odd reallocation before the final lstat() call in
the method.
Refactor the method to use strbufs and copy the path into the strbuf,
but also only the parent directory and not the whole path. This looks
like extra copying when assigning the path to the strbuf, but we save an
allocation by dropping the 'tmp' string, and we are "reusing" the copy
from 'tmp' to put the data in the strbuf.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 28 Jun 2024 12:43:22 +0000 (12:43 +0000)]
sparse-index: refactor path_found()
In advance of changing the behavior of path_found(), take all of the
intermediate data values and group them into a single struct. This
simplifies the method prototype as well as the initialization. Future
changes can be made directly to the struct and method without changing
the callers with this approach.
Note that the clear_path_found_data() method is currently empty, as
there is nothing to free. This method is a placeholder for future
changes that require a non-trivial implementation. Its stub is created
now so consumers could call it now and not change in future changes.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The clear_skip_worktree_from_present_files() method was introduced in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present
in worktree, 2022-01-14) to help cases where sparse-checkout is enabled
but some paths outside of the sparse-checkout also exist on disk. This
operation can be slow as it needs to check path existence in a way not
stored in the index, so caching was introduced in d79d299352 (Accelerate
clear_skip_worktree_from_present_files() by caching, 2022-01-14).
This check is particularly confusing in the presence of a sparse index,
as a sparse tree entry corresponding to an existing directory must first
be expanded to a full index before examining the paths within. This is
currently implemented using a 'goto' and a boolean variable to ensure we
restart only once.
Even with that caching, it was noticed that this could take a long time
to execute. 89aaab11a3 (index: add trace2 region for clear skip
worktree, 2022-11-03) introduced trace2 regions to measure this time.
Further, the way the loop repeats itself was slightly confusing and
prone to breakage, so a BUG() statement was added in 8c7abdc596 (index:
raise a bug if the index is materialised more than once, 2022-11-03) to
be sure that the second run of the loop does not hit any sparse trees.
One thing that can be confusing about the current setup is that the
trace2 regions nest and it is not clear that a second loop is running
after a sparse index is expanded. Here is an example of what the regions
look like in a typical case:
One thing that is particularly difficult to understand about these
regions is that most of the time is spent between the close of the
ensure_full_index region and the reporting of the end data. This is
because of the restart of the loop being within the same region as the
first iteration of the loop.
This change refactors the method into two separate methods that are
traced separately. This will be more important later when we change
other features of the methods, but for now the only functional change is
the difference in the structure of the trace regions.
After this change, the same telemetry section is split into three
distinct chunks:
Here, we see the sparse loop terminating early with its first sparse
path being a sparse directory containing a file. Then, that loop's
region terminates before ensure_full_index begins (in this case, the
cache-tree must also be computed). Then, _after_ the index is expanded,
the full loop begins with its own region.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When "git merge" sees that the index cannot be refreshed (e.g. due
to another process doing the same in the background), it died but
after writing MERGE_HEAD etc. files, which was useless for the
purpose to recover from the failure.
* kz/merge-fail-early-upon-refresh-failure:
merge: avoid write merge state when unable to write index
Junio C Hamano [Fri, 14 Jun 2024 18:40:57 +0000 (11:40 -0700)]
archive: document that --add-virtual-file takes full path
Tom Scogland noticed that `--add-virtual-file` option uses the path
specified as its value as-is, without prepending any value given to
the `--prefix` option like `--add-file` does.
The behaviour has always been that way since the option was
introduced, but the documentation has always been wrong and said
that it would use the value of `--prefix` just like `--add-file`
does.
We could modify the behaviour to make it literally work like the
documentation said, but it would break existing scripts the users
use.
Noticed-by: Tom Scogland <scogland1@llnl.gov> Acked-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Darcy Burke [Tue, 25 Jun 2024 23:12:48 +0000 (16:12 -0700)]
date: detect underflow/overflow when parsing dates with timezone offset
Overriding the date of a commit to be close to "1970-01-01 00:00:00"
with a large enough positive timezone for the equivelant GMT time to be
before the epoch is considered valid by `parse_date_basic`. Similar
behaviour occurs when using a date close to "2099-12-31 23:59:59" (the
maximum date allowed by `tm_to_time_t`) with a large enough negative
timezone offset.
This leads to an integer underflow or underflow respectively in the
commit timestamp, which is not caught by `git-commit`, but will cause
other services to fail, such as `git-fsck`, which, for the first case,
reports "badDateOverflow: invalid author/committer line - date causes
integer overflow".
Instead check the timezone offset and fail if the resulting time comes
before the epoch "1970-01-01T00:00:00Z" or after the maximum date
"2099-12-31T23:59:59Z".
Using the REQUIRE_64BIT_TIME prerequisite, make sure that the tests
near the end of Git time (aka end of year 2099) are not attempted on
purely 32-bit systems, as they cannot express timestamp beyond 2038
anyway.
Signed-off-by: Darcy Burke <acednes@gmail.com>
[jc: fixups for 32-bit platforms] Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Tue, 25 Jun 2024 23:12:47 +0000 (16:12 -0700)]
t0006: simplify prerequisites
The system must support 64-bit time and its time_t must be 64-bit
wide to pass these tests. Combine these two prerequisites together
to simplify the tests. In theory, they could be fulfilled
independently and tests could require only one without the other,
but in practice, these must come hand-in-hand.
Update the "check_parse" test helper to pay attention to the
REQUIRE_64BIT_TIME variable, which can be set to the HAVE_64BIT_TIME
prerequisite so that a parse test can be skipped on 32-bit systems.
This will be used in the next step to skip tests for timestamps near
the end of year 2099, as 32-bit systems will not be able to express
a timestamp beyond 2038 anyway.
Taylor Blau [Tue, 25 Jun 2024 17:40:15 +0000 (13:40 -0400)]
bloom: introduce `deinit_bloom_filters()`
After we are done using Bloom filters, we do not currently clean up any
memory allocated by the commit slab used to store those filters in the
first place.
Besides the bloom_filter structures themselves, there is mostly nothing
to free() in the first place, since in the read-only path all Bloom
filter's `data` members point to a memory mapped region in the
commit-graph file itself.
But when generating Bloom filters from scratch (or initializing
truncated filters) we allocate additional memory to store the filter's
data.
Keep track of when we need to free() this additional chunk of memory by
using an extra pointer `to_free`. Most of the time this will be NULL
(indicating that we are representing an existing Bloom filter stored in
a memory mapped region). When it is non-NULL, free it before discarding
the Bloom filters slab.
Suggested-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 25 Jun 2024 17:40:11 +0000 (13:40 -0400)]
commit-graph: reuse existing Bloom filters where possible
In an earlier commit, a bug was described where it's possible for Git to
produce non-murmur3 hashes when the platform's "char" type is signed,
and there are paths with characters whose highest bit is set (i.e. all
characters >= 0x80).
That patch allows the caller to control which version of Bloom filters
are read and written. However, even on platforms with a signed "char"
type, it is possible to reuse existing Bloom filters if and only if
there are no changed paths in any commit's first parent tree-diff whose
characters have their highest bit set.
When this is the case, we can reuse the existing filter without having
to compute a new one. This is done by marking trees which are known to
have (or not have) any such paths. When a commit's root tree is verified
to not have any such paths, we mark it as such and declare that the
commit's Bloom filter is reusable.
Note that this heuristic only goes in one direction. If neither a commit
nor its first parent have any paths in their trees with non-ASCII
characters, then we know for certain that a path with non-ASCII
characters will not appear in a tree-diff against that commit's first
parent. The reverse isn't necessarily true: just because the tree-diff
doesn't contain any such paths does not imply that no such paths exist
in either tree.
So we end up recomputing some Bloom filters that we don't strictly have
to (i.e. their bits are the same no matter which version of murmur3 we
use). But culling these out is impossible, since we'd have to perform
the full tree-diff, which is the same effort as computing the Bloom
filter from scratch.
But because we can cache our results in each tree's flag bits, we can
often avoid recomputing many filters, thereby reducing the time it takes
to run
To benchmark this, let's generate a commit-graph in linux.git with v1
changed-paths in generation order[^1]:
$ git clone git@github.com:torvalds/linux.git
$ cd linux
$ git commit-graph write --reachable --changed-paths
$ graph=".git/objects/info/commit-graph"
$ mv $graph{,.bak}
Then let's time how long it takes to go from v1 to v2 filters (with and
without the upgrade path enabled), resetting the state of the
commit-graph each time:
On linux.git (where there aren't any non-ASCII paths), the timings
indicate that this patch represents a speed-up over recomputing all
Bloom filters from scratch:
Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
Time (mean ± σ): 124.873 s ± 0.316 s [User: 124.081 s, System: 0.643 s]
Range (min … max): 124.621 s … 125.227 s 3 runs
Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
Time (mean ± σ): 79.271 s ± 0.163 s [User: 74.611 s, System: 4.521 s]
Range (min … max): 79.112 s … 79.437 s 3 runs
Summary
'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'
On git.git, we do have some non-ASCII paths, giving us a more modest
improvement from 4.163 seconds to 3.348 seconds, for a 1.24x speed-up.
On my machine, the stats for git.git are:
- 8,285 Bloom filters computed from scratch
- 10 Bloom filters generated as empty
- 4 Bloom filters generated as truncated due to too many changed paths
- 65,114 Bloom filters were reused when transitioning from v1 to v2.
[^1]: Note that this is is important, since `--stdin-packs` or
`--stdin-commits` orders commits in the commit-graph by their pack
position (with `--stdin-packs`) or in the raw input (with
`--stdin-commits`).
Since we compute Bloom filters in the same order that commits appear
in the graph, we must see a commit's (first) parent before we process
the commit itself. This is only guaranteed to happen when sorting
commits by their generation number.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 25 Jun 2024 17:40:04 +0000 (13:40 -0400)]
commit-graph: new Bloom filter version that fixes murmur3
The murmur3 implementation in bloom.c has a bug when converting series
of 4 bytes into network-order integers when char is signed (which is
controllable by a compiler option, and the default signedness of char is
platform-specific). When a string contains characters with the high bit
set, this bug causes results that, although internally consistent within
Git, does not accord with other implementations of murmur3 (thus,
the changed path filters wouldn't be readable by other off-the-shelf
implementatios of murmur3) and even with Git binaries that were compiled
with different signedness of char. This bug affects both how Git writes
changed path filters to disk and how Git interprets changed path filters
on disk.
Therefore, introduce a new version (2) of changed path filters that
corrects this problem. The existing version (1) is still supported and
is still the default, but users should migrate away from it as soon
as possible.
Because this bug only manifests with characters that have the high bit
set, it may be possible that some (or all) commits in a given repo would
have the same changed path filter both before and after this fix is
applied. However, in order to determine whether this is the case, the
changed paths would first have to be computed, at which point it is not
much more expensive to just compute a new changed path filter.
So this patch does not include any mechanism to "salvage" changed path
filters from repositories. There is also no "mixed" mode - for each
invocation of Git, reading and writing changed path filters are done
with the same version number; this version number may be explicitly
stated (typically if the user knows which version they need) or
automatically determined from the version of the existing changed path
filters in the repository.
There is a change in write_commit_graph(). graph_read_bloom_data()
makes it possible for chunk_bloom_data to be non-NULL but
bloom_filter_settings to be NULL, which causes a segfault later on. I
produced such a segfault while developing this patch, but couldn't find
a way to reproduce it neither after this complete patch (or before),
but in any case it seemed like a good thing to include that might help
future patch authors.
The value in t0095 was obtained from another murmur3 implementation
using the following Go source code:
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 25 Jun 2024 17:40:00 +0000 (13:40 -0400)]
commit-graph: unconditionally load Bloom filters
In an earlier commit, we began ignoring the Bloom data ("BDAT") chunk
for commit-graphs whose Bloom filters were computed using a hash version
incompatible with the value of `commitGraph.changedPathVersion`.
Now that the Bloom API has been hardened to discard these incompatible
filters (with the exception of low-level APIs), we can safely load these
Bloom filters unconditionally.
We no longer want to return early from `graph_read_bloom_data()`, and
similarly do not want to set the bloom_settings' `hash_version` field as
a side-effect. The latter is because we want to wait until we know which
Bloom settings we're using (either the defaults, from the GIT_TEST
variables, or from the previous commit-graph layer) before deciding what
hash_version to use.
If we detect an existing BDAT chunk, we'll infer the rest of the
settings (e.g., number of hashes, bits per entry, and maximum number of
changed paths) from the earlier graph layer. The hash_version will be
inferred from the previous layer as well, unless one has already been
specified via configuration.
Once all of that is done, we normalize the value of the hash_version to
either "1" or "2".
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 25 Jun 2024 17:39:57 +0000 (13:39 -0400)]
bloom: prepare to discard incompatible Bloom filters
Callers use the inline `get_bloom_filter()` implementation as a thin
wrapper around `get_or_compute_bloom_filter()`. The former calls the
latter with a value of "0" for `compute_if_not_present`, making
`get_bloom_filter()` the default read-only path for fetching an existing
Bloom filter.
Callers expect the value returned from `get_bloom_filter()` is usable,
that is that it's compatible with the configured value corresponding to
`commitGraph.changedPathsVersion`.
This is OK, since the commit-graph machinery only initializes its BDAT
chunk (thereby enabling it to service Bloom filter queries) when the
Bloom filter hash_version is compatible with our settings. So any value
returned by `get_bloom_filter()` is trivially useable.
However, subsequent commits will load the BDAT chunk even when the Bloom
filters are built with incompatible hash versions. Prepare to handle
this by teaching `get_bloom_filter()` to discard filters that are
incompatible with the configured hash version.
Callers who wish to read incompatible filters (e.g., for upgrading
filters from v1 to v2) may use the lower level routine,
`get_or_compute_bloom_filter()`.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 25 Jun 2024 17:39:54 +0000 (13:39 -0400)]
bloom: annotate filters with hash version
In subsequent commits, we will want to load existing Bloom filters out
of a commit-graph, even when the hash version they were computed with
does not match the value of `commitGraph.changedPathVersion`.
In order to differentiate between the two, add a "version" field to each
Bloom filter.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
A subsequent commit will introduce another version of the changed-path
filter in the commit graph file. In order to control which version to
write (and read), a config variable is needed.
Therefore, introduce this config variable. For forwards compatibility,
teach Git to not read commit graphs when the config variable
is set to an unsupported version. Because we teach Git this,
commitgraph.readChangedPaths is now redundant, so deprecate it and
define its behavior in terms of the config variable we introduce.
This commit does not change the behavior of writing (Git writes changed
path filters when explicitly instructed regardless of any config
variable), but a subsequent commit will restrict Git such that it will
only write when commitgraph.changedPathsVersion is a recognized value.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 25 Jun 2024 17:39:47 +0000 (13:39 -0400)]
t4216: test changed path filters with high bit paths
Subsequent commits will teach Git another version of changed path
filter that has different behavior with paths that contain at least
one character with its high bit set, so test the existing behavior as
a baseline.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Implement a mode of the "read-graph" test helper to dump out the
hexadecimal contents of the Bloom filter(s) contained in a commit-graph.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 25 Jun 2024 17:39:41 +0000 (13:39 -0400)]
bloom.h: make `load_bloom_filter_from_graph()` public
Prepare for a future commit to use the load_bloom_filter_from_graph()
function directly to load specific Bloom filters out of the commit-graph
for manual inspection (to be used during tests).
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Prepare for the 'read-graph' test helper to perform other tasks besides
dumping high-level information about the commit-graph by extracting its
main routine into a separate function.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jonathan Tan [Tue, 25 Jun 2024 17:39:34 +0000 (13:39 -0400)]
gitformat-commit-graph: describe version 2 of BDAT
The code change to Git to support version 2 will be done in subsequent
commits.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 25 Jun 2024 17:39:16 +0000 (13:39 -0400)]
commit-graph: ensure Bloom filters are read with consistent settings
The changed-path Bloom filter mechanism is parameterized by a couple of
variables, notably the number of bits per hash (typically "m" in Bloom
filter literature) and the number of hashes themselves (typically "k").
It is critically important that filters are read with the Bloom filter
settings that they were written with. Failing to do so would mean that
each query is liable to compute different fingerprints, meaning that the
filter itself could return a false negative. This goes against a basic
assumption of using Bloom filters (that they may return false positives,
but never false negatives) and can lead to incorrect results.
We have some existing logic to carry forward existing Bloom filter
settings from one layer to the next. In `write_commit_graph()`, we have
something like:
if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
struct commit_graph *g = ctx->r->objects->commit_graph;
/* We have changed-paths already. Keep them in the next graph */
if (g && g->chunk_bloom_data) {
ctx->changed_paths = 1;
ctx->bloom_settings = g->bloom_filter_settings;
}
}
, which drags forward Bloom filter settings across adjacent layers.
This doesn't quite address all cases, however, since it is possible for
intermediate layers to contain no Bloom filters at all. For example,
suppose we have two layers in a commit-graph chain, say, {G1, G2}. If G1
contains Bloom filters, but G2 doesn't, a new G3 (whose base graph is
G2) may be written with arbitrary Bloom filter settings, because we only
check the immediately adjacent layer's settings for compatibility.
This behavior has existed since the introduction of changed-path Bloom
filters. But in practice, this is not such a big deal, since the only
way up until this point to modify the Bloom filter settings at write
time is with the undocumented environment variables:
(it is still possible to tweak MAX_CHANGED_PATHS between layers, but
this does not affect reads, so is allowed to differ across multiple
graph layers).
But in future commits, we will introduce another parameter to change the
hash algorithm used to compute Bloom fingerprints itself. This will be
exposed via a configuration setting, making this foot-gun easier to use.
To prevent this potential issue, validate that all layers of a split
commit-graph have compatible settings with the newest layer which
contains Bloom filters.
Reported-by: SZEDER Gábor <szeder.dev@gmail.com> Original-test-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Taylor Blau [Tue, 25 Jun 2024 17:39:13 +0000 (13:39 -0400)]
revision.c: consult Bloom filters for root commits
The commit-graph stores changed-path Bloom filters which represent the
set of paths included in a tree-level diff between a commit's root tree
and that of its parent.
When a commit has no parents, the tree-diff is computed against that
commit's root tree and the empty tree. In other words, every path in
that commit's tree is stored in the Bloom filter (since they all appear
in the diff).
Consult these filters during pathspec-limited traversals in the function
`rev_same_tree_as_empty()`. Doing so yields a performance improvement
where we can avoid enumerating the full set of paths in a parentless
commit's root tree when we know that the path(s) of interest were not
listed in that commit's changed-path Bloom filter.
Suggested-by: SZEDER Gábor <szeder.dev@gmail.com> Original-patch-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The existing implementation of test_bloom_filters_not_used() asserts
that the Bloom filter sub-system has not been initialized at all, by
checking for the absence of any data from it from trace2.
In the following commit, it will become possible to load Bloom filters
without using them (e.g., because the `commitGraph.changedPathVersion`
introduced later in this series is incompatible with the hash version
with which the commit-graph's Bloom filters were written).
When this is the case, it's possible to initialize the Bloom filter
sub-system, while still not using any Bloom filters. When this is the
case, check that the data dump from the Bloom sub-system is all zeros,
indicating that no filters were used.
Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rubén Justo [Sun, 23 Jun 2024 07:09:02 +0000 (09:09 +0200)]
pager: die when paging to non-existing command
When trying to execute a non-existent program from GIT_PAGER, we display
an error. However, we also send the complete text to the terminal
and return a successful exit code. This can be confusing for the user
and the displayed error could easily become obscured by a lengthy
text.
For example, here the error message would be very far above after
sending 50 MB of text:
$ GIT_PAGER=non-existent t/test-terminal.perl git log | wc -c
error: cannot run non-existent: No such file or directory 50314363
Let's make the error clear by aborting the process and return an error
so that the user can easily correct their mistake.
This will be the result of the change:
$ GIT_PAGER=non-existent t/test-terminal.perl git log | wc -c
error: cannot run non-existent: No such file or directory
fatal: unable to execute pager 'non-existent'
0
The behavior change we're introducing in this commit affects two tests
in t7006, which is a good sign regarding test coverage and requires us
to address it.
The first test is 'git skips paging non-existing command'. This test
comes from f7991f01f2 (t7006: clean up SIGPIPE handling in trace2 tests,
2021-11-21,) where a modification was made to a test that was originally
introduced in c24b7f6736 (pager: test for exit code with and without
SIGPIPE, 2021-02-02). That original test was, IMHO, in the same
direction we're going in this commit.
At any rate, this test obviously needs to be adjusted to check the new
behavior we are introducing. Do it.
The second test being affected is: 'non-existent pager doesnt cause
crash', introduced in f917f57f40 (pager: fix crash when pager program
doesn't exist, 2021-11-24). As its name states, it has the intention of
checking that we don't introduce a regression that produces a crash when
GIT_PAGER points to a nonexistent program.
This test could be considered redundant nowadays, due to us already
having several tests checking implicitly what a non-existent command in
GIT_PAGER produces. However, let's maintain a good belt-and-suspenders
strategy; adapt it to the new world.
Finally, it's worth noting that we are not changing the behavior if the
command specified in GIT_PAGER is a shell command. In such cases, it
is:
$ GIT_PAGER=:\;non-existent t/test-terminal.perl git log
:;non-existent: 1: non-existent: not found
died of signal 13 at t/test-terminal.perl line 33.
Signed-off-by: Rubén Justo <rjusto@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Mon, 24 Jun 2024 23:39:14 +0000 (16:39 -0700)]
Merge branch 'jc/add-i-retire-usebuiltin-config'
For over a year, setting add.interactive.useBuiltin configuration
variable did nothing but giving a "this does not do anything"
warning. Finally remove it.
Junio C Hamano [Mon, 24 Jun 2024 23:39:14 +0000 (16:39 -0700)]
Merge branch 'jc/no-default-attr-tree-in-bare'
Earlier we stopped using the tree of HEAD as the default source of
attributes in a bare repository, but failed to document it. This
has been corrected.
* jc/no-default-attr-tree-in-bare:
attr.tree: HEAD:.gitattributes is no longer the default in a bare repo