Johannes Sixt [Tue, 20 May 2025 06:54:24 +0000 (08:54 +0200)]
Merge branch 'ml/replace-auto-execok'
This addresses CVE-2025-46334, Git GUI malicious command injection on
Windows.
A malicious repository can ship versions of sh.exe or typical textconv
filter programs such as astextplain. Due to the unfortunate design of
Tcl on Windows, the search path when looking for an executable always
includes the current directory. The mentioned programs are invoked when
the user selects "Git Bash" or "Browse Files" from the menu.
Johannes Sixt [Wed, 14 May 2025 17:56:27 +0000 (19:56 +0200)]
Merge branch 'js/fix-open-exec'
This addresses CVE-2025-27613, Gitk can create and truncate a user's
files:
When a user clones an untrusted repository and runs gitk without
additional command arguments, files for which the user has write
permission can be created and truncated. The option "Support per-file
encoding" must have been enabled before in Gitk's Preferences. This
option is disabled by default.
The same happens when "Show origin of this line" is used in the main
window (regardless of whether "Support per-file encoding" is enabled or
not).
Johannes Sixt [Wed, 14 May 2025 16:27:05 +0000 (18:27 +0200)]
Merge branch 'ah/fix-open-with-stdin'
This addresses CVE-2025-27614, Arbitrary command execution with Gitk:
A Git repository can be crafted in such a way that with some social
engineering a user who has cloned the repository can be tricked into
running any script (e.g., Bourne shell, Perl, Python, ...) supplied by
the attacker by invoking `gitk filename`, where `filename` has a
particular structure. The script is run with the privileges of the user.
Johannes Sixt [Sun, 4 May 2025 19:59:19 +0000 (21:59 +0200)]
git-gui: do not mistake command arguments as redirection operators
Tcl 'open' assigns special meaning to its argument when they begin with
redirection, pipe or background operator. There are many calls of the
'open' variant that runs a process which construct arguments that are
taken from the Git repository or are user input. However, when file
names or ref names are taken from the repository, it is possible to
find names that have these special forms. They must not be interpreted
by 'open' lest it redirects input or output, or attempts to build a
pipeline using a command name controlled by the repository.
Use the helper function make_arglist_safe, which identifies such
arguments and prepends "./" to force such a name to be regarded as a
relative file name.
After this change the following 'open' calls that start a process do not
apply the argument processing:
In all cases, the command arguments are constant strings (or begin with
a constant string) that are of a form that would not be affected by the
processing anyway.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sun, 4 May 2025 18:26:11 +0000 (20:26 +0200)]
git-gui: introduce function git_redir for git calls with redirections
Proc git invokes git and collects all output, which is it returns.
We are going to treat command arguments and redirections differently to
avoid passing arguments that look like redirections to the command
accidentally. A few invocations also pass redirection operators as
command arguments deliberately. Rewrite these cases to use a new
function git_redir that takes two lists, one for the regular command
arguments and one for the redirection operations.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sun, 4 May 2025 13:39:03 +0000 (15:39 +0200)]
git-gui: pass redirections as separate argument to git_read
We are going to treat command arguments and redirections differently to
avoid passing arguments that look like redirections to the command
accidentally. To do so, it will be necessary to know which arguments
are intentional redirections. Rewrite direct call sites of git_read
to pass intentional redirections as a second (optional) argument.
git_read defers to safe_open_command, but we cannot make it safe, yet,
because one of the callers of git_read is proc git, which does not yet
know which of its arguments are redirections. This is the topic of the
next commit.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sun, 4 May 2025 13:06:11 +0000 (15:06 +0200)]
git-gui: pass redirections as separate argument to _open_stdout_stderr
We are going to treat command arguments and redirections differently to
avoid passing arguments that look like redirections to the command
accidentally. To do so, it will be necessary to know which arguments
are intentional redirections. Rewrite direct callers of
_open_stdout_stderr to pass intentional redirections as a second
(optional) argument.
Passing arbitrary arguments is not safe right now, but we rename it
to safe_open_command anyway to avoid having to touch the call sites
again later when we make it actually safe.
We cannot make the function safe right away because one caller is
git_read, which does not yet know which of its arguments are
redirections. This is the topic of the next commit.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sat, 3 May 2025 11:24:48 +0000 (13:24 +0200)]
git-gui: convert git_read*, git_write to be non-variadic
We are going to treat command arguments and redirections differently to
avoid passing arguments that look like redirections to the command
accidentally. To do so, it will be necessary to know which arguments
are intentional redirections. As a preparation, convert git_read,
git_read_nice, and git_write to take just a single argument that is
the command in a list. Adjust all call sites accordingly.
In the future, this argument will be the regular command arguments and
a second argument will be the redirection operations.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Fri, 11 Apr 2025 14:58:20 +0000 (10:58 -0400)]
git-gui: override exec and open only on Windows
Since aae9560a355d (Work around Tcl's default `PATH` lookup,
2022-11-23), git-gui overrides exec and open on all platforms. But,
this was done in response to Tcl adding elements to $PATH on Windows,
while exec, open, and auto_execok honor $PATH as given on all other
platforms.
Let's do the override only on Windows, restoring others to using their
native exec and open. These honor the sanitized $PATH as that is written
out to env(PATH) in a previous commit. auto_execok is also safe on these
platforms, so can be used for _which.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
The previous commits bb5cb23daf75 (gitk: prevent overly long command
lines, 2023-01-24) rewrote a set of the 'open' calls substantially.
These were then later updated by 7dd272eca153 (gitk: escape file paths
before piping to git log, 2023-01-24) and d5d1b91e5327 (gitk: encode
arguments correctly with "open", 2025-03-07). In the preceding merge,
the conversions to a safe_open variant were undone to ensure that the
principal operation of the new 'open' calls is not modified by accident.
Since the 'open' calls now pass a redirection from a Tcl string as
stdin, convert the calls to 'safe_open_command_redirect'.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sat, 3 May 2025 17:21:53 +0000 (19:21 +0200)]
git-gui: use git_read in githook_read
0730a5a3a5e6 ("git-gui - use git-hook, honor core.hooksPath", 2023-09-17)
rewrote githook_read to use `git hook` to run a hook script. The code
that was replaced discovered the hook script file manually and invoked
it using function _open_stdout_stderr. After the rewrite, this function
is still invoked, but it calls into `git` instead of the hook scripts.
Notice though, that we have function git_read that invokes git and
prepares a pipe for the caller to read from. Replace the implementation
of githook_read to be just a wrapper around git_read. This unifies the
way in which the git executable is invoked. git_read ultimately also
calls into _open_stdout_stderr, but it modifies the path to the git
executable before doing so.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Fri, 11 Apr 2025 14:47:04 +0000 (10:47 -0400)]
git-gui: sanitize $PATH on all platforms
Since 8f23432b38d9 (windows: ignore empty `PATH` elements, 2022-11-23),
git-gui removes empty elements from $PATH, and a prior commit made this
remove all non-absolute elements from $PATH. But, this happens only on
Windows. Unsafe $PATH elements in $PATH are possible on all platforms.
Let's sanitize $PATH on all platforms to have consistent behavior. If a
user really wants the current repository on $PATH, they can add its
absolute name to $PATH.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sat, 3 May 2025 11:11:21 +0000 (13:11 +0200)]
git-gui: break out a separate function git_read_nice
There are two callers of git_read that request special treatment using
option --nice. Rewrite them to call a new function git_read_nice that
does the special treatment. Now we can remove all option treatment from
git_read.
git_write has the same capability, but there are no callers that
request --nice. Remove the feature without substitution.
This is a preparation for a later change where we want to make git_read
and friends non-variadic. Then it cannot have optional arguments.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Fri, 11 Apr 2025 14:08:52 +0000 (10:08 -0400)]
git-gui: assure PATH has only absolute elements.
Since 8f23432b38d9 (windows: ignore empty `PATH` elements, 2022-11-23),
git-gui excises all empty paths from $PATH, but still allows '.' or
other relative paths, which can also allow executing code from the
repository. Let's remove anything except absolute elements. While here,
let's remove duplicated elements, which are very common on Windows:
only the first such item can do anything except waste time repeating a
search.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sat, 3 May 2025 09:52:35 +0000 (11:52 +0200)]
git-gui: remove option --stderr from git_read
Some callers of git_read want to redirect stderr of the invoked command
to stdout. The function offers option --stderr for this purpose.
However, the option only appends 2>@1 to the commands. The callers can
do that themselves. In lib/console.tcl we even have a caller that
already knew implictly what --stderr does behind the scenes.
This is a preparation for a later change where we want to make git_read
non-variadic. Then it cannot have optional leading arguments.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Mon, 7 Apr 2025 21:12:56 +0000 (17:12 -0400)]
git-gui: cleanup git-bash menu item
git-gui on Git for Windows creates a menu item to start a git-bash
session for the current repository. This menu-item works as desired when
git-gui is installed in the Git for Windows (g4w) distribution, but
not when run from a different location such as normally done in
development. The reason is that git-bash's location is known to be
'/git-bash' in the Unix pathname space known to MSYS, but this is not
known in the Windows pathname space. Instead, git-gui derives a pathname
for git-bash assuming it is at a known relative location.
If git-gui is run from a different directory than assumed in g4w, the
relative location changes, and git-gui resorts to running a generic bash
login session in a Windows console.
But, the MSYS system underlying Git for Windows includes the 'cygpath'
utility to convert between Unix and Windows pathnames. Let's use this so
git-bash's Windows pathname is determined directly from /git-bash.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sat, 26 Apr 2025 16:46:06 +0000 (18:46 +0200)]
git-gui: sanitize 'exec' arguments: background
As in the previous commits, introduce a function that sanitizes
arguments intended for the process, but runs the process in the
background. Convert 'exec' calls to use this new function.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Thu, 3 Apr 2025 04:37:08 +0000 (00:37 -0400)]
git-gui: avoid auto_execok in do_windows_shortcut
git-gui on Windows uses auto_execok to locate git-gui.exe,
which performs the same flawed search as does the builtin exec.
Use _which instead, performing a safe PATH lookup.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Mon, 21 Apr 2025 16:14:54 +0000 (18:14 +0200)]
git-gui: sanitize 'exec' arguments: simple cases
Tcl 'exec' assigns special meaning to its argument when they begin with
redirection, pipe or background operator. There are a number of
invocations of 'exec' which construct arguments that are taken from the
Git repository or a user input. However, when file names or ref names
are taken from the repository, it is possible to find names that have
these special forms. They must not be interpreted by 'exec' lest it
redirects input or output, or attempts to build a pipeline using a
command name controlled by the repository.
Introduce a helper function that identifies such arguments and prepends
"./" to force such a name to be regarded as a relative file name.
Convert those 'exec' calls where the arguments can simply be packed
into a list.
Note that most commands containing the word 'exec' route through
console::exec or console::chain, which we will treat in another commit.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Wed, 2 Apr 2025 21:37:27 +0000 (17:37 -0400)]
git-gui: avoid auto_execok for git-bash menu item
On Windows, git-gui offers to open a git-bash session for the current
repository from the menu, but uses [auto_execok start] to get the
command to actually run that shell.
The code for auto_execok, in /usr/share/tcl8.6/tcl.init, has 'start' in
the 'shellBuiltins' list for cmd.exe on Windows: as a result,
auto_execok does not actually search for start, meaning this usage is
technically ok with auto_execok now. However, leaving this use of
auto_execok in place will just induce confusion about why a known unsafe
function is being used on Windows. Instead, let's switch to using our
known safe _which function that looks only in $PATH, excluding the
current working directory.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Mon, 21 Apr 2025 15:07:10 +0000 (17:07 +0200)]
git-gui: treat file names beginning with "|" as relative paths
The Tcl 'open' function has a very wide interface. It can open files as
well as pipes to external processes. The difference is made only by the
first character of the file name: if it is "|", a process is spawned.
We have a number of calls of Tcl 'open' that take a file name from the
environment in which Git GUI is running. Be prepared that insane values
are injected. In particular, when we intend to open a file, do not take
a file name that happens to begin with "|" as a request to run a process.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Fri, 4 Apr 2025 20:55:59 +0000 (16:55 -0400)]
git-gui: remove unused proc is_shellscript
Commit 7d076d56757c (git-gui: handle shell script text filters when
loading for blame, 2011-12-09) added is_shellscript to test if a file
is executable by the shell, used only when searching for textconv
filters. The previous commit rearranged the tests for finding such
filters, and removed the only user of is_shellscript. Remove this
function.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sat, 3 May 2025 11:37:35 +0000 (13:37 +0200)]
git-gui: remove git config --list handling for git < 1.5.3
git-gui uses `git config --null --list` to parse configuration. Git
versions prior to 1.5.3 do not have --null and need different treatment.
Nobody should be using such an old version anymore. (Moreover, since 0730a5a3a, git-gui requires git v2.36 or later). Keep only the code for
modern Git.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sun, 18 May 2025 14:08:06 +0000 (16:08 +0200)]
git-gui: remove special treatment of Windows from open_cmd_pipe
Commit 7d076d56757c (git-gui: handle shell script text filters when
loading for blame, 2011-12-09) added open_cmd_pipe to run text
conversion in support of blame, with special handling for shell
scripts on Windows. To determine whether the command is a shell
script, 'lindex' is used to pick off the first token from the command.
However, cmd is actually a command string taken from .gitconfig
literally and is not necessarily a syntactically correct Tcl list.
Hence, it cannot be processed by 'lindex' and 'lrange' reliably.
Pass the command string to the shell just like on non-Windows
platforms to avoid the potentially incorrect treatment.
A use of 'auto_execok' is removed by this change. This function is
dangerous on Windows, because it searches programs in the current
directory. Delegating the path lookup to the shell is safe, because
/bin/sh and /bin/bash follow POSIX on all platforms, including the
Git for Windows port.
A possible regression is that the old code, given filter command of
'foo', could find 'foo.bat' as a script, and not just bare 'foo', or
'foo.exe'. This rewrite requires explicitly giving the suffix if it is
not .exe.
This part of Git GUI can be exercised using
git gui blame -- some.file
while some.file has a textconv filter configured and has unstaged
modifications.
Helped-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Fri, 2 May 2025 15:39:55 +0000 (11:39 -0400)]
git-gui: remove HEAD detachment implementation for git < 1.5.3
git-gui provides an implementation to detach HEAD on Git versions prior
to 1.5.3. Nobody should be using such an old version anymore.
(Moreover, since 0730a5a3a, git-gui requires git v2.36 or later).
Keep only the code for modern Git.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com>
[j6t: message tweaked] Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Sun, 6 Apr 2025 22:20:14 +0000 (18:20 -0400)]
git-gui: use only the configured shell
git-gui has a few places where a bare "sh" is passed to exec, meaning
that the first instance of "sh" on $PATH will be used rather than the
shell configured. This violates expectations that the configured shell
is being used. Let's use [shellpath] everywhere.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Wed, 20 Sep 2023 21:56:14 +0000 (17:56 -0400)]
git-gui: remove Tcl 8.4 workaround on 2>@1 redirection
Since b792230 ("git-gui: Show a progress meter for checking out files",
2007-07-08), git-gui includes a workaround for Tcl that does not support
using 2>@1 to redirect stderr to stdout. Tcl added such support in
8.4.7, released in 2004, and this is fully supported in all 8.5
releases.
As git-gui has a hard-coded requirement for Tcl >= 8.5, the workaround
is no longer needed. Delete it.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Tue, 1 Apr 2025 15:45:06 +0000 (11:45 -0400)]
git-gui: make _shellpath usable on startup
Since commit d5257fb3c1de (git-gui: handle textconv filter on
Windows and in development, 2010-08-07), git-gui will search for a
usable shell if _shellpath is not configured, and on Windows may
resort to using auto_execok to find 'sh'. While this was intended for
development use, checks are insufficient to assure a proper
configuration when deployed where _shellpath is always set, but might
not give a usable shell.
Let's make this more robust by only searching if _shellpath was not
defined, and then using only our restricted search functions.
Furthermore, we should convert to a Windows path on Windows. Always
check for a valid shell on startup, meaning an absolute path to an
executable, aborting if these conditions are not met.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Wed, 2 Apr 2025 15:23:03 +0000 (11:23 -0400)]
git-gui: use [is_Windows], not bad _shellpath
Commit 7d076d56757c (git-gui: handle shell script text filters when
loading for blame, 2011-12-09) added open_cmd_pipe, with special
handling for Windows detected by seeing that _shellpath does not
point to an executable shell. That is bad practice, and is broken by
the next commit that assures _shellpath is valid on all platforms.
Fix this by using [is_Windows] as done for all Windows specific code.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Mark Levedahl [Thu, 3 Apr 2025 14:26:21 +0000 (10:26 -0400)]
git-gui: _which, only add .exe suffix if not present
The _which function finds executables on $PATH, and adds .exe on Windows
unless -script was given. However, win32.tcl executes "wscript.exe"
and "cscript.exe", both of which fail as _which adds .exe to both. This
is already fixed in git-gui released by Git for Windows. Do so here.
Signed-off-by: Mark Levedahl <mlevedahl@gmail.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Taylor Blau [Fri, 23 May 2025 21:04:21 +0000 (17:04 -0400)]
Merge branch 'js/fix-open-exec-2.40.0' into js/fix-open-exec
Branch js/fix-open-exec-2.40.0 converts `open` and `exec` calls to call
wrappers that sanitze the command arguments. This side branch updates
three `open` calls that are in conflict with the fix in the preceding
commit. To keep the intended operation of the 'open' calls, this merge
does not try to merge and resolve the conflicts, but ignores the
conversions that are brought in by the side branch, taking "ours" side
of the code in these three cases.
New fixes are the topic of the next commit.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
While "exec" uses a normal arguments list which is applied as
command + arguments (and redirections, etc), "open" uses a single
argument which is this command+arguments, where the command and
arguments are a list inside this one argument to "open".
Commit bb5cb23 (gitk: prevent overly long command lines 2023-05-08)
changed several values from individual arguments in that list (hashes
and file names), to a single value which is fed to git via redirection
to its stdin using "open" [1].
However, it didn't ensure correctly that this aggregate value in this
string is interpreted as a single element in this command+args list.
It did just enough so that newlines (which is how these elements are
concatenated) don't split this single list element.
A followup commit at the same patchset: 7dd272e (gitk: escape file
paths before piping to git log 2023-05-08) added a bit more, by
escaping backslahes and spaces at the file names, so that at least
it doesn't break when such file names get used there.
But these are not enough. At the very least tab is missing, and more,
and trying to manually escape every possible thing which can affect
how this string is interpreted in a list is a sub-par approach.
The solution is simply to tell tcl "this is a single list element".
which we can do by aggregating this value completely normally (hashes
and files separated by newlines), and then do [list $value].
So this is what this commit does, for all 3 places where bb5cb23
changed individual elements into an aggregate value.
[1]
That was not a fully accurate description. The accurate version
is that this string originally included two lists: hashes and files.
When used with "open" these lists correctly become the individual
elements of these lists, even if they contain spaces etc, so the
arguments which were used at this "git" commands were correct.
Commit bb5cb23 couldn't use these two lists as-is, because it needed
to process the individual elements in them (one element per line of
the aggregate value), and the issue is that ensuring this aggregate
is indeed interpreted as a single list element was sub-par.
Note: all the (double) quotes before/after the modification are not
required and with zero effect, even for \n. But this commit preserves
the original quoting form intentionally. It can be cleaned up later.
Signed-off-by: Avi Halachmi (:avih) <avihpit@yahoo.com> Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sun, 23 Mar 2025 21:34:11 +0000 (22:34 +0100)]
gitk: collect construction of blameargs into a single conditional
The command line to invoke 'git blame' for a single line is constructed
using several if-conditionals, each with the same condition
{$from_index new {}}. Merge all of them into a single conditional.
This requires to duplicate significant parts of the command, but it
helps the next change, where we will have to deal with a nested list
structure.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Thu, 20 Mar 2025 19:00:57 +0000 (20:00 +0100)]
gitk: sanitize 'open' arguments: simple commands with redirections
As in the previous commits, introduce a function that sanitizes
arguments intended for the process and in addition allows to pass
redirections, which are passed to Tcl's 'open' verbatim.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Thu, 20 Mar 2025 18:32:56 +0000 (19:32 +0100)]
gitk: sanitize 'open' arguments: simple commands
Tcl 'open' treats the second argument as a command when it begins
with |. The remainder of the argument is a list comprising the command
and its arguments. It assigns special meaning to these arguments when
they begin with a redirection, pipe or background operator. There are a
number of invocations of 'open' which construct arguments that are
taken from the Git repository or a user input. However, when file names
or ref names are taken from the repository, it is possible to find
names which have these special forms. They must not be interpreted by
'open' lest it redirects input or output, or attempts to build a
pipeline using a command name controlled by the repository.
Introduce a helper function that identifies such arguments and prepends
"./" to force such a name to be regarded as a relative file name.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sat, 29 Mar 2025 16:35:19 +0000 (17:35 +0100)]
gitk: sanitize 'exec' arguments: redirect to process
Convert one 'exec' call that sends output to a process (pipeline).
Fortunately, the command does not contain any variables. For this
reason, just treat it as a "redirection".
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sat, 29 Mar 2025 16:21:27 +0000 (17:21 +0100)]
gitk: sanitize 'exec' arguments: redirections and background
Convert 'exec' calls that both redirect output to a file and run the
process in the background. 'safe_exec_redirect' can take both these
"redirections" in the second argument simultaneously.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sat, 29 Mar 2025 16:01:54 +0000 (17:01 +0100)]
gitk: sanitize 'exec' arguments: redirections
As in the previous commits, introduce a function that sanitizes
arguments intended for the process and in addition allows to pass
redirections verbatim, which are interpreted by Tcl's 'exec'.
Redirections can include the background operator '&'.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Sat, 29 Mar 2025 15:51:29 +0000 (16:51 +0100)]
gitk: sanitize 'exec' arguments: 'eval exec'
Convert calls of 'exec' where the arguments are already available in
a list and 'eval' is used to unpack the list. Use 'concat' to unite
the arguments into a single list before passing them to 'safe_exec'.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Mon, 17 Mar 2025 21:59:27 +0000 (22:59 +0100)]
gitk: sanitize 'exec' arguments: simple cases
Tcl 'exec' assigns special meaning to its argument when they begin with
redirection, pipe or background operator. There are a number of
invocations of 'exec' which construct arguments that are taken from the
Git repository or a user input. However, when file names or ref names
are taken from the repository, it is possible to find names with have
these special forms. They must not be interpreted by 'exec' lest it
redirects input or output, or attempts to build a pipeline using a
command name controlled by the repository.
Introduce a helper function that identifies such arguments and prepends
"./" to force such a name to be regarded as a relative file name.
Convert those 'exec' calls where the arguments can simply be packed
into a list.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Mon, 17 Mar 2025 20:39:58 +0000 (21:39 +0100)]
gitk: have callers of diffcmd supply pipe symbol when necessary
Function 'diffcmd' derives which of git diff-files, git diff-index, or
git diff-tree must be invoked depending on the ids provided. It puts
the pipe symbol as the first element of the returned command list.
Note though that of the four callers only two use the command with
Tcl 'open' and need the pipe symbol. The other two callers pass the
command to Tcl 'exec' and must remove the pipe symbol.
Do not include the pipe symbol in the constructed command list, but let
the call sites decide whether to add it or not. Note that Tcl 'open'
inspects only the first character of the command list, which is also
the first character of the first element in the list. For this reason,
it is valid to just tack on the pipe symbol with |$cmd and it is not
necessary to use [concat | $cmd].
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Johannes Sixt [Mon, 17 Mar 2025 19:36:04 +0000 (20:36 +0100)]
gitk: treat file names beginning with "|" as relative paths
The Tcl 'open' function has a vary wide interface. It can open files as
well as pipes to external processes. The difference is made only by the
first character of the file name: if it is "|", an process is spawned.
We have a number of calls of Tcl 'open' that take a file name from the
environment in which Gitk is running. Be prepared that insane values are
injected. In particular, when we intend to open a file, do not mistake
a file name that happens to begin with "|" as a request to run a process.
Signed-off-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Phillip Wood [Thu, 22 May 2025 15:55:23 +0000 (16:55 +0100)]
midx docs: clarify tie breaking
Clarify what happens when an object exists in more than one pack, but
not in the preferred pack. "git multi-pack-index repack" relies on ties
for objects that are not in the preferred pack being resolved in favor
of the newest pack that contains a copy of the object. If ties were
resolved in favor of the oldest pack as the current documentation
suggests the multi-pack index would not reference any of the objects in
the pack created by "git multi-pack-index repack".
Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Phillip Wood [Thu, 22 May 2025 15:55:22 +0000 (16:55 +0100)]
midx: avoid negative array index
nth_midxed_pack_int_id() returns the index of the pack file in the multi
pack index's list of packfiles that the specified object. The index is
returned as a uint32_t. Storing this in an int will make the index
negative if the most significant bit is set. Fix this by using uint32_t
as the rest of the code does. This is unlikely to be a practical problem
as it requires the multipack index to reference 2^31 packfiles.
Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Phillip Wood [Thu, 22 May 2025 15:55:21 +0000 (16:55 +0100)]
midx repack: avoid potential integer overflow on 64 bit systems
On a 64 bit system the calculation
p->pack_size * pack_info[i].referenced_objects
could overflow. If a pack file contains 2^28 objects with an average
compressed size of 1KB then the pack size will be 2^38B. If all of the
objects are referenced by the multi-pack index the sum above will
overflow. Avoid this by using shifted integer arithmetic and changing
the order of the calculation so that the pack size is divided by the
total number of objects in the pack before multiplying by the number of
objects referenced by the multi-pack index. Using a shift of 14 bits
should give reasonable accuracy while avoiding overflow for pack sizes
less that 1PB.
Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The calculation to estimated size of the objects in the pack referenced
by the multi-pack-index uses st_mult() to multiply the pack size by the
number of referenced objects before dividing by the total number of
objects in the pack. As size_t is 32 bits on 32 bit systems this
calculation easily overflows. Fix this by using 64bit arithmetic instead.
Also fix a potential overflow when caluculating the total size of the
objects referenced by the multipack index with a batch size larger
than SIZE_MAX / 2. In that case
total_size += estimated_size
can overflow as both total_size and estimated_size can be greater that
SIZE_MAX / 2. This is addressed by using saturating arithmetic for the
addition. Although estimated_size is of type uint64_t by the time we
reach this sum it is bounded by the batch size which is of type size_t
and so casting estimated_size to size_t does not truncate the value.
Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jacob Keller [Wed, 21 May 2025 23:29:17 +0000 (16:29 -0700)]
diff --no-index: support limiting by pathspec
The --no-index option of git-diff enables using the diff machinery from
git while operating outside of a repository. This mode of git diff is
able to compare directories and produce a diff of their contents.
When operating git diff in a repository, git has the notion of
"pathspecs" which can specify which files to compare. In particular,
when using git to diff two trees, you might invoke:
$ git diff-tree -r <treeish1> <treeish2>.
where the treeish could point to a subdirectory of the repository.
When invoked this way, users can limit the selected paths of the tree by
using a pathspec. Either by providing some list of paths to accept, or
by removing paths via a negative refspec.
The git diff --no-index mode does not support pathspecs, and cannot
limit the diff output in this way. Other diff programs such as GNU
difftools have options for excluding paths based on a pattern match.
However, using git diff as a diff replacement has several advantages
over many popular diff tools, including coloring moved lines, rename
detections, and similar.
Teach git diff --no-index how to handle pathspecs to limit the
comparisons. This will only be supported if both provided paths are
directories.
For comparisons where one path isn't a directory, the --no-index mode
already has some DWIM shortcuts implemented in the fixup_paths()
function.
Modify the fixup_paths function to return 1 if both paths are
directories. If this is the case, interpret any extra arguments to git
diff as pathspecs via parse_pathspec.
Use parse_pathspec to load the remaining arguments (if any) to git diff
--no-index as pathspec items. Disable PATHSPEC_ATTR support since we do
not have a repository to do attribute lookup. Disable PATHSPEC_FROMTOP
since we do not have a repository root. All pathspecs are treated as
rooted at the provided comparison paths.
After loading the pathspec data, calculate skip offsets for skipping
past the root portion of the paths. This is required to ensure that
pathspecs start matching from the provided path, rather than matching
from the absolute path. We could instead pass the paths as prefix values
to parse_pathspec. This is slightly problematic because the paths come
from the command line and don't necessarily have the proper trailing
slash. Additionally, that would require parsing pathspecs multiple
times.
Pass the pathspec object and the skip offsets into queue_diff, which
in-turn must pass them along to read_directory_contents.
Modify read_directory_contents to check against the pathspecs when
scanning the directory. Use the skip offset to skip past the initial
root of the path, and only match against portions that are below the
intended directory structure being compared.
The search algorithm for finding paths is recursive with read_dir. To
make pathspec matching work properly, we must set both
DO_MATCH_DIRECTORY and DO_MATCH_LEADING_PATHSPEC.
Without DO_MATCH_DIRECTORY, paths like "a/b/c/d" will not match against
pathspecs like "a/b/c". This is usually achieved by setting the is_dir
parameter of match_pathspec.
Without DO_MATCH_LEADING_PATHSPEC, paths like "a/b/c" would not match
against pathspecs like "a/b/c/d". This is crucial because we recursively
iterate down the directories. We could simply avoid checking pathspecs
at subdirectories, but this would force recursion down directories
which would simply be skipped.
If we always passed DO_MATCH_LEADING_PATHSPEC, then we will
incorrectly match in certain cases such as matching 'a/c' against
':(glob)**/d'. The match logic will see that a matches the leading part
of the **/ and accept this even tho c doesn't match.
To avoid this, use the match_leading_pathspec() variant recently
introduced. This sets both flags when is_dir is set, but leaves them
both cleared when is_dir is 0.
Add test cases and documentation covering the new functionality. Note
for the documentation I opted not to move the placement of '--' which is
sometimes used to disambiguate arguments. The diff --no-index mode
requires exactly 2 arguments determining what to compare. Any additional
arguments are interpreted as pathspecs and must come afterwards. Use of
'--' would not actually disambiguate anything, since there will never be
ambiguity over which arguments represent paths or pathspecs.
Signed-off-by: Jacob Keller <jacob.keller@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jacob Keller [Wed, 21 May 2025 23:29:16 +0000 (16:29 -0700)]
pathspec: add flag to indicate operation without repository
A following change will add support for pathspecs to the git diff
--no-index command. This mode of git diff does not load any repository.
Add a new PATHSPEC_NO_REPOSITORY flag indicating that we're parsing
pathspecs without a repository.
Both PATHSPEC_ATTR and PATHSPEC_FROMTOP require a repository to
function. Thus, verify that both of these are set in magic_mask to
ensure they won't be accepted when PATHSPEC_NO_REPOSITORY is set.
Check PATHSPEC_NO_REPOSITORY when warning about paths outside the
directory tree. When the flag is set, do not look for a git repository
when generating the warning message.
Finally, add a BUG in match_pathspec_item if the istate is NULL but the
pathspec has PATHSPEC_ATTR set. Callers which support PATHSPEC_ATTR
should always pass a valid istate, and callers which don't pass a valid
istate should have set PATHSPEC_ATTR in the magic_mask field to disable
support for attribute-based pathspecs.
Signed-off-by: Jacob Keller <jacob.keller@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Jacob Keller [Wed, 21 May 2025 23:29:15 +0000 (16:29 -0700)]
pathspec: add match_leading_pathspec variant
The do_match_pathspec() function has the DO_MATCH_LEADING_PATHSPEC
option to allow pathspecs to match when matching "src" against a
pathspec like "src/path/...". This support is not exposed by
match_pathspec, and the internal flags to do_match_pathspec are not
exposed outside of dir.c
The upcoming support for pathspecs in git diff --no-index need the
LEADING matching behavior when iterating down through a directory with
readdir.
We could try to expose the match_pathspec_with_flags to the public API.
However, DO_MATCH_EXCLUDES really shouldn't be public, and its a bit
weird to only have a few of the flags become public.
Instead, add match_leading_pathspec() as a function which sets both
DO_MATCH_DIRECTORY and DO_MATCH_LEADING_PATHSPEC when is_dir is true.
This will be used in a following change to support pathspec matching in
git diff --no-index.
Signed-off-by: Jacob Keller <jacob.keller@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Alex Mironov [Wed, 21 May 2025 21:29:31 +0000 (21:29 +0000)]
name-hash: don't add sparse directories in threaded lazy init
Ensure that logic added in 5f11669586 (name-hash: don't add directories
to name_hash, 2021-04-12) also applies in multithreaded hashtable init
path.
As per the original single-threaded change above: sparse directory entries
represent a directory that is outside the sparse-checkout definition.
These are not paths to blobs, so should not be added to the name_hash
table. Instead, they should be added to the directory hashtable when
'ignore_case' is true.
Add a condition to avoid placing sparse directories into the name_hash
hashtable. This avoids filling the table with extra entries that will
never be queried.
Signed-off-by: Alex Mironov <alexandrfox@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Karthik Nayak [Tue, 20 May 2025 14:40:12 +0000 (16:40 +0200)]
t: remove unexpected SANITIZE_LEAK variables
As of 1fc7ddf35b (test-lib: unconditionally enable leak checking,
2024-11-20), both the `GIT_TEST_PASSING_SANITIZE_LEAK` and
`TEST_PASSES_SANITIZE_LEAK` variables no longer have any meaning, the
leak checks are enabled by default. However, some newly added tests
include them by mistake. Let's clean this up.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Acked-by: Justin Tobler <jltobler@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Justin Tobler [Tue, 20 May 2025 16:32:18 +0000 (11:32 -0500)]
builtin/receive-pack: add option to skip connectivity check
During git-receive-pack(1), connectivity of the object graph is
validated to ensure that the received packfile does not leave the
repository in a broken state. This is done via git-rev-list(1) and
walking the objects, which can be expensive for large repositories.
Generally, this check is critical to avoid an incomplete received
packfile from corrupting a repository. Server operators may have
additional knowledge though around exactly how Git is being used on the
server-side which can be used to facilitate more efficient connectivity
computation of incoming objects.
For example, if it can be ensured that all objects in a repository are
connected and do not depend on any missing objects, the connectivity of
newly written objects can be checked by walking the object graph
containing only the new objects from the updated tips and identifying
the missing objects which represent the boundary between the new objects
and the repository. These boundary objects can be checked in the
canonical repository to ensure the new objects connect as expected and
thus avoid walking the rest of the object graph.
Git itself cannot make the guarantees required for such an optimization
as it is possible for a repository to contain an unreachable object that
references a missing object without the repository being considered
corrupt.
Introduce the --skip-connectivity-check option for git-receive-pack(1)
which bypasses this connectivity check to give more control to the
server-side. Note that without proper server-side validation of newly
received objects handled outside of Git, usage of this option risks
corrupting a repository.
Signed-off-by: Justin Tobler <jltobler@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Justin Tobler [Tue, 20 May 2025 16:32:17 +0000 (11:32 -0500)]
t5410: test receive-pack connectivity check
As part of git-recieve-pack(1), the connectivity of objects is checked.
Add a test validating that git-receive-pack(1) fails due to an incoming
packfile that would leave the repository with missing objects. Instead
of creating a new test file, "t5410" is generalized for receive-pack
testing.
Signed-off-by: Justin Tobler <jltobler@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano [Mon, 19 May 2025 23:02:45 +0000 (16:02 -0700)]
Merge branch 'ag/doc-send-email'
The `send-email` documentation has been updated with OAuth2.0
related examples.
* ag/doc-send-email:
docs: add credential helper for outlook and gmail in OAuth list of helpers
docs: improve send-email documentation
send-mail: improve checks for valid_fqdn
Bundle-URI feature did not use refs recorded in the bundle other
than normal branches as anchoring points to optimize the follow-up
fetch during "git clone"; now it is told to utilize all.
* sc/bundle-uri-use-all-refs-in-bundle:
bundle-uri: add test for bundle-uri clones with tags
bundle-uri: copy all bundle references ino the refs/bundle space
Ramsay Jones [Mon, 19 May 2025 16:25:23 +0000 (17:25 +0100)]
configure.ac: upgrade to a compilation check for sysinfo
Commit f5e3c6c57d ("meson: do a full usage-based compile check for
sysinfo", 2025-04-25) updated the 'sysinfo()' check, as part of the
meson build, due to the failure of the check on Solaris. Prior to
that commit, the meson build only checked the availability of the
'<sys/sysinfo.h>' header file. On Solaris, both the header and the
'sysinfo()' function exist, but are completely unrelated to the same
function on Linux (and cygwin).
Commit 50dec7c566 ("config.mak.uname: add sysinfo() configuration for
cygwin", 2025-04-17) added a similar 'sysinfo()' check to the autoconf
build. This check looked for the 'sysinfo()' function itself, rather
than just the header, but it will fail (incorrectly set HAVE_SYSINFO)
for the same reason.
In order to correctly identify the 'sysinfo()' function we require as
part of 'git-gc' (used in the 'total_ram() function), we also upgrade
to a compilation check, in a similar way to the meson commit. Note that
since commit c9a51775a3 ("builtin/gc.c: correct RAM calculation when
using sysinfo", 2025-04-17) both the 'totalram' and 'mem_unit' fields
of the 'struct sysinfo' are used, so the new check includes both of
those fields in the compile check.
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ramsay Jones [Mon, 19 May 2025 16:25:22 +0000 (17:25 +0100)]
meson.build: correct setting of GIT_EXEC_PATH
For the non-'runtime prefix' case, the meson build sets the GIT_EXEC_PATH
build variable to an absolute path equivalent to <prefix>/libexec/git-core.
In comparison, the default make build sets it to a relative path equivalent
to 'libexec/git-core'. Indeed, the make build requires the use of some
means outside of the Makefile (eg. config.mak[.*] or the command-line)
to set GIT_EXEC_PATH to anything other than 'libexec/git-core'.
For example, the make invocation:
$ make gitexecdir=/some/other/bin all install
will build git with GIT_EXEC_PATH set to '/some/other/bin' and install
the 'library' executables to that location. However, without setting the
'gitexecdir' make variable, irrespective of the 'runtime prefix' setting,
the GIT_EXEC_PATH is always set to 'libexec/git-core'.
The meson built-in 'libexecdir' option can be used to provide a similar
configurability. The default value for the option is 'libexec'. Attempting
to set the option to '' on the command-line, will reset it to the '.'
string, presumably to ensure a relative path value.
This commit allows the meson build, similar to the above, to configure the
project like:
so that the GIT_EXEC_PATH is set to '/some/other/bin'. Absent the
-Dlibexecdir argument, the GIT_EXEC_PATH is set to 'libexec/git-core'.
In order to correct the value of GIT_EXEC_PATH, default the value to the
static string value 'libexec/git-core', and only override if the value
of the 'libexecdir' option has a value different to 'libexec' or '.'.
Also, like the Makefile, add a check for an absolute path when the
runtime prefix option is true (and if so, error out).
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ramsay Jones [Mon, 19 May 2025 16:25:21 +0000 (17:25 +0100)]
meson: correct path to system config/attribute files
The path to the system-wide config and attributes files are not being
set correctly in the meson build. Unless explicitly overridden on the
command line during setup, the 'gitconfig' and 'gitattributes' options
are defaulting to absolute paths in the '/etc' system directory. This
is only appropriate if the <prefix> is set specifically to '/usr'.
The directory in which these files are placed is generally referred to
as the 'system configuration directory' or 'sysconfdir' for short. When
the prefix is '/usr' then the sysconfdir is usually set to '/etc', but
any other value for prefix results in the relative directory value 'etc'
instead. (eg if prefix is '/usr/local', then the 'etc' relative value
results in a system configuration directory of '/usr/local/etc'). When
setting the 'sysconfdir' builtin option value, the meson system uses
exactly this algorithm, so we can use get_option('sysconfdir') directly
when setting the (non-overridden) build variables.
In order to allow for overriding from the command line, remove the
default values specified for the 'gitconfig' and 'gitattributes' options
in the 'meson_options.txt' file. This allows the user to specify any
pathname for those options, while being able to test for the unset
(empty) value. An absolute pathname will be used unchanged and a relative
pathname will be appended to '<prefix>/'. These values are then used to
set the 'ETC_GITCONFIG' and 'ETC_GITATTRIBUTES' build variables which are,
in turn, passed to the compiler as '-D' arguments.
When the 'gitconfig' or 'gitattributes' options are not used, then use
the built-in 'sysconfdir' and set the ETC_GITCONFIG build variable to
the string "<sysconfdir>/gitconfig". Similarly, set ETC_ATTRIBUTES to
"<sysconfdir>/gitattributes".
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ramsay Jones [Mon, 19 May 2025 16:25:20 +0000 (17:25 +0100)]
meson: correct install location of YAML.pm
When executing an 'meson install' the YAML.pm file is incorrectly
placed in the <prefix>/share/perl5/Git/SVN directory. The YAML.pm
file should be placed in a 'Memoize' subdirectory instead. In order
to correct the location, update the 'install_dir' of the relevant
target in the 'perl/Git/SVN/Memoize/meson.build' file.
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Ramsay Jones [Mon, 19 May 2025 16:25:19 +0000 (17:25 +0100)]
meson.build: quote the GITWEBDIR build configuration
The build configuration options with (non-empty) values, for example
filesystem paths potentially containing spaces, have been set using
the '.set_quoted()' method. However, the GITWEBDIR value has been
set using the '.set()' method instead. In order to correctly quote
the GITWEBDIR value, replace the '.set()' method with '.set_quoted()'.
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Eli Schwartz [Mon, 19 May 2025 17:09:42 +0000 (13:09 -0400)]
meson: reformat default options to workaround bug in `meson configure`
Since 13cb20fc46 ("meson: fix compilation with Visual Studio",
2025-01-22) it has not been possible to list build options via `meson
configure`. This is due to Meson's static analysis of build options
failing to handle constant folding, and thinking we set a totally
invalid default `-std=`.
This is reported upstream but we anyways need to work with existing
versions. It turns out there is a simple solution: turn the entire
default option into a conditional branch, which means Meson sees either
nothing, or everything.
As a result, Git users can once again see pretty-printed options before
building.
Reported-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Bug: https://github.com/mesonbuild/meson/issues/14623 Signed-off-by: Eli Schwartz <eschwartz@gentoo.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Karthik Nayak [Mon, 19 May 2025 09:58:09 +0000 (11:58 +0200)]
receive-pack: use batched reference updates
The reference updates performed as a part of 'git-receive-pack(1)', take
place one at a time. For each reference update, a new transaction is
created and committed. This is necessary to ensure we can allow
individual updates to fail without failing the entire command. The
command also supports an 'atomic' mode, which uses a single transaction
to update all of the references. But this mode has an all-or-nothing
approach, where if a single update fails, all updates would fail.
In 23fc8e4f61 (refs: implement batch reference update support,
2025-04-08), we introduced a new mechanism to batch reference updates.
Under the hood, this uses a single transaction to perform a batch of
reference updates, while allowing only individual updates to fail.
Utilize this newly introduced batch update mechanism in
'git-receive-pack(1)'. This provides a significant bump in performance,
especially when dealing with repositories with large number of
references.
With the reftable backend there is a 18x performance improvement, when
performing receive-pack with 10000 refs:
Benchmark 1: receive: many refs (refformat = reftable, refcount = 10000, revision = master)
Time (mean ± σ): 4.276 s ± 0.078 s [User: 0.796 s, System: 3.318 s]
Range (min … max): 4.185 s … 4.430 s 10 runs
Benchmark 2: receive: many refs (refformat = reftable, refcount = 10000, revision = HEAD)
Time (mean ± σ): 235.4 ms ± 6.9 ms [User: 75.4 ms, System: 157.3 ms]
Range (min … max): 228.5 ms … 254.2 ms 11 runs
Summary
receive: many refs (refformat = reftable, refcount = 10000, revision = HEAD) ran
18.16 ± 0.63 times faster than receive: many refs (refformat = reftable, refcount = 10000, revision = master)
In similar conditions, the files backend sees a 1.21x performance
improvement:
Benchmark 1: receive: many refs (refformat = files, refcount = 10000, revision = master)
Time (mean ± σ): 1.121 s ± 0.021 s [User: 0.128 s, System: 0.975 s]
Range (min … max): 1.097 s … 1.156 s 10 runs
Benchmark 2: receive: many refs (refformat = files, refcount = 10000, revision = HEAD)
Time (mean ± σ): 927.9 ms ± 22.6 ms [User: 99.0 ms, System: 815.2 ms]
Range (min … max): 903.1 ms … 978.0 ms 10 runs
Summary
receive: many refs (refformat = files, refcount = 10000, revision = HEAD) ran
1.21 ± 0.04 times faster than receive: many refs (refformat = files, refcount = 10000, revision = master)
As using batched updates requires the error handling to be moved to the
end of the flow, create and use a 'struct strset' to track the failed
refs and attribute the correct errors to them.
This change also uncovers an issue when a client provides multiple
updates to the same reference. For example:
$ git send-pack remote.git A:foo B:foo
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 20 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 226 bytes | 226.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0), pack-reused 0 (from 0)
remote: error: cannot lock ref 'refs/heads/foo': reference already exists
To remote.git
! [remote rejected] A -> foo (failed to update ref)
! [remote failure] B -> foo (remote failed to report status)
As you can see, the remote runs into an error because it cannot lock the
target reference for the second update. Furthermore, the remote complains
that the first update has been rejected whereas the second update didn't
receive any status update because we failed to lock it. Reading this status
message alone a user would probably expect that `foo` has not been updated
at all. But that's not the case: while we claim that the ref wasn't updated,
it surprisingly points to `A` now.
One could argue that this is merely an error in how we report the result of
this push. But ultimately, the user's request itself is already broken and
doesn't make any sense in the first place and cannot ever lead to a sensible
outcome that honors the full request.
The conversion to batched transactions fixes the issue because we now try to
queue both updates in the same transaction. As such, the transaction itself
will notice this conflict and refuse the update altogether before we commit
any of the values.
Note that this requires changes to a couple of tests in t5408 that happened
to exercise this behaviour. Given that the generated output is misleading
and given that the user request cannot ever be fully honored this really
feels more like a bug than properly designed behaviour. As such, changing
the behaviour feels like the right thing to do.
Since now reference updates are batched, the 'reference-transaction'
hook will be invoked with all updates together. Currently git will 'die'
when the hook returns with a non-zero exit status in the 'prepared'
stage. For 'git-receive-pack(1)', this allowed users to reject an
individual reference update, git would have applied previous updates but
immediately abort further execution. This is definitely an incorrect
usage of this hook, since the right place to do this would be the
'update' hook. This patch retains the latter behavior, but
'reference-transaction' hook now changes to a all-or-nothing behavior
when a non-zero exit status is returned in the 'prepared' stage, since
batch updates use a transaction under the hood. This explains the change
in 't1416'.
Helped-by: Jeff King <peff@peff.net> Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Karthik Nayak [Mon, 19 May 2025 09:58:08 +0000 (11:58 +0200)]
send-pack: fix memory leak around duplicate refs
The 'git-send-pack(1)' allows users to push objects to a remote
repository and explicitly list the references to be pushed. The status
of each reference pushed is captured into a list mapped by refname.
If a reference fails to be updated, its error message is captured in the
`ref->remote_status` field. While the command allows duplicate ref
inputs, the list doesn't accommodate this behavior as a particular
refname is linked to a single `struct ref*` element. So if the user
inputs a reference twice like:
git send-pack remote.git A:foo B:foo
where the user is trying to update the same reference 'foo' twice and
the reference fails to be updated, we first fill `ref->remote_status`
with error message for the input 'A:foo' then we override the same field
with the error message for 'B:foo'. This override happens without first
free'ing the previous value. Fix this leak.
The current tests already incorporate the above example, but in the test
'A:foo' succeeds while 'B:foo' fails, meaning that the memory leak isn't
triggered. Add a new test with multiple duplicates.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Karthik Nayak [Mon, 19 May 2025 09:58:07 +0000 (11:58 +0200)]
fetch: use batched reference updates
The reference updates performed as a part of 'git-fetch(1)', take place
one at a time. For each reference update, a new transaction is created
and committed. This is necessary to ensure we can allow individual
updates to fail without failing the entire command. The command also
supports an '--atomic' mode, which uses a single transaction to update
all of the references. But this mode has an all-or-nothing approach,
where if a single update fails, all updates would fail.
In 23fc8e4f61 (refs: implement batch reference update support,
2025-04-08), we introduced a new mechanism to batch reference updates.
Under the hood, this uses a single transaction to perform a batch of
reference updates, while allowing only individual updates to fail.
Utilize this newly introduced batch update mechanism in 'git-fetch(1)'.
This provides a significant bump in performance, especially when dealing
with repositories with large number of references.
Adding support for batched updates is simply modifying the flow to also
create a batch update transaction in the non-atomic flow.
With the reftable backend there is a 22x performance improvement, when
performing 'git-fetch(1)' with 10000 refs:
Benchmark 1: fetch: many refs (refformat = reftable, refcount = 10000, revision = master)
Time (mean ± σ): 3.403 s ± 0.775 s [User: 1.875 s, System: 1.417 s]
Range (min … max): 2.454 s … 4.529 s 10 runs
Benchmark 2: fetch: many refs (refformat = reftable, refcount = 10000, revision = HEAD)
Time (mean ± σ): 154.3 ms ± 17.6 ms [User: 102.5 ms, System: 56.1 ms]
Range (min … max): 145.2 ms … 220.5 ms 18 runs
Summary
fetch: many refs (refformat = reftable, refcount = 10000, revision = HEAD) ran
22.06 ± 5.62 times faster than fetch: many refs (refformat = reftable, refcount = 10000, revision = master)
In similar conditions, the files backend sees a 1.25x performance
improvement:
Benchmark 1: fetch: many refs (refformat = files, refcount = 10000, revision = master)
Time (mean ± σ): 605.5 ms ± 9.4 ms [User: 117.8 ms, System: 483.3 ms]
Range (min … max): 595.6 ms … 621.5 ms 10 runs
Benchmark 2: fetch: many refs (refformat = files, refcount = 10000, revision = HEAD)
Time (mean ± σ): 485.8 ms ± 4.3 ms [User: 91.1 ms, System: 396.7 ms]
Range (min … max): 477.6 ms … 494.3 ms 10 runs
Summary
fetch: many refs (refformat = files, refcount = 10000, revision = HEAD) ran
1.25 ± 0.02 times faster than fetch: many refs (refformat = files, refcount = 10000, revision = master)
With this we'll either be using a regular transaction or a batch update
transaction. This helps cleanup some code which is no longer needed as
we'll now always have some type of 'ref_transaction' object being
propagated.
One big change is that earlier, each individual update would propagate a
failure. Whereas now, the `ref_transaction_for_each_rejected_update`
function is called at the end of the flow to capture the exit status for
'git-fetch(1)' and also to print F/D conflict errors. This does change
the order of the errors being printed, but the behavior stays the same.
Since transaction errors are now explicitly defined as part of 76e760b999 (refs: introduce enum-based transaction error types,
2025-04-08), utilize them and get rid of custom errors defined within
'builtin/fetch.c'.
Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Karthik Nayak [Mon, 19 May 2025 09:58:06 +0000 (11:58 +0200)]
refs: add function to translate errors to strings
The commit 76e760b999 (refs: introduce enum-based transaction error
types, 2025-04-08) introduced enum-based transaction error types. The
refs transaction logic was also modified to propagate these errors. For
clients of the ref transaction system, it would be beneficial to provide
human readable messages for these errors.
There is already an existing mapping in 'builtin/update-ref.c', move it
to 'refs.c' as `ref_transaction_error_msg()` and use the same within the
'builtin/update-ref.c'.
Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
K Jayatheerth [Sun, 18 May 2025 07:43:17 +0000 (13:13 +0530)]
docs: replace git_config to repo_config
Since this document was written, the built-in API has been
updated a few times, but the document was left stale.
Adjust to the current best practices by calling repo_config() on the
repository instance the subcommand implementation receives as a
parameter, instead of calling git_config() that used to be the
common practice.
Signed-off-by: K Jayatheerth <jayatheerthkulkarni2005@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
K Jayatheerth [Sun, 18 May 2025 07:43:16 +0000 (13:13 +0530)]
docs: clarify cmd_psuh signature and explain UNUSED macro
The sample program, as written, would no longer build for at least two
reasons:
- Since this document was first written, the convention to call a
subcommand implementation has changed, and cmd_psuh() now needs
to accept the fourth parameter, repository.
- These days, compiler warning options for developers include one
that detects and complains about unused parameters, so ones that
are deliberately unused have to be marked as such.
Update the old-style examples to adjust to the current practices,
with explanations as needed.
Signed-off-by: K Jayatheerth <jayatheerthkulkarni2005@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
K Jayatheerth [Sun, 18 May 2025 07:43:15 +0000 (13:13 +0530)]
docs: remove unused mentoring mailing list reference
The git-mentoring group was initially created to help newcomers
with their development itches. However, in practice,
most of their questions were already being addressed
directly on the mailing list, and contributors consistently
received helpful responses there.
Remove the mentoring group details from the Documentation.
Signed-off-by: K Jayatheerth <jayatheerthkulkarni2005@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Elijah Newren [Fri, 16 May 2025 20:04:18 +0000 (20:04 +0000)]
merge-tree: add a new --quiet flag
Git Forges may be interested in whether two branches can be merged while
not being interested in what the resulting merge tree is nor which files
conflicted. For such cases, add a new --quiet flag which
will make use of the new mergeability_only flag added to merge-ort in
the previous commit. This option allows the merge machinery to, in the
outer layer of the merge:
* exit early when a conflict is detected
* avoid writing (most) merged blobs/trees to the object store
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Elijah Newren [Fri, 16 May 2025 20:04:17 +0000 (20:04 +0000)]
merge-ort: add a new mergeability_only option
Git Forges may be interested in whether two branches can be merged while
not being interested in what the resulting merge tree is nor which files
conflicted. For such cases, add a new mergeability_only option. This
option allows the merge machinery to, in the "outer layer" of the merge:
* exit upon first[-ish] conflict
* avoid (not prevent) writing merged blobs/trees to the object store
I have a number of qualifiers there, so let me explain each:
"outer layer":
Note that since the recursive merge of merge bases (corresponding to
call_depth > 0) can conflict without the outer final merge
(corresponding to call_depth == 0) conflicting, we can't short-circuit
nor avoid writing merged blobs/trees to the object store during those
inner merges.
"first-ish conflict":
The current patch only exits early from process_entries() on the first
conflict it detects, but conflicts could have been detected in a
previous function call, namely detect_and_process_renames(). However:
* conflicts detected by detect_and_process_renames() are quite rare
conflict types
* the detection would still come after regular rename detection
(which is the expensive part of detect_and_process_renames()), so
it is not saving us much in computation time given that
process_entries() directly follows detect_and_process_renames()
* [this overlaps with the next bullet point] process_entries() is the
place where virtually all object writing occurs (object writing is
sometimes more of a concern for Forges than computation time), so
exiting early here isn't saving us much in object writes either
* the code changes needed to handle an earlier exit are slightly
more invasive in detect_and_process_renames() than for
process_entries().
Given the rareness of the even earlier conflicts, the limited savings
we'd get from exiting even earlier, and in an attempt to keep this
patch simpler, we don't guarantee that we actually exit on the first
conflict detected. We can always revisit this decision later if we
decide that a further micro-optimization to exit slightly earlier in
rare cases is worthwhile.
"avoid (not prevent) writing objects":
The detect_and_process_renames() call can also write objects to the
object store, when rename/rename conflicts involve one (or more) files
that have also been modified on both sides. Because of this alternate
call path leading to handle_content_merges(), our "early exit" does not
prevent writing objects entirely, even within the "outer layer"
(i.e. even within call_depth == 0). I figure that's fine though, since
we're already writing objects for the inner merges (i.e. for call_depth
> 0), which are likely going to represent vastly more objects than files
involved in rename/rename+modify/modify cases in the outer merge, on
average.
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Elijah Newren [Fri, 16 May 2025 16:26:26 +0000 (16:26 +0000)]
sequencer: make it clearer that commit descriptions are just comments
Every once in a while, users report that editing the commit summaries
in the todo list does not get reflected in the rebase operation,
suggesting that users are (a) only using one-line commit messages, and
(b) not understanding that the commit summaries are merely helpful
comments to help them find the right hashes.
It may be difficult to correct users' poor commit messages, but we can
at least try to make it clearer that the commit summaries are not
directives of some sort by inserting a comment character. Hopefully
that leads to them looking a little further and noticing the hints at
the bottom to use 'reword' or 'edit' directives.
Yes, this change may look funny at first since it hardcodes '#' rather
than using comment_line_str. However:
* comment_line_str exists to allow disambiguation between lines in
a commit message and lines that are instructions to users editing
the commit message. No such disambiguation is needed for these
comments that occur on the same line after existing directives
* the exact "comment" character(s) on regular pick lines used aren't
actually important; I could have used anything, including completely
random variable length text for each line and it'd work because we
ignore everything after 'pick' and the hash.
* The whole point of this change is to signal to users that they
should NOT be editing any part of the line after the hash (and if
they do so, their edits will be ignored), while the whole point of
comment_line_str is to allow highly flexible editing. So making
it more general by using comment_line_str actually feels
counterproductive.
* The character for merge directives absolutely must be '#'; that
has been deeply hardcoded for a long time (see below), and will
break if some other comment character is used instead. In a
desire to have pick and merge directives be similar, I use the
same comment character for both.
* Perhaps merge directives could be fixed to not be inflexible about
the comment character used, if someone feels highly motivated, but
I think that should be done in a separate follow-on patch.
Here are (some of?) the locations where '#' has already been hardcoded
for a long time for merges:
1) In check_label_or_ref_arg():
case TODO_LABEL:
/*
* '#' is not a valid label as the merge command uses it to
* separate merge parents from the commit subject.
*/
2) In do_merge():
/*
* For octopus merges, the arg starts with the list of revisions to be
* merged. The list is optionally followed by '#' and the oneline.
*/
merge_arg_len = oneline_offset = arg_len;
for (p = arg; p - arg < arg_len; p += strspn(p, " \t\n")) {
if (!*p)
break;
if (*p == '#' && (!p[1] || isspace(p[1]))) {
3) In label_oid():
if ((buf->len == the_hash_algo->hexsz &&
!get_oid_hex(label, &dummy)) ||
(buf->len == 1 && *label == '#') ||
hashmap_get_from_hash(&state->labels,
strihash(label), label)) {
/*
* If the label already exists, or if the label is a
* valid full OID, or the label is a '#' (which we use
* as a separator between merge heads and oneline), we
* append a dash and a number to make it unique.
*/
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:12:03 +0000 (18:12 +0000)]
pack-objects: allow --shallow and --path-walk
There does not appear to be anything particularly incompatible about the
--shallow and --path-walk options of 'git pack-objects'. If shallow
commits are to be handled differently, then it is by the revision walk
that defines the commit set and which are interesting or uninteresting.
However, before the previous change, a trivial removal of the warning
would cause a failure in t5500-fetch-pack.sh when
GIT_TEST_PACK_PATH_WALK is enabled. The shallow fetch would provide more
objects than we desired, due to some incorrect behavior of the path-walk
API, especially around walking uninteresting objects.
The recently-added tests in t5538-push-shallow.sh help to confirm this
behavior is working with the --path-walk option if
GIT_TEST_PACK_PATH_WALK is enabled. These tests passed previously due to
the --path-walk feature being disabled in the presence of a shallow
clone.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:12:02 +0000 (18:12 +0000)]
path-walk: add new 'edge_aggressive' option
In preparation for allowing both the --shallow and --path-walk options
in the 'git pack-objects' builtin, create a new 'edge_aggressive' option
in the path-walk API. This option will help walk the boundary more
thoroughly and help avoid sending extra objects during fetches and
pushes.
The only use of the 'edge_hint_aggressive' option in the revision API is
within mark_edges_uninteresting(), which is usually called before
between prepare_revision_walk() and before visiting commits with
get_revision(). In prepare_revision_walk(), the UNINTERESTING commits
are walked until a boundary is found.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:12:01 +0000 (18:12 +0000)]
pack-objects: thread the path-based compression
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.
This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.
Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)
Test HEAD~1 HEAD
---------------------------------------------------
5313.20: big pack 2.38 1.99 -16.4%
5313.21: big pack size 16.1M 16.0M -0.2%
5313.24: repack 107.32 45.41 -57.7%
5313.25: repack size 213.3M 213.2M -0.0%
(Test output is formatted to better fit in message.)
This ~60% reduction in 'git repack --path-walk' time is typical across
all repos I used for testing. What is interesting is to compare when the
overall time improves enough to outperform the --name-hash-version=1
case. These time improvements correlate with repositories with data
shapes that significantly improve their data size as well. The
--path-walk feature frequently takes longer than --name-hash-version=2,
trading some extra computation for some additional compression. The
natural place where this additional computation comes from is the two
compression passes that --path-walk takes, though the first pass is
naturally faster due to the path boundaries avoiding a number of delta
compression attempts.
For example, the microsoft/fluentui repo has significant size reduction
from --name-hash-version=1 to --name-hash-version=2 followed by further
improvements with --path-walk. The threaded computation makes
--path-walk more competitive in time compared to --name-hash-version=2,
though still ~31% more expensive in that metric.
Repack Method Pack Size Time
------------------------------------------
Hash v1 439.4M 87.24s
Hash v2 161.7M 21.51s
Path Walk (Before) 142.5M 81.29s
Path Walk (After) 142.5M 28.16s
Similar results hold for the Git repository:
Repack Method Pack Size Time
------------------------------------------
Hash v1 248.8M 30.44s
Hash v2 249.0M 30.15s
Path Walk (Before) 213.2M 142.50s
Path Walk (After) 213.3M 45.41s
...as well as the nodejs/node repository:
Repack Method Pack Size Time
------------------------------------------
Hash v1 739.9M 71.18s
Hash v2 764.6M 67.82s
Path Walk (Before) 698.1M 208.10s
Path Walk (After) 698.0M 75.10s
Finally, the Linux kernel repository is a good test for this repacking
time change, even though the space savings is more subtle:
Repack Method Pack Size Time
------------------------------------------
Hash v1 2.5G 554.41s
Hash v2 2.5G 549.62s
Path Walk (before) 2.2G 1562.36s
Path Walk (before) 2.2G 559.00s
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:12:00 +0000 (18:12 +0000)]
pack-objects: refactor path-walk delta phase
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.
Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.
This presents a new progress indicator that can be used in tests to
verify that this stage is happening.
The current implementation is not integrated with threads, but we are
setting it up to arrive in the next change.
Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:11:59 +0000 (18:11 +0000)]
scalar: enable path-walk during push via config
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:11:58 +0000 (18:11 +0000)]
pack-objects: enable --path-walk via config
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.
This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.
Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".
In order to test that this config works, add a new trace2 region around
the path walk code that can be checked by a 'git push' command.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:11:57 +0000 (18:11 +0000)]
repack: add --path-walk option
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.
Add the --path-walk option to the performance tests in p5313.
For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the --path-walk tests in p5313 look like this:
Test this tree
-------------------------------------------------------------------------
5313.18: thin pack with --path-walk 0.08(0.06+0.02)
5313.19: thin pack size with --path-walk 18.4K
5313.20: big pack with --path-walk 2.10(7.80+0.26)
5313.21: big pack size with --path-walk 19.8M
5313.22: shallow fetch pack with --path-walk 1.62(3.38+0.17)
5313.23: shallow pack size with --path-walk 33.6M
5313.24: repack with --path-walk 81.29(96.08+0.71)
5313.25: repack size with --path-walk 142.5M
Along with the earlier tests in p5313, I'll instead reformat the
comparison as follows:
Repack Method Pack Size Time
---------------------------------------
Hash v1 439.4M 87.24s
Hash v2 161.7M 21.51s
Path Walk 142.5M 81.29s
There are a few things to notice here:
1. The benefits of --name-hash-version=2 over --name-hash-version=1 are
significant, but --path-walk still compresses better than that
option.
2. The --path-walk command is still using --name-hash-version=1 for the
second pass of delta computation, using the increased name hash
collisions as a potential method for opportunistic compression on
top of the path-focused compression.
3. The --path-walk algorithm is currently sequential and does not use
multiple threads for delta compression. Threading will be
implemented in a future change so the computation time will improve
to better compete in this metric.
There are small benefits in size for my copy of the Git repository:
Repack Method Pack Size Time
---------------------------------------
Hash v1 248.8M 30.44s
Hash v2 249.0M 30.15s
Path Walk 213.2M 142.50s
As well as in the nodejs/node repository [3]:
Repack Method Pack Size Time
---------------------------------------
Hash v1 739.9M 71.18s
Hash v2 764.6M 67.82s
Path Walk 698.1M 208.10s
[3] https://github.com/nodejs/node
This benefit also repeats in my copy of the Linux kernel repository:
Repack Method Pack Size Time
---------------------------------------
Hash v1 2.5G 554.41s
Hash v2 2.5G 549.62s
Path Walk 2.2G 1562.36s
It is important to see that even when the repository shape does not have
many name-hash collisions, there is a slight space boost to be found
using this method.
As this repacking strategy was released in Git for Windows 2.47.0, some
users have reported cases where the --path-walk compression is slightly
worse than the --name-hash-version=2 option. In those cases, it may be
beneficial to combine the two options. However, there has not been a
released version of Git that has both options and I don't have access to
these repos for testing.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:11:56 +0000 (18:11 +0000)]
t5538: add tests to confirm deltas in shallow pushes
It can be notoriously difficult to detect if delta bases are being
computed properly during 'git push'. Construct an example where it will
make a kilobyte worth of difference when a delta base is not found. We
can then use the progress indicators to distinguish between bytes and
KiB depending on whether the delta base is found and used.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:11:55 +0000 (18:11 +0000)]
pack-objects: introduce GIT_TEST_PACK_PATH_WALK
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.
This was useful in testing the implementation of the --path-walk
implementation, helping to find tests that are overly specific to the
default object walk. These include:
- t0411-clone-from-partial.sh : One test fetches from a repo that does
not have the boundary objects. This causes the path-based walk to
fail. Disable the variable for this test.
- t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
without a boundary object.
- t5310-pack-bitmaps.sh : One test compares the case when packing with
bitmaps to the case when packing without them. Since we disable the
test variable when writing bitmaps, this causes a difference in the
object list (the --path-walk option adds an extra object). Specify
--no-path-walk in both processes for the comparison. Another test
checks for a specific delta base, but when computing dynamically
without using bitmaps, the base object it too small to be considered
in the delta calculations so no base is used.
- t5316-pack-delta-depth.sh : This script cares about certain delta
choices and their chain lengths. The --path-walk option changes how
these chains are selected, and thus changes the results of this test.
- t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
the --sparse option and how it combines with --path-walk.
- t5332-multi-pack-reuse.sh : This test verifies that the preferred
pack is used for delta reuse when possible. The --path-walk option is
not currently aware of the preferred pack at all, so finds a
different delta base.
- t7406-submodule-update.sh : When using the variable, the --depth
option collides with the --path-walk feature, resulting in a warning
message. Disable the variable so this warning does not appear.
I want to call out one specific test change that is only temporary:
- t5530-upload-pack-error.sh : One test cares specifically about an
"unable to read" error message. Since the current implementation
performs delta calculations within the path-walk API callback, a
different "unable to get size" error message appears. When this
is changed in a future refactoring, this test change can be reverted.
Similar to GIT_TEST_NAME_HASH_VERSION, we do not add this option to the
linux-TEST-vars CI build as that's already an overloaded build.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:11:54 +0000 (18:11 +0000)]
p5313: add performance tests for --path-walk
The previous change added a --path-walk option to 'git pack-objects'.
Create a performance test that demonstrates the time and space benefits
of the feature.
In order to get an appropriate comparison, we need to avoid reusing
deltas and recompute them from scratch.
Compare the creation of a thin pack representing a small push and the
creation of a relatively large non-thin pack.
Running on my copy of the Git repository results in this data (removing
the repack tests for --name-hash-version):
Test this tree
------------------------------------------------------------------------
5313.2: thin pack with --name-hash-version=1 0.02(0.01+0.01)
5313.3: thin pack size with --name-hash-version=1 1.6K
5313.4: big pack with --name-hash-version=1 2.55(4.20+0.26)
5313.5: big pack size with --name-hash-version=1 16.4M
5313.6: shallow fetch pack with --name-hash-version=1 1.24(2.03+0.08)
5313.7: shallow pack size with --name-hash-version=1 12.2M
5313.10: thin pack with --name-hash-version=2 0.03(0.01+0.01)
5313.11: thin pack size with --name-hash-version=2 1.6K
5313.12: big pack with --name-hash-version=2 1.91(3.23+0.20)
5313.13: big pack size with --name-hash-version=2 16.4M
5313.14: shallow fetch pack with --name-hash-version=2 1.06(1.57+0.10)
5313.15: shallow pack size with --name-hash-version=2 12.5M
5313.18: thin pack with --path-walk 0.03(0.01+0.01)
5313.19: thin pack size with --path-walk 1.6K
5313.20: big pack with --path-walk 2.05(3.24+0.27)
5313.21: big pack size with --path-walk 16.3M
5313.22: shallow fetch pack with --path-walk 1.08(1.66+0.07)
5313.23: shallow pack size with --path-walk 12.4M
Note that the timing is slower because there is no threading in the
--path-walk case (yet). Also, the shallow pack cases are really not
using the --path-walk logic right now because it is disabled until some
additions are made to the path walk API.
The cases where the --path-walk option really shines is when the default
name-hash is overwhelmed with unhelpful collisions. An open source
example can be found in the microsoft/fluentui repo [1] at a certain
commit [2].
Notice in particular that in the small thin pack, the time performance
has improved from 0.36s for --name-hash-version=1 to 0.08s and this is
likely due to the improved size of the resulting pack: 18.4K instead of
1.2M. The relatively new --name-hash-version=2 is competitive with
--path-walk (0.12s and 22.0K) but not quite as successful.
Finally, running this on a copy of the Linux kernel repository results
in these data points:
Derrick Stolee [Fri, 16 May 2025 18:11:53 +0000 (18:11 +0000)]
pack-objects: update usage to match docs
The t0450 test script verifies that builtin usage matches the synopsis
in the documentation. Adjust the builtin to match and then remove 'git
pack-objects' from the exception list.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 18:11:52 +0000 (18:11 +0000)]
pack-objects: add --path-walk option
In order to more easily compute delta bases among objects that appear at
the exact same path, add a --path-walk option to 'git pack-objects'.
This option will use the path-walk API instead of the object walk given
by the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.
The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.
After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.
The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.
This new walk is incompatible with some features and is ignored by
others:
* Object filters are not currently integrated with the path-walk API,
such as sparse-checkout or tree depth. A blobless packfile could be
integrated easily, but that is deferred for later.
* Server-focused features such as delta islands, shallow packs, and
using a bitmap index are incompatible with the path-walk API.
* The path walk API is only compatible with the --revs option, not
taking object lists or pack lists over stdin. These alternative ways
to specify the objects currently ignores the --path-walk option
without even a warning.
Future changes will create performance tests that demonstrate the power
of this approach.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Derrick Stolee [Fri, 16 May 2025 14:55:30 +0000 (14:55 +0000)]
p2000: add performance test for patch-mode commands
The previous three changes contributed performance improvements to 'git
apply', 'git add -p', and 'git reset -p' when using a sparse index. The
improvement to 'git apply' also improved 'git checkout -p'. Add
performance tests to demonstrate this (and to help validate that
performance remains good in the future).
In the truncated test output below, we see that the full checkout
performance changes within noise expectations, but the sparse index
cases improve 33% and then 96% for 'git add -p' and 41% and then 95% for
'git reset -p'. 'git checkout -p' improves immediatley by 91% because it
does not need any change to its builtin.
It is worth noting that if our test was more involved and had multiple
hunks to evaluate, then the time spent in 'git apply' would dominate due
to multiple index loads and writes. As it stands, we need the sparse
index improvement in 'git add -p' itself to confirm this performance
improvement.
Since the change for 'git add -i' is identical, we avoid a second test
case for that similar operation.
Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>