git.ipfire.org Git - thirdparty/git.git/log

]> git.ipfire.org Git - thirdparty/git.git/log

Elijah Newren [Wed, 6 Aug 2025 23:15:22 +0000 (23:15 +0000)]

merge-ort: fix directory rename on top of source of other rename/delete

At GitHub, we've got a real-world repository that has been triggering
failures of the form:

    git: merge-ort.c:3007: process_renames: Assertion `newinfo && !newinfo->merged.clean' failed.

which comes from the line:

    VERIFY_CI(newinfo);

Unfortunately, this one has been quite complex to unravel, and is a
bit complex to explain.  So, I'm going to carefully try to explain each
relevant piece needed to understand the fix, then carefully build up
from a simple testcase to some of the relevant testcases.

== New special case we need to consider ==

Rename pairs in the diffcore machinery connect the source path of a
rename with the destination path of a rename.  Since we have rename
pairs to consider on both sides of history since the merge base,
merging has to consider a few special cases of possible overlap:

  A) two rename pairs having the same target path
  B) two rename pairs having the same source path
  C) the source path of one rename pair being the target path of a
     different rename pair

Some of these came up often enough that we gave them names:
  A) a rename/rename(2to1) conflict (looks similar to an add/add conflict)
  B) a rename/rename(1to2) conflict, which represents the same path being
     renamed differently on the two sides of history
  C) not yet named

merge-ort is well-prepared to handle cases (A) and (B), as was
merge-recursive (which was merge-ort's predecessor).  Case (C) was
briefly considered during the years of merge-recursive maintenance,
but the full extent of support it got was a few FIXME/TODO comments
littered around the code highlighting some of the places that would
probably need to be fixed to support it.  When I wrote merge-ort I
ignored case (C) entirely, since I believed that case (C) was only
possible if we were to support break detection during merges.  Not
only had break detection never been supported by any merge algorithm,
I thought break detection wasn't worth the effort to support in a
merge algorithm.  However, it turns out that case (C) can be triggered
without break detection, if there's enough moving pieces.

Before I dive into how to trigger case (C) with directory renames plus
other renames, it might be helpful to use a simpler example with break
detection first.  And before we get to that it may help to explain
some more basics of handling renames in the merge algorithm.  So, let
me first backup and provide a quick refresher on each of

  * handling renames
  * what break detection would mean, if supported in merging
  * handling directory renames

From there, I'll build up from a basic directory rename detection case
to one that triggers a failure currently.

== Handling renames ==

In the merge machinery when we have a rename of a path A -> B,
processing that rename needs to remove path A, and make sure that path B
has the relevant information.  Note that if the content was also
modified on both sides, this may mean that we have 3 different stages
that need to be stored at path B instead of having some stored at path
A.

Having all stages stored at path B makes it much easier for users to
investigate and resolve the content conflict associated with a renamed
path.  For example:
  * "git status" doesn't have to figure out how to list paths A & B and
    attempt to connect them for users; it can just list path B.
  * Users can use "git ls-files -u B" (instead of trying to find the
    previous name of the file so they can list both, i.e. "git ls-files
    -u A B")
  * Users can resolve via "git add B" (without needing to "git rm A")

== What break detection would mean ==

If break detection were supported, we might have cases where A -> B
*and* C -> A, meaning that both rename pairs might believe they need to
update A.  In particular, the processing of A -> B would need to be
careful to not clear out all stages of A and mark it resolved, while
both renames would need to figure out which stages of A belong with A
and which belong with B, so that both paths have the right stages
associated with them.

merge-ort (like merge-recursive before it) makes no attempt to handle
break detection; it runs with break detection turned off.  It would
need to be retrofitted to handle such cases.

== Directory rename detection ==

If one side of history renames directory D/ -> E/, and the other side of
history adds new files to D/, then directory rename detection notices
and suggests moving those new files to E/.  A similar thing is done for
paths renamed into D/, causing them to be transitively renamed into E/.

The default in the merge machinery is to report a conflict whenever a
directory rename might modify the location of a path, so that users can
decide whether they wanted the original path or the
directory-rename-induced location.  However, that means the default
codepath still runs through all the directory rename detection logic, it
just supplements it with providing conflict notices when it is done.

== Building up increasingly complex testcases ==

I'll start with a really simple directory rename example, and then
slowly add twists that explain new pieces until we get to the
problematic cases:

=== Testcase 1 ===

Let's start with a concrete example, where particular files/directories of
interest that exist or are changed on each side are called out:

  Original:   <nothing of note>
  our side:   rename B/file -> C/file
  their side: rename C/     -> A/

For this case, we'd expect to see the original B/file appear not at
C/file but at A/file.

(We would also expect a conflict notice that the user will want to
choose between C/file and A/file, but I'm going to ignore conflict
notices from here on by assuming merge.directoryRenames is set to
`true` rather than `conflict`; the only difference that assumption
makes is whether that makes the merge be considered to be conflicted
and whether it prints a conflict notice; what is written to the index
or working directory is unchanged.)

=== Testcase 2 ===

Modify testcase 1 by having A/file exist from the start:

  Original:   A/file exists
  our side:   rename B/file -> C/file
  their side: rename C/     -> A/

In such a case, to avoid user confusion at what looks kind of like an
add/add conflict (even though the original path at A/file was not added
by either side of the merge), we turn off directory rename detection for
this path and print a "in the way" warning to the user:
    CONFLICT (implicit dir rename): Existing file/dir ... in the way ...
The testcases in section 5 of t6423 explore these in more detail.

=== Testcase 3 ===

Let's modify testcase 1 in a slightly different way: have A/file be
added by their side rather than it already existing.

  Original:   <nothing of note>
  our side:   rename B/file -> C/file
  their side: rename C/     -> A/
              add A/file

In this case, the directory rename detection basically transforms our
side's original B/file -> C/file into a B/file -> A/file, and so we
get a rename/add conflict, with one version of A/file coming from the
renamed file, and another coming from the new A/file, each stored as
stages 2 and 3 in conflicts.  This kind of add/add conflict is perhaps
slightly more complex than a regular add/add conflict, but with the
printed messages it makes sense where it came from and we have
different stages of the file to work with to resolve the conflict.

=== Testcase 4 ===

Let's do something similar to testcase 3, but have the opposite side of
history add A/file:

  Original:   <nothing of note>
  our side:   rename B/file -> C/file
              add    A/file
  their side: rename C/     -> A/

Now if we allow directory rename detection to modify C/file to A/file,
then we also get a rename/add conflict, but in this case we'd need both
higher order stages being recorded on side 2, which makes no sense.  The
index can't store multiple stage 2 entries, and even if we could, it
would probably be confusing for users to work with.  So, similar to what
we do when there was an A/file in the original version, we simply turn
off directory rename detection for cases like this and provide the "in
the way" CONFLICT notice to the user.

=== Testcase 5 ===

We're slowly getting closer.  Let's mix it up by having A/file exist at
the beginning but not exist on their side:

  original:   A/file exists
  our side:   rename B/file -> C/file
  their side: rename C/     -> A/
              rename A/file -> D/file

For this case, you could say that since A/file -> D/file, it's no longer
in the way of C/file being moved by directory rename detection to
A/file.  But that would give us a case where A/file is both the source
and the target of a rename, similar to break detection, which the code
isn't currently equipped to handle.

This is not yet the case that causes current failures; to the current
code, this kind of looks like testcase 4 in that A/file is in the way
on our side (since A/file was in the original and was umodified by our
side).  So, it results in a "in the way" notification with directory
rename detection being turned off for A/file so that B/file ends up at
C/file.

Perhaps the resolution could be improved in the future, but our "in
the way" checks prevented such problems by noticing that A/file exists
on our side and thus turns off directory rename detection from
affecting C/file's location.  So, while the merge result could be
perhaps improved, the fact that this is currently handled by giving
the user an "in the way" message gives the user a chance to resolve
and prevents the code from tripping itself up.

=== Testcase 6 ===

Let's modify testcase 5 a bit more, to also delete A/file on our side:

  original:   A/file exists
  our side:   rename B/file -> C/file
              delete A/file
  their side: rename C/     -> A/
              rename A/file -> D/file

Now the "in the way" logic doesn't detect that there's an A/file in
the way (neither side has an A/file anymore), so it's fine to
transitively rename C/file further to A/file...except that we end up
with A/file being both the source of one rename, and the target of a
different rename.  Each rename pair tries to handle the resolution of
the source and target paths of its own rename.  But when we go to
process the second rename pair in process_renames(), we do not expect
either the source or the destination to be marked as handled already;
so, when we hit the sanity checks that these are not handled:

    VERIFY_CI(oldinfo);
    VERIFY_CI(newinfo);

then one of these is going to throw an assertion failure since the
previous rename pair already marked both of its paths as handled.
This will give us an error of the form:

    git: merge-ort.c:3007: process_renames: Assertion `newinfo && !newinfo->merged.clean' failed.

This is the failure we're currently triggering, and it fundamentally
depends on:
  * a path existing in the original
  * that original path being removed or renamed on *both* sides
  * some kind of directory rename moving some *other* path into that
    original path

This was added as testcase 12q in t6423.

=== Testcase 7 ===

Bonus bug found while investigating!

Let's go back to the comparison between testcases 2 & 3, and set up a
file present on their side that we need to consider:

  Original:   A/file exists
  our side:   rename B/file -> C/file
              rename A/file -> D/file
  their side: rename C/     -> A/

Here, there is no A/file in the way on our side like testcase 4.
There is an A/file present on their side like testcase 3, which was an
add/add conflict, but that's associated with the file be renamed to
D/file.  So, that really shouldn't be an add/add conflict because we
instead want all modes of the original A/file to be transported to
D/file.

Unfortunately, the current code kind of treats it like an add/add
conflict instead...but even worse.  There is also a valid mode for
A/file in the original, which normally goes to stage 1.  However, an
add/add conflict should be represented in the index with no mode at
stage 1 (for the original side), only modes at stages 2 and 3 (for our
and their side), so for an add/add we'd expect that mode for A/file in
the original version to be cleared out (or be transported to D/file).

Unfortunately, the code currently leaves not only the stage 3 entry
for A/file intact, it also leaves the stage 1 entry for A/file.  This
results in `git ls-files -u A/file` output of the form:

    100644 d00491fd7e5bb6fa28c517a0bb32b8b506539d4d 1 A/file
    100644 0cfbf08886fca9a91cb753ec8734c84fcbe52c9f 2 A/file
    100644 d00491fd7e5bb6fa28c517a0bb32b8b506539d4d 3 A/file

This would likely cause users to believe this isn't an add/add
conflict; rather, this would lead them to believe that A/file was only
modified on our side and that therefore it should not have been a
conflict in the first place.  And while resolving the conflict in
favor of our side is the correct resolution (because stages 1 and 3
should have been cleared out in the first place), this is certainly
likely to cause confusion for anyone attempting to investigate why
this path was marked as conflicted.

This was added as testcase 12p in t6423.

== Attempted solutions that I discarded ==

1) For each side of history, create a strset of the sources of each
   rename on the other side of history.  Then when using directory
   renames to modify existing renames, verify that we aren't renaming
   to a source of another rename.

   Unfortunately, the "relevant renames" optimization in merge-ort
   means we often don't detect renames -- we just see a delete and an
   add -- which is easy to forget and makes debugging testcases harder,
   but it also turns out that this solution in insufficient to solve
   the related problems in the area (more on that below).

2) Modify the code to be aware of the possibility of renaming to
   the source of another side's rename, and make all the conflict
   resolution logic for each case (including existing
   rename/rename(2to1) and rename/rename(1to2) cases) handle the
   additional complexity.  It turns out there was much more code to
   audit than I wanted, for a really niche case.  I didn't like how
   many changes were needed, and aborted.

== Solution ==

We do not want the stages of unrelated files appearing at the same path
in the index except when dealing with an add/add conflict.  While we
previously handled this for stages 2 & 3, we also need to worry about
stage 1.  So check for a stage 1 index entry being in the way of a
directory rename.

However, if we can detect that the stage 1 index entry is actually from
a related file due to a directory-rename-causes-rename-to-self
situation, then we can allow the stage 1 entry to remain.

From this wording, you may note that it's not just rename cases that
are a problem; bugs could be triggered with directory renames vs simple
adds.  That leads us to...

== Testcases 8+ ==

Another bonus bug, found via understanding our final solutions (and the
failure of our first attempted solution)!

Let's tweak testcase 7 a bit:

  Original:   A/file exists
  our side:   delete A/file
              add -> C/file
  their side: delete A/file
              rename C/     -> A/

Here, there doesn't seem to be a big problem.  Sure C/file gets modified
via the directory rename of C/ -> A/ so that it becomes A/file, but
there's no file in the way, right?  Actually, here we have a problem
that the stage 1 entry of A/file would be combined with the stage 2
entry of C/file, and make it look like a modify/delete conflict.
Perhaps there is some extra checking that could be added to the code to
make it attempt to clear out the stage 1 entry of A/file, but the
various rename-to-self-via-directory-rename testcases make that a bit
more difficult.  For now, it's easier to just treat this as a
path-in-the-way situation and not allow the directory rename to modify
C/file.

That sounds all well and good, but it does have an interesting side
effect.  Due to the "relevant renames" optimizations in merge-ort (i.e.
only detect the renames you need), 100% renames whose files weren't
modified on the other side often go undetected.  This means that if we
modify this testcase slightly to:

  Original:   A/file exists
  our side:   A/file -> C/file
  their side: rename C/ -> A/

Then although this looks like where the directory rename just moves
C/file back to A/file and there's no problem, we may not detect the
A/file -> C/file rename.  Instead it will look like a deletion of A/file
and an addition of C/file.  The directory rename then appears to be
moving C/file to A/file, which is on top of an "unrelated" file (or at
least a file it doesn't know is related).  So, we will report
path-in-the-way conflicts now in cases where we didn't before.  That's
better than silently and accidentally combining stages of unrelated
files and making them look like a modify/delete; users can investigate
the reported conflict and simply resolve it.

This means we tweak the expected solution for testcases 12i, 12j, and
12k.  (Those three tests are basically the same test repeated three
times, but I was worried when I added those that subtle differences in
parent/child, sibling/sibling, and toplevel directories might mess up
how rename-to-self testcases actually get handled.)

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>