tl;dr git destroyed my data; my team now has severe trust issues with git
We ask a lot from our source control systems. We want them to be flexible, fast, distributed, clever and even easy-to-use. But the number 1 thing we should demand from a source control system is that it doesn’t destroy our data. Probably most importantly, it shouldn’t ever lose stuff that has been committed, but just behind that it really shouldn’t destroy data in our working directory.
When you find out that your source control system has lost your
data you end up in a very bad place. Once your source control system
destroys your data once, you immediately have a severe break-down of
trust between yourself and your tool. You revert to using cp -R
to create backups before doing anything with the tool, just in case it
destroys your data again.
Development was proceeding along at a cracking pace, and I was
getting ready to commit a series of important changes. Before doing
so, I want to merge in the recent changes from the remote master, so I
do the familiar git pull
. It complained about some files
that would be overwritten by the merge, so I saved a backup of my changes,
then reverted my changes in those specific files, and proceeded.
The merge went went fine, and pulled in all the
remote changes to the working directory. Getting ready for the commit,
I do a git status
and start to get a little concerned;
one of the files I’ve been heavily editting doesn’t show up in the
status list. I cat
the file to see what is going on;
seems none of my changes are there. Now, I’m getting concerned, maybe
I’m going slightly crazy after 3 days straight hacking, but I’m sure I
made some changes to this file. I scroll up the terminal backlog to
the git status
I did before the pull. Sure enough, the
file is marked as changed there, but not after the merge. I carefully
read the full details from the merge; my file isn’t listed being
touched there. Now I am really worried. git has just gone and destroyed
the last 5 or 6 hours worht of work. Not happy!
Luckily, I was able to reconstruct most of the work from editor buffers, which I luckily still had open.
But, now I am worried. Why the fuck did git decide to destroy data in my working directory, without even telling me!. Did I do something wrong? Is this a case I should know about? I had to investigate.
So, I took a snapshot of my repository, rolled back to the revision before the merge, mad some minor modifications to my file, the ran the merge again. And, again, git destroys the change in the working directory. Now this isn’t normal behaviour, something is really fucked. The only thing slightly interesting about the file in question is that it had been recently renamed. Somehow this rename had confused the merge, and the merge was silently overwriting files.
Now git has a few different merge strategies, so I tried out some different ones. This was a simple pretty simple merge with 2-heads so the options were really recursive or resolve. git picks recursive be default, so I tried resolve instead. This worked fine. Surprsingly this made me feel a little better, I wasn’t completely crazy, silently updating files in my working directory wasn’t intended behaviour, there had to be something wrong in recursive merge.
So, I updated to the latest version in homebrew. Same problem.
Then it was time to start debugging git for real. So I downloaded
the source (using git of course). I started having a look through
merge-recursive.c
. It didn’t look too bad, but there was
clearly going to be a lot to learn if I was going to debug this. Before
I started literring the code with prints I thought I better just see if head had the
same problem. Lo and behold head worked! OK, cool, they fixed the bug. But
that isn’t really a satisfying answer. Just for fun I checked out some
random version to try and narrow down when the bug was fixed. In doing
so I found that actually it worked in some old vesions, then didn’t work,
and then finally worked again in the very latest. Here are my raw notes:
1.7.1 => good 1.7.2 => good 1.7.3 => good 1.7.4 => bad 1.7.5 => bad 1.7.6.1 (installed) => bad 1.7.6.1 (checkout) => bad 1.7.6.4 => bad 1.7.7-rc0 => fail 1.7.7-rc1 => pass 1.7.7-rc3 => pass
OK, this is getting more interesting. So somewhere between
1.7.2 and 1.7.3 this bug was introduced. I started using git bisect
to narrow things down. I quickly got bored of manually doing git bisect good
and git bisect bad
, luckily I stumbled upon git bisect run
that
automates the whole process. After about 20 minutes compiling and testing it found the
bad commit.
commit 882fd11aff6f0e8add77e75924678cce875a0eaf Author: Elijah NewrenDate: Mon Sep 20 02:29:03 2010 -0600 merge-recursive: Delay content merging for renames Move the handling of content merging for renames from process_renames() to process_df_entry(). Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano
OK, lots of talk about merge-recursive and renames. That sounds like it makes sense; at least there is a specific bit of code that I can blame for my data destruction, maybe I don’t have to distrust the whole tool.
But to really be confident, I want to think that the fix isn’t
just something random, and was actually done to fix this this problem. So I
switched the return code in my test script, and ran git bisect
again to find when the bug was fixed. Eventually it found this commit:
commit 5b448b8530308b1f5a7a721cb1bf0ba557b5c78d Author: Elijah NewrenDate: Thu Aug 11 23:20:10 2011 -0600 merge-recursive: When we detect we can skip an update, actually skip it In 882fd11 (merge-recursive: Delay content merging for renames 2010-09-20), there was code that checked for whether we could skip updating a file in the working directory, based on whether the merged version matched the current working copy. Due to the desire to handle directory/file conflicts that were resolvable, that commit deferred content merging by first updating the index with the unmerged entries and then moving the actual merging (along with the skip-the-content-update check) to another function that ran later in the merge process. As part moving the content merging code, a bug was introduced such that although the message about skipping the update would be printed (whenever GIT_MERGE_VERBOSITY was sufficiently high), the file would be unconditionally updated in the working copy anyway. When we detect that the file does not need to be updated in the working copy, update the index appropriately and then return early before updating the working copy. Note that there was a similar change in b2c8c0a (merge-recursive: When we detect we can skip an update, actually skip it 2011-02-28), but it was reverted by 6db4105 (Revert "Merge branch 'en/merge-recursive'" 2011-05-19) since it did not fix both of the relevant types of unnecessary update breakages and, worse, it made use of some band-aids that caused other problems. The reason this change works is due to the changes earlier in this series to (a) record_df_conflict_files instead of just unlinking them early, (b) allowing make_room_for_path() to remove D/F entries, (c) the splitting of update_stages_and_entry() to have its functionality called at different points, and (d) making the pathnames of the files involved in the merge available to merge_content(). Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano
OK, this is good. Looks like they fixed the bug, and it even references the bad commit that I had narrowed things down to.
So I’m a little bit dismayed that this bug existed for almost a full year before being fixed. I can’t be the only person to have been hit by this problem can I? I looked at the release notes for v1.7.7. This is what they have to say abou the issue:
The recursive merge strategy implementation got a fairly large fix for many corner cases that may rarely happen in real world projects (it has been verified that none of the 16000+ merges in the Linux kernel history back to v2.6.12 is affected with the corner case bugs this update fixes).
OK, so the bug never trigged in 16,000+ Linux kernel merges. Strangely that doesn’t actually make me feel any better.
So, I don’t think git sucks. All software has bugs, but bugs that destroy data are pretty devastating. It is a little hard to trust git merge operations now. I’ll probably try to make sure I don’t merge on to a working directory (i.e: stash my changes first, since then they are at least backed up on the object database).
Of course convincing my colleagues, who were also affected by this bug, and didn’t really have any love for git in the first place, that git isn’t completely broken is going to be a tough sell.