Comparing Git Branches with Cherry-Picked Commits
I was dealing with a code base where two branches had diverged quite some time ago and commits had been shared between them through cherry-picking. I wanted to bring the branches back in sync, though, and that was proving harder than is usually the case when you can just merge and it figures it out for you. The cherry-picking makes it so that Git has a harder time figuring out which parts to prioritise.
In particular, this was the situation.
- develop has the code that is being worked on, it is also deployed to a dev environment
- main has the code that runs in the production environment, periodically is updated by merging develop into main
For $reasons, at some point in the past develop was being worked on and a whole lot of changes were ported to main by means of cherry-picking instead of merging. The branches were not merged in ages. At a glance, it seemed like develop had changes that main did not have and main had changes that develop did not have.
I wanted to fix this and get them both in sync again, but wading through the mess is tricky. I need to figure out whether every commit on main is actually already in develop or not. Develop has evolved so much that a simple diff made it confusing at best.
Instead, I worked as follows.
First, figure out the last time main and develop were synced up according to Git. This is the commit that Git would use as base if you were to try to merge the two together.
git merge-base main develop
This gives a commit hash. I tagged it (literally, git tag
) as
merge-base-main-dev
for easier access (remember, a tag is just a label
pointing to a certain hash). Next, I checked the number of commits on main and
develop since that last common point.
git log --oneline --ancestry-path merge-base-main-dev..main | wc -l
git log --oneline --ancestry-path merge-base-main-dev..develop | wc -l
In my case I got ~300 and ~600 commits. That is a lot to dig through manually. However, Git can tell you which commits it thinks were cherry-picked between the branches, which are only part of one branch, and which are only part of the other branch.
git log --left-right --oneline main...develop --cherry-mark
This command lists all the commits since merge-base-main-dev
in a predefined
format. Here is an example from my repository with the commit messages
redacted.
> 86be49a Commit description here
> 103d591 Commit description here
= 8413c89 Commit description here
< d3a9909 Commit description here
< dd92fdf Commit description here
> 6147360 Commit description here
> 4eeac33 Commit description here
> 08d9d0a Commit description here
> ac18815 Commit description here
= bdbe96e Commit description here
The first character indicates where the commit was found.
>
The commit was only found on develop (the right branch)<
The commit was only found on main (the left branch)=
A commit was found in main and develop that looks like it was cherry picked, i.e., the SHA sum was the same. I think this is only the SHA sum of the code diff, not of the message, otherwise the approach would be ruined by a cherry-picked commit that says “Cherry-picked from xxxxx”. I did not test this assumption.
I ended up with just short of 100 commits that existed only on main, just short
of 400 only on develop, and over 400 on both. Now, my goal is to figure out
commits on main that might need to be ported to develop (yes, that direction, I
just wanted to make sure that every addition to main also exists on develop,
then I could start clean from there). So from that list, I decided to ignore
all the commits starting with >
or with =
.
I manually went through that list and found out that the vast majority of the commits were cherry-picked, but there was some small difference that made Git unable to make the match. Some examples:
- Develop had seen some intermediary change. That change was incomplete or just testing out an idea. The follow up change that implemented it properly was cherry-picked to main and the intermediary step was skipped. So the before version of the diff on main is different from the before version of the diff on develop, but the after version of both is the same. This one happened quite often.
- Develop added some extra logging in the same diff.
Doing this manually is hard, you are staring at blocks of diffed code and hoping that your eyes and brain are good enough at picking out small differences.
Turns out Git has another command to help with this particular problem:
range-diff
. This command expects ranges, but I am only interested in
comparing one commit to another commit. So just make an ad-hoc range using ~
(somecommit~
means the commit before somecommit
, though that explanation
gets trickier if you are using it on a merge commit). For example, say I have
COMMITONMAIN
that I think was cherry-picked from COMMITONDEV
.
git range-diff COMMITONMAIN~..COMMITONMAIN COMMITONDEV~..COMMITONDEV
This gets the diff between COMMITONMAIN and the one before it. It gets the diff
between COMMITONDEV and the one before it. Finally, it gives you a diff of
these diffs of the code. When looking at this in a terminal, the
highlighted -
means a part of the diff of the code that was in main, but
not in develop. The highlighted +
means a part of the diff of the code that
was in develop, but not in main. Do not confuse them with the simply coloured
-
and +
which are the regular ones from the regular diff. Also note that
there can be difference in the context of the diff, not the changed lines
themselves. Then you will see a highlighted -
/+
not followed by a regularly
coloured -
/+
. Yes, that sounds confusing, I reread it and can barely parse
it. Honestly, if you ever get in this situation and to this point, try it out,
and I think it will become clear. Just keep your diff of diffs of the code
apart from your diffs of the code :)
This seems to be working out for me. I take a commit. I locate the one I think
it is cherry-picked from, helped along by either similar looking commit
messages or based on the changes I see. I run range-diff
as mentioned. I use
the diff of diffs to check each diff of the code along that part. I decide
what, if anything, has to be added from main to develop.
Eventually, for every commit on main, I have it traced back to something in develop or I have added a new change to develop. So by the end of this exercise, develop subsumes main.
Now, the original goal was to bring main back up to date with develop, but I went the opposite way, develop is up to date with main (and then some), what gives? Well, merging develop into main would still be a pain, Git has trouble figuring out which changes to prioritise. To solve things, I renamed the main branch to old-main. I then started a new main branch from the develop branch. A branch is just a pointer to a certain commit, so renaming and creating a new one are cheap operations.