Comparing Git Branches with Cherry-Picked Commits

I was dealing with a code base where two branches had diverged quite some time ago and commits had been shared between them through cherry-picking. I wanted to bring the branches back in sync, though, and that was proving harder than is usually the case when you can just merge and it figures it out for you. The cherry-picking makes it so that Git has a harder time figuring out which parts to prioritise.

In particular, this was the situation.

develop has the code that is being worked on, it is also deployed to a dev environment
main has the code that runs in the production environment, periodically is updated by merging develop into main

For $reasons, at some point in the past develop was being worked on and a whole lot of changes were ported to main by means of cherry-picking instead of merging. The branches were not merged in ages. At a glance, it seemed like develop had changes that main did not have and main had changes that develop did not have.

I wanted to fix this and get them both in sync again, but wading through the mess is tricky. I need to figure out whether every commit on main is actually already in develop or not. Develop has evolved so much that a simple diff made it confusing at best.

Instead, I worked as follows.

First, figure out the last time main and develop were synced up according to Git. This is the commit that Git would use as base if you were to try to merge the two together.

git merge-base main develop

This gives a commit hash. I tagged it (literally, git tag) as merge-base-main-dev for easier access (remember, a tag is just a label pointing to a certain hash). Next, I checked the number of commits on main and develop since that last common point.

git log --oneline --ancestry-path merge-base-main-dev..main | wc -l
git log --oneline --ancestry-path merge-base-main-dev..develop | wc -l

In my case I got ~300 and ~600 commits. That is a lot to dig through manually. However, Git can tell you which commits it thinks were cherry-picked between the branches, which are only part of one branch, and which are only part of the other branch.

git log --left-right --oneline main...develop --cherry-mark

This command lists all the commits since merge-base-main-dev in a predefined format. Here is an example from my repository with the commit messages redacted.

> 86be49a Commit description here
> 103d591 Commit description here
= 8413c89 Commit description here
< d3a9909 Commit description here
< dd92fdf Commit description here
> 6147360 Commit description here
> 4eeac33 Commit description here
> 08d9d0a Commit description here
> ac18815 Commit description here
= bdbe96e Commit description here

The first character indicates where the commit was found.

> The commit was only found on develop (the right branch)
< The commit was only found on main (the left branch)
= A commit was found in main and develop that looks like it was cherry picked, i.e., the SHA sum was the same. I think this is only the SHA sum of the code diff, not of the message, otherwise the approach would be ruined by a cherry-picked commit that says “Cherry-picked from xxxxx”. I did not test this assumption.

I ended up with just short of 100 commits that existed only on main, just short of 400 only on develop, and over 400 on both. Now, my goal is to figure out commits on main that might need to be ported to develop (yes, that direction, I just wanted to make sure that every addition to main also exists on develop, then I could start clean from there). So from that list, I decided to ignore all the commits starting with > or with =.

I manually went through that list and found out that the vast majority of the commits were cherry-picked, but there was some small difference that made Git unable to make the match. Some examples:

Develop had seen some intermediary change. That change was incomplete or just testing out an idea. The follow up change that implemented it properly was cherry-picked to main and the intermediary step was skipped. So the before version of the diff on main is different from the before version of the diff on develop, but the after version of both is the same. This one happened quite often.
Develop added some extra logging in the same diff.

Doing this manually is hard, you are staring at blocks of diffed code and hoping that your eyes and brain are good enough at picking out small differences.

Turns out Git has another command to help with this particular problem: range-diff. This command expects ranges, but I am only interested in comparing one commit to another commit. So just make an ad-hoc range using ~ (somecommit~ means the commit before somecommit, though that explanation gets trickier if you are using it on a merge commit). For example, say I have COMMITONMAIN that I think was cherry-picked from COMMITONDEV.

git range-diff COMMITONMAIN~..COMMITONMAIN COMMITONDEV~..COMMITONDEV

This gets the diff between COMMITONMAIN and the one before it. It gets the diff between COMMITONDEV and the one before it. Finally, it gives you a diff of these diffs of the code. When looking at this in a terminal, the highlighted - means a part of the diff of the code that was in main, but not in develop. The highlighted + means a part of the diff of the code that was in develop, but not in main. Do not confuse them with the simply coloured - and + which are the regular ones from the regular diff. Also note that there can be difference in the context of the diff, not the changed lines themselves. Then you will see a highlighted -/+ not followed by a regularly coloured -/+. Yes, that sounds confusing, I reread it and can barely parse it. Honestly, if you ever get in this situation and to this point, try it out, and I think it will become clear. Just keep your diff of diffs of the code apart from your diffs of the code :)

This seems to be working out for me. I take a commit. I locate the one I think it is cherry-picked from, helped along by either similar looking commit messages or based on the changes I see. I run range-diff as mentioned. I use the diff of diffs to check each diff of the code along that part. I decide what, if anything, has to be added from main to develop.

Eventually, for every commit on main, I have it traced back to something in develop or I have added a new change to develop. So by the end of this exercise, develop subsumes main.

Now, the original goal was to bring main back up to date with develop, but I went the opposite way, develop is up to date with main (and then some), what gives? Well, merging develop into main would still be a pain, Git has trouble figuring out which changes to prioritise. To solve things, I renamed the main branch to old-main. I then started a new main branch from the develop branch. A branch is just a pointer to a certain commit, so renaming and creating a new one are cheap operations.