Stacked Diffs Versus Pull Requests

People who have worked with Phabricator using a ‘stacked diff’ workflow generally love it and seek it wherever they next go. People who have never used it and only use Pull Requests with GitHub/GitLabs generally don’t understand what the fuss is all about. How can code review be *sooo* much better using some obscure tool with hilarious documentation? Well, hold on to your butts, because I’m about to break it down. This post is going to focus solely on the engineering workflow for committing code and getting reviews. I’ll probably do a second post about the details of code review between the two tools.

Before I dig deeply, let me just say I’ve created and merged hundreds of Pull Requests and landed thousands of Diffs. I know both of these workflows in and out. I’m not just some ex-Facebook, Phabricator fan boy pining for the ‘good old days.’ I’ve worked on engineering teams using CVS (oh yes), SVN, Git, and Mercurial. GitLabs, GitHub, Gerrit, and Phabricator. I’ll happily acknowledge that you can get a lot of good work done using any of these. Now, if you want to talk about how to get the most work done — the most productivity per engineer — that’s where I have a strong opinion informed by lots of experience.

What are stacked diffs?

Many folks reading this post won’t actually have a clue what “stacked diffs” are all about anyway. This is understandable. Feature branches and Pull Requests (PRs) are fairly ubiquitous and (sort of) well-understood. For the uninitiated, I’ll outline how it works.

First, Pull Requests

In PR-based development, you start by branching master then add one or more commits which you submit as a ‘Pull Request’ in the Github UI. A Pull Request is (or at least should be) an atomic unit of code for review. When someone requests changes, you do this by adding additional commits to the pull request until the sum of these changes satisfies the reviewers demands.

The really important thing about this is that the state of your local repository is dictated by the review process. If you want to have your code reviewed, you first have to branch master, then commit to that branch, then push it remotely, then create a PR. If you get review feedback, you have to commit more code onto the same branch and a) push it to create a longer, less coherent commit history or b) merge your local commits and force push to the branch. This also means that you can’t have a local checkout of the repository that looks different from the remote. This is a really, really important point that I’ll come back to again and again.

So, Stacked Diffs

The basic idea of stacked diffs is that you have a local checkout of the repository which you can mangle to your heart’s content. The only thing that the world needs to care about is what you want to push out for review. This means you decide what view of your local checkout the reviewers see. You present something that can be ‘landed’ on top of master. It may be helpful to skip down to the Case Studies section below to get a more intuitive feel about how this works.

The typical workflow is to work right on top of master, committing to master as you go. For each of the commits, you then use the Phabricator command line tool to create a ‘Diff’ which is the Phabricator equivalent of a Pull Request. Unlike Pull Requests, Diffs are usually based on exactly one commit and instead of pushing updates as additional commits, you update the single commit in place and then tell Phabricator to update the remote view. When a Diff gets reviewed and approved, you can “land” it onto remote master. Your local copy and master don’t have to be in perfect sync in order to do this. You can think of this as the remote master cherry-picking the specific commit from your git history.

That’s right, I said it. You can probably commit everything to master. Sound terrifying? It’s mostly… well… just, not a problem at all. It’s fine like 93% of the time. In fact, this approach gives you the ability to do things that branches alone just can’t (more on that below). The anxiety many engineers feel about committing ahead of master is a lot like the fear that if you fly at lightspeed, you’ll crash into a star. Popularly held, theoretically true, and practically completely wrong.

In practice, engineers tend to work on problems whose chunks don’t easily divide into units of code review that make sense as a branch-per-unit-of-review. In fact, most engineers don’t know exactly how their work decomposes when they start working on a problem. Maybe they could commit everything to master. Maybe they need a branch per commit. Maybe it’s somewhere in between. If the rules for how to get code reviewed and how to commit code are defined for you ahead of time, you don’t get to choose, which in many cases means a net loss in productivity.

The big “aha!” idea here is that units of code review are individual commits and that those commits can stack arbitrarily, because they’re all on one branch. You can have 17 local commits all stacked ahead of master and life is peachy. Each one of them can have a proper, unique commit message (i.e. title, description, test plan, etc.). Each of them can be a unit out for code review. Most importantly, each one of them can have a single thesis. This matters *so* much more than most engineering teams realize.

Yes, basically every commit can be on top of master

“But that’s marmot floofing crazy!” I hear you say at your computer, reading this months after the blog post was written. Is it? Is it, really?! You may be surprised to learn that many engineers, who make a fantastic amount of money from some of the best companies in the world, commit directly to master all of the time, unless they have a reason not to.

To enable this, the mental model is different. A unit of code review is a single commit, not a branch. The heuristic for whether or not to branch is this: ‘Am I going to generate many units of code review for this overall change?’ If the answer is yes, you might create a branch to house the many units of code review the overall change will require. In this model, a branch is just a utility for organising many units of code review, not something forced on you *as* the mechanism of code review.

If you adopt this approach, you can use master as much as you want. You can branch when/if you want. You, the engineer, decide when/if to branch and how much to branch.

In this model, every commit must pass lint. It must pass unit tests. It must build. Every commit should have a test plan. A description. A meaningful title. Every. Single. Commit. This level of discipline means the code quality bar is fundamentally higher than the Pull Request world (especially if you rely on Squash Merge to save you). Because every commit builds, you can bisect. You can expect that reading pure git log is actually meaningful. In fact, in this model every single commit is like the top commit from a Pull Request. Every commit has a link to the code review that allowed the commit to land. You can see who wrote it and who reviewed it at a glance.

For clarity, let me describe the extreme case where you only commit to master. I’ll outline things that are simpler because of this. I’m starting in order of least-important to most-important just to build the drama.

#1 Rebasing against master

With Pull Requests, if you want to catch up your local branch to master, you have to do the following:

  1. Fetch the changes from remote
  2. Rebase your branch on top of master
  3. Merge any conflicts that arise

That doesn’t sound so bad, but what about when you have a branch off a branch off of master? Then you have to repeat the last two steps for each branch, every time.

By contrast, if you only worked from master, you only have to do a git pull --rebase and you get to skip the cascading rebases, every time. You get to do just the work that you care about. All the branch jumping falls away without any cost. Might seem minor, but if you do the math on how often you have to do this, it adds up.

#2 Doing unrelated tasks

Many of us wear a lot of hats in our jobs. I’m the owner of a user-facing product codebase, which is many tens of thousands of lines, separated into dozens of features. That means I often jump between, for instance, refactoring big chunks of crufty old JavaScript (e.g. hundreds of lines of code across dozens of files) and working out small, nuanced bugs that relate to individual lines of code in a single file.

In the Pull Request world, this might mean I switch branches a dozen times per day. In most cases, that’s not really necessary because many of these changes would never conflict. Yet, most highly productive people do half a dozen or more unrelated changes in day. This means that all that time spent branching and merging is wasted because those changes would never have conflicted anyway. This is evidenced by the fact that the majority of changes can be merged from the Github Pull Request UI without any manual steps at all. If the changes would never have conflicted, why are you wasting your time branching and merging? Surely you should be able to choose when/if to branch.

#3 Doing related tasks

One of the most time-destroying aspects of the Pull Request workflow is when you have multiple, dependent units of work. We all do this all the time. I want to achieve X, which requires doing V, W, X, and Y.

Sound far fetched? Well, just recently, I wanted to fix a user-facing feature. However, the UI code was all wrong. It needed to have a bunch of bad XHR code abstracted out first. Then, the UI code I wanted to change would be isolated enough to work on. The UI change required two server-side changes as well — one to alter the existing REST API and one to change the data representation. In order to properly test this, I’d need all three changes all together. But none of these changes required the same reviewer and they could all land independently, apart from the XHR and feature changes.

In the stacked diff world, this looks like this:

$ git log
commit c1e3cc829bcf05790241b997e81e678b3b309cc8 (HEAD -> master) 
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 16:43:22 2018 +0100

    Alter API to enable my-sweet-feature-change

commit 6baac280353eb3c69056d90202bebef5de963afe
Author: Jackson Gabbard <madeup@email.com> Date: Sat Sep 29 16:44:27 2018 +0100

    Alter the database schema representation to enable my-sweet-feature 
commit a16589b0fec54a2503c18ef6ece50f63214fa553
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 16:42:28 2018 +0100

    Make awesome user-facing change

commit cd2e43210bb48158a1c5eddb7c178070a8572e4d
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 16:41:26 2018 +0100

    Add an XHR library to abstract redundant calls

commit 5c63f48334a5879fffee3a29bf12f6ecd1c6a1dc  (origin/master, origin/HEAD)
Author: Some Other Engineer <madeup-2@email.com>
Date:   Sat Sep 29 16:40:16 2018 +0100
    
    Did some work on some things

The equivalent of the Git configuration of this might look like this:

$ git log

commit 55c9fc3be10ebfe642b8d3ac3b30fa60a1710f0a (HEAD -> api-changes)
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 17:02:48 2018 +0100

    Alter API to enable my-sweet-feature-change

commit b4dd1715cb47ace52bc773312544eb5da3b08038 (data-model-change)
Author: Jackson Gabbard <madeup@email.com>
Date: Sat Sep 29 17:03:25 2018 +0100

    Alter the database schema representation to enable my-sweet-feature

commit 532e86c9042b54c881c955b549634b81af6cdd2b (my-sweet-feature)
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 17:02:02 2018 +0100

    Make awesome user-facing change

commit d2383f17db1692708ed854735caf72a88ee16e46 (xhr-changes)
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 17:01:29 2018 +0100

    Add an XHR library to abstract out redundant calls

commit ba28b0c843a863719d0ac489b933add61303a141 (master)
Author: Some Other Engineer <madeup-2@email.com>
Date:   Sat Sep 29 17:00:56 2018 +0100

    Did some work on some things

Realistically though, in the Pull Request world, this commonly goes one of two ways:

  1. You care massively about code quality so you diligently create a branch off of master for V, then a branch off of V for W, then a branch off of W for X, then a branch off of X for Y. You create a pull request for each one (as above).
  2. You actually want to get work done so you create one big ass Pull Request that has commits for V, W, X, and Y.

In either case, someone loses.

For Case #1, what happens when someone requests changes to V? Simple, right? You make those changes in branch V and push them, updating your PR. Then you switch to W and rebase. Then you switch to X and rebase. Then you switch to Y and rebase. And when you’re done, you go to the orthopedist to get a walker because you’re literally elderly now. You’ve wasted your best years rebasing branches, but hey — the commit history is clean AF.

Importantly, woe be unto you if you happened to miss a branch rebase in the middle somewhere. This also means that when it comes time to commit, you have to remember which destination branch to select in the Github UI. If you mess that up and merge from X to W after W was merged to master, you’ve got an exciting, life-shortening mess to clean up. Yay!

For Case #2, everyone else loses because people just don’t feel the same burden of quality per-commit in a PR. You don’t have a test plan for every commit. You don’t bother with good documentation on each individual commit, because you’re thinking in terms of a PR.

In this case, when different reviewers request changes to the code for theses V and W, you just slap commits Y++ and Y++++ onto the end of the Pull Request to address the feedback across all of the commits. This means that the coherence of codebase goes down over time.

You can’t intelligently squash merge the aspects of the various commits in the Pull Request that are actually related. The tool doesn’t work that way so people don’t work or think that way. I can’t tell you the number of times I’ve seen the last two or three commits to a PR titled with “Addresses feedback” or “tweaks” and nothing else. Those commits tend to be among the sloppiest and least coherent. In the context of the PR, that *seems* fine, but when you fast-forward 6 months and you’re trying to figure out why some code doesn’t do what it’s supposed to and all you have in a stack of 20 commits from a seemingly unrepated PR and the git blame shows that the offending line comes from a commit titled “nits” and nothing else and no other context, life is just harder.

#4 Doing multiple sets of related tasks

If you happen to be one of the rare engineers who is so productive that you work on multiple, distinct problems at the same time — you still probably want a branch per-thing, even in the stacked diff workflow. This likely means that you create a branch per-thing (i.e. per distinct problem), but that you put out multiple units of code for review on each branch.

For the mortals amongst us, let’s imagine the case where an amazing engineer is working on three different hard problems at once. This engineer is working on three different strands of work, each of which require many commits and review by many different people. This person might generate conflicts between their branches, but they also are clever enough and productive enough to manage that. Let’s assume that each of this person’s branches includes an average of 5 or more units of code review in solving each of the 3 distinct problems.

In the Pull Request model, this means that person will have to create 3 branches off of master and then 5 branches-of-branches.  Alternatively, this person will create 3 Pull Requests, each of which is stacked 5 commits deep with this that only go together because of a very high level problem, not because it actually makes sense for code review. Those 5 commits may not require the same reviewer. Yet, the pull request model is going to put the onus on a single reviewer, because that’s how the tool works.

The Stacked Diff model allows that amazing engineer to choose how/if to branch any commit. That person can decide if their work requires 3 branches and 15 units of code review or if their work requires 15 branches and 15 units of code review or something different.

This is more important than many people realize. Engineering managers know that allowing their most productive people to be as productive as possible amounts to big chunks of the team’s total output. Why on earth would you saddle your most productive engineers with a process that eats away at their productivity?

Thoughtless commits are bad commits

Every single commit that hits a codebase means more shit to trawl through trying to fix a production bug while your system is melting. Every merge commit. Every junk mid-PR commit that still doesn’t build but kinda gets your change closer to working. Every time you smashed two or three extra things into the PR because it was too much bother to create a separate PR. These things add up. These things make a codebase harder to wrangle, month after month, engineer after engineer.

How do you git bisect a codebase where every 6th commit doesn’t build because it was jammed into the middle of a Pull Request?

How much harder is it to audit a codebase where many times the blame is some massive merge commit?

How much more work is it to figure out what a commit actually does by reading the code because the blame commit message was “fixes bugs” and the pull request was 12 commits back?

The answer is *a lot harder*. Specifically because Pull Requests set you up for way more, way lower quality commits. It’s just the default mode of the workflow. That is what happens in practice, in codebases all over the world, every day. I’ve seen it in five different companies now on two continents in massively different technical domains.

Make the default mode a good one

You can make the argument that none of this is the fault of Pull Requests. Hi, thanks for your input. You’re technically correct. To you, I’d like to offer the Tale of the Tree Icon. When Facebook re-launched Facebook Groups in 2011, I was the engineer who implemented the New User Experience. I worked directly with the designer who implemented the Group Icons, which show up in the left navigation of the site. Weeks after launch, we noticed that almost all the groups had their icon set to… a tree. It was a gorgeous icon designed by the truly exceptional Soleio Cuervo. But… a tree? Why?

Because it was the first thing in the list.

People choose whatever is easiest. Defaults matter. So much. Even us demi-god-like Engineers are subject to the trappings of default behaviour. Which is why Pull Requests are terrible for code quality. The easiest behaviour is shoehorning in a bunch of shit under one PR because it’s just so much work to get code out for review.

This is where Stacked Diffs win out, no question. It’s not even close. The default behaviour is to be able to create a unit of code review for any change, no matter how minor. This means that you can get the dozens of uninteresting changes that come along with any significant work approved effortlessly. The changes that are actually controversial can be easily separated from the hum-drum, iterative code that we all write every day. Pull Requests encourage exactly the opposite — pounding in all of the changes into one high-level thesis and leaving the actual commit history a shambles.

Coding as a queue

The fundamental shift that the Stacked Diff workflow enables is moving from the idea that every change is a branch of off master and to a world where your work is actually a queue of changes ahead of master. If you’re a productive engineer, you’ll pretty much always have five or more changes out for review. They’ll get reviewed in some order and commited in some order. With Pull Requests, the work queue is hidden behind the cruft of juggling branches and trying to treat each change like a clean room separated from your other work. With Stacked Diffs, the queue is obvious — it’s a stack of commits ahead of master.  You put new work on the end of the queue. Work that is ready to land gets bumped to the front of the queue and landed onto master. It’s a much, much simpler mental model than a tangle of dependent branches and much more flexible than moving every change into the clean room of a new branch.

(For the pedantic few out there, yes, I just said stacked diffs are like a queue. Yeah… I didn’t name the workflow. Don’t hurl the rotten tomatoes at me.)


By now, you’re probably sick of this theoretical/rhetorical discussion of what good engineering looks like. Let’s switch gears and talk about this in practical, day to day terms.

Case Study #1: The Highly Productive Coder

In this case study, we take a look at Suhair. Suhair is a really productive coder. Suhair produces 10 or more high quality commits every day.

With Pull Requests

Suhair starts the day fixing a bug. Creates a branch, makes changes. Commits them. Suhair then pushes to the remote branch. Then navigates away from the terminal to the Github UI to create a pull request.

Next, Suhair switches back to master, pulls, and creates a new branch to work on a new feature. Commits code. This code is completely unrelated to the bug fix. In fact, they would never generate merge conflicts. Still, Suhair sticks to branches. Works on the feature. Gets it to a good RFC state. Suhair pushes the changes. In Github, Suhair creates a pull request.

Next Suhair starts working on another feature improvement. Switches to master. Pulls. Branches. But… uh oh. This change depends on his bug fix from earlier? What to do? He goes to the bug fix PR, sees if there are any comments. One person left some passing comments, but the person Suhair needs to review it hasn’t commented.

So Suhair decides it’s too much work to create a branch off his bug fix branch and decides to do something else in the interim. Suhair pings the needed person, begging for code review, interrupting their flow and then starts working on something else.

With Stacked Diffs

Suhair pulls master in the morning to get the latest changes. Makes the first bug fix, commits it, creates a Diff to be reviewed, entirely from the command line. Suhair then works on the unrelated feature. Commits. Creates a Diff from the command line. Then starts working on the bug-fix-dependent feature improvement. Because Suhair never left master, the bug fix is still in the stack. So, Suhair can proceed with the feature improvement uninterrupted. So, Suhair does the work. Commits it. Creates a Diff for review.

By now, the person who should have reviewed the initial bug fix actually got around to it. They give Suhair some feedback which Suhair incorporates via interactive rebase. Suhair rebases the changes on top of the updated bug fix, which generates a small merge conflict with the feature improvement, which Suhair fixes. Then Suhair lands the change via interactive rebase. On the next git pull –rebase against remote master, the local commit disappears because remote master already has an identical change, and Suhair’s queue of commits ahead of master decreases by one.

As a bonus for Suhair today, the same reviewer who approved the bug fix is also the reviewer needed for his feature improvement. That person approved his tweaks right after they reviewed his bug fix. So, Suhair rebases those changes to be at the top of the commit stack, then lands them. Suhair never switches branches. At the end of the day, only the feature work is left in his local repo, everything else is landed on top of master.

The next day, Suhair comes in, runs git pull –rebase and starts working without any branch juggling.

Case Study #2: The Free Spirited Hacker

Charlie is a productive, energetic, somewhat amoral hacker who just wants to get work done as fast as possible. Charlie knows the product better than any one, but doesn’t really care about code quality. Charlie is best paired with a senior tech lead (or two) who can rein in the chaos a bit.

With Pull Requests

Charlie starts the day by branching master and spamming the branch with five commits that are only vaguely related. Charlie’s commits are big, chunky commits that don’t make a lot of sense. They tend to be a bunch of things all crammed together. Reviewers of Charlie’s code always know they’ll have a lot of work ahead of them to make sense of the tangle of ideas. Because of this, they tend to put off reviewing. Today, a senior tech lead takes 45 minutes to read through all these changes, giving detailed feedback and explaining how to improve the various strands of the change. Charlie commits more changes onto the PR, addressing feedback and also makes random “fixes” along the way. In the end, the PR is probably okay, but it’s certainly not coherent and may the Mighty Lobster on High protect those who have to make sense of the code in the coming months.

During this laborious back-and-forth, Charlie’s best option is to keep piling things on this PR because all the related changes are in it. The tech lead doesn’t have a reasonable alternative to offer Charlie.

With Stacked Diffs

Charlie blasts out five commits and five Diffs back to back. Each one addresses something specific. Each one goes to a different reviewer because Charlie happens to be making a sweeping change to the codebase. Charlie knows how it all fits together and the tech leads can make sure that the individual changes aren’t going to ruin everything.

Because the changes are smaller and more coherent, they get much better review. A tech lead points out that one of the changes is clearly two separate theses that happen to touch the same set of files. This tech lead reviewing the code pushes back on Charlie. The tech lead points out that these should actually be two separate commits. Unfazed, Charlie abandons the Diff. Using interactive rebase to rewind history to that troublesome commit, Charlie uses git reset to uncommit the single commit that has two theses.

At this point, Charlie’s local master is two commits ahead of remote master and has a bag of uncommitted changes that Charlie is currently hacking on. There are two more changes that are in the future, waiting to be added back to the local commit history by Git when Charlie is done rebasing interactively.

So, Charlie uses git add -p to separate out one change from the other and creates two new commits and two new diffs for them separately. They each get a title, a description, and a test plan. Charlie then runs git rebase –continue to bring fast-forward time and bring back the later changes. Now, Charlie’s local master is six commits ahead of remote master. There are six Diffs out for review. Charlie never switched branches.

Case Study #3: The Engineer with a Bad Neighbour

Yang is a great engineer working in a fun part of the infrastructure. Unfortunately, Yang has a bad neighbour. This other engineer constantly lands the team in trouble. Today, Yang has found that the build is broken due to yet another incomprehensible change. The neighbour has a “fix” out for a review, but no one trusts it and several knowledgeable people are picking through the code in a very contentious code review. Yang just wants to get work done, but can’t because the bug is blocking everything.

With Pull Requests

Yang will checkout the remote branch with the “fix”. Next, Yang will branch off of that branch in order to get a sort-of-working codebase. Yang gets to work. Midday, the bad neighbour pushes a big update to the “fix”. Yang has to switch to that branch, pull, then switch back to the branch Yang has been working on, rebase, and then push the branch for review. Yang then switches gears to refactor a class nearby in the codebase. So, Yang has to go back to the bug “fix” branch, branch off it, start the refactor, and push the commit remotely for review. The next day, Yang wants to merge the changes, but the “fix” has changed and needs rebasing again. Yang switches to the bug fix branch, pulls. Switches to the first branch, rebases. Pushes. Switches to GitHub to do the merge, carefully selecting to merge onto master rather than the bug fix branch. Then Yang goes back to the terminal, switches to the second feature, rebases, pushes, goes to GitHub, selecrs to merge to master, and merges. Then, Yang applies for AARP, because Yang is now in geriatric care.

With Stacked Diffs

Yang sees that the Diff for the “fix” is out for review. Yang uses the Phabricator command line tool to patch that commit on top of master. This means that it’s not a branch. It’s just a throwaway local commit. Yang then starts working on the first change. Yang submits a Diff for review from the command line. Later, the “fix” has changed, so Yang drops the patch of the old version from the Git history and patches in the updated one via interactive rebase. Yang then starts working on the second change, submits a Diff for review. The next day, Yang is ready to land both changes. First, Yang dumps the previous patch of the fix and repatches the update to make sure everything works. Then, Yang uses the command line via interactive rebase to land both of the changes without ever switching branches or leaving the terminal. Later, the fix lands, so Yang does a  got pull –rebase and the local patch falls off because it’s already in master. Then Yang goes to skydiving because Yang is still young and vital.

In Conclusion

As you can see from the Case Studies, you can definitely get good work done no matter what tool chain you use. I think it’s also quite clear that Stacked Diffs make life easier and work faster. Many engineers reading this will say the cost of switching is too high. This is expectable. It’s a thing called the Sunk Cost Fallacy. Everyone prefers the thing they feel they have invested in, even if there is an alternative that is provably more valuable. The stacked diff workflow is a clearly higher-throughput workflow that gives significantly more power to engineers to develop code as they see fit.

Inside Facebook, engineering used the branch-oriented workflow for years. They eventually replaced it with the stacked diff workflow because it made engineers more productive in very concrete terms. It also encourages good engineering practices in a way exactly opposite to branching and Pull Requests.

Something I haven’t touched on at all is the actual work of reviewing code. As it turns out, Phabricator also happens to offer better code review tools, but I’ll save that for another post.

Communicating via AES 256 GCM between NodeJS and Go(lang)

I recently had the mixed fortune of needing to send encrypted payloads back and forth between a service running in NodeJS and a service built in Golang. It was not a straightforward thing to do, which made me appreciate just how particular and difficult these libraries are to use, doubly so to communicate across services written in different languages. Hence, I decided to write a blog post about the intricacies of the process.

First Issue: Transport Encoding

Enciphered payloads are just strings of bits. These bit strings are not ASCII or UTF8, though. You can’t output them to your terminal, for instance, because they can completely destroy the terminal due to byte sequences that the terminal interprets as instructions. Nor can you include them as HTTP request bodies, for the same reason. They can include characters that are interpreted by HTTP parsers in weird ways, most likely causing message truncation.

So, you need a transport encoding for sending ciphertext around. Base64 works great for this… with a catch.

Golang has multiple Base64 encoding methods. Specifically, StdEncoding and URLEncoding. URLEncoding does what it says on the tin — it creates Base64 output that is safe for use in a URL. Standard Base64 includes characters like the “=” sign, which requires escaping in order to work as part of a URL. (It’s worth noting that URLEncoding Base64 is also a lot longer than StdEncoding.)

I’d written a crypto wrapper library for my Golang service that was using URLEncoding internally. However, NodeJS doesn’t have native support for URLEncoded JSON. There are some NPMs for this, but if you’re like me and you hate adding a million dependencies, you just want this to work out of the box. So, I had to switch my existing Go library to use StdEncoding. Not a big deal, but it’s a Gotcha™ that tripped me up because I just didn’t expect that NodeJS wouldn’t support this. Ho hum.

Second Issue: Terminology

If you read the NodeJS crypto library documentation, it’s a) sparse, b) full of unexplained jargon and initials, and c) uses different words from the Golang library. Because of course they wouldn’t be the same.

So, what NodeJS calls an “IV,” which is a shortening of the term Initialization Vector, Golang refers to as a nonce. Nonce <sarcasm>*obviously*</sarcasm> is a mash-word for “number used once.” And, <sarcasm>as all of us crypto experts know</sarcasm>, you should never use the same Initialization Vector more than once. Don’t we all feel better, wiser, and superior to the mere mortals who don’t know all this already.

In truth, there is a subtle difference between the two terms. An initialisation vector means “choose some random bytes, used to lock in security of the encryption algorithm.” A none on the other hand, refers to “choose a random number with the correct number of bytes, used to lock in the security of the encryption algorithm” Initialization vectors are usually appended to the message, meaning your ciphertext is slightly longer than your message. A nonce (in theory) can be derived from context. In practice however, it seems the two are used interchangeably.

For AES 256 GCM, your nonce/IV/initialization vector ought to be 12 bytes long (i.e. 96 bits). This has to do with the size of the blocks in the block cipher. An IV needs to be sized correctly for the length of the block size in the block cipher and for the specific counter method used. For GCM with AES256, 12-bytes is standard.

This is important to get right for security and also because if you try to use the wrong sized nonce, you’ll get obscure errors like:

Error: Unsupported state or unable to authenticate data

Super helpful, that. Especially because you can get this error a bunch of different ways. In truth, ambiguous errors when attempting to decrypt a ciphertext is actually an important part of the security of the algorithm. If you can get the algorithm to output different errors by tweaking the ciphertext, you can use chosen ciphertext attacks to infer details about the plaintext. In some cases, you can even decrypt the ciphertext this way.

Unrelated, if you’re using a different cipher algorithm (i.e. aes256 rather than aes-256-gcm), you might get this error instead:

Error: Invalid IV length

My solution to this problem was to generate 12 bytes of randomness to use as the nonce. I chose to prepend these bytes to the start of my encoded blob.

Lastly, secure encryption requires a mechanism for message integrity. This is done by one of many algorithms which calculate a MAC or Message Authentication Code. I’m not going to attempt to discuss the details of this here, but there is a terminology hurdle with this notion. What some algorithms and papers refer to as a MAC others will refer to as an Auth Tag, which is short for Authentication Tag which is just one of an endless string of unexplained synonyms in the crypto world. Now you know.

Third Issue: Some disassembly required

Here’s where things get really gnarly. If you look at the Golang API for sealing an AES 256 GCM-enciphered plaintext message, you’ll note that it deals only with a nonce, a plaintext, and the mysterious “additionalData”:

What’s important about this is that there is something missing.

If we look at NodeJS’s implementation of the same thing, we’ve got an extra piece of required information — the AuthTag.

So that’s an interesting little mystery. Where is this AuthTag component in the Golang library?

Also, note that the deciphering will not work in NodeJS without the AuthTag. You’ll get the extremely useful, highly unique error message:

Error: Unsupported state or unable to authenticate data

Ugh.

To resolve this mystery, I had to dig into the actual source code of the Golang library for crypto. If we crawl into the source a bit, we can see that they’re helping us out by hiding the auth tag from us entirely:

 

Here you can see that Golang is using/anticipating it’s own structure for the ciphertext and the AuthTag. This isn’t documented in the Golang docs, of course. However, once you know that the ciphertext produced by Golang’s GCM sealing code, you can easily write your own code to splice out these bits. My code looks like this:

The assumption here is that the bundle is a Base64 encoded string of bits with the first 12 bytes being the nonce, the last 16 bytes being the AuthTag, and all the bytes in the middle being the ciphertext. Golang auto-appends the AuthTag to the end and I wrote some assembly code to prepend this with the nonce. With each of these pieces extracted from the Buffer, performing the decryption is finally possible.

(You might note the weird ordering of parameters to this function. It is strange to have callbacks followed by additional params. This was done because this particular callback is used in a chain of callbacks and I curried some of the arguments. Without doing so, it would have meant a lot more forwarding of arguments along the chain, which results in messy code. This way, the function that decrypts the AES key only needs to call onKeySuccess with a single argument, because the other three were already bound to the onKeySuccess function. Yay JavaScript.)

For the Golang code that does Encryption, check out the example code here. Mine is quite similar.

Conclusion

Writing cross-service, secure communication is hard. Different libraries choose to implement the ciphertext and AuthTag packaging differently. Golang is particularly problematic because it does not tell you it has hidden away some of the details. Because it’s not part of the public API, it means that relying on this particular implementation is unsafe. They might choose to put it the other way around without telling you because you were never supposed to reach inside the blackbox is the first place. Alas. Alack.

Links

Basics of what these different pieces of the puzzle actually do:

https://crypto.stackexchange.com/questions/3965/what-is-the-main-difference-between-a-key-an-iv-and-a-nonce

A fully-baked NodeJS encryption example:

http://lollyrock.com/articles/nodejs-encryption/

 

How not to join the wrong company

I recently had the experience of joining and then leaving a startup. I won’t get into the details of the company, but let’s just say it rhymes it “kitty slapper.” I stayed a month, which was long enough to understand the culture of the company, the mission, and the people. In that time, it became really apparent to me that I’d joined the wrong company for me. Once I knew, I didn’t waste any time. I let the team know and transitioned out.

So, on one front, I think I did okay. When I knew it wasn’t going to work, I didn’t belabour it. I didn’t foolishly stick to my plan despite the evidence I had that it wasn’t going to work. A younger, more driven, and more naïve me would have pushed ahead trying to make it work at all costs. I did that at Facebook and mostly just gave myself more grey hairs.

Now, I wouldn’t pretend that I left Facebook in the most mature, conscientious way. No, in truth I was really frustrated and hurt when I left. Things weren’t adding up and my two separate chains of accountability were giving me incongruous feedback. If I’m honest though, I think there was a way forward for me there. It would have just required a lot of letting go of hard feelings and finding a new team to call home.  I wasn’t willing to let go of my goals or area of focus at Facebook. So, instead of moving away from security or quitting when it seemed like there was an impasse, I just kept grinding. I gritted my teeth and pushed forward. The result? I still quit a year later, even more frustrated and demoralised than the year before.

Okay, so I didn’t repeat that mistake this time. I recognised the way was blocked by unresolvable issues and I walked away.

But there’s room for improvement here. The question I keep coming back to since I left a couple weeks ago is how did I miss these signals in the first place? I’m a very, very seasoned interviewer. So much so that when someone is interviewing me, it typically works more like I’m interviewing them. So how the hell did I miss majorly important signals that would’ve prevented me from joining that company in the first place?

After another week of self-reflection, I think it comes down to just a few things.

#1 Don’t fall subject to the authority bias

One of my favourite books is The Art of Thinking Clearly by Rolf Dobelli. The chapter devoted to the Authority Bias poignantly sums up one place where I went wrong in joining the startup I just left:

The first book of the Bible explains what happens when we disobey a great authority: we get ejected from paradise. This is also what less celestial authorities would have us believe.

When I went for one of several interviews with the CEO, I noted a clear authoritarian streak in his communication. Shockingly (at least to myself in retrospect), rather than investigate this, I bowed down. I was excited about the company and scared of not getting to join it. So, even though I had signal that *already* was telling me something wasn’t right, I ignored it. This is probably a bit of confirmation bias on my part as well, but the fact that I didn’t take the opportunity to ask the hard questions I should have is something different. It’s the authority-fearing part of me (and I guess part of everyone) who is afraid of causing trouble — even when the consequences of *not* causing that trouble turn out to be much worse.

#2 Don’t let “winning” override making a smart decision

In the interview phase of this company, I sent over some outlines that explained how I would run a team doing the work I had been discussing with the leadership team. It was very technical (as one might imagine it ought to be if I’m being interviewed to be a lead engineer). Now, from the technical people, I got a thumbs up. In fact, after a month at the company, I can see that the outline I sent over was basically spot on. Yay me. The important part about this is that I also got told that my communication “wasn’t a successful one for informing business people.”

At this point, if I’m honest with myself, I let my ego get the better of me. Incredulity overtook me and I got caught up in a haughty, “Why would business people expect to be able to understand an outline explaining how to tackle a hard technical problem?” train of thought. Now — that question is actually the perfect question to ask. However, instead of facing the reality that I was having to ask such a ridiculous thing and walking away, I tried to “win” the debate. I should have just accepted that at this particular company the top of the food chain was obviously business people rather than engineers. I should have owned the reality that their expectation would be that I would communicate extremely complex things upwardly in some sort of digest format. That if I didn’t summarise complexity digestibly enough, the communication would be a failure and blame would be assigned to me.

But I didn’t do any of that. I got caught in the trap of “No way, I’m a great communicator — I’ll prove it!” I can see that now, but it was not at all obvious to me then. In fact, after being told that my hours of free effort devising a way to tackle a hard problem with a team were wasted because a business person didn’t understand it, I was nervous. I worried that I had screwed up my prospects of getting hired. I spent one night tossing and turning hoping that my failure hadn’t undone all the positive interactions I’d had prior. Silly me.

#3 Pay attention to all the signals you have

Going into this company, I had several reports that all was not unspoilt in the state of Denmark. I had some direct feedback from current employees and I had some very telling Glassdoor reviews. Did I weigh these correctly? Nope! Again, I let confirmation bias take over my reasoning. Rather than asking the hard questions about these things, I instead asked the questions that I knew gave my interviewers enough rhetorical space to give answers I’d accept.

For instance, I should have asked, “I read on Glassdoor that X, Y, and Z appears to happen at your company. Is that true? If so, how often? Has anyone left the company as a result?” That question leaves no room wriggling around. Instead of using the signal I had usefully, I instead asked questions like, “You guys have a bit of a reputation. How does that play out in the day-to-day running of the company?” This question gives *plenty* of room for subjectivity and soft answers. This is silly of me. Wasting a strong signal about a company — especially one that invalidates the assumption that it’s a good one to join — is about as rookie a mistake as one can make. 10+ years into my career, here I am making exactly that mistake.

#4 Checksum your intuitions about how your job will work

As I look back at my interview performances, I realise I didn’t bother to do something really important. I didn’t bother to check if the company worked in a sensible way relative to my expectations. I think my years at Facebook set me up perfectly for this. At a company like Facebook, even a “meh” team still produces a lot of great work, has amazingly talented people on it, and has plenty of remit to do their jobs well. It never occurred to me that these things might not be true at a different company. So, rather than asking straightforward questions like:

  • If I need to coordinate with other teams in order to do my job well, what might that look like?
  • What does the reporting structure look like for engineers?
  • If I see an opportunity to improve processes across the engineering team, how can I push that forward?
  • If I see an opportunity to improve processes across the company, how can I make them happen?

If I’d asked these questions and pushed for authentic answers, I would have known that I was barking up the wrong tree. But, I didn’t. I just assumed that every company works as well as Facebook by default.

Nope.

Do you need strong mental math to pass an engineering interview?

I got a great question from a viewer of the Intro to Architecture and Systems Design Interviews video I created (https://youtu.be/ZgdS0EUmn70). The question is: If my mental math is really weak, is it OK to whip out a calculator app?

Below is my answer, republished here so everyone can (I hope) benefit from it:

For mental math vs. busting out a calculator, I think it’s probably inconsistent from one interviewer to the next. In no cases do I think it alone could possibly cost you the interview unless you’re interviewing at some extremely mathsy place (like an algorithmic trading company, for instance).

I’ve seen a spectrum in terms of judgement towards lack of mental math. One end has people like me, who couldn’t care less if you use a calculator. I would probably award you points for self knowledge and for using tools to make you more effective.

The other end of the spectrum would be someone who is very strong in maths and who doesn’t think the problems you’re tasked with are of sufficient complexity to warrant a calculator. With an interviewer on that end of the spectrum, you might lose some love, but it wouldn’t counterbalance an overall strong performance. For instance, in hundreds of decision-making discussions, I’ve never seen lack of mental math come up as a deal breaker.

There is one caveat to all of this. If you’re using a calculator for numbers that engineers *should* know, that could hurt you*. For instance, if 2^5 comes up and you have to bang it out on your TI-83, you’re probably going to lose real points. Even as a more math-lenient interviewer, I would have some serious questions about someone if powers of two don’t seem familiar. Likewise powers of two on the big end. 2^16 and 2^32 are both important numbers that should be in your mind as a programmer.

Also sums and differences of common powers of two. If you need a calculator to sum 4096 and 4096, I would knock you down a rung in my estimation.

Hope that helps!

 

* Re: Numbers engineers *should* know, here are some helpful links:

Architecture and Systems Design Interviews

Welcome back for Episode 06 of The Unqualified Engineer! I wanted to switch things up a bit this time. On an earlier episode, one of our viewers Vahid Noormofidi asked about system design interviews, so today we’re going to go there.

We’ve talked a lot about coding skills on this channel so far. We’ve walked through a bunch of examples of coding questions you might face in a coding interview. Some good, some bad. Some completely shit and unfair. While coding interviews are super important to succeed at in order to get a job offer, a less well-known detail of Silicon Valley style interviews is that your ‘level’ comes primarily from your performance in a design and architecture interview.

For those who don’t already know, Silicon Valley interviews have about five or six common forms, with three being ubiquitous. There’s the coding interview, which we’ve covered a lot. There’s the background/behaviour interview, which tries to determine if you’re a good person to work with and someone who will be a good addition to the team. There’s also the design & architecture interview, which focuses on your ability to take a big fuzzy problem and come up with a broad, detailed plan for solving it.

Broad and detailed? WTF?

If it feels weird to hear ‘broad *and* detailed,’ it should. Design and architecture interviews are impossibly big problems that definitely cannot be solved in 45 minutes. The goal isn’t to come up with a bullet-proof, 100% complete solution to a massive problem. Rather, it’s designed to give you the chance to show the interviewer what aspects of the problem you think about, what solutions you can come up with, and how much technical depth and diversity you’re bringing to the table. This is why it so strongly influences what ‘level’ you get hired at.

If you’re wondering what I mean about “levels,” it’s probably because you haven’t worked in a company that is as structured as a Google or a Facebook. The basic idea is that engineers (well, all employees actually) get broken into numbered groups that map coarsely to a scales of a person’s ability to contribute value to the company.

For instance, at Facebook, a new graduate from university might get hired in at Level 3. Why Level 3? Fuck if I know. I think it’s probably just self importance — we’re all just *too* special and *too* awesome to start at Level 1… Anyway, that’s how it works. Your level determines your compensation, it determines what equity stake you might get in the company, and it determines what amount of value you have to add to the company in order to keep your job. Brutal, yes, but also somewhat reasonable in practice.

So anyway, back to design and architecture interviews. They have a few subforms like product design interviews, database/data storage design interviews, etc.. The most common type is a generic systems design interview.

A typical question for an interview like this has a really common form. You’ll be presented with a big problem that has only a few concrete details. For instance, in a Facebook interview, you might get asked a question like “Design a mobile app logging infrastructure that can handle a user base of 500 million monthly active users. The system needs to handle any sort of logs from mobile devices like application start, stop, errors, etc..”

Breadth

If you don’t come from a systems background, you might initially think an interview question like this is grossly unfair. Aand, well, you’re kind right to feel that way. A question like this is much easier for someone with experience designing systems like this, at that scale. It’s harder for someone who hasn’t. Before you self-cycle into existential rage, try to remember that’s exactly the point of the interview.

On top of that, bear in mind that there’s no correct answer to a question like this.

There are certainly incorrect answers (i.e. don’t say “We can solve this with Mongo DB, because it’s web scale.”). However, the universe of good answers is vast. Good interviewers are going to look for you to use your experience to give the strongest performance in the domain you should be strongest in.

For instance, given the question above, let’s say you’re a mobile product engineer. The interviewer is going to be looking for you to think about the product impact of a system like this. When should you send logs from a device? What will the performance impact be of collecting logs on the device? What about supporting iOS and Android and mobile web? Will application engineers be able to log arbitrary data or is it a strictly defined log set? Those are the kinds of details your interviewer should expect you to be strong with.

On the other if you’re a distributed filesystem engineer, your interviewer is going to expect you to be amazingly good at describing how to handle the scale of the problem on the server side and to think about strategies for reducing the amount of time it takes for new log data to become available. The interviewer probably wouldn’t expect you to think about the battery life implications in the same detail.

If you can deliver awesome answers to both the client and the server side of the problem, all the better. Some people can.

The important part is to play to your strengths while also showing reasonable technical breadth. If a mobile app engineer answered this question for me and kicked ass with the client-side aspects but literally didn’t address the server side at all, it would be tough for me to feel confident in hiring her. On the other hand, if she thought through as much as detail as she could on the server and gave me clear signals about what she does and don’t know, that’s better. Maybe she wouldn’t invent a server side solution as well as a distributed file system engineer, but that’s OK. That makes sense given her skills and specialisation as a mobile engineer.

Break the problem down

One of the most critical things for you to do in this interview is to break the problem down even more coherently than the interviewer presented it. Going back to the example problem I proposed, you might have noticed that some of the details aren’t in a usable form. If you did, good. If you didn’t, you’ve got some work to do.

For instance, I said “500 million monthly active users.” What does that even mean? Can you use a *monthly* active number to design a real-time system? The answer is yes, but not directly. You need to break it down.

The rookie might do this by taking 500 million and dividing it by 30 days in a month and then divide that numbers by 86400* to figure out the concurrent users. The reason this is a naive way to approach the problem is that people live near each other and live life in daily cycles. That means that you can’t just evenly split the user base by seconds in a month as though all people are evenly spread across the world and active, using your app evenly 24 hours a day.

So how do you capacity plan? If you don’t have an immediate answer, you should start by interacting with the interviewer. Ask about the geographic distribution of users. Ask about the biggest markets for the app. What timezones are most of the users in?

You should be able to arrive at the notion that (assuming the app is doing well!), some large fraction of the users will be active every single day and mostly during specific times of day. You can reasonably smoke up these numbers. Maybe you would estimate that 70% of the monthly active users are also daily actives and that 70% of those people are active at roughly the same time.

Why? Well, it’s very likely that night time in the big population centers of the world is going to be much quiter than the day time. A key insight you need to have is that to handle ‘the user base,’ you really need to worry about the peak-time traffic. So, if 70% of the users are going to be using your app every day and 70% of that 70% will be using the app all at once during peak times, your 500 million users is actually only 245 million at peak times. If we assume that means 245 million people in an hour, then we can start doing even finer-grained estimation. 245M / 60 minutes / 60 seconds nets out to about 68K second-to-second active users. That number is much, much more reasonable to plan for than a hazy “500 million.” *How many servers do you need to handle 500 million people?* is almost meaningless compared to the real question — *How many servers do you need to handle 70K concurrent users during peak hours? *

Meta point about breaking the problem down

Did I just make up a bunch of numbers based on no data? Yes. Could they be “wrong” numbers? Well yes, in fact they probably are. Does that matter? Not as much as you might think.

The interview is not trying to figure out if you can psychically derive the exact user stats and behaviours of an imaginary, complex system. Rather, it’s designed to see if you can handle a big fuzzy need in a reasonable way. Can you take a big, insane requirement and grind through it to figure out what sort of actual technical problem it poses.

Veteran systems engineers, devops engineers, and people who have had the chance to work on large scale problems before will have a leg up in this domain. That might seem unfair, but it’s really not. Remember, we’re not talking about school where you can study the hardest and get the best grade. This is real life. Valuable skills are valuable and who has them and who doesn’t is not “fair,” it’s reality. If you’re expecting it to be fair, you’re going spend a lot of time being frustrated.

Breaking the problem down continued

So, back to the problem, — we’ve decided we’re talking about 70K concurrent users. Now we have to think about what that means in terms of actual throughput. If you were trying to solve this problem in the real world, your bottom line for the server side will be *How much log data are we actually talking about here?*

Right now, you might be thinking — it’s impossible to know! If you are, you’re either not thinking critically about mobile use cases or you’re not experienced enough in this domain to have an intuition about this. The second is better than the first. Let’s think about some real-world bounds. *How much network bandwidth does a phone have?* *How much storage space?* If we assume that the logs are going to be generated by user actions — remember the problem details “application start, application end, errors, etc.” — then we have an excellent frame of reference for breaking it down further. How many actions per minute can a person do on their phone? Maybe the average person could send and receive fifteen messages. Or maybe they could consume twenty photos on Instagram. Maybe they can switch apps roughly four times. In any of those cases, we’re not looking at a massive amount of data here.

The novice will assume that we just need to guess some reasonable average number and stick with that (E.g. “half a megabyte, tops.”). The better engineer will grind through some concrete estimations based on the size of data. Maybe we’re talking about 20+ 32-bit integers that refer to entity IDs (photo IDs, user IDs, etc.) at least, probably some overhead for a data encoding like JSON, probably some strings for things like event types, and in the error case hopefully something like a stack trace to aid in debugging. So, at least a few hundred bytes per minute, but with an upper bound that could be pretty big in error cases (them stack traces, yo).

The best way to answer here — in my opinion — is to have an opinion and to make intentional, reasonable tradeoffs. Burning through hundreds of bytes of data per minute is a guaranteed shitty and expensive experience for someone using your product. So, be a empathetic system designer. Come up with a guideline and set some numbers based on giving a shit about the user experience. Perhaps that number is 5KB of log data per minute as a maximum, anything more gets discarded. That means that if a person uses your product for an hour, they’ll only rack up 300KB of log data that they’re paying for out of their own hard earned money. That’s like asking them to download an extra image every hour or so in order to give your company the data you need to make sure the app is working and to understand why it’s not. A reasonable trade off.

And again, yes, this is a made up number. The specific number is not as important as why you chose it and whether or not that choice was reasonable. I could have said 500KB per minute (I wouldn’t because not even Facebook’s bloated apps use that much bandwidth for metadata), but the most important part is *why* that number.

The reason I think taking a strong stance towards how a system should work is the better option is because it’s show experience, leadership, and ownership. Sure, I could just let the users’ activity dictate some reasonable number of logs per minute but what if my estimation leads me to some big number like 5MB per hour? My inner nurturer needs to kick in and tell me that logging that aggressively is abusive of user trust. My inner data analyst should be telling me that 5MB of data per hour per user is a glut of data that will mostly be giving redundant signal. User empathy, experience and technical leadership should be the guiding light here over a raw calculation.

In a real company doing real work, that’s what actually happens, after all. Product decisions are made by people trying to make the smartest tradeoffs. There are no right answers. The only wrong answers are ones that drive away users or lead the company to fail. The right answers are many and varied. This is why design and architecture interviews are so important for determining people’s levels.

Keep Going

This discussion is already massive and ranging, yet we haven’t even talked about the concrete details of a client implementation, a server implementation. We haven’t talked about client-side caching. About mobile networks versus WiFi delivery of logs. We haven’t talked about data-saving strategies like user sampling. We haven’t talked about how to collect log data from the server side, storing it long-term. We haven’t talked about how many people it would take to build a system like this. We haven’t talked about how many servers we’ll need to handle the peak traffic. We haven’t talked about how much that will cost. We haven’t talked about countless absolutely critical aspects of the problem. Nor could we in 45 minutes. That’s why you, as the interviewee, must keep going.

At any point if you’re looking to your interviewer for guidance about how to proceed, you’re losing points. Now, don’t get me wrong on this point. Needing to bounce ideas off the interviewer or get clarification is fine. Also, you might be legitimately stuck on the problem. If so, by god man — ask for a pointer. I’ve seen many interviews that resulted in a job offer include feedback like “The candidate started off strong but needed a small pointer on X. After that, she did really well.” Don’t be afraid to ask for help. Still, you should bear in mind that the more help you need, the more likely it is that you’re showing lack of experience, creativity, and/or technical leadership.

I don’t know anything about image processing pipelines, for instance. But, if you give me a problem related to image processing pipelines at scale, I have high confidence that I can keep exploring the problem space making meaningful progress for 45 minutes. In all honesty, I love these kinds of challenges and find them really energising. Not everyone does, but everyone who is a strong technical leader *can* do these kinds of explorations.

Failing this interview

One interesting thing to consider on this point is that Facebook, at least, don’t even do design/architecture interviews for new graduates. They used to, but consistently found that the candidates just didn’t have enough experience with applied systems problems to even begin solving the problem.

Not having experience enough to be able to digest a big problem like this doesn’t always mean you’re fucked. It is an important signal though.

For instance, I was once on an interview loop where we all decided to hire the candidate even though he faired poorly in the architecture interview. His coding skills were strong and he seemed passionate about our company as well as a good person to work with. We didn’t hire him though.

When the interview feedback made it up to the director level, the director stopped us and rejected the candidate. I was confused. Turns out the director did some mental math that I didn’t have the foresight to do. Yes, the coding skills were good and the candidate was a reasonably good communicator, team player, etc.. However, he was lacking in design/architecture ability. This on it’s own is no failure, but the candidate had 8 years of industry experience. He had worked in a strongly technical company for that long without ever taking on enough leadership to get good at reasoning about large scale systems. That is a red flag.

Like Bob Dylan says, you’re either busy being born or busy dying. In this case, our candidate had plateaued in his career — so, he was busy dying. It could have been for a million reasons, but to the director making the call it means making a tough decision about whether or not we can expect this candidate to advance in his career or not.

At Facebook at least, you have to keep advancing at least to level 5, which is roughly equivalent to “senior engineer” or “tech lead” in other environments. After that, you’re doing valuable enough work to stick around without getting promoted again. This candidate would not have gotten a level 5 offer and if he’s already 8 years into his career, it’s unlikely he ever would. So, we would be sentencing him to a short-term career at Facebook at best. It was definitely the right call.

Final points

Architecture interviews are formidable, open-ended problems that you definitely cannot exhaustively solve in the time allotted. If you have no idea how to solve these kinds of problems, you might start by checking out the engineering blogs of companies like Google, Apple, Dropbox, and others. The amazing thing about architecture is that most of the best companies are sharing all their work.

Even if you have no background in the work, you can familiarise yourself with the common patterns of system design by reading diversely from the blogs on the topic, watching YouTube videos of tech talks from conferences, etc.. If that feels like cheating, it shouldn’t. After all, the reason engineers at big tech companies are good at solving these kinds of problems isn’t usually because they do it all day, every day. Rather, it’s because they get exposure to the solutions in internal tech talks and write-ups. The information is available. Go turn it into knowledge.