Choosing your tech stack, 2020 Edition

Part 2/3: Server Side Tech Picks

This is the second part of a three part series about choosing a technology stack for a startup in the modern (at least as of the time of this writing) world. The last post focussed on building a client side web app using modern tech. For this post, we’ll dive into the server side tech choices.

This post is going to focus specifically on implementing a server and database for our use case. I’ll touch briefly on things deeper in the stack than these components, but our devops choices will come through in part 3 of this series.

Our Constraints

As we discussed before, we’re building a tool that enhances other tools. We want to make any existing SaaS tool into a dynamic, multi-player tool that feels connected. To make the experience of working across many SaaS tools into a connected, coherent experience rather than a series of disconnected silos. We’re building features that helps a team feel like they’re working together, even if they have never even stepped foot in the same room together. Our application architecture demands a mixture of real-time and asynchronous features.

These constraints dictate many key pieces of our server-side product infrastructure. For one, we need authentication and sessions. We also need a way to communicate between the client and the server that works both synchronously and asynchronously. In particular, we need a server that can push updates to the client in real time.

The pivotal constraint for us is that we want a blisteringly fast pace of development for the product. Engineers working in our stack should be able to ship features as quickly as possible, experiment, throw things away, and redeploy changes in minutes. Our current continuous integration runs in about 3 minutes.

Why this need for speed? Well, for one, it just feels great to work in a high-speed tech stack. For two, we’re still early in our product’s evolution. Scaling our server-side infra is not a constraint yet. In fact, if we focussed on building an infinitely scalable server, we’d be wasting time. Right now, we’re incrementally transitioning from early user tests to true product market fit. For our stage of life, things like maximum I/O parallelisation or complete database race condition safety just don’t matter. What does matter is being able to iterate on product features, test them with real users, understand if they work or not, and then iterate again. So, our tech stack choices reflect the need for this flexibility over alternative choices (e.g. choosing a “high performance” language like Go or Rust; division of responsibilities like microservices; etc.).

The Picks

NodeJS

One of the very first questions to answer when designing a server architecture is ‘What language/run-time should the server be built in?’ For us, this answer was trivially easy to reach — NodeJS. While server-less architectures using AWS Lambda or Firebase are becoming quite trendy, they suffer from some really unfortunate deployment pathologies. Our philosophy is that it should be stupidly easy to know what is deployed — from the server to the database. Because of this, we opted for a TypeScript monorepo which deploys atomically as a NodeJS server. Sure, this means we can’t re-deploy individual functions at a time as we could if this were a Firebase app. Honestly though, the server-less trend seems like a really false economy in practice. With a monolith, we actually know what versions of the code are running at any given moment and we can trivially guarantee that all the pieces fit together properly all the way from the database schema to the client-side React UI components. For similar reasons, we did not build a microservices-based backend.

We could have chosen Golang as a backend or something built in Python using Flask or another hand-rolled server of some sort. Golang would give us obvious performance and concurrency benefits. Python has a fantastic ecosystem of pre-fab libraries. However, both of them offer one huge drawback: they aren’t JavaScript. Try as we might, no one has yet killed the warty-but-loveable beast that is JavaScript. They came with their torches and pitchforks and ActionScript in the 2000s. They came with their big promises and DART in the 2010s. They now sing siren songs of WebAssembly in the 2020s. Yet, JavaScript remains. It is inevitable, like Thanos, but like… actually.

Inasmuch, since we’re going to have to build a JavaScript app no matter what, the best we can do to make the overall development experience really, really good is build everything in JavaScript. Sorry Golang. Sorry Rust. Sorry Python. Sorry Haskell. So Erlang. Building an application in multiple languages quickly is much, much harder than building an application in one language quickly. We want our engineers to be able to deliver value across the entire tech stack deftly. We want to benefit from shared libraries and well-crafted language idioms whether they’re building a complex client-side UI component or they’re refactoring our server-sided data caching logic.

Using multiple languages creates a surprising number of barriers to contribution, the first and most obvious of which is that not everyone will know all the languages. On top of that, you instantly lose the ability for one engineer to make sweeping changes across the entire codebase with a single commit.

Often referred to as “codemods” (see https://github.com/facebook/codemod), these sorts of changes enable a small number of engineers to maintain massive amounts of code. If we’d chosen the industry standard practice of multiple repositories hosting multiple projects in different languages, such codemods would be impossible. They’d be relegated to at most one repository at a time and likely even worse than that because not all the repos would be in the same language. This leads to differing code styling standards, differing code review standards, differing CI pipelines, differing linters. And infinite — I mean — infinite bike shedding about the right way to do things. Ain’t nobody got time for that.

TypeScript

If you read Part 1 of this series, you won’t be surprised to see TypeScript here. Most of the mixed blessings of TypeScript apply on the server-side as much as they did on the client, but for NodeJS application development, TypeScript offers a lot of value. Server-side development tends to be a bit less messy with respect to types than client-side. Part of the fiddly-ness of TypeScript in the browser is the DOM API itself. For server side development, you just don’t have to fool with it, which is pretty great.

Postgres

For data storage, we’ve opted for a tried and true, fast, stable, relational database. Server-less apps running against schema-free databases are all the rage because they make for ultrafast prototyping. If you’re creating a build-and-forget backend for a hackathon, that stack choice is totally reasonable. However, if you need to maintain your data over a long period of time, schema-free databases like DynamoDB or MongoDB might trip you up unexpectedly. I can’t tell you how many teams of engineers I’ve interacted with who don’t properly migrate their schema-free database because “but it’s schema-free, you don’t have to know what fields you’ve set on every record!” That’s great if you don’t want to be troubled by keeping track of what you’ve stored…

However, if you want your data to match the code that is operating on it, you need to manage this relationship. There’s nothing more difficult to debug than random bug reports that come from users in the wild because some poor user wrote some records to the database before you added some field or before you removed it and now when she loads the app, it crashes because you can’t read some property of undefined. So, so many engineer lifetimes have been wasted debugging problems that boiled down to “Oh, wait, for this one random user the data is actually corrupt because they happened to hit the save button on the day before we added the spline-reticulation-rate field, so her app can never load without errors. Fml.“. Yet, these kinds of bugs happen all the time in modern codebases specifically because the engineers don’t operate with some discipline about their data. Ho hum. Not us.

But I agree… maintaining the schema of your data is a chore. That’s why we use a migration framework which handles upgrading and downgrading our database for us. All our engineers have to do is make sure we’ve written correct SQL migrations to add and remove fields or tables and correctly transform the existing data. After that, our CI tooling handles the rest across our environments. Easy peasy, lemon squeezy.

Sequelize as an ORM

Using an ORM in a large-scale, high-performance application is usually a fool’s errand. Trying to get performance out of queries you don’t have direct control over is a bit like trying to assemble a ship-in-a-bottle while wearing oven mitts. Fortunately for us, we’re not a large-scale, high-performance application — yet. For now, speed of development is the dominant factor in the Big-O equation of our company’s success. We will almost certainly rip out Sequelize’s ORM functionality over time, replacing it with carefully designed indexes and queries. For now though, we can add types, find, update, and delete stuff, and generally crack on with the work without stressing about every join clause or order by. I haven’t written a single raw SQL query apart from a CREATE / ALTER / DROP TABLE query in months.

For the TypeScript savvy folks, you might be surprised to see Sequelize as our pick for ORM rather than TypeORM, given that TypeORM is… well.. typed. We started with TypeORM and, within a month, hit so many issues and so many known-but-unfixed problems with TypeORM that we ended up ripping it out. As of the time of this writing, the TypeORM project has more than 2k open issues, which is actually staggering for a codebase with only 4k commits in it. Your mileage may vary, but TypeORM was not the ORM for us.

Sequelize for migrations

Also offers a really nice migration framework which we’re likely to keep well into the future. I’ve worked previously with schema-free codebases (like Mongo) and codebases where the schema was maintained independently of the codebase. Having migrations tied to version control is an absurdly better world to be in. I’ll never go back. Don’t make me go back.

GraphQL with Apollo

If you read Part 1 of this series, you’ll know I have mixed feelings about GraphQL. It’s heavy. Like heeeeaaaavvvvy. You have to do a heap of work just to get the basic value of the abstraction. If the entities in your system don’t have a lot of complex relationships, it’s probably not for you. However, in our case, we have lots and lots of relationships in our schema. For instance, an organization has users. Also, users have messages. And messages can have mentions, which themselves can have users. In this sort of application, GraphQL begins to show its value peeking out over the mountaintops of extra work you need to do to get it up and running.

Why? Well, because a user is such a core type in our system, we get an economy of scale out of defining it once in GraphQL. Once we’ve done the heavy lifting of GraphQL-ifying the user type, we can use it over and over again everywhere that it comes up. This is because GraphQL makes entity composition trivial. The more types you have and the more relationships they share, the more valuable GraphQL becomes.

Let’s compare that with a vanilla JSON API. If we started both versions of the server on the same day, Team Vanilla JSON API would easily race ahead at the start. Then, as the number of entities and number of relationships between entities grows, the Team Vanilla JSON API would start to become ponderous — needing to reassemble the same queries and transformations again and again. Team GraphQL by contrast would get left on the starting line, toiling away trying to get their resolvers and mutations in order. However, as the race wore on, Team GraphQL would start to make huge leaps. Once the core types were defined and GraphQL-ified, Team GraphQL would basically never touch them and would instead only be building the new entities. They wouldn’t have to worry about which fields of the sub-sub-queries to expose, because their earlier work would already account for that. Over the course of a year with lots of product iteration, Team GraphQL would have enabled way more product iteration, despite taking longer to lurch into motion.

A basic example would be the @-mentions in our chat. To render a message with an @-mention in the UI, you have to fetch: the message, the message sender’s user data, the mention data, and the user data of the user that was mentioned. Since we have a well-defined User GraphQL entity, we don’t have to do any extra work to the data required to display both the sender’s user info and the mentioned users user info. They’ll leverage the same GraphQL entity. To achieve this same value with a JSON API, we’d have to write custom server-side code to fetch the message, then fetch its users info, decide which fields to expose, then parse out the @-mention from the message, then fetch the mentioned user’s info, then decide which of those fields to expose in the response, then assemble a JSON payload that includes all of this information. We’d also have to think about things like batching of these fetches if we’re rendering a list of messages. And if we later on modified the User entity to have some new fields, we’d have to maintain all that custom JSON API endpoint code, too. What a pain. While GraphQL slows us down at first, its speeds us up shortly thereafter in a way that vanilla JSON APIs won’t catch up with.

But let me be real with you — the itch to just rip all this GraphQL out and replace it with a dead simple JSON API that stringifies data pulled straight from a raw SELECT * FROM ... statement is… strong. When working with a GraphQL backend, you feel the pain of all the overhead more acutely than working in the client. Again, this is a reasonable choice for us because we want to be able to iterate on the product as fast as possible. Whatever form our UIs take, the backend data types are quite stable. So, GraphQL unlocks a lot of value for present and future product development. If I had it to do all over again, I’d probably choose it again, begrudgingly. I really wish it was less heavy weight. Codegen can theoretically save you some of the pain of defining all these JS-to-GraphQL mappings, but then you also have GraphQL and codegen to worry about. There’s no free lunch there.

WebSockets

In our application, real-time interactions are hugely important. We’re building a product that makes people feel connected and present, even when they’re not in the same room together. Surprising moments like “Oh, look, my teammate and I are on the same page at the same time” make the experience of working together-but-apart much more serendipitous and rewarding, like being in the office together. To enable that, we have loads of real-time, in-memory state that reflects the ephemera of day-to-day work. Without something real-time like WebSockets, we would struggle to keep entire teams in sync. If I add an annotation to a page in our shared SaaS tool, my teammate needs to know about it instantly. We can’t wait for the teammate to reload the page whenever it might next happen.

Apollo helps a lot with this. Using GraphQL subscriptions with Apollo means we have a really clean API for triggering data pushes down to all the currently connected clients. We also have the flexibility to put these pushes before, after, or concurrently with database transactions, depending on the desired user experience of the particular real-time feature. Lovely.

Tradeoffs

The dominant tradeoff of our server-side codebase is product development velocity over literally everything else. We haven’t chosen the most strongly typed language. TypeScript is a fairly good type checker, but it’s far from the most opinionated type system. We certainly haven’t chosen the most high-performance programming language. A Golang implementation of our backend would be between 10 and 100 times more efficient. However, with Golang, we’d also pay the bill of having two languages, two sets of frameworks, two sets of idioms, two sets of code review standards, two different development environments to maintain, etc.. Each one of these small frictions adds up to lost developer velocity.

We’re a pre-product market fit company. We don’t yet know what features will be the winning ones that shape the company’s future. We don’t yet know what the customer growth trajectory will look like. We also don’t know how much we’ll need to scale the engineering team to match the adoption we get once we launch. Success for us right now is definable as building out product features as fast as possible, testing them with users, and honing our product vision based on those outcomes. Anything that gets in the way of that iteration loop is a drag on our success.

One view of how to proceed would be to cobble together as many open source libraries as possible to build some vaguely product-shaped Frankenstein’s monster just to test the product in front of users. Many, many companies take this approach. Some good, most of them… bad. You often see this approach led by a non-technical founder who is outsourcing development to an inexpensive, remote dev team. This overall approach is deeply unwise except in the case that you never need to scale the product or infrastructure. It’s almost always the wrong tradeoff because it trades quick-and-dirty features today against your ability to grow with your success tomorrow. What happens if your product gains any traction at all? If you take this approach, the first thing that happens is that the application all falls apart and you have to scramble to find great developers to undo and rebuild all the work you already paid for once — if you can get them at all. Most talented engineers will take two looks at a tech stack that is in such a poor state and keep looking for a better opportunity. What you end up with is contractors who are happy to saunter leisurely through your technical debt for a handsome day rate. That’s penny-wise, pound-poor thinking.

Another view of how to proceed is to jump straight to a perfect implementation of the product and infrastructure on the first pass. If you’re literally psychic or luckier than Harry Potter high as a kite on Felix Felicis, you can make this tradeoff. It requires more time, more money, and more prescience than most startups have, certainly more than we do. For the mere mortals and muggles among us, building the “perfect” infrastructure is basically meaningless. Until you have real users using your product, you don’t know what your product is. You don’t know what usage patterns will create what pressures in your backend requiring optimisation. You don’t know what killer features you’ll discover when someone in a user interview casually opines that they “just wish the product did X” — where X is some unbelievably brilliant insight about the user experience that your team never had. So, this approach is infeasible and unwise, despite being extremely tempting to many tech teams, especially ones filled with PhD engineers or engineers from FAANG-esque companies who want to flex.

The wisest approach is a pragmatic compromise: move as fast as possible while building the application in a way that doesn’t block you from changing and optimising it later. That’s where we’ve landed. We have a really fast pace of product iteration and a super tight feedback loop between user testing and product development. Our stack is optimised exactly for this purpose. It hinges on the fact that a single engineer can make changes across the full stack in a single commit and see the outcome in production 3 minutes later. No waiting for code review from different 5 people, working with 5 different pull requests to 5 different repos, all written in different languages. This is the way.

Controversial Hot Takes

Monolith versus microservices versus server-less

We chose a monolithic backend. This means everything is in-memory on the same machine until it can’t be due to scalability constraints. Of coures we will eventually have services that get sharded out from the monolith. However, we won’t call them microservices. We’ll call them services. And they won’t be arbitrarily small function-calls-over-REST like many microservices tend to be in practice. Instead, they’ll be whatever size they need to be to coherently solve the problem they need to solve. Everything that has similar performance and/or storage needs will get sharded out as a service.

As of the time of this writing, the most likely first candidate for this is a websocket termination service/loadbalancer that will act as a façade to our backend, allowing WebSocket connections to stay connected between service restarts.

What will the next service be? I don’t know. Ask me again when we’ve had sufficient growth and user adoption to know what parts of our monolith end up dominating any of: CPU, memory, or network. When one piece of the application is behaving sufficiently differently from the rest of the application that they no longer make sense running in the same process, that’s when we’ll break it out.

Server-less architecture is a non-start for us. Because our application hinges on real-time, push-to-client interactions, there is definitely going to be a server. We could use some pre-baked WebSocket-as-a-service utility from an existing provider, but that’s just paying someone else to handle complexity that our app definitely has. The lock-in that these sorts of tradeoffs create is very likely to kill our flexibility down the round. Right now, we can chose literally any path we like for how to tweak the server to handle larger scale traffic. If we were paying Google Cloud for the privilege of using their real-time infrastructure, we’d be building our application around that constraint. We’ve got a sufficiently talented engineering team to be able to manage that complexity ourselves and the tradeoff of getting to shape our application around our actual constraints rather than the arbitrary constraints imposed by a 3rd party service are absolutely worth it.

Postgres versus Mongo/DynamoDB/etc.

The MERN stack gets a lot of hype these days. In many coding bootcamps, it’s taught as the de facto standard stack for modern development. Call us crazy, but we like doing far out things like selecting data by date ranges. Or joining data by foreign keys. These arcane arts are not lost, though their power has been largely forgotten by recent generations of developers. We don’t currently and likely never will use MongoDB in our stack. There is some potential for a caching layer like Redis to benefit us, but at the moment we can do everything flawlessly in Postgres.

I talked a lot about the downside of schema-free databases earlier in the post, but the same holds true here. We want to know exactly what our data looks like. We want to be able to shape it commit-by-commit with clearly defined migrations. Schema-free databases are easier at the start, but then tie your hands with respect to what features you can build later without writing your own data processing to create intermediate data representations and caches. And woe be unto you who writes some data to the DB in a different way that doesn’t know to update the aggregations and lists stored elsewhere. No thanks. We’ll take the burden of a relational datastore any day.

Choosing your tech stack, 2020 Edition

Part 1/3: Client Side Tech Picks

We started Radical in 2020 in the middle of a pandemic. Our mission is building absolutely killer real-time and asynchronous collaboration features that work with any existing SaaS tools.

In other words…

Collaboratify all the SaaSes!

This idea sounds simple enough, but when you dig into what’s required to deliver this, you start to see a lot of unique constraints starting to emerge, which lead us to some very specific technical choices.

TL;DR: If you just want to know what tech stack we’re running here’s a quick run down. For more detail, see below. We’ve chosen TypeScript for both server and client; modern React with Hooks for the client UIs; GraphQL for the client-server interaction; Apollo on the client and server using WebSockets; React-JSS for managing styles; Webpack + Babel for transpilation.

Our Constraints

We’re building tools that enhance tools. That in itself creates some super interesting constraints for how we have structure our tech stack. For instance, we can’t expect to have control of the page our code runs in. Our implementation has to play very nicely with our neighbours.

We have a heavy focus on both real-time and asynchronous features. On one extreme end, we want to be able to reflect up-to-the-second information like whether or not your teammate is typing a message. On the other end, we want to be able to reflect an entire conversation history in a tool quickly. This means we have to support short-term storage features, a lot like how WhatsApp works, and we have to support long-term storage features like complete-history retrieval, like Slack.

The Picks

TypeScript

TypeScript is a mixed blessing everywhere you take it. It means that writing the code in the first place will be slower and more annoying. Learning all the finicky interactions between React and TypeScript is definitely a learning curve. Want to create an input event listener outside of your JSX markup? Cool, just learn the magical incantation e: React.ChangeEvent<HTMLInputElement>. Dead simple?! Right!!? Well, yes, once you’ve scratched your head and read docs for a while. Still, once you’ve paid the bill to create the TypeScript code correctly in the first place, it becomes indispensable for maintaining your codebase. I’ve now bounced between TypeScript and non-TypeScript codebases a few times and I miss TypeScript annotation every time.

Wow, no mystery about what properties this component supports!

The downside of the React + TypeScript combo is that you’ve pretty much guaranteed your compilation is going to feel sluggish. If you work locally on a reasonably modern laptop, it’s tolerable. Still, even after some optimisations, we’re seeing warm-recompile times in the low single-digits seconds. That’s just slow enough to feel sluggish. No whizbang tech is free. Come on Deno!

Choosing TypeScript has huge implications for the rest of the stack, so I’ve listed it first. If you start off with TypeScript, you have to think about the TypeScript implications of every next technical choice. DefinitelyTyped will save you some of the time, but some projects are not TypeScript friendly and you’ll be incentivised not to choose them.

Visual Studio Code

It may seem strange to include VS Code in the Tech Stack discussion, but the raw truth is that if you’re using TypeScript, you’re using VS Code. I grew up in the generation of developers who love to hate Microsoft. But I have to admit VS Code is the best IDE I’ve ever used. Ugh. I hate saying that. The ultra rich support for TypeScript is a killer feature. I still work in Vim/iTerm most of the time, but when I’m hacking React, I’m in VS Code (with Vim bindings!).

React

React was the easiest of the choices for this project. Especially since Hooks have come out, React is just the clear, clear winner over everything else out there right now. In my last role, I ported an existing PolymerJS app into modern React and generally had an excellent experience with React. The tooling support is excellent. The React core team are beasts with incredible velocity. Couldn’t be happier with this choice.

Runners up are thing like AngularJS or VueJS, but honestly, it’s not a remotely close race.

GraphQL with Apollo

GraphQL. Ooph. Where to start. Is it good? Yes. It’s really good. Especially with all the nice things you get from Apollo. Having pre-built hooks for queries and mutations is awesome. Being able to create queries that get you exactly what you need. Killer. When combined with TypeScript, this means that you get back typed, well structured data from your queries, which is pretty great. The downside is that GraphQL isn’t JavaScript or TypeScript. It’s GraphQL. Yet another language to support in a project that is, at least theoretically, JavaScript.

The GraphQL query UI is fantastic. Super useful for debugging and for composing exactly the right query. No Swagger or Postman UIs for our frontend engineers.

React-JSS

Oh my goodness. Someone took CSS and found a good way to represent it in JavaScript? Shut up and take my admiration. This one is an easy yes. CSS-in-JS for life. Remember all the hype about CSS variables? Guess what’s better than CSS variables. Actual TypeScript variables!

Pro Tip™: Always alphabetise CSS declarations.

Webpack

An oldie but a goodie. I’ve encountered the Metro bundler more recently in interacting with some ReactNative/Expo code and I’m curious how it compares performance-wise with Webpack. Still, I wanted something I knew and something that is reliable. The Webpack plugin ecosystem is extremely hit-or-miss, but the core tech is solid.

Still, when it slows down (which it will with all these bells and whistles and spinning rims), there is a huge body of well-documented Webpack optimisations to employ to reclaim that performance. I’m happy with Webpack for now, but I’d happily jump on something faster and simpler.

Tradeoffs

The biggest negative tradeoff of this whole setup is that it has so many damn languages in it. We have to have HTML, CSS, and JavaScript. It’s the web. Those are table stakes. React-JSS gives us the ability fold the CSS into the JS, which is awesome. So, that means there are only two languages, right? Wrong. We’re using TypeScript on top of JavaScript. And we’re using GraphQL in addition to that. Oh and don’t forget React’s JSX language. So, this means we have HTML, JSX, JavaScript, TypeScript, GraphQL all in a codebase that compiles into pure JavaScript. All this complexity isn’t free. It’s actually super expensive. Onboarding junior engineers or interns shows the cost. Our designer is a confident CSS hacker and he feels cautious around all this complexity. There is a lot of cognitive load to manage in understanding how the pieces fit together.

The flip side of this negative is that once it clicks, the codebase is pretty great. Having TypeScript underneath everything means that you can hover over just about any variable in VS Code and get a massive amount of context on what it is, where it comes from, what its types are. The constant sense of mystery you feel exploring a big JavaScript codebase isn’t something we feel. In fact, I can navigate this codebase faster than any JavaScript codebase I’ve ever worked in.

Another subtle tradeoff is that we’ve started from scratch. The upside of this choice is that we’re running the best-of-breed everything. Our codebase doesn’t suffer from any legacy ailments or poor technical choices (at least not that we know about yet!). No tech debt to pay off. Clean, linear commit history thanks to the Stacked Diff Workflow. The downside is that we’ve had to fit all the pieces together ourselves. We didn’t start with any boilerplate project helper or create-react-app. So that cost us some time. Still, this tradeoff seems easily worth it.

Like what you’re reading? Come join us! We’re currently hiring for a small number of great people to join our merry band. We’re going to change the way people work. Help us get there. Email me — jack@getradical.co.

Debugging Elastic Beanstalk

Recently, I’ve had the extremely mixed blessing of working with AWS and Elastic Beanstalk to deploy an app. This app has continuous delivery going using the Code* Suite. While not wonderful, CodePipeline/CodeBuild are at least straightforward.

Enter Elastic Beanstalk.

I’ve been attempting to use the CLI to trigger Elastic Beanstalk version updates. This would seem quite straight forward:

  • Elastic Beanstalk has Applications
  • Applications have Environments
  • Applications have Versions
  • Environments have deployed versions (which are called “Running version” in the AWS console but “version label” or “version-label” or “VersionLabel” in the CLI/API)

So, to update an Application, you just need to create a new Application Version and then update its Environment with a new Version Label. Bish bash bosh. In fact, that’s exactly what the eb deploy command does (https://github.com/aws/aws-elastic-beanstalk-cli/blob/master/ebcli/operations/deployops.py#L23).

My problems started here. All of the above worked fine. However, if the app was already deployed, it would (and as of the most recent updateof this post — will) not redeploy the app. In the UI, the manifests as “degraded” state where the “Running version” and the “deployed version” are not the same. Annoyingly, Elastic Beanstalk will want to revert to the previously successful version and it will tell you that its most recently deployed version is unexpected. I mean, if the revert worked, okay fair enough, but since the revert is failing, that just leaves things in an even more confusing state.

Naturally, what I try to do is redeploy the correct version manually. That’s where the wheels really come off. Manually redeploying either in the console or via the CLI or via continuous delivery results in an unsuccessful command execution and the entire environment becomes unrecoverable.

So, what on earth is going wrong here? Well, in the AWS Console, I don’t get much:

ERROR During an aborted deployment, some instances may have deployed the new application version. To ensure all instances are running the same version, re-deploy the appropriate application version.

ERROR Failed to deploy application.

ERROR Unsuccessful command execution on instance id(s) 'i-XXXXXXXXXXXXXXXXX'. Aborting the operation.

INFO Command execution completed on all instances. Summary: [Successful: 0, Failed: 1].

ERROR [Instance: i-XXXXXXXXXXXXXXXXX] Command failed on instance. An unexpected error has occurred [ErrorCode: 0000000001].

Not much help there.

Out of curiousity, I SSH’d into one of the EC2 instances.

Running sudo docker ps, I could see there were no containers running. So, then I turned to /var/log/ to see if I could find anything. I did indeed.

Digging into /var/log/eb-engine.log on the EC2 instance, I can see that Beanstalk is trying to restart nginx, but nginx is putting up a fight:

2020/04/15 00:34:55.708346 [INFO] Executing instruction: register Nginx process
2020/04/15 00:34:55.708425 [INFO] Register process nginx
2020/04/15 00:34:55.708516 [INFO] Running command /bin/sh -c systemctl show -p PartOf nginx.service
2020/04/15 00:34:55.714774 [WARN] Warning: process nginx is already registered...
Deregistering the process ...
2020/04/15 00:34:55.714869 [INFO] Running command /bin/sh -c systemctl show -p PartOf nginx.service
2020/04/15 00:34:55.722615 [INFO] Running command /bin/sh -c systemctl is-active nginx.service
2020/04/15 00:34:55.729960 [INFO] Running command /bin/sh -c systemctl show -p PartOf nginx.service
2020/04/15 00:34:55.737441 [INFO] Running command /bin/sh -c systemctl stop nginx.service
2020/04/15 00:34:56.066516 [ERROR] Job for nginx.service canceled.

2020/04/15 00:34:56.066664 [ERROR] stopProcess Failure: stopping process nginx failed: Command /bin/sh -c systemctl stop nginx.service failed with error exit status 1. Stderr:Job for nginx.service canceled.

2020/04/15 00:34:56.066717 [ERROR] deregisterProcess Failure: process nginx failed to stop:
stopProcess Failure: stopping process nginx failed: Command /bin/sh -c systemctl stop nginx.service failed with error exit status 1. Stderr:Job for nginx.service canceled.


2020/04/15 00:34:56.066810 [ERROR] An error occurred during execution of command [app-deploy] - [register Nginx process]. Stop running the command. Error: register process nginx failed with error deregisterProcess Failure: process nginx failed to stop:
stopProcess Failure: stopping process nginx failed: Command /bin/sh -c systemctl stop nginx.service failed with error exit status 1. Stderr:Job for nginx.service canceled.

What the logs above show is that the Elastic Beanstalk engine itself is trying to stop and restart nginx on the EC2 instance, but those commands are failing for reasons I don’t yet understand. Because these permissions are failing, the deployment process halts and the instance gets stuck in a down state.

By running sudo systemctl stop nginx.service, I can get the instance back and re-deploy successfully, but so far this has a 100% failure rate for all new deployments. Weird.

To get more clarity (though I have literally zero idea what the heck this means), I tried running the same command without sudo. I get an… interesting… error:

[ec2-user@ip-XXXXXXXXXXX log]$ systemctl stop nginx.service
Failed to stop nginx.service: The name org.freedesktop.PolicyKit1 was not provided by any .service files
See system logs and 'systemctl status nginx.service' for details.

Wat. But the same command with sudo (as you might expect) works a treat. sudo make me a sandwich to the rescue.

[ec2-user@ip-XXXXXXXXXXX log]$ sudo systemctl stop nginx.service
[ec2-user@ip-XXXXXXXXXXX log]$

Ho hum. I’m still chasing the root cause. Maybe this is the “unreliable deployment” cited as an issue here: https://medium.com/@acamp/elastic-beanstalk-advantages-and-drawbacks-be814615af01.

Update: For anyone working against the same issue, I managed to get around the problem by switching to immutable deployments. While *significantly* slower, immutable deployments have the advantage of always deploying to a clean EC2 instance. Because they always spin up a new instance, they don’t suffer from the restart failures described above.

Stacked Diffs Versus Pull Requests

Update: This post received quite a lot of healthy discussion on Hacker News. You can follow that conversation here: https://news.ycombinator.com/item?id=18119570.

People who have worked with Phabricator using a ‘stacked diff’ workflow generally love it and seek it wherever they next go. People who have never used it and only use Pull Requests with GitHub/GitLabs generally don’t understand what the fuss is all about. How can code review be *sooo* much better using some obscure tool with hilarious documentation? Well, hold on to your butts, because I’m about to break it down. This post is going to focus solely on the engineering workflow for committing code and getting reviews. I’ll probably do a second post about the details of code review between the two tools.

Before I dig deeply, let me just say I’ve created and merged hundreds of Pull Requests and landed thousands of Diffs. I know both of these workflows in and out. I’m not just some ex-Facebook, Phabricator fan boy pining for the ‘good old days.’ I’ve worked on engineering teams using CVS (oh yes), SVN, Git, and Mercurial. GitLabs, GitHub, Gerrit, and Phabricator. I’ll happily acknowledge that you can get a lot of good work done using any of these. Now, if you want to talk about how to get the most work done — the most productivity per engineer — that’s where I have a strong opinion informed by lots of experience.

What are stacked diffs?

Many folks reading this post won’t actually have a clue what “stacked diffs” are all about anyway. This is understandable. Feature branches and Pull Requests (PRs) are fairly ubiquitous and (sort of) well-understood. For the uninitiated, I’ll outline how it works.

First, Pull Requests

In PR-based development, you start by branching master then add one or more commits which you submit as a ‘Pull Request’ in the Github UI. A Pull Request is (or at least should be) an atomic unit of code for review. When someone requests changes, you do this by adding additional commits to the pull request until the sum of these changes satisfies the reviewers demands.

The really important thing about this is that the state of your local repository is dictated by the review process. If you want to have your code reviewed, you first have to branch master, then commit to that branch, then push it remotely, then create a PR. If you get review feedback, you have to commit more code onto the same branch and a) push it to create a longer, less coherent commit history or b) merge your local commits and force push to the branch. This also means that you can’t have a local checkout of the repository that looks different from the remote. This is a really, really important point that I’ll come back to again and again.

So, Stacked Diffs

The basic idea of stacked diffs is that you have a local checkout of the repository which you can mangle to your heart’s content. The only thing that the world needs to care about is what you want to push out for review. This means you decide what view of your local checkout the reviewers see. You present something that can be ‘landed’ on top of master. It may be helpful to skip down to the Case Studies section below to get a more intuitive feel about how this works.

The typical workflow is to work right on top of master, committing to master as you go. For each of the commits, you then use the Phabricator command line tool to create a ‘Diff’ which is the Phabricator equivalent of a Pull Request. Unlike Pull Requests, Diffs are usually based on exactly one commit and instead of pushing updates as additional commits, you update the single commit in place and then tell Phabricator to update the remote view. When a Diff gets reviewed and approved, you can “land” it onto remote master. Your local copy and master don’t have to be in perfect sync in order to do this. You can think of this as the remote master cherry-picking the specific commit from your git history.

That’s right, I said it. You can probably commit everything to master. Sound terrifying? It’s mostly… well… just, not a problem at all. It’s fine like 93% of the time. In fact, this approach gives you the ability to do things that branches alone just can’t (more on that below). The anxiety many engineers feel about committing ahead of master is a lot like the fear that if you fly at lightspeed, you’ll crash into a star. Popularly held, theoretically true, and practically completely wrong.

In practice, engineers tend to work on problems whose chunks don’t easily divide into units of code review that make sense as a branch-per-unit-of-review. In fact, most engineers don’t know exactly how their work decomposes when they start working on a problem. Maybe they could commit everything to master. Maybe they need a branch per commit. Maybe it’s somewhere in between. If the rules for how to get code reviewed and how to commit code are defined for you ahead of time, you don’t get to choose, which in many cases means a net loss in productivity.

The big “aha!” idea here is that units of code review are individual commits and that those commits can stack arbitrarily, because they’re all on one branch. You can have 17 local commits all stacked ahead of master and life is peachy. Each one of them can have a proper, unique commit message (i.e. title, description, test plan, etc.). Each of them can be a unit out for code review. Most importantly, each one of them can have a single thesis. This matters *so* much more than most engineering teams realize.

Yes, basically every commit can be on top of master

“But that’s marmot floofing crazy!” I hear you say at your computer, reading this months after the blog post was written. Is it? Is it, really?! You may be surprised to learn that many engineers, who make a fantastic amount of money from some of the best companies in the world, commit directly to master all of the time, unless they have a reason not to.

To enable this, the mental model is different. A unit of code review is a single commit, not a branch. The heuristic for whether or not to branch is this: ‘Am I going to generate many units of code review for this overall change?’ If the answer is yes, you might create a branch to house the many units of code review the overall change will require. In this model, a branch is just a utility for organising many units of code review, not something forced on you *as* the mechanism of code review.

If you adopt this approach, you can use master as much as you want. You can branch when/if you want. You, the engineer, decide when/if to branch and how much to branch.

In this model, every commit must pass lint. It must pass unit tests. It must build. Every commit should have a test plan. A description. A meaningful title. Every. Single. Commit. This level of discipline means the code quality bar is fundamentally higher than the Pull Request world (especially if you rely on Squash Merge to save you). Because every commit builds, you can bisect. You can expect that reading pure git log is actually meaningful. In fact, in this model every single commit is like the top commit from a Pull Request. Every commit has a link to the code review that allowed the commit to land. You can see who wrote it and who reviewed it at a glance.

For clarity, let me describe the extreme case where you only commit to master. I’ll outline things that are simpler because of this. I’m starting in order of least-important to most-important just to build the drama.

#1 Rebasing against master

With Pull Requests, if you want to catch up your local branch to master, you have to do the following:

  1. Fetch the changes from remote
  2. Rebase your branch on top of master
  3. Merge any conflicts that arise

That doesn’t sound so bad, but what about when you have a branch off a branch off of master? Then you have to repeat the last two steps for each branch, every time.

By contrast, if you only worked from master, you only have to do a git pull --rebase and you get to skip the cascading rebases, every time. You get to do just the work that you care about. All the branch jumping falls away without any cost. Might seem minor, but if you do the math on how often you have to do this, it adds up.

#2 Doing unrelated tasks

Many of us wear a lot of hats in our jobs. I’m the owner of a user-facing product codebase, which is many tens of thousands of lines, separated into dozens of features. That means I often jump between, for instance, refactoring big chunks of crufty old JavaScript (e.g. hundreds of lines of code across dozens of files) and working out small, nuanced bugs that relate to individual lines of code in a single file.

In the Pull Request world, this might mean I switch branches a dozen times per day. In most cases, that’s not really necessary because many of these changes would never conflict. Yet, most highly productive people do half a dozen or more unrelated changes in day. This means that all that time spent branching and merging is wasted because those changes would never have conflicted anyway. This is evidenced by the fact that the majority of changes can be merged from the Github Pull Request UI without any manual steps at all. If the changes would never have conflicted, why are you wasting your time branching and merging? Surely you should be able to choose when/if to branch.

#3 Doing related tasks

One of the most time-destroying aspects of the Pull Request workflow is when you have multiple, dependent units of work. We all do this all the time. I want to achieve X, which requires doing V, W, X, and Y.

Sound far fetched? Well, just recently, I wanted to fix a user-facing feature. However, the UI code was all wrong. It needed to have a bunch of bad XHR code abstracted out first. Then, the UI code I wanted to change would be isolated enough to work on. The UI change required two server-side changes as well — one to alter the existing REST API and one to change the data representation. In order to properly test this, I’d need all three changes all together. But none of these changes required the same reviewer and they could all land independently, apart from the XHR and feature changes.

In the stacked diff world, this looks like this:

$ git log
commit c1e3cc829bcf05790241b997e81e678b3b309cc8 (HEAD -> master) 
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 16:43:22 2018 +0100

    Alter API to enable my-sweet-feature-change

commit 6baac280353eb3c69056d90202bebef5de963afe
Author: Jackson Gabbard <madeup@email.com> Date: Sat Sep 29 16:44:27 2018 +0100

    Alter the database schema representation to enable my-sweet-feature 
commit a16589b0fec54a2503c18ef6ece50f63214fa553
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 16:42:28 2018 +0100

    Make awesome user-facing change

commit cd2e43210bb48158a1c5eddb7c178070a8572e4d
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 16:41:26 2018 +0100

    Add an XHR library to abstract redundant calls

commit 5c63f48334a5879fffee3a29bf12f6ecd1c6a1dc  (origin/master, origin/HEAD)
Author: Some Other Engineer <madeup-2@email.com>
Date:   Sat Sep 29 16:40:16 2018 +0100
    
    Did some work on some things

The equivalent of the Git configuration of this might look like this:

$ git log

commit 55c9fc3be10ebfe642b8d3ac3b30fa60a1710f0a (HEAD -> api-changes)
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 17:02:48 2018 +0100

    Alter API to enable my-sweet-feature-change

commit b4dd1715cb47ace52bc773312544eb5da3b08038 (data-model-change)
Author: Jackson Gabbard <madeup@email.com>
Date: Sat Sep 29 17:03:25 2018 +0100

    Alter the database schema representation to enable my-sweet-feature

commit 532e86c9042b54c881c955b549634b81af6cdd2b (my-sweet-feature)
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 17:02:02 2018 +0100

    Make awesome user-facing change

commit d2383f17db1692708ed854735caf72a88ee16e46 (xhr-changes)
Author: Jackson Gabbard <madeup@email.com>
Date:   Sat Sep 29 17:01:29 2018 +0100

    Add an XHR library to abstract out redundant calls

commit ba28b0c843a863719d0ac489b933add61303a141 (master)
Author: Some Other Engineer <madeup-2@email.com>
Date:   Sat Sep 29 17:00:56 2018 +0100

    Did some work on some things

Realistically though, in the Pull Request world, this commonly goes one of two ways:

  1. You care massively about code quality so you diligently create a branch off of master for V, then a branch off of V for W, then a branch off of W for X, then a branch off of X for Y. You create a pull request for each one (as above).
  2. You actually want to get work done so you create one big ass Pull Request that has commits for V, W, X, and Y.

In either case, someone loses.

For Case #1, what happens when someone requests changes to V? Simple, right? You make those changes in branch V and push them, updating your PR. Then you switch to W and rebase. Then you switch to X and rebase. Then you switch to Y and rebase. And when you’re done, you go to the orthopedist to get a walker because you’re literally elderly now. You’ve wasted your best years rebasing branches, but hey — the commit history is clean AF.

Importantly, woe be unto you if you happened to miss a branch rebase in the middle somewhere. This also means that when it comes time to commit, you have to remember which destination branch to select in the Github UI. If you mess that up and merge from X to W after W was merged to master, you’ve got an exciting, life-shortening mess to clean up. Yay!

For Case #2, everyone else loses because people just don’t feel the same burden of quality per-commit in a PR. You don’t have a test plan for every commit. You don’t bother with good documentation on each individual commit, because you’re thinking in terms of a PR.

In this case, when different reviewers request changes to the code for theses V and W, you just slap commits Y++ and Y++++ onto the end of the Pull Request to address the feedback across all of the commits. This means that the coherence of codebase goes down over time.

You can’t intelligently squash merge the aspects of the various commits in the Pull Request that are actually related. The tool doesn’t work that way so people don’t work or think that way. I can’t tell you the number of times I’ve seen the last two or three commits to a PR titled with “Addresses feedback” or “tweaks” and nothing else. Those commits tend to be among the sloppiest and least coherent. In the context of the PR, that *seems* fine, but when you fast-forward 6 months and you’re trying to figure out why some code doesn’t do what it’s supposed to and all you have in a stack of 20 commits from a seemingly unrepated PR and the git blame shows that the offending line comes from a commit titled “nits” and nothing else and no other context, life is just harder.

#4 Doing multiple sets of related tasks

If you happen to be one of the rare engineers who is so productive that you work on multiple, distinct problems at the same time — you still probably want a branch per-thing, even in the stacked diff workflow. This likely means that you create a branch per-thing (i.e. per distinct problem), but that you put out multiple units of code for review on each branch.

For the mortals amongst us, let’s imagine the case where an amazing engineer is working on three different hard problems at once. This engineer is working on three different strands of work, each of which require many commits and review by many different people. This person might generate conflicts between their branches, but they also are clever enough and productive enough to manage that. Let’s assume that each of this person’s branches includes an average of 5 or more units of code review in solving each of the 3 distinct problems.

In the Pull Request model, this means that person will have to create 3 branches off of master and then 5 branches-of-branches.  Alternatively, this person will create 3 Pull Requests, each of which is stacked 5 commits deep with this that only go together because of a very high level problem, not because it actually makes sense for code review. Those 5 commits may not require the same reviewer. Yet, the pull request model is going to put the onus on a single reviewer, because that’s how the tool works.

The Stacked Diff model allows that amazing engineer to choose how/if to branch any commit. That person can decide if their work requires 3 branches and 15 units of code review or if their work requires 15 branches and 15 units of code review or something different.

This is more important than many people realize. Engineering managers know that allowing their most productive people to be as productive as possible amounts to big chunks of the team’s total output. Why on earth would you saddle your most productive engineers with a process that eats away at their productivity?

Thoughtless commits are bad commits

Every single commit that hits a codebase means more shit to trawl through trying to fix a production bug while your system is melting. Every merge commit. Every junk mid-PR commit that still doesn’t build but kinda gets your change closer to working. Every time you smashed two or three extra things into the PR because it was too much bother to create a separate PR. These things add up. These things make a codebase harder to wrangle, month after month, engineer after engineer.

How do you git bisect a codebase where every 6th commit doesn’t build because it was jammed into the middle of a Pull Request?

How much harder is it to audit a codebase where many times the blame is some massive merge commit?

How much more work is it to figure out what a commit actually does by reading the code because the blame commit message was “fixes bugs” and the pull request was 12 commits back?

The answer is *a lot harder*. Specifically because Pull Requests set you up for way more, way lower quality commits. It’s just the default mode of the workflow. That is what happens in practice, in codebases all over the world, every day. I’ve seen it in five different companies now on two continents in massively different technical domains.

Make the default mode a good one

You can make the argument that none of this is the fault of Pull Requests. Hi, thanks for your input. You’re technically correct. To you, I’d like to offer the Tale of the Tree Icon. When Facebook re-launched Facebook Groups in 2011, I was the engineer who implemented the New User Experience. I worked directly with the designer who implemented the Group Icons, which show up in the left navigation of the site. Weeks after launch, we noticed that almost all the groups had their icon set to… a tree. It was a gorgeous icon designed by the truly exceptional Soleio Cuervo. But… a tree? Why?

Because it was the first thing in the list.

People choose whatever is easiest. Defaults matter. So much. Even us demi-god-like Engineers are subject to the trappings of default behaviour. Which is why Pull Requests are terrible for code quality. The easiest behaviour is shoehorning in a bunch of shit under one PR because it’s just so much work to get code out for review.

This is where Stacked Diffs win out, no question. It’s not even close. The default behaviour is to be able to create a unit of code review for any change, no matter how minor. This means that you can get the dozens of uninteresting changes that come along with any significant work approved effortlessly. The changes that are actually controversial can be easily separated from the hum-drum, iterative code that we all write every day. Pull Requests encourage exactly the opposite — pounding in all of the changes into one high-level thesis and leaving the actual commit history a shambles.

Coding as a queue

The fundamental shift that the Stacked Diff workflow enables is moving from the idea that every change is a branch of off master and to a world where your work is actually a queue of changes ahead of master. If you’re a productive engineer, you’ll pretty much always have five or more changes out for review. They’ll get reviewed in some order and commited in some order. With Pull Requests, the work queue is hidden behind the cruft of juggling branches and trying to treat each change like a clean room separated from your other work. With Stacked Diffs, the queue is obvious — it’s a stack of commits ahead of master.  You put new work on the end of the queue. Work that is ready to land gets bumped to the front of the queue and landed onto master. It’s a much, much simpler mental model than a tangle of dependent branches and much more flexible than moving every change into the clean room of a new branch.

(For the pedantic few out there, yes, I just said stacked diffs are like a queue. Yeah… I didn’t name the workflow. Don’t hurl the rotten tomatoes at me.)


By now, you’re probably sick of this theoretical/rhetorical discussion of what good engineering looks like. Let’s switch gears and talk about this in practical, day to day terms.

Case Study #1: The Highly Productive Coder

In this case study, we take a look at Suhair. Suhair is a really productive coder. Suhair produces 10 or more high quality commits every day.

With Pull Requests

Suhair starts the day fixing a bug. Creates a branch, makes changes. Commits them. Suhair then pushes to the remote branch. Then navigates away from the terminal to the Github UI to create a pull request.

Next, Suhair switches back to master, pulls, and creates a new branch to work on a new feature. Commits code. This code is completely unrelated to the bug fix. In fact, they would never generate merge conflicts. Still, Suhair sticks to branches. Works on the feature. Gets it to a good RFC state. Suhair pushes the changes. In Github, Suhair creates a pull request.

Next Suhair starts working on another feature improvement. Switches to master. Pulls. Branches. But… uh oh. This change depends on his bug fix from earlier? What to do? He goes to the bug fix PR, sees if there are any comments. One person left some passing comments, but the person Suhair needs to review it hasn’t commented.

So Suhair decides it’s too much work to create a branch off his bug fix branch and decides to do something else in the interim. Suhair pings the needed person, begging for code review, interrupting their flow and then starts working on something else.

With Stacked Diffs

Suhair pulls master in the morning to get the latest changes. Makes the first bug fix, commits it, creates a Diff to be reviewed, entirely from the command line. Suhair then works on the unrelated feature. Commits. Creates a Diff from the command line. Then starts working on the bug-fix-dependent feature improvement. Because Suhair never left master, the bug fix is still in the stack. So, Suhair can proceed with the feature improvement uninterrupted. So, Suhair does the work. Commits it. Creates a Diff for review.

By now, the person who should have reviewed the initial bug fix actually got around to it. They give Suhair some feedback which Suhair incorporates via interactive rebase. Suhair rebases the changes on top of the updated bug fix, which generates a small merge conflict with the feature improvement, which Suhair fixes. Then Suhair lands the change via interactive rebase. On the next git pull –rebase against remote master, the local commit disappears because remote master already has an identical change, and Suhair’s queue of commits ahead of master decreases by one.

As a bonus for Suhair today, the same reviewer who approved the bug fix is also the reviewer needed for his feature improvement. That person approved his tweaks right after they reviewed his bug fix. So, Suhair rebases those changes to be at the top of the commit stack, then lands them. Suhair never switches branches. At the end of the day, only the feature work is left in his local repo, everything else is landed on top of master.

The next day, Suhair comes in, runs git pull –rebase and starts working without any branch juggling.

Case Study #2: The Free Spirited Hacker

Charlie is a productive, energetic, somewhat amoral hacker who just wants to get work done as fast as possible. Charlie knows the product better than any one, but doesn’t really care about code quality. Charlie is best paired with a senior tech lead (or two) who can rein in the chaos a bit.

With Pull Requests

Charlie starts the day by branching master and spamming the branch with five commits that are only vaguely related. Charlie’s commits are big, chunky commits that don’t make a lot of sense. They tend to be a bunch of things all crammed together. Reviewers of Charlie’s code always know they’ll have a lot of work ahead of them to make sense of the tangle of ideas. Because of this, they tend to put off reviewing. Today, a senior tech lead takes 45 minutes to read through all these changes, giving detailed feedback and explaining how to improve the various strands of the change. Charlie commits more changes onto the PR, addressing feedback and also makes random “fixes” along the way. In the end, the PR is probably okay, but it’s certainly not coherent and may the Mighty Lobster on High protect those who have to make sense of the code in the coming months.

During this laborious back-and-forth, Charlie’s best option is to keep piling things on this PR because all the related changes are in it. The tech lead doesn’t have a reasonable alternative to offer Charlie.

With Stacked Diffs

Charlie blasts out five commits and five Diffs back to back. Each one addresses something specific. Each one goes to a different reviewer because Charlie happens to be making a sweeping change to the codebase. Charlie knows how it all fits together and the tech leads can make sure that the individual changes aren’t going to ruin everything.

Because the changes are smaller and more coherent, they get much better review. A tech lead points out that one of the changes is clearly two separate theses that happen to touch the same set of files. This tech lead reviewing the code pushes back on Charlie. The tech lead points out that these should actually be two separate commits. Unfazed, Charlie abandons the Diff. Using interactive rebase to rewind history to that troublesome commit, Charlie uses git reset to uncommit the single commit that has two theses.

At this point, Charlie’s local master is two commits ahead of remote master and has a bag of uncommitted changes that Charlie is currently hacking on. There are two more changes that are in the future, waiting to be added back to the local commit history by Git when Charlie is done rebasing interactively.

So, Charlie uses git add -p to separate out one change from the other and creates two new commits and two new diffs for them separately. They each get a title, a description, and a test plan. Charlie then runs git rebase –continue to bring fast-forward time and bring back the later changes. Now, Charlie’s local master is six commits ahead of remote master. There are six Diffs out for review. Charlie never switched branches.

Case Study #3: The Engineer with a Bad Neighbour

Yang is a great engineer working in a fun part of the infrastructure. Unfortunately, Yang has a bad neighbour. This other engineer constantly lands the team in trouble. Today, Yang has found that the build is broken due to yet another incomprehensible change. The neighbour has a “fix” out for a review, but no one trusts it and several knowledgeable people are picking through the code in a very contentious code review. Yang just wants to get work done, but can’t because the bug is blocking everything.

With Pull Requests

Yang will checkout the remote branch with the “fix”. Next, Yang will branch off of that branch in order to get a sort-of-working codebase. Yang gets to work. Midday, the bad neighbour pushes a big update to the “fix”. Yang has to switch to that branch, pull, then switch back to the branch Yang has been working on, rebase, and then push the branch for review. Yang then switches gears to refactor a class nearby in the codebase. So, Yang has to go back to the bug “fix” branch, branch off it, start the refactor, and push the commit remotely for review. The next day, Yang wants to merge the changes, but the “fix” has changed and needs rebasing again. Yang switches to the bug fix branch, pulls. Switches to the first branch, rebases. Pushes. Switches to GitHub to do the merge, carefully selecting to merge onto master rather than the bug fix branch. Then Yang goes back to the terminal, switches to the second feature, rebases, pushes, goes to GitHub, selecrs to merge to master, and merges. Then, Yang applies for AARP, because Yang is now in geriatric care.

With Stacked Diffs

Yang sees that the Diff for the “fix” is out for review. Yang uses the Phabricator command line tool to patch that commit on top of master. This means that it’s not a branch. It’s just a throwaway local commit. Yang then starts working on the first change. Yang submits a Diff for review from the command line. Later, the “fix” has changed, so Yang drops the patch of the old version from the Git history and patches in the updated one via interactive rebase. Yang then starts working on the second change, submits a Diff for review. The next day, Yang is ready to land both changes. First, Yang dumps the previous patch of the fix and repatches the update to make sure everything works. Then, Yang uses the command line via interactive rebase to land both of the changes without ever switching branches or leaving the terminal. Later, the fix lands, so Yang does a  got pull –rebase and the local patch falls off because it’s already in master. Then Yang goes to skydiving because Yang is still young and vital.

In Conclusion

As you can see from the Case Studies, you can definitely get good work done no matter what tool chain you use. I think it’s also quite clear that Stacked Diffs make life easier and work faster. Many engineers reading this will say the cost of switching is too high. This is expectable. It’s a thing called the Sunk Cost Fallacy. Everyone prefers the thing they feel they have invested in, even if there is an alternative that is provably more valuable. The stacked diff workflow is a clearly higher-throughput workflow that gives significantly more power to engineers to develop code as they see fit.

Inside Facebook, engineering used the branch-oriented workflow for years. They eventually replaced it with the stacked diff workflow because it made engineers more productive in very concrete terms. It also encourages good engineering practices in a way exactly opposite to branching and Pull Requests.

Something I haven’t touched on at all is the actual work of reviewing code. As it turns out, Phabricator also happens to offer better code review tools, but I’ll save that for another post.

Communicating via AES 256 GCM between NodeJS and Go(lang)

I recently had the mixed fortune of needing to send encrypted payloads back and forth between a service running in NodeJS and a service built in Golang. It was not a straightforward thing to do, which made me appreciate just how particular and difficult these libraries are to use, doubly so to communicate across services written in different languages. Hence, I decided to write a blog post about the intricacies of the process.

First Issue: Transport Encoding

Enciphered payloads are just strings of bits. These bit strings are not ASCII or UTF8, though. You can’t output them to your terminal, for instance, because they can completely destroy the terminal due to byte sequences that the terminal interprets as instructions. Nor can you include them as HTTP request bodies, for the same reason. They can include characters that are interpreted by HTTP parsers in weird ways, most likely causing message truncation.

So, you need a transport encoding for sending ciphertext around. Base64 works great for this… with a catch.

Golang has multiple Base64 encoding methods. Specifically, StdEncoding and URLEncoding. URLEncoding does what it says on the tin — it creates Base64 output that is safe for use in a URL. Standard Base64 includes characters like the “=” sign, which requires escaping in order to work as part of a URL. (It’s worth noting that URLEncoding Base64 is also a lot longer than StdEncoding.)

I’d written a crypto wrapper library for my Golang service that was using URLEncoding internally. However, NodeJS doesn’t have native support for URLEncoded JSON. There are some NPMs for this, but if you’re like me and you hate adding a million dependencies, you just want this to work out of the box. So, I had to switch my existing Go library to use StdEncoding. Not a big deal, but it’s a Gotcha™ that tripped me up because I just didn’t expect that NodeJS wouldn’t support this. Ho hum.

Second Issue: Terminology

If you read the NodeJS crypto library documentation, it’s a) sparse, b) full of unexplained jargon and initials, and c) uses different words from the Golang library. Because of course they wouldn’t be the same.

So, what NodeJS calls an “IV,” which is a shortening of the term Initialization Vector, Golang refers to as a nonce. Nonce <sarcasm>*obviously*</sarcasm> is a mash-word for “number used once.” And, <sarcasm>as all of us crypto experts know</sarcasm>, you should never use the same Initialization Vector more than once. Don’t we all feel better, wiser, and superior to the mere mortals who don’t know all this already.

In truth, there is a subtle difference between the two terms. An initialisation vector means “choose some random bytes, used to lock in security of the encryption algorithm.” A none on the other hand, refers to “choose a random number with the correct number of bytes, used to lock in the security of the encryption algorithm” Initialization vectors are usually appended to the message, meaning your ciphertext is slightly longer than your message. A nonce (in theory) can be derived from context. In practice however, it seems the two are used interchangeably.

For AES 256 GCM, your nonce/IV/initialization vector ought to be 12 bytes long (i.e. 96 bits). This has to do with the size of the blocks in the block cipher. An IV needs to be sized correctly for the length of the block size in the block cipher and for the specific counter method used. For GCM with AES256, 12-bytes is standard.

This is important to get right for security and also because if you try to use the wrong sized nonce, you’ll get obscure errors like:

Error: Unsupported state or unable to authenticate data

Super helpful, that. Especially because you can get this error a bunch of different ways. In truth, ambiguous errors when attempting to decrypt a ciphertext is actually an important part of the security of the algorithm. If you can get the algorithm to output different errors by tweaking the ciphertext, you can use chosen ciphertext attacks to infer details about the plaintext. In some cases, you can even decrypt the ciphertext this way.

Unrelated, if you’re using a different cipher algorithm (i.e. aes256 rather than aes-256-gcm), you might get this error instead:

Error: Invalid IV length

My solution to this problem was to generate 12 bytes of randomness to use as the nonce. I chose to prepend these bytes to the start of my encoded blob.

Lastly, secure encryption requires a mechanism for message integrity. This is done by one of many algorithms which calculate a MAC or Message Authentication Code. I’m not going to attempt to discuss the details of this here, but there is a terminology hurdle with this notion. What some algorithms and papers refer to as a MAC others will refer to as an Auth Tag, which is short for Authentication Tag which is just one of an endless string of unexplained synonyms in the crypto world. Now you know.

Third Issue: Some disassembly required

Here’s where things get really gnarly. If you look at the Golang API for sealing an AES 256 GCM-enciphered plaintext message, you’ll note that it deals only with a nonce, a plaintext, and the mysterious “additionalData”:

What’s important about this is that there is something missing.

If we look at NodeJS’s implementation of the same thing, we’ve got an extra piece of required information — the AuthTag.

So that’s an interesting little mystery. Where is this AuthTag component in the Golang library?

Also, note that the deciphering will not work in NodeJS without the AuthTag. You’ll get the extremely useful, highly unique error message:

Error: Unsupported state or unable to authenticate data

Ugh.

To resolve this mystery, I had to dig into the actual source code of the Golang library for crypto. If we crawl into the source a bit, we can see that they’re helping us out by hiding the auth tag from us entirely:

 

Here you can see that Golang is using/anticipating it’s own structure for the ciphertext and the AuthTag. This isn’t documented in the Golang docs, of course. However, once you know that the ciphertext produced by Golang’s GCM sealing code, you can easily write your own code to splice out these bits. My code looks like this:

The assumption here is that the bundle is a Base64 encoded string of bits with the first 12 bytes being the nonce, the last 16 bytes being the AuthTag, and all the bytes in the middle being the ciphertext. Golang auto-appends the AuthTag to the end and I wrote some assembly code to prepend this with the nonce. With each of these pieces extracted from the Buffer, performing the decryption is finally possible.

(You might note the weird ordering of parameters to this function. It is strange to have callbacks followed by additional params. This was done because this particular callback is used in a chain of callbacks and I curried some of the arguments. Without doing so, it would have meant a lot more forwarding of arguments along the chain, which results in messy code. This way, the function that decrypts the AES key only needs to call onKeySuccess with a single argument, because the other three were already bound to the onKeySuccess function. Yay JavaScript.)

For the Golang code that does Encryption, check out the example code here. Mine is quite similar.

Conclusion

Writing cross-service, secure communication is hard. Different libraries choose to implement the ciphertext and AuthTag packaging differently. Golang is particularly problematic because it does not tell you it has hidden away some of the details. Because it’s not part of the public API, it means that relying on this particular implementation is unsafe. They might choose to put it the other way around without telling you because you were never supposed to reach inside the blackbox is the first place. Alas. Alack.

Links

Basics of what these different pieces of the puzzle actually do:

https://crypto.stackexchange.com/questions/3965/what-is-the-main-difference-between-a-key-an-iv-and-a-nonce

A fully-baked NodeJS encryption example:

http://lollyrock.com/articles/nodejs-encryption/

 

How not to join the wrong company

I recently had the experience of joining and then leaving a startup. I won’t get into the details of the company, but let’s just say it rhymes it “kitty slapper.” I stayed a month, which was long enough to understand the culture of the company, the mission, and the people. In that time, it became really apparent to me that I’d joined the wrong company for me. Once I knew, I didn’t waste any time. I let the team know and transitioned out.

So, on one front, I think I did okay. When I knew it wasn’t going to work, I didn’t belabour it. I didn’t foolishly stick to my plan despite the evidence I had that it wasn’t going to work. A younger, more driven, and more naïve me would have pushed ahead trying to make it work at all costs. I did that at Facebook and mostly just gave myself more grey hairs.

Now, I wouldn’t pretend that I left Facebook in the most mature, conscientious way. No, in truth I was really frustrated and hurt when I left. Things weren’t adding up and my two separate chains of accountability were giving me incongruous feedback. If I’m honest though, I think there was a way forward for me there. It would have just required a lot of letting go of hard feelings and finding a new team to call home.  I wasn’t willing to let go of my goals or area of focus at Facebook. So, instead of moving away from security or quitting when it seemed like there was an impasse, I just kept grinding. I gritted my teeth and pushed forward. The result? I still quit a year later, even more frustrated and demoralised than the year before.

Okay, so I didn’t repeat that mistake this time. I recognised the way was blocked by unresolvable issues and I walked away.

But there’s room for improvement here. The question I keep coming back to since I left a couple weeks ago is how did I miss these signals in the first place? I’m a very, very seasoned interviewer. So much so that when someone is interviewing me, it typically works more like I’m interviewing them. So how the hell did I miss majorly important signals that would’ve prevented me from joining that company in the first place?

After another week of self-reflection, I think it comes down to just a few things.

#1 Don’t fall subject to the authority bias

One of my favourite books is The Art of Thinking Clearly by Rolf Dobelli. The chapter devoted to the Authority Bias poignantly sums up one place where I went wrong in joining the startup I just left:

The first book of the Bible explains what happens when we disobey a great authority: we get ejected from paradise. This is also what less celestial authorities would have us believe.

When I went for one of several interviews with the CEO, I noted a clear authoritarian streak in his communication. Shockingly (at least to myself in retrospect), rather than investigate this, I bowed down. I was excited about the company and scared of not getting to join it. So, even though I had signal that *already* was telling me something wasn’t right, I ignored it. This is probably a bit of confirmation bias on my part as well, but the fact that I didn’t take the opportunity to ask the hard questions I should have is something different. It’s the authority-fearing part of me (and I guess part of everyone) who is afraid of causing trouble — even when the consequences of *not* causing that trouble turn out to be much worse.

#2 Don’t let “winning” override making a smart decision

In the interview phase of this company, I sent over some outlines that explained how I would run a team doing the work I had been discussing with the leadership team. It was very technical (as one might imagine it ought to be if I’m being interviewed to be a lead engineer). Now, from the technical people, I got a thumbs up. In fact, after a month at the company, I can see that the outline I sent over was basically spot on. Yay me. The important part about this is that I also got told that my communication “wasn’t a successful one for informing business people.”

At this point, if I’m honest with myself, I let my ego get the better of me. Incredulity overtook me and I got caught up in a haughty, “Why would business people expect to be able to understand an outline explaining how to tackle a hard technical problem?” train of thought. Now — that question is actually the perfect question to ask. However, instead of facing the reality that I was having to ask such a ridiculous thing and walking away, I tried to “win” the debate. I should have just accepted that at this particular company the top of the food chain was obviously business people rather than engineers. I should have owned the reality that their expectation would be that I would communicate extremely complex things upwardly in some sort of digest format. That if I didn’t summarise complexity digestibly enough, the communication would be a failure and blame would be assigned to me.

But I didn’t do any of that. I got caught in the trap of “No way, I’m a great communicator — I’ll prove it!” I can see that now, but it was not at all obvious to me then. In fact, after being told that my hours of free effort devising a way to tackle a hard problem with a team were wasted because a business person didn’t understand it, I was nervous. I worried that I had screwed up my prospects of getting hired. I spent one night tossing and turning hoping that my failure hadn’t undone all the positive interactions I’d had prior. Silly me.

#3 Pay attention to all the signals you have

Going into this company, I had several reports that all was not unspoilt in the state of Denmark. I had some direct feedback from current employees and I had some very telling Glassdoor reviews. Did I weigh these correctly? Nope! Again, I let confirmation bias take over my reasoning. Rather than asking the hard questions about these things, I instead asked the questions that I knew gave my interviewers enough rhetorical space to give answers I’d accept.

For instance, I should have asked, “I read on Glassdoor that X, Y, and Z appears to happen at your company. Is that true? If so, how often? Has anyone left the company as a result?” That question leaves no room wriggling around. Instead of using the signal I had usefully, I instead asked questions like, “You guys have a bit of a reputation. How does that play out in the day-to-day running of the company?” This question gives *plenty* of room for subjectivity and soft answers. This is silly of me. Wasting a strong signal about a company — especially one that invalidates the assumption that it’s a good one to join — is about as rookie a mistake as one can make. 10+ years into my career, here I am making exactly that mistake.

#4 Checksum your intuitions about how your job will work

As I look back at my interview performances, I realise I didn’t bother to do something really important. I didn’t bother to check if the company worked in a sensible way relative to my expectations. I think my years at Facebook set me up perfectly for this. At a company like Facebook, even a “meh” team still produces a lot of great work, has amazingly talented people on it, and has plenty of remit to do their jobs well. It never occurred to me that these things might not be true at a different company. So, rather than asking straightforward questions like:

  • If I need to coordinate with other teams in order to do my job well, what might that look like?
  • What does the reporting structure look like for engineers?
  • If I see an opportunity to improve processes across the engineering team, how can I push that forward?
  • If I see an opportunity to improve processes across the company, how can I make them happen?

If I’d asked these questions and pushed for authentic answers, I would have known that I was barking up the wrong tree. But, I didn’t. I just assumed that every company works as well as Facebook by default.

Nope.

Do you need strong mental math to pass an engineering interview?

I got a great question from a viewer of the Intro to Architecture and Systems Design Interviews video I created (https://youtu.be/ZgdS0EUmn70). The question is: If my mental math is really weak, is it OK to whip out a calculator app?

Below is my answer, republished here so everyone can (I hope) benefit from it:

For mental math vs. busting out a calculator, I think it’s probably inconsistent from one interviewer to the next. In no cases do I think it alone could possibly cost you the interview unless you’re interviewing at some extremely mathsy place (like an algorithmic trading company, for instance).

I’ve seen a spectrum in terms of judgement towards lack of mental math. One end has people like me, who couldn’t care less if you use a calculator. I would probably award you points for self knowledge and for using tools to make you more effective.

The other end of the spectrum would be someone who is very strong in maths and who doesn’t think the problems you’re tasked with are of sufficient complexity to warrant a calculator. With an interviewer on that end of the spectrum, you might lose some love, but it wouldn’t counterbalance an overall strong performance. For instance, in hundreds of decision-making discussions, I’ve never seen lack of mental math come up as a deal breaker.

There is one caveat to all of this. If you’re using a calculator for numbers that engineers *should* know, that could hurt you*. For instance, if 2^5 comes up and you have to bang it out on your TI-83, you’re probably going to lose real points. Even as a more math-lenient interviewer, I would have some serious questions about someone if powers of two don’t seem familiar. Likewise powers of two on the big end. 2^16 and 2^32 are both important numbers that should be in your mind as a programmer.

Also sums and differences of common powers of two. If you need a calculator to sum 4096 and 4096, I would knock you down a rung in my estimation.

Hope that helps!

 

* Re: Numbers engineers *should* know, here are some helpful links:

Architecture and Systems Design Interviews

Welcome back for Episode 06 of The Unqualified Engineer! I wanted to switch things up a bit this time. On an earlier episode, one of our viewers Vahid Noormofidi asked about system design interviews, so today we’re going to go there.

We’ve talked a lot about coding skills on this channel so far. We’ve walked through a bunch of examples of coding questions you might face in a coding interview. Some good, some bad. Some completely shit and unfair. While coding interviews are super important to succeed at in order to get a job offer, a less well-known detail of Silicon Valley style interviews is that your ‘level’ comes primarily from your performance in a design and architecture interview.

For those who don’t already know, Silicon Valley interviews have about five or six common forms, with three being ubiquitous. There’s the coding interview, which we’ve covered a lot. There’s the background/behaviour interview, which tries to determine if you’re a good person to work with and someone who will be a good addition to the team. There’s also the design & architecture interview, which focuses on your ability to take a big fuzzy problem and come up with a broad, detailed plan for solving it.

Broad and detailed? WTF?

If it feels weird to hear ‘broad *and* detailed,’ it should. Design and architecture interviews are impossibly big problems that definitely cannot be solved in 45 minutes. The goal isn’t to come up with a bullet-proof, 100% complete solution to a massive problem. Rather, it’s designed to give you the chance to show the interviewer what aspects of the problem you think about, what solutions you can come up with, and how much technical depth and diversity you’re bringing to the table. This is why it so strongly influences what ‘level’ you get hired at.

If you’re wondering what I mean about “levels,” it’s probably because you haven’t worked in a company that is as structured as a Google or a Facebook. The basic idea is that engineers (well, all employees actually) get broken into numbered groups that map coarsely to a scales of a person’s ability to contribute value to the company.

For instance, at Facebook, a new graduate from university might get hired in at Level 3. Why Level 3? Fuck if I know. I think it’s probably just self importance — we’re all just *too* special and *too* awesome to start at Level 1… Anyway, that’s how it works. Your level determines your compensation, it determines what equity stake you might get in the company, and it determines what amount of value you have to add to the company in order to keep your job. Brutal, yes, but also somewhat reasonable in practice.

So anyway, back to design and architecture interviews. They have a few subforms like product design interviews, database/data storage design interviews, etc.. The most common type is a generic systems design interview.

A typical question for an interview like this has a really common form. You’ll be presented with a big problem that has only a few concrete details. For instance, in a Facebook interview, you might get asked a question like “Design a mobile app logging infrastructure that can handle a user base of 500 million monthly active users. The system needs to handle any sort of logs from mobile devices like application start, stop, errors, etc..”

Breadth

If you don’t come from a systems background, you might initially think an interview question like this is grossly unfair. Aand, well, you’re kind right to feel that way. A question like this is much easier for someone with experience designing systems like this, at that scale. It’s harder for someone who hasn’t. Before you self-cycle into existential rage, try to remember that’s exactly the point of the interview.

On top of that, bear in mind that there’s no correct answer to a question like this.

There are certainly incorrect answers (i.e. don’t say “We can solve this with Mongo DB, because it’s web scale.”). However, the universe of good answers is vast. Good interviewers are going to look for you to use your experience to give the strongest performance in the domain you should be strongest in.

For instance, given the question above, let’s say you’re a mobile product engineer. The interviewer is going to be looking for you to think about the product impact of a system like this. When should you send logs from a device? What will the performance impact be of collecting logs on the device? What about supporting iOS and Android and mobile web? Will application engineers be able to log arbitrary data or is it a strictly defined log set? Those are the kinds of details your interviewer should expect you to be strong with.

On the other if you’re a distributed filesystem engineer, your interviewer is going to expect you to be amazingly good at describing how to handle the scale of the problem on the server side and to think about strategies for reducing the amount of time it takes for new log data to become available. The interviewer probably wouldn’t expect you to think about the battery life implications in the same detail.

If you can deliver awesome answers to both the client and the server side of the problem, all the better. Some people can.

The important part is to play to your strengths while also showing reasonable technical breadth. If a mobile app engineer answered this question for me and kicked ass with the client-side aspects but literally didn’t address the server side at all, it would be tough for me to feel confident in hiring her. On the other hand, if she thought through as much as detail as she could on the server and gave me clear signals about what she does and don’t know, that’s better. Maybe she wouldn’t invent a server side solution as well as a distributed file system engineer, but that’s OK. That makes sense given her skills and specialisation as a mobile engineer.

Break the problem down

One of the most critical things for you to do in this interview is to break the problem down even more coherently than the interviewer presented it. Going back to the example problem I proposed, you might have noticed that some of the details aren’t in a usable form. If you did, good. If you didn’t, you’ve got some work to do.

For instance, I said “500 million monthly active users.” What does that even mean? Can you use a *monthly* active number to design a real-time system? The answer is yes, but not directly. You need to break it down.

The rookie might do this by taking 500 million and dividing it by 30 days in a month and then divide that numbers by 86400* to figure out the concurrent users. The reason this is a naive way to approach the problem is that people live near each other and live life in daily cycles. That means that you can’t just evenly split the user base by seconds in a month as though all people are evenly spread across the world and active, using your app evenly 24 hours a day.

So how do you capacity plan? If you don’t have an immediate answer, you should start by interacting with the interviewer. Ask about the geographic distribution of users. Ask about the biggest markets for the app. What timezones are most of the users in?

You should be able to arrive at the notion that (assuming the app is doing well!), some large fraction of the users will be active every single day and mostly during specific times of day. You can reasonably smoke up these numbers. Maybe you would estimate that 70% of the monthly active users are also daily actives and that 70% of those people are active at roughly the same time.

Why? Well, it’s very likely that night time in the big population centers of the world is going to be much quiter than the day time. A key insight you need to have is that to handle ‘the user base,’ you really need to worry about the peak-time traffic. So, if 70% of the users are going to be using your app every day and 70% of that 70% will be using the app all at once during peak times, your 500 million users is actually only 245 million at peak times. If we assume that means 245 million people in an hour, then we can start doing even finer-grained estimation. 245M / 60 minutes / 60 seconds nets out to about 68K second-to-second active users. That number is much, much more reasonable to plan for than a hazy “500 million.” *How many servers do you need to handle 500 million people?* is almost meaningless compared to the real question — *How many servers do you need to handle 70K concurrent users during peak hours? *

Meta point about breaking the problem down

Did I just make up a bunch of numbers based on no data? Yes. Could they be “wrong” numbers? Well yes, in fact they probably are. Does that matter? Not as much as you might think.

The interview is not trying to figure out if you can psychically derive the exact user stats and behaviours of an imaginary, complex system. Rather, it’s designed to see if you can handle a big fuzzy need in a reasonable way. Can you take a big, insane requirement and grind through it to figure out what sort of actual technical problem it poses.

Veteran systems engineers, devops engineers, and people who have had the chance to work on large scale problems before will have a leg up in this domain. That might seem unfair, but it’s really not. Remember, we’re not talking about school where you can study the hardest and get the best grade. This is real life. Valuable skills are valuable and who has them and who doesn’t is not “fair,” it’s reality. If you’re expecting it to be fair, you’re going spend a lot of time being frustrated.

Breaking the problem down continued

So, back to the problem, — we’ve decided we’re talking about 70K concurrent users. Now we have to think about what that means in terms of actual throughput. If you were trying to solve this problem in the real world, your bottom line for the server side will be *How much log data are we actually talking about here?*

Right now, you might be thinking — it’s impossible to know! If you are, you’re either not thinking critically about mobile use cases or you’re not experienced enough in this domain to have an intuition about this. The second is better than the first. Let’s think about some real-world bounds. *How much network bandwidth does a phone have?* *How much storage space?* If we assume that the logs are going to be generated by user actions — remember the problem details “application start, application end, errors, etc.” — then we have an excellent frame of reference for breaking it down further. How many actions per minute can a person do on their phone? Maybe the average person could send and receive fifteen messages. Or maybe they could consume twenty photos on Instagram. Maybe they can switch apps roughly four times. In any of those cases, we’re not looking at a massive amount of data here.

The novice will assume that we just need to guess some reasonable average number and stick with that (E.g. “half a megabyte, tops.”). The better engineer will grind through some concrete estimations based on the size of data. Maybe we’re talking about 20+ 32-bit integers that refer to entity IDs (photo IDs, user IDs, etc.) at least, probably some overhead for a data encoding like JSON, probably some strings for things like event types, and in the error case hopefully something like a stack trace to aid in debugging. So, at least a few hundred bytes per minute, but with an upper bound that could be pretty big in error cases (them stack traces, yo).

The best way to answer here — in my opinion — is to have an opinion and to make intentional, reasonable tradeoffs. Burning through hundreds of bytes of data per minute is a guaranteed shitty and expensive experience for someone using your product. So, be a empathetic system designer. Come up with a guideline and set some numbers based on giving a shit about the user experience. Perhaps that number is 5KB of log data per minute as a maximum, anything more gets discarded. That means that if a person uses your product for an hour, they’ll only rack up 300KB of log data that they’re paying for out of their own hard earned money. That’s like asking them to download an extra image every hour or so in order to give your company the data you need to make sure the app is working and to understand why it’s not. A reasonable trade off.

And again, yes, this is a made up number. The specific number is not as important as why you chose it and whether or not that choice was reasonable. I could have said 500KB per minute (I wouldn’t because not even Facebook’s bloated apps use that much bandwidth for metadata), but the most important part is *why* that number.

The reason I think taking a strong stance towards how a system should work is the better option is because it’s show experience, leadership, and ownership. Sure, I could just let the users’ activity dictate some reasonable number of logs per minute but what if my estimation leads me to some big number like 5MB per hour? My inner nurturer needs to kick in and tell me that logging that aggressively is abusive of user trust. My inner data analyst should be telling me that 5MB of data per hour per user is a glut of data that will mostly be giving redundant signal. User empathy, experience and technical leadership should be the guiding light here over a raw calculation.

In a real company doing real work, that’s what actually happens, after all. Product decisions are made by people trying to make the smartest tradeoffs. There are no right answers. The only wrong answers are ones that drive away users or lead the company to fail. The right answers are many and varied. This is why design and architecture interviews are so important for determining people’s levels.

Keep Going

This discussion is already massive and ranging, yet we haven’t even talked about the concrete details of a client implementation, a server implementation. We haven’t talked about client-side caching. About mobile networks versus WiFi delivery of logs. We haven’t talked about data-saving strategies like user sampling. We haven’t talked about how to collect log data from the server side, storing it long-term. We haven’t talked about how many people it would take to build a system like this. We haven’t talked about how many servers we’ll need to handle the peak traffic. We haven’t talked about how much that will cost. We haven’t talked about countless absolutely critical aspects of the problem. Nor could we in 45 minutes. That’s why you, as the interviewee, must keep going.

At any point if you’re looking to your interviewer for guidance about how to proceed, you’re losing points. Now, don’t get me wrong on this point. Needing to bounce ideas off the interviewer or get clarification is fine. Also, you might be legitimately stuck on the problem. If so, by god man — ask for a pointer. I’ve seen many interviews that resulted in a job offer include feedback like “The candidate started off strong but needed a small pointer on X. After that, she did really well.” Don’t be afraid to ask for help. Still, you should bear in mind that the more help you need, the more likely it is that you’re showing lack of experience, creativity, and/or technical leadership.

I don’t know anything about image processing pipelines, for instance. But, if you give me a problem related to image processing pipelines at scale, I have high confidence that I can keep exploring the problem space making meaningful progress for 45 minutes. In all honesty, I love these kinds of challenges and find them really energising. Not everyone does, but everyone who is a strong technical leader *can* do these kinds of explorations.

Failing this interview

One interesting thing to consider on this point is that Facebook, at least, don’t even do design/architecture interviews for new graduates. They used to, but consistently found that the candidates just didn’t have enough experience with applied systems problems to even begin solving the problem.

Not having experience enough to be able to digest a big problem like this doesn’t always mean you’re fucked. It is an important signal though.

For instance, I was once on an interview loop where we all decided to hire the candidate even though he faired poorly in the architecture interview. His coding skills were strong and he seemed passionate about our company as well as a good person to work with. We didn’t hire him though.

When the interview feedback made it up to the director level, the director stopped us and rejected the candidate. I was confused. Turns out the director did some mental math that I didn’t have the foresight to do. Yes, the coding skills were good and the candidate was a reasonably good communicator, team player, etc.. However, he was lacking in design/architecture ability. This on it’s own is no failure, but the candidate had 8 years of industry experience. He had worked in a strongly technical company for that long without ever taking on enough leadership to get good at reasoning about large scale systems. That is a red flag.

Like Bob Dylan says, you’re either busy being born or busy dying. In this case, our candidate had plateaued in his career — so, he was busy dying. It could have been for a million reasons, but to the director making the call it means making a tough decision about whether or not we can expect this candidate to advance in his career or not.

At Facebook at least, you have to keep advancing at least to level 5, which is roughly equivalent to “senior engineer” or “tech lead” in other environments. After that, you’re doing valuable enough work to stick around without getting promoted again. This candidate would not have gotten a level 5 offer and if he’s already 8 years into his career, it’s unlikely he ever would. So, we would be sentencing him to a short-term career at Facebook at best. It was definitely the right call.

Final points

Architecture interviews are formidable, open-ended problems that you definitely cannot exhaustively solve in the time allotted. If you have no idea how to solve these kinds of problems, you might start by checking out the engineering blogs of companies like Google, Apple, Dropbox, and others. The amazing thing about architecture is that most of the best companies are sharing all their work.

Even if you have no background in the work, you can familiarise yourself with the common patterns of system design by reading diversely from the blogs on the topic, watching YouTube videos of tech talks from conferences, etc.. If that feels like cheating, it shouldn’t. After all, the reason engineers at big tech companies are good at solving these kinds of problems isn’t usually because they do it all day, every day. Rather, it’s because they get exposure to the solutions in internal tech talks and write-ups. The information is available. Go turn it into knowledge.