Architecture and Systems Design Interviews

[youtube https://www.youtube.com/watch?v=ZgdS0EUmn70&w=1280&h=720%5D

Welcome back for Episode 06 of The Unqualified Engineer! I wanted to switch things up a bit this time. On an earlier episode, one of our viewers Vahid Noormofidi asked about system design interviews, so today we’re going to go there.

We’ve talked a lot about coding skills on this channel so far. We’ve walked through a bunch of examples of coding questions you might face in a coding interview. Some good, some bad. Some completely shit and unfair. While coding interviews are super important to succeed at in order to get a job offer, a less well-known detail of Silicon Valley style interviews is that your ‘level’ comes primarily from your performance in a design and architecture interview.

For those who don’t already know, Silicon Valley interviews have about five or six common forms, with three being ubiquitous. There’s the coding interview, which we’ve covered a lot. There’s the background/behaviour interview, which tries to determine if you’re a good person to work with and someone who will be a good addition to the team. There’s also the design & architecture interview, which focuses on your ability to take a big fuzzy problem and come up with a broad, detailed plan for solving it.

Broad and detailed? WTF?

If it feels weird to hear ‘broad *and* detailed,’ it should. Design and architecture interviews are impossibly big problems that definitely cannot be solved in 45 minutes. The goal isn’t to come up with a bullet-proof, 100% complete solution to a massive problem. Rather, it’s designed to give you the chance to show the interviewer what aspects of the problem you think about, what solutions you can come up with, and how much technical depth and diversity you’re bringing to the table. This is why it so strongly influences what ‘level’ you get hired at.

If you’re wondering what I mean about “levels,” it’s probably because you haven’t worked in a company that is as structured as a Google or a Facebook. The basic idea is that engineers (well, all employees actually) get broken into numbered groups that map coarsely to a scales of a person’s ability to contribute value to the company.

For instance, at Facebook, a new graduate from university might get hired in at Level 3. Why Level 3? Fuck if I know. I think it’s probably just self importance — we’re all just *too* special and *too* awesome to start at Level 1… Anyway, that’s how it works. Your level determines your compensation, it determines what equity stake you might get in the company, and it determines what amount of value you have to add to the company in order to keep your job. Brutal, yes, but also somewhat reasonable in practice.

So anyway, back to design and architecture interviews. They have a few subforms like product design interviews, database/data storage design interviews, etc.. The most common type is a generic systems design interview.

A typical question for an interview like this has a really common form. You’ll be presented with a big problem that has only a few concrete details. For instance, in a Facebook interview, you might get asked a question like “Design a mobile app logging infrastructure that can handle a user base of 500 million monthly active users. The system needs to handle any sort of logs from mobile devices like application start, stop, errors, etc..”

Breadth

If you don’t come from a systems background, you might initially think an interview question like this is grossly unfair. Aand, well, you’re kind right to feel that way. A question like this is much easier for someone with experience designing systems like this, at that scale. It’s harder for someone who hasn’t. Before you self-cycle into existential rage, try to remember that’s exactly the point of the interview.

On top of that, bear in mind that there’s no correct answer to a question like this.

There are certainly incorrect answers (i.e. don’t say “We can solve this with Mongo DB, because it’s web scale.”). However, the universe of good answers is vast. Good interviewers are going to look for you to use your experience to give the strongest performance in the domain you should be strongest in.

For instance, given the question above, let’s say you’re a mobile product engineer. The interviewer is going to be looking for you to think about the product impact of a system like this. When should you send logs from a device? What will the performance impact be of collecting logs on the device? What about supporting iOS and Android and mobile web? Will application engineers be able to log arbitrary data or is it a strictly defined log set? Those are the kinds of details your interviewer should expect you to be strong with.

On the other if you’re a distributed filesystem engineer, your interviewer is going to expect you to be amazingly good at describing how to handle the scale of the problem on the server side and to think about strategies for reducing the amount of time it takes for new log data to become available. The interviewer probably wouldn’t expect you to think about the battery life implications in the same detail.

If you can deliver awesome answers to both the client and the server side of the problem, all the better. Some people can.

The important part is to play to your strengths while also showing reasonable technical breadth. If a mobile app engineer answered this question for me and kicked ass with the client-side aspects but literally didn’t address the server side at all, it would be tough for me to feel confident in hiring her. On the other hand, if she thought through as much as detail as she could on the server and gave me clear signals about what she does and don’t know, that’s better. Maybe she wouldn’t invent a server side solution as well as a distributed file system engineer, but that’s OK. That makes sense given her skills and specialisation as a mobile engineer.

Break the problem down

One of the most critical things for you to do in this interview is to break the problem down even more coherently than the interviewer presented it. Going back to the example problem I proposed, you might have noticed that some of the details aren’t in a usable form. If you did, good. If you didn’t, you’ve got some work to do.

For instance, I said “500 million monthly active users.” What does that even mean? Can you use a *monthly* active number to design a real-time system? The answer is yes, but not directly. You need to break it down.

The rookie might do this by taking 500 million and dividing it by 30 days in a month and then divide that numbers by 86400* to figure out the concurrent users. The reason this is a naive way to approach the problem is that people live near each other and live life in daily cycles. That means that you can’t just evenly split the user base by seconds in a month as though all people are evenly spread across the world and active, using your app evenly 24 hours a day.

So how do you capacity plan? If you don’t have an immediate answer, you should start by interacting with the interviewer. Ask about the geographic distribution of users. Ask about the biggest markets for the app. What timezones are most of the users in?

You should be able to arrive at the notion that (assuming the app is doing well!), some large fraction of the users will be active every single day and mostly during specific times of day. You can reasonably smoke up these numbers. Maybe you would estimate that 70% of the monthly active users are also daily actives and that 70% of those people are active at roughly the same time.

Why? Well, it’s very likely that night time in the big population centers of the world is going to be much quiter than the day time. A key insight you need to have is that to handle ‘the user base,’ you really need to worry about the peak-time traffic. So, if 70% of the users are going to be using your app every day and 70% of that 70% will be using the app all at once during peak times, your 500 million users is actually only 245 million at peak times. If we assume that means 245 million people in an hour, then we can start doing even finer-grained estimation. 245M / 60 minutes / 60 seconds nets out to about 68K second-to-second active users. That number is much, much more reasonable to plan for than a hazy “500 million.” *How many servers do you need to handle 500 million people?* is almost meaningless compared to the real question — *How many servers do you need to handle 70K concurrent users during peak hours? *

Meta point about breaking the problem down

Did I just make up a bunch of numbers based on no data? Yes. Could they be “wrong” numbers? Well yes, in fact they probably are. Does that matter? Not as much as you might think.

The interview is not trying to figure out if you can psychically derive the exact user stats and behaviours of an imaginary, complex system. Rather, it’s designed to see if you can handle a big fuzzy need in a reasonable way. Can you take a big, insane requirement and grind through it to figure out what sort of actual technical problem it poses.

Veteran systems engineers, devops engineers, and people who have had the chance to work on large scale problems before will have a leg up in this domain. That might seem unfair, but it’s really not. Remember, we’re not talking about school where you can study the hardest and get the best grade. This is real life. Valuable skills are valuable and who has them and who doesn’t is not “fair,” it’s reality. If you’re expecting it to be fair, you’re going spend a lot of time being frustrated.

Breaking the problem down continued

So, back to the problem, — we’ve decided we’re talking about 70K concurrent users. Now we have to think about what that means in terms of actual throughput. If you were trying to solve this problem in the real world, your bottom line for the server side will be *How much log data are we actually talking about here?*

Right now, you might be thinking — it’s impossible to know! If you are, you’re either not thinking critically about mobile use cases or you’re not experienced enough in this domain to have an intuition about this. The second is better than the first. Let’s think about some real-world bounds. *How much network bandwidth does a phone have?* *How much storage space?* If we assume that the logs are going to be generated by user actions — remember the problem details “application start, application end, errors, etc.” — then we have an excellent frame of reference for breaking it down further. How many actions per minute can a person do on their phone? Maybe the average person could send and receive fifteen messages. Or maybe they could consume twenty photos on Instagram. Maybe they can switch apps roughly four times. In any of those cases, we’re not looking at a massive amount of data here.

The novice will assume that we just need to guess some reasonable average number and stick with that (E.g. “half a megabyte, tops.”). The better engineer will grind through some concrete estimations based on the size of data. Maybe we’re talking about 20+ 32-bit integers that refer to entity IDs (photo IDs, user IDs, etc.) at least, probably some overhead for a data encoding like JSON, probably some strings for things like event types, and in the error case hopefully something like a stack trace to aid in debugging. So, at least a few hundred bytes per minute, but with an upper bound that could be pretty big in error cases (them stack traces, yo).

The best way to answer here — in my opinion — is to have an opinion and to make intentional, reasonable tradeoffs. Burning through hundreds of bytes of data per minute is a guaranteed shitty and expensive experience for someone using your product. So, be a empathetic system designer. Come up with a guideline and set some numbers based on giving a shit about the user experience. Perhaps that number is 5KB of log data per minute as a maximum, anything more gets discarded. That means that if a person uses your product for an hour, they’ll only rack up 300KB of log data that they’re paying for out of their own hard earned money. That’s like asking them to download an extra image every hour or so in order to give your company the data you need to make sure the app is working and to understand why it’s not. A reasonable trade off.

And again, yes, this is a made up number. The specific number is not as important as why you chose it and whether or not that choice was reasonable. I could have said 500KB per minute (I wouldn’t because not even Facebook’s bloated apps use that much bandwidth for metadata), but the most important part is *why* that number.

The reason I think taking a strong stance towards how a system should work is the better option is because it’s show experience, leadership, and ownership. Sure, I could just let the users’ activity dictate some reasonable number of logs per minute but what if my estimation leads me to some big number like 5MB per hour? My inner nurturer needs to kick in and tell me that logging that aggressively is abusive of user trust. My inner data analyst should be telling me that 5MB of data per hour per user is a glut of data that will mostly be giving redundant signal. User empathy, experience and technical leadership should be the guiding light here over a raw calculation.

In a real company doing real work, that’s what actually happens, after all. Product decisions are made by people trying to make the smartest tradeoffs. There are no right answers. The only wrong answers are ones that drive away users or lead the company to fail. The right answers are many and varied. This is why design and architecture interviews are so important for determining people’s levels.

Keep Going

This discussion is already massive and ranging, yet we haven’t even talked about the concrete details of a client implementation, a server implementation. We haven’t talked about client-side caching. About mobile networks versus WiFi delivery of logs. We haven’t talked about data-saving strategies like user sampling. We haven’t talked about how to collect log data from the server side, storing it long-term. We haven’t talked about how many people it would take to build a system like this. We haven’t talked about how many servers we’ll need to handle the peak traffic. We haven’t talked about how much that will cost. We haven’t talked about countless absolutely critical aspects of the problem. Nor could we in 45 minutes. That’s why you, as the interviewee, must keep going.

At any point if you’re looking to your interviewer for guidance about how to proceed, you’re losing points. Now, don’t get me wrong on this point. Needing to bounce ideas off the interviewer or get clarification is fine. Also, you might be legitimately stuck on the problem. If so, by god man — ask for a pointer. I’ve seen many interviews that resulted in a job offer include feedback like “The candidate started off strong but needed a small pointer on X. After that, she did really well.” Don’t be afraid to ask for help. Still, you should bear in mind that the more help you need, the more likely it is that you’re showing lack of experience, creativity, and/or technical leadership.

I don’t know anything about image processing pipelines, for instance. But, if you give me a problem related to image processing pipelines at scale, I have high confidence that I can keep exploring the problem space making meaningful progress for 45 minutes. In all honesty, I love these kinds of challenges and find them really energising. Not everyone does, but everyone who is a strong technical leader *can* do these kinds of explorations.

Failing this interview

One interesting thing to consider on this point is that Facebook, at least, don’t even do design/architecture interviews for new graduates. They used to, but consistently found that the candidates just didn’t have enough experience with applied systems problems to even begin solving the problem.

Not having experience enough to be able to digest a big problem like this doesn’t always mean you’re fucked. It is an important signal though.

For instance, I was once on an interview loop where we all decided to hire the candidate even though he faired poorly in the architecture interview. His coding skills were strong and he seemed passionate about our company as well as a good person to work with. We didn’t hire him though.

When the interview feedback made it up to the director level, the director stopped us and rejected the candidate. I was confused. Turns out the director did some mental math that I didn’t have the foresight to do. Yes, the coding skills were good and the candidate was a reasonably good communicator, team player, etc.. However, he was lacking in design/architecture ability. This on it’s own is no failure, but the candidate had 8 years of industry experience. He had worked in a strongly technical company for that long without ever taking on enough leadership to get good at reasoning about large scale systems. That is a red flag.

Like Bob Dylan says, you’re either busy being born or busy dying. In this case, our candidate had plateaued in his career — so, he was busy dying. It could have been for a million reasons, but to the director making the call it means making a tough decision about whether or not we can expect this candidate to advance in his career or not.

At Facebook at least, you have to keep advancing at least to level 5, which is roughly equivalent to “senior engineer” or “tech lead” in other environments. After that, you’re doing valuable enough work to stick around without getting promoted again. This candidate would not have gotten a level 5 offer and if he’s already 8 years into his career, it’s unlikely he ever would. So, we would be sentencing him to a short-term career at Facebook at best. It was definitely the right call.

Final points

Architecture interviews are formidable, open-ended problems that you definitely cannot exhaustively solve in the time allotted. If you have no idea how to solve these kinds of problems, you might start by checking out the engineering blogs of companies like Google, Apple, Dropbox, and others. The amazing thing about architecture is that most of the best companies are sharing all their work.

Even if you have no background in the work, you can familiarise yourself with the common patterns of system design by reading diversely from the blogs on the topic, watching YouTube videos of tech talks from conferences, etc.. If that feels like cheating, it shouldn’t. After all, the reason engineers at big tech companies are good at solving these kinds of problems isn’t usually because they do it all day, every day. Rather, it’s because they get exposure to the solutions in internal tech talks and write-ups. The information is available. Go turn it into knowledge.

6 thoughts on “Architecture and Systems Design Interviews

    1. It’s a quite complicated answer, really. The shortest form of it is that a) I had been there a long, long time and was thinking about life after; b) I didn’t really need a job any more; and c) there was a confluence not-so-great things inside the company that made my job a lot less fun, overall. The sum of all of those meant that I was thinking about leaving for more than a year before I finally pulled the trigger.

  1. Why is MongoDB an incorrect answer? Sorry if it’s supposed to be obvious 😦

    “There are certainly incorrect answers (i.e. don’t say “We can solve this with Mongo DB, because it’s web scale.”).”

    1. Hiya, thanks for the comment. I bet if you asked it, a dozen other people thought the same question but we’re too afraid of looking dumb to speak up.

      I made that joke in the video to riff on the long-running “MongoDB is webscale” meme that started (as far as I know) with this video:

      https://youtu.be/b2F-DItXtZs

  2. Thanks for giving a textual script for the video. I skipped straight to the text, very convenient with no headphones and super well explained! A side thought tho, would giving text scripts pull people away from watching videos and impacting your YouTube metrics?

Leave a Reply

Your email address will not be published. Required fields are marked *