Part 2/3: Server Side Tech Picks
This is the second part of a three part series about choosing a technology stack for a startup in the modern (at least as of the time of this writing) world. The last post focussed on building a client side web app using modern tech. For this post, we’ll dive into the server side tech choices.
This post is going to focus specifically on implementing a server and database for our use case. I’ll touch briefly on things deeper in the stack than these components, but our devops choices will come through in part 3 of this series.
As we discussed before, we’re building a tool that enhances other tools. We want to make any existing SaaS tool into a dynamic, multi-player tool that feels connected. To make the experience of working across many SaaS tools into a connected, coherent experience rather than a series of disconnected silos. We’re building features that helps a team feel like they’re working together, even if they have never even stepped foot in the same room together. Our application architecture demands a mixture of real-time and asynchronous features.
These constraints dictate many key pieces of our server-side product infrastructure. For one, we need authentication and sessions. We also need a way to communicate between the client and the server that works both synchronously and asynchronously. In particular, we need a server that can push updates to the client in real time.
The pivotal constraint for us is that we want a blisteringly fast pace of development for the product. Engineers working in our stack should be able to ship features as quickly as possible, experiment, throw things away, and redeploy changes in minutes. Our current continuous integration runs in about 3 minutes.
Why this need for speed? Well, for one, it just feels great to work in a high-speed tech stack. For two, we’re still early in our product’s evolution. Scaling our server-side infra is not a constraint yet. In fact, if we focussed on building an infinitely scalable server, we’d be wasting time. Right now, we’re incrementally transitioning from early user tests to true product market fit. For our stage of life, things like maximum I/O parallelisation or complete database race condition safety just don’t matter. What does matter is being able to iterate on product features, test them with real users, understand if they work or not, and then iterate again. So, our tech stack choices reflect the need for this flexibility over alternative choices (e.g. choosing a “high performance” language like Go or Rust; division of responsibilities like microservices; etc.).
One of the very first questions to answer when designing a server architecture is ‘What language/run-time should the server be built in?’ For us, this answer was trivially easy to reach — NodeJS. While server-less architectures using AWS Lambda or Firebase are becoming quite trendy, they suffer from some really unfortunate deployment pathologies. Our philosophy is that it should be stupidly easy to know what is deployed — from the server to the database. Because of this, we opted for a TypeScript monorepo which deploys atomically as a NodeJS server. Sure, this means we can’t re-deploy individual functions at a time as we could if this were a Firebase app. Honestly though, the server-less trend seems like a really false economy in practice. With a monolith, we actually know what versions of the code are running at any given moment and we can trivially guarantee that all the pieces fit together properly all the way from the database schema to the client-side React UI components. For similar reasons, we did not build a microservices-based backend.
Using multiple languages creates a surprising number of barriers to contribution, the first and most obvious of which is that not everyone will know all the languages. On top of that, you instantly lose the ability for one engineer to make sweeping changes across the entire codebase with a single commit.
Often referred to as “codemods” (see https://github.com/facebook/codemod), these sorts of changes enable a small number of engineers to maintain massive amounts of code. If we’d chosen the industry standard practice of multiple repositories hosting multiple projects in different languages, such codemods would be impossible. They’d be relegated to at most one repository at a time and likely even worse than that because not all the repos would be in the same language. This leads to differing code styling standards, differing code review standards, differing CI pipelines, differing linters. And infinite — I mean — infinite bike shedding about the right way to do things. Ain’t nobody got time for that.
If you read Part 1 of this series, you won’t be surprised to see TypeScript here. Most of the mixed blessings of TypeScript apply on the server-side as much as they did on the client, but for NodeJS application development, TypeScript offers a lot of value. Server-side development tends to be a bit less messy with respect to types than client-side. Part of the fiddly-ness of TypeScript in the browser is the DOM API itself. For server side development, you just don’t have to fool with it, which is pretty great.
For data storage, we’ve opted for a tried and true, fast, stable, relational database. Server-less apps running against schema-free databases are all the rage because they make for ultrafast prototyping. If you’re creating a build-and-forget backend for a hackathon, that stack choice is totally reasonable. However, if you need to maintain your data over a long period of time, schema-free databases like DynamoDB or MongoDB might trip you up unexpectedly. I can’t tell you how many teams of engineers I’ve interacted with who don’t properly migrate their schema-free database because “but it’s schema-free, you don’t have to know what fields you’ve set on every record!” That’s great if you don’t want to be troubled by keeping track of what you’ve stored…
However, if you want your data to match the code that is operating on it, you need to manage this relationship. There’s nothing more difficult to debug than random bug reports that come from users in the wild because some poor user wrote some records to the database before you added some field or before you removed it and now when she loads the app, it crashes because you can’t read some property of undefined. So, so many engineer lifetimes have been wasted debugging problems that boiled down to “Oh, wait, for this one random user the data is actually corrupt because they happened to hit the save button on the day before we added the
spline-reticulation-rate field, so her app can never load without errors. Fml.“. Yet, these kinds of bugs happen all the time in modern codebases specifically because the engineers don’t operate with some discipline about their data. Ho hum. Not us.
But I agree… maintaining the schema of your data is a chore. That’s why we use a migration framework which handles upgrading and downgrading our database for us. All our engineers have to do is make sure we’ve written correct SQL migrations to add and remove fields or tables and correctly transform the existing data. After that, our CI tooling handles the rest across our environments. Easy peasy, lemon squeezy.
Sequelize as an ORM
Using an ORM in a large-scale, high-performance application is usually a fool’s errand. Trying to get performance out of queries you don’t have direct control over is a bit like trying to assemble a ship-in-a-bottle while wearing oven mitts. Fortunately for us, we’re not a large-scale, high-performance application — yet. For now, speed of development is the dominant factor in the Big-O equation of our company’s success. We will almost certainly rip out Sequelize’s ORM functionality over time, replacing it with carefully designed indexes and queries. For now though, we can add types, find, update, and delete stuff, and generally crack on with the work without stressing about every
join clause or
order by. I haven’t written a single raw SQL query apart from a CREATE / ALTER / DROP TABLE query in months.
For the TypeScript savvy folks, you might be surprised to see Sequelize as our pick for ORM rather than TypeORM, given that TypeORM is… well.. typed. We started with TypeORM and, within a month, hit so many issues and so many known-but-unfixed problems with TypeORM that we ended up ripping it out. As of the time of this writing, the TypeORM project has more than 2k open issues, which is actually staggering for a codebase with only 4k commits in it. Your mileage may vary, but TypeORM was not the ORM for us.
Sequelize for migrations
Also offers a really nice migration framework which we’re likely to keep well into the future. I’ve worked previously with schema-free codebases (like Mongo) and codebases where the schema was maintained independently of the codebase. Having migrations tied to version control is an absurdly better world to be in. I’ll never go back. Don’t make me go back.
GraphQL with Apollo
If you read Part 1 of this series, you’ll know I have mixed feelings about GraphQL. It’s heavy. Like heeeeaaaavvvvy. You have to do a heap of work just to get the basic value of the abstraction. If the entities in your system don’t have a lot of complex relationships, it’s probably not for you. However, in our case, we have lots and lots of relationships in our schema. For instance, an
messages can have
mentions, which themselves can have
users. In this sort of application, GraphQL begins to show its value peeking out over the mountaintops of extra work you need to do to get it up and running.
Why? Well, because a
user is such a core type in our system, we get an economy of scale out of defining it once in GraphQL. Once we’ve done the heavy lifting of GraphQL-ifying the
user type, we can use it over and over again everywhere that it comes up. This is because GraphQL makes entity composition trivial. The more types you have and the more relationships they share, the more valuable GraphQL becomes.
Let’s compare that with a vanilla JSON API. If we started both versions of the server on the same day, Team Vanilla JSON API would easily race ahead at the start. Then, as the number of entities and number of relationships between entities grows, the Team Vanilla JSON API would start to become ponderous — needing to reassemble the same queries and transformations again and again. Team GraphQL by contrast would get left on the starting line, toiling away trying to get their resolvers and mutations in order. However, as the race wore on, Team GraphQL would start to make huge leaps. Once the core types were defined and GraphQL-ified, Team GraphQL would basically never touch them and would instead only be building the new entities. They wouldn’t have to worry about which fields of the sub-sub-queries to expose, because their earlier work would already account for that. Over the course of a year with lots of product iteration, Team GraphQL would have enabled way more product iteration, despite taking longer to lurch into motion.
A basic example would be the
@-mentions in our chat. To render a message with an
@-mention in the UI, you have to fetch: the message, the message sender’s user data, the mention data, and the user data of the user that was mentioned. Since we have a well-defined User GraphQL entity, we don’t have to do any extra work to the data required to display both the sender’s user info and the mentioned users user info. They’ll leverage the same GraphQL entity. To achieve this same value with a JSON API, we’d have to write custom server-side code to fetch the message, then fetch its users info, decide which fields to expose, then parse out the
@-mention from the message, then fetch the mentioned user’s info, then decide which of those fields to expose in the response, then assemble a JSON payload that includes all of this information. We’d also have to think about things like batching of these fetches if we’re rendering a list of messages. And if we later on modified the
User entity to have some new fields, we’d have to maintain all that custom JSON API endpoint code, too. What a pain. While GraphQL slows us down at first, its speeds us up shortly thereafter in a way that vanilla JSON APIs won’t catch up with.
But let me be real with you — the itch to just rip all this GraphQL out and replace it with a dead simple JSON API that stringifies data pulled straight from a raw
SELECT * FROM ... statement is… strong. When working with a GraphQL backend, you feel the pain of all the overhead more acutely than working in the client. Again, this is a reasonable choice for us because we want to be able to iterate on the product as fast as possible. Whatever form our UIs take, the backend data types are quite stable. So, GraphQL unlocks a lot of value for present and future product development. If I had it to do all over again, I’d probably choose it again, begrudgingly. I really wish it was less heavy weight. Codegen can theoretically save you some of the pain of defining all these JS-to-GraphQL mappings, but then you also have GraphQL and codegen to worry about. There’s no free lunch there.
In our application, real-time interactions are hugely important. We’re building a product that makes people feel connected and present, even when they’re not in the same room together. Surprising moments like “Oh, look, my teammate and I are on the same page at the same time” make the experience of working together-but-apart much more serendipitous and rewarding, like being in the office together. To enable that, we have loads of real-time, in-memory state that reflects the ephemera of day-to-day work. Without something real-time like WebSockets, we would struggle to keep entire teams in sync. If I add an annotation to a page in our shared SaaS tool, my teammate needs to know about it instantly. We can’t wait for the teammate to reload the page whenever it might next happen.
Apollo helps a lot with this. Using GraphQL subscriptions with Apollo means we have a really clean API for triggering data pushes down to all the currently connected clients. We also have the flexibility to put these pushes before, after, or concurrently with database transactions, depending on the desired user experience of the particular real-time feature. Lovely.
The dominant tradeoff of our server-side codebase is product development velocity over literally everything else. We haven’t chosen the most strongly typed language. TypeScript is a fairly good type checker, but it’s far from the most opinionated type system. We certainly haven’t chosen the most high-performance programming language. A Golang implementation of our backend would be between 10 and 100 times more efficient. However, with Golang, we’d also pay the bill of having two languages, two sets of frameworks, two sets of idioms, two sets of code review standards, two different development environments to maintain, etc.. Each one of these small frictions adds up to lost developer velocity.
We’re a pre-product market fit company. We don’t yet know what features will be the winning ones that shape the company’s future. We don’t yet know what the customer growth trajectory will look like. We also don’t know how much we’ll need to scale the engineering team to match the adoption we get once we launch. Success for us right now is definable as building out product features as fast as possible, testing them with users, and honing our product vision based on those outcomes. Anything that gets in the way of that iteration loop is a drag on our success.
One view of how to proceed would be to cobble together as many open source libraries as possible to build some vaguely product-shaped Frankenstein’s monster just to test the product in front of users. Many, many companies take this approach. Some good, most of them… bad. You often see this approach led by a non-technical founder who is outsourcing development to an inexpensive, remote dev team. This overall approach is deeply unwise except in the case that you never need to scale the product or infrastructure. It’s almost always the wrong tradeoff because it trades quick-and-dirty features today against your ability to grow with your success tomorrow. What happens if your product gains any traction at all? If you take this approach, the first thing that happens is that the application all falls apart and you have to scramble to find great developers to undo and rebuild all the work you already paid for once — if you can get them at all. Most talented engineers will take two looks at a tech stack that is in such a poor state and keep looking for a better opportunity. What you end up with is contractors who are happy to saunter leisurely through your technical debt for a handsome day rate. That’s penny-wise, pound-poor thinking.
Another view of how to proceed is to jump straight to a perfect implementation of the product and infrastructure on the first pass. If you’re literally psychic or luckier than Harry Potter high as a kite on Felix Felicis, you can make this tradeoff. It requires more time, more money, and more prescience than most startups have, certainly more than we do. For the mere mortals and muggles among us, building the “perfect” infrastructure is basically meaningless. Until you have real users using your product, you don’t know what your product is. You don’t know what usage patterns will create what pressures in your backend requiring optimisation. You don’t know what killer features you’ll discover when someone in a user interview casually opines that they “just wish the product did X” — where X is some unbelievably brilliant insight about the user experience that your team never had. So, this approach is infeasible and unwise, despite being extremely tempting to many tech teams, especially ones filled with PhD engineers or engineers from FAANG-esque companies who want to flex.
The wisest approach is a pragmatic compromise: move as fast as possible while building the application in a way that doesn’t block you from changing and optimising it later. That’s where we’ve landed. We have a really fast pace of product iteration and a super tight feedback loop between user testing and product development. Our stack is optimised exactly for this purpose. It hinges on the fact that a single engineer can make changes across the full stack in a single commit and see the outcome in production 3 minutes later. No waiting for code review from different 5 people, working with 5 different pull requests to 5 different repos, all written in different languages. This is the way.
Controversial Hot Takes
Monolith versus microservices versus server-less
We chose a monolithic backend. This means everything is in-memory on the same machine until it can’t be due to scalability constraints. Of coures we will eventually have services that get sharded out from the monolith. However, we won’t call them microservices. We’ll call them services. And they won’t be arbitrarily small function-calls-over-REST like many microservices tend to be in practice. Instead, they’ll be whatever size they need to be to coherently solve the problem they need to solve. Everything that has similar performance and/or storage needs will get sharded out as a service.
As of the time of this writing, the most likely first candidate for this is a websocket termination service/loadbalancer that will act as a façade to our backend, allowing WebSocket connections to stay connected between service restarts.
What will the next service be? I don’t know. Ask me again when we’ve had sufficient growth and user adoption to know what parts of our monolith end up dominating any of: CPU, memory, or network. When one piece of the application is behaving sufficiently differently from the rest of the application that they no longer make sense running in the same process, that’s when we’ll break it out.
Server-less architecture is a non-start for us. Because our application hinges on real-time, push-to-client interactions, there is definitely going to be a server. We could use some pre-baked WebSocket-as-a-service utility from an existing provider, but that’s just paying someone else to handle complexity that our app definitely has. The lock-in that these sorts of tradeoffs create is very likely to kill our flexibility down the round. Right now, we can chose literally any path we like for how to tweak the server to handle larger scale traffic. If we were paying Google Cloud for the privilege of using their real-time infrastructure, we’d be building our application around that constraint. We’ve got a sufficiently talented engineering team to be able to manage that complexity ourselves and the tradeoff of getting to shape our application around our actual constraints rather than the arbitrary constraints imposed by a 3rd party service are absolutely worth it.
Postgres versus Mongo/DynamoDB/etc.
The MERN stack gets a lot of hype these days. In many coding bootcamps, it’s taught as the de facto standard stack for modern development. Call us crazy, but we like doing far out things like selecting data by date ranges. Or joining data by foreign keys. These arcane arts are not lost, though their power has been largely forgotten by recent generations of developers. We don’t currently and likely never will use MongoDB in our stack. There is some potential for a caching layer like Redis to benefit us, but at the moment we can do everything flawlessly in Postgres.
I talked a lot about the downside of schema-free databases earlier in the post, but the same holds true here. We want to know exactly what our data looks like. We want to be able to shape it commit-by-commit with clearly defined migrations. Schema-free databases are easier at the start, but then tie your hands with respect to what features you can build later without writing your own data processing to create intermediate data representations and caches. And woe be unto you who writes some data to the DB in a different way that doesn’t know to update the aggregations and lists stored elsewhere. No thanks. We’ll take the burden of a relational datastore any day.