I started a new job recently, and am trying to get up to speed and familiar with DynamoDB. It’s the second time I have used a NoSQL database (I am counting ElasticSearch) and the first time I have used it as a primary database. Thus far, it feels like a net loss to me to use DynamoDB at a startup over Postgres.1 Doing so involves a lot more work for essentially zero gain.
Some preliminaries/caveats:
These aren’t my final thoughts on the matter, and I’m not trying to sophomorically argue that this applies as a general rule.
What I am saying here might apply specifically to DynamoDB, as I will note below, and not necessarily to Mongo or other NoSQL databases.2
I am thinking specifically about the context of “most startups”—businesses whose software products fall more or less into the category of “CRUD apps”, where end-users have a number of different types of associated widgets, and the widgets are updated, directly or indirectly, via user action, and which are not operating at any kind of “scale” that threatens to degrade Postgres performance.3
Those being stated, here is my current thinking:
Managing Schemas with noSQL Databases is Just Worse
A couple of the supposed advantages of noSQL databases are that, by ostensibly not having a schema, teams can:
Add new fields easily without worrying about migrations
Avoid having to backfill data for schema changes
These feel to me simply like myths people believe about using noSQL databases. I do not think these contentions survive even just a little bit of scrutiny.
You always have a schema, but with most noSQL databases, it tends to be enforced and encoded as types in application code, as opposed to at the DB layer. And, while it is of course very easy to add fields to these types, that isn’t the end of the story—unless you are willing to totally abandon existing data that doesn’t fit the new schema, you must have some way of performing “migrations” and/or “backfilling”. 4
As far as I currently know, there are 2 broad ways of contending with this issue:
Making every new field nullable and living with the sadge—the clutter in your code for checking if field exists and has value, the unknownness of what’s in your own database, the fundamental unreliability of your system, etc.
Devising one’s own method of “migrating” objects, which can itself go one of 2 ways:
Sometimes, especially early on, you could probably just query all of your objects (which isn’t as easy as it should be to do in DynamoDB, due to the size limit on the amount of data that can be returned per request, which is itself kind of annoying, but let’s ignore that for now), update them, and save them back. This is ok, for as long as you can get away with it.5 To be clear, doing this means one has now written not only a migration, but a data migration as well.
You could adopt an update-on-read policy where you have the “migration” or “transformation” function in your code somewhere. This involves writing some custom code to: check objects for some condition, invoke the appropriate functions, etc. On the plus side, you don’t need to debase yourself by running a one-off “job”. The primary downside is that you have to write more code, because you are now writing schema migrations, data migrations, and supplemental code to execute them. Moreover, it seems intuitively suboptimal that these could be invoked on any read query instead of handled only ever in one place and one way.6
Comparatively, there is nothing difficult about adding or dropping columns, as well as backfilling data, in modern SQL workflows that use common libraries in for your application and framework of choice, many of which provide some clear and logical place where you would fit a data migration among your schema migrations.
Moreover, even assuming arguendo that it is meaningfully faster to add or remove fields using noSQL databases, the fact of the matter is that frequent broad schema changes are not the norm.
I have worked at early stage startups in the seed — Series B range my whole (admittedly rather short) career. I have never encountered a scenario where someone came up with a schema for a table, we put it into production, and then realized we totally whiffed by getting most of the fields wrong. At most, the kinds of issues that arise are along the lines of “oh, we don’t need this column; we did need this other column; actually, this thing should get its own table instead of just being an enum”.
In contrast, and this is a particularly pernicious and inconvenient fact when it comes to using DynamoDB…
It is queries that are unknown upfront, not the data schema
The common line in DynamoDB is that the first step to designing schema and setting things up is to understand your applications query patterns—but, axiomatically, for most record types, one will have only a partial understanding of the required query patterns in the early stages of a project. I think that SQL’s more flexible approach of storing data in normalized fashion and then making arbitrary queries on it more closely aligns with this reality.
In general, it usually isn’t terribly hard to support a new query in DynamoDB, with the caveat that all of the common noSQL pain-points for queries tend to apply. If you have some attribute foo on a schema and you suddenly have to query all Users by foo, then you can add a Global Secondary Index for all Users, where the GSI PK is something like USER#<foo>
and SK is the id. Maybe it isn’t a big deal to go and add this GSI to all of your existing objects, but it is objectively more work than the SQL version of, at most, adding an index to the foo column.
And as your application accumulates more queries, even more complicated data storage patterns with more complicated de-and-re-normalization requirements emerge. Take a look, for example, at the examples discussed in this, the DynamoDB talk/video.7 I think it’s safe to say that all of those examples are objectively more complicated to understand than the SQL equivalent, and that the things that fall under the category of “advanced data modeling” in DynamoDB require no advanced knowledge or special handling if using a SQL database. As more and more such queries arise in one’s application, the amount of work it takes to add and support them accelerates much faster in DynamoDB than with Postgres. Put differently, perhaps, the work is front-loaded with DynamoDB; you take the hit of more work upfront in the knowledge that the work will be applicable and relevant at large scale and volume, whereas with a SQL database, it’s easy to get going, and you expect to hit a point where you need to start devoting more and more time to DB maintenance when your data exceeds a certain threshold. It is precisely the SQL tradeoff that makes the most sense for most early-stage startups.
The Developer Experience is Poor for Startups that Wish to Move Fast
On some level, a lot of the issues I am identifying could be framed in this light. Using DynamoDB requires more work and mental overhead to everything, and does not make for a fast development loop.
Here are some issues in this regard that are more specifically in the realm of DevEx. I suspect several of these are pain-points that are particularly acute with DynamoDB as opposed to with noSQL databases in general.
First, there is usually no objectively correct way of doing things or modeling data. There are too many choices, and this leads naturally to overthinking and time-wastage at both the individual and team-level. Take for example the somewhat idiosyncratic choice of PK and SK for User
entities above. There are pros and cons to that vis-a-vis doing something like PK=USER#<id> and SK=<some other field>, which is an approach where you treat DynamoDB like a giant K-V store for most things. That might sound like an anti-pattern (and the DynamoDB talk linked above comes pretty close to saying so), but it can be appropriate. However, at the beginning of a project or product, there isn’t the clarity of constraint to help teams navigate this ambiguity, and I think it’s easy to waste time trying to think about this kind of stuff.8 And whatever you come up with, it will take more cognitive load for engineers to deal with, as opposed to the standard and obvious ways in which people tend to model data in relational databases.
In the several months I have been in my current role, I can confidently say I have spent more time discussing DynamoDB keys and design patterns than I have spent discussing Postgres indices or schemas in aggregate over years of using Postgres. One of the benefits of choosing boring technology is that these are things whose basic usage is clear and generally agreed upon, and which engineers are able to easily grasp and be productive with quickly.9
Second, there are things one just cannot do in DynamoDB that one can do in Postgres, such as enforcing the uniqueness of certain fields at the database level; cascading deletes; querying on complex field types (e.g. you can have a column that’s JSONB in Postgres and index individual fields); etc. Not to mention that you can add extensions to Postgres for things like GIS, enhanced text search, and vector search, whereas if DynamoDB is your first and primary DB, you will find yourself adding additional infrastructure to support those use-cases in the future.
Additionally, in DynamoDB, you are not supposed to query on arbitrary fields. There is a mechanism to do this, but, speed-wise, it’s basically equivalent to iterating through ALL relevant objects in an array to filter them. Instead, you are supposed to use defined indices.
If you try to do arbitrary filters, this is worse than “get all of your objects and filter in application code”, due to the fact that DynamoDB has a limit on her much data it will return per request and the fact that filters are applied only on items returned by the initial Key Query; the Key Query results count for the request data limit, but it could be that none of them survive the filter, so you end up spending an entire request to retrieve nothing, and then going to the next “page”.
As alluded to above, the chances you need to be able to quickly add support for arbitrary queries should be considered fairly high.
Third, the development loop in general is just worse. It is extremely refreshing and easy to have a local SQL database that you query as you develop. You can see exactly what your data looks like easily, and you can easily delete stuff and retest whatever you are working on. There are some Dynamo GUI clients out there, but they are not as good as the bevy of clients which one can use for Postgres, and the lack of a real query language for Dynamo places a ceiling on their utility. This is one of the bigger day-to-day I have, and my assumption is that there are better options in this regard for more “managed” offerings like Mongo.
Fourth, any sort of type or schema you have in DynamoDB is a leaky abstraction. No engineer can query for anything unless they first check whether the field they are querying is already a key of some kind. How an object’s storage is implemented—what the PK and SK are, what, if any, the secondary indices are—are required pieces of knowledge to do basic things in DynamoDB. Again, this just adds mental overhead to the development process.
A couple points in favor of noSQL
I will end by noting one thing that is easier in DynamoDB and other noSQL databases compared to relational databases: supporting polymorphism.
For example, for a startup whose app involves storing different types of pets, a SQL approach would likely want to have one table for all pets, with a column for “species”. But if different species have rather different attributes (e.g., dogs and cats might both have “breed”, but that doesn’t sound applicable to fish, who might have some attribute f or fresh-water v. salt), this can become unwieldy. Some approaches include having separate tables for the attributes or storing them in a JSONB column. I am not sure that either approach is as good as what you can do with DynamoDB, which is storing totally different types for each species, but giving them all the same keys/indices (e.g., PK of “PET#<id>”) so that they can all be queried at once, with the type discrimination happening in application code.
I think ultimately, it’s probably reasonable for a startup to use Dynamo for specific entity types like the above that could particularly benefit from certain features or advantages of noSQL databases. Polymorphism is one, though it may not rise to the level of meriting additional, special infrastructure.
However, if certain workflows are particularly latency-sensitive and you need really fast writes, or some table in your SQL database looks like it will be disproportionately higher volume than the rest of your application and would be the only reason why you would run into issues scaling Postgres later on, it conceivably makes sense to me to use Dynamo (or whatever noSQL db makes sense) only for those facets of the application. It’s honestly not a heavy piece of infrastructure to manage compared to a SQL database; it’s inherently serverless and pay-as-you-go, and you interact with it over HTTP. In the future, this is probably what I would do; I would augment a Postgres database with Dynamo as needed, and I think such a setup would probably scale very well, with less engineering work and overhead than using either technology alone.
Do people still choose MySQL for new projects? Crazy.
I suspect a lot of my complaints and criticisms here, especially the point about developer tooling I make at the end, might not apply to various other noSQL databases, especially ones that are more managed offering like MongoDB.
Related note: the choreography of standard System Design interviews seems to assume that “NoSQL” as a category consists of a fairly interchangeable set of technologies that share a set of performance characteristics, and a standard set of tradeoffs vis-a-vis SQL databases, but in practice of course the defining characteristic is just that the databases aren’t SQL databases, which may or may not say all that much relevant to the specific problem at hand.
So I am not talking about, e.g., companies whose product offerings require handling large volumes of realtime data.
Really, in noSQL land it’s all “backfilling”—you have to process objects and add or remove fields, and store them back.
Kicking around the premise of this post with a friend of mine, he pointed out there are people who are willing to abandon/strand stale data, in which case he argues someone might as well consider just making a “v2” table in a SQL database.
I will note here that this luckily isn’t necessary with something like ElasticSearch; the superior pattern there is to use a datastore optimized for persistence and storage as opposed to search as the source of truth and treat the ES indices as transitory. So if you update the schema for the index, you should nuke the whole thing and recreate. (And yes, I think the `dynamic` for the index should be set to `strict`).
Required watching for anyone using DynamoDB, and also an example of a very good tech talk.
One counter-argument to my point here is that, early on, engineers are always making blind tradeoffs, and we should all just expect to do a bit of in-the-dark-stabbing that we expect to clean or fix later. But I think that ignores the point that there is a big advantage to choosing boring, ez tech that absolves you of having to make certain choices at all in order to save time. There would be no equivalent debate about PKs and SKs with Postgres, for example, although I suppose some people might debate using UUIDs for Postgres ids versus bigint or something.
Though I would probably rephrase the principle as “choose standard technology”—what’s boring in one context might be exotic and kind of self-indulgent in another. You could probably opt to use SQLite as a message broker or cache just out of familiarity, but it likely wouldn’t be as optimal a choice as just using Redis.