Postgres Sharding lessons from Notion • Tushar Tripathi

Notion used to run a Postgres Monolith till early 2021. They wrote a blog post about sharding their database to increase stability and performance. Notes and comments -

Deciding when to shard

the user experince became visibly poor, with site load etc. taking a lot of time
internal foresights showed the data increase won’t be able to be served by the monolith
a lot of on-calls, with cpu spikes etc. for engineers
Postgres vaccum process which cleans up unused space begin to frequently stall
there was a risk of Transcation ID wraparound in Postgres which would have led to full DB outage if vaccuming doesn’t run properly

Transaction ID wraparound

There is a counter with an upper limit Postgres uses to give unique IDs to transactions. The vaccum process ensures that the counter values which are no longer needed can be reused. With heavy writes and without proper vaccuming, the counter value can reach its maximum(~2 Billion) and wrap around which can cause all sorts of issues, as temporal ordering of transactions(order in which transactions are applied) won’t be guaranteed.

Sharding Scheme

There are generally two types of sharding schemes -

application being aware of shards, being connected to them and hitting the right shard for a request.
some kind of proxy layer in between which routes the requests to the right shards and then returns back the consolidated result. This is usually done via a packaged third party sharding soluton(e.g. citus, vitess).

Notion used application level sharding as they wanted control over data distribution and a packaged solution would have been a black box with opqaue clustering logic.

the core of notion’s data model is what they call a block. Everything you see in notion(text, image, list, table, page) is a block. These are like lego sets, which recursively compose to create something much greater than the sum of its parts.
each block is a row in the block table, which was the biggest and most important table. It was picked first for sharding.
a block can also reference other tables(workspace, discussion, comments etc.). To ensure transaction guarantees, every table the block table can reach with foregin keys are sharded alongside the block table.
when you’re sharding your data, you really want to minimise any cross shard joins, as they can be slow. Notion partitioned by workspace id as most work happen within a single workspace to avoid cross shard joins.
physical database vs logical shard - so a common thing you would see with sharding is that, 100 shards won’t necessarily mean 100 physical databases on different machines. If you do that you would only be designing for today, if you had to add another shard tomorrow, the details would start leaking in your application layer. So what you do? well you could make 500 shards, but then you would probably be wasting your hardware resources. There is a neat solution here which Postgres’ schema allows for. To create a lot of logical shards, which are spread across far lesser number of physical machines. So the application would think there are 500 shards from day 0, but you won’t have that many physical machines.
Notion limited single table to 500GB, and the physical database to 10TB. For picking the number of logical shards and physical databases, they wanted to ensure that the shards can be evenly distributed across the physical machines.
finally they went with - 480 logical shards over 32 physical databases.
- Instagram did something similar by creating thousands of “logical shards”(built with Postgres’ schema) on a few physical machines. This helps for scaling further, as you could just move more logical shards to new physical machines without any application layer changes.
480 was chosen because it has a lot of numeric factors so there are many options for number of databases. For e.g. for their next jump, they can move to 40 machines. The subsequent factors of 480 are - 48, 60, 80, 96,..., the jumps in number of physical machines is reasonable at each stage.
for same table, instead of using native partitioning with one table per database, standalone tables were created for each schema (schema1.block, schema2.block etc.).
all routing from workspace id to logical schema happens in application.

Transition Phase

double writes - write in both old and new databases. This was done via audit log which is table to keep track of all writes to the old table database. A catchup process iterates over this and applies each update to the new database.
backfill - migrate old data to new after double writing has started(took around 3 days with 96 CPUs).
verification - ensure integrtity of data -
- using verification script, comparing a contiguous range of the UUID space starting from a given value(sampled randomly).
- Dark reads - reading from both databases in application, comparing them, and discarding one of them.
switch-over - actually migrate. they took a scheduled 5 minutes of downtime for this to let the catchup script finish running and then switched over.