Notion used to run a Postgres Monolith till early 2021. They wrote a blog post about sharding their database to increase stability and performance. Notes and comments -
Deciding when to shard
- the user experince became visibly poor, with site load etc. taking a lot of time
- internal foresights showed the data increase won’t be able to be served by the monolith
- a lot of on-calls, with cpu spikes etc. for engineers
- Postgres vaccum process which cleans up unused space begin to frequently stall
- there was a risk of Transcation ID wraparound in Postgres which would have led to full DB outage if vaccuming doesn’t run properly
Transaction ID wraparound
There is a counter with an upper limit Postgres uses to give unique IDs to transactions. The vaccum process ensures that the counter values which are no longer needed can be reused. With heavy writes and without proper vaccuming, the counter value can reach its maximum(~2 Billion) and wrap around which can cause all sorts of issues, as temporal ordering of transactions won’t not guaranteed.
Sharding Scheme
Used application level sharding where the application is aware about existence of shards and make requests appropriately to the right shard. The other alternative is using packaged third party sharding soluton(e.g. citus, vitess). Notion wanted control over data distribution and those would have been a black box with opqaue clustering logic.
-
the core of notion’s data model is what they call a block. Everything you see in notion(text, image, list, table, page) is a block. These are like lego sets, which recursively compose to create something much greater than the sum of its parts.
-
each block is a row in the block table, which was the biggest and most important table. It was picked first for sharding.
-
A block can also reference other tables(workspace, discussion, comments etc.), to ensure transaction guarantees, every table the block table can reach with foregin keys are sharded alongside the block table.
-
paritioning by workspace id and most work happen with workspace which will minimise cross shard joins
-
limits of 500GB per table, 10TB per physical database. number of logical shards and number of physical databases should be such that shards can be divided evenly on databases
-
finally - 480 logical shards over 32 physical databases
- Instagram did something similar by creating thousands of “logical shards”(built with Postgres’ schema) on a few physical machines. This helps for scaling further, as you could just move more logical shards to new physical machines without any data movements.
-
480 was chosen because it has a lot of numeric factors so there are many options for number of databases
-
for same table, instead of using native partitioning[https://www.postgresql.org/docs/10/ddl-partitioning.html] with one table per database, standalone tables were created for each schema (schema1.block, schema2.block etc.)
-
all routing from workspace id to logical schema happens in application
Transition Phase
- double writes - write in both old and new databases. This was done via audit log which is table to keep track of all writes to the old table database. A catchup process iterates over this and applies each update to the new database.
- backfill - migrate old data to new after double writing has started(took around 3 days with 96 CPUs)
- verification - ensure integrtity of data -
- using verification script, comparing a contiguous range of the UUID space starting from a given value(sampled randomly)
- Dark reads - reading from both databases in application, comparing them, and discarding one of them
- switch-over - actually migrate. they took a scheduled 5 minutes of downtime for this to let the catchup script finish running and then switched over.