Excerpt from Grokking System Design
1. Partitioning Methods
a. Horizontal partitioning (Row):
In this scheme, we put different rows into different tables.
b. Vertical Partitioning (Column):
In this scheme, we divide our data to store tables related to a specific feature to their own server.
c. Directory Based Partitioning (Custom Mapping - loosely coupled approach):
Create a lookup service which knows your current partitioning scheme and abstracts it away from the DB access code.
2. Partitioning Criteria
a. Key or Hash-based partitioning
Under this scheme, we apply a hash function to some key attribute of the entity we are storing, that yields the partition number.
CON: The fundamental problem with this approach is that it effectively fixes the total number of DB servers, since adding new servers means changing the hash function which would require redistribution of data and downtime for the service. Workaround: Consistent Hashing
b. List partitioning
In this scheme, each partition is assigned a list of values.
c. Round-robin partitioning
This is a very simple strategy that ensures uniform data distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).
d. Composite partitioning:
Under this scheme, we combine any of above partitioning schemes to devise a new scheme.
3. Common Problems of Sharding
a. Joins and Denormalization:
Performing joins on a database which is running on one server is straightforward, but once a database is partitioned and spread across multiple machines it is often not feasible to perform joins that span database shards.
b. Referential integrity:
As we saw that performing a cross-shard query on a partitioned database is not feasible, similarly trying to enforce data integrity constraints such as foreign keys in a sharded database can be extremely difficult.
c. Rebalancing:
There could be many reasons we have to change our sharding scheme:
The data distribution is not uniform
There are a lot of load on a shard
Last updated