(Random) ID Genrator
Requirements
Usually there are a few requirements for ID generators:
They cannot be arbitrarily long. Let’s say we keep it within 64 bits.
ID is incremented by date. This gives the system a lot of flexibility, e.g. you can sort users by ID, which is same as ordering by register date.
Approaches
Single Machine
Start with something simple and keep optimizing it later, which is more true when the question is broad.
How:
In the simplest case, we can just keep incrementing ID from 1, 2, 3 … N, which in fact is one of the most popular ways to generate ID in many real life projects. If user A’s ID is larger than user B, then we know that A registered later.
Pros:
Start with single machine design, not only is this easy to design, but it’s good enough for many of the cases.
Cons:
hard to scale
3rd party service
How:
Keep a single separate server that is only responsible for ID generation. More specifically, when a user registers to the product, for whichever database that handles this request, it’ll connect to the 3rd party server to ask for a random ID.
Pros:
Since all the ID generation is handled in a single server, there’s no risk of generating duplicate IDs.
Cons:
3rd party server will soon become the bottleneck
Multiple machine solution
Example: Open Source ID Generator -- Snowflake from Twitter
How:
Since within a single timestamp there can also be multiple users, we could solve this with two approaches.
We assign a server ID to each ID generation server and the final ID is a combination of timestamp and the server ID.
We can also allow multiple requests within a single timestamp on a single server. We can keep a counter on each server, which indicates how many IDs have been generated in the current timestamp. So the final ID is a combination of timestamp, serverID and the counter.
Limitation for counter: since the ID cannot be arbitarily long, the counter may end up with only 8bits for instance. In this case, the server can only handle 256 requests within a single timestamp at most. -- If it frequently exceeds this limit, we need to add more instances.
Clock synchronization
There’s a hidden assumption that all ID generation servers have the same clock to generate the timestamp, which might not be true in distributed systems. In reality, system clocks can drastically skew in distributed systems
Industry Practices
Before starting out, we listed out what features were essential in our system:
Generated IDs should be sortable by time (so a list of photo IDs, for example, could be sorted without fetching more information about the photos)
IDs should ideally be 64 bits (for smaller indexes, and better storage in systems like Redis)
The system should introduce as few new ‘moving parts’ as possible — a large part of how we’ve been able to scale Instagram with very few engineers is by choosing simple, easy-to-understand solutions that we trust.
Each of our IDs consists of:
41 bits for time in milliseconds (gives us 41 years of IDs with a custom epoch)
13 bits that represent the logical shard ID
10 bits that represent an auto-incrementing sequence, modulus 1024. This means we can generate 1024 IDs, per shard, per millisecond
Example: September 9th, 2011, at 5:00pm, which is 1387263000 milliseconds since the beginning of our epoch
To start our ID, we fill the left-most 41 bits with this value (epoch timestamp in ms) with a left-shift
id = 1387263000 <<(64-41)
Next, take the shard ID for this particular piece of data we’re trying to insert. Let’s say we’re sharding by user ID, and there are 2000 logical shards; if our user ID is 31341, then the shard ID is 31341 % 2000 -> 1341. We fill the next 13 bits with this value:
id |= 1341 <<(64-41-13)
Finally, we take whatever the next value of our auto-increment sequence (this sequence is unique to each table in each schema) and fill out the remaining bits. Let’s say we’d generated 5,000 IDs for this table already; our next value is 5,001, which we take and mod by 1024 (so it fits in 10 bits)
id |= (5001 % 1024)
.
Reference
Last updated