Designing an API Rate Limiter

Questions to Ask:

What is a Rate Limiter?

A rate limiter, at a high-level, limits the number of events an entity (user, device, IP, etc.) can perform in a particular time window.

In general, a rate limiter caps how many requests a sender can issue in a specific time window. It then blocks requests once the cap is reached.

Why do we need API rate limiting?

Protect services for abusive behaviors, brute-force password attempts, security, reduce costs, revenue

How to do Rate Limiting?

To be answered after we figure out the requirements and goals of the system.

Requirements and Goals of the System

Functional Requirements:

  • Limit the number of requests an entity can send to an API within a time window

  • (Distributed Scenario) The APIs are accessible through a cluster, so the rate limit should be considered across different servers

Non-Functional Requirements: (Availability, Latency, Scalability)

  • Highly available

  • Not introducing substantial latencies

Algorithms for Rate Limiting

Leaky Bucket

Leaky bucket (closely related to token bucket) is an algorithm that provides a simple, intuitive approach to rate limiting via a queue which you can think of as a bucket holding the requests.

Also known as a first in first out (FIFO) queue. If the queue is full, then additional requests are discarded (or leaked).


  • smooths out bursts of requests and processes them at an approximately average rate

  • easy to implement on a single server or load balancer

  • memory efficient for each user given the limited queue size


  • a burst of traffic can fill up the queue with old requests and starve more recent requests from being processed

  • no guarantee that requests get processed in a fixed amount of time

  • challenges of distributed environments: must use a policy to coordinate and enforce the limit between servers, if load balancing is used

Fixed Window

In a fixed window algorithm, a window size of n seconds (typically using human-friendly values, such as 60 or 3600 seconds) is used to track the rate. Each incoming request increments the counter for the window. If the counter exceeds a threshold, the request is discarded. The windows are typically defined by the floor of the current timestamp. e.g. 12:00:03 with a 60 second window length, would be in the 12:00:00 window.

In this algorithm, the time window is considered from the start of the time-unit to the end of the time-unit. For example, a period would be considered as 0-60 seconds for a minute irrespective of the time frame at which the API request has been made.


  • ensures more recent requests gets processed without being starved by old requests


  • (boundary conditions) a single burst of traffic that occurs near the boundary of a window can result in twice the rate of requests being processed, because it will allow requests for both the current and next windows within a short time

  • if many consumers wait for a reset window, for example at the top of the hour, then they may stampede your API at the same time.

Possible Implementation Pseudo Algo

It keeps track of the startTime and count for each userId, when new requests coming in, it will check if currentTime - startTime >= definedInterval then reset the startTime to the currentTime and reset count to 1.

Sliding (Rolling) Log

Sliding Log rate limiting involves tracking a time stamped log for each consumer’s request. These logs are usually stored in a hash set or table that is sorted by time. Logs with timestamps beyond a threshold are discarded. When a new request comes in, we calculate the sum of logs to determine the request rate. If the request would exceed the threshold rate, then it is held.


  • does not suffer from the boundary conditions of fixed windows. The rate limit will be enforced precisely

  • sliding log is tracked for each consumer, you don’t have the stampede effect that challenges fixed windows


  • very expensive to store an unlimited number of logs for every request

  • also expensive to compute because each request requires calculating a summation over the consumer’s prior requests, potentially across a cluster of servers

Sliding (Rolling) Window*

Hybrid approach that combines the low processing cost of the fixed window algorithm, and the improved boundary conditions of the sliding log.

Like the fixed window algorithm, we track a counter for each fixed window. Next, we account for a weighted value of the previous window’s request rate based on the current timestamp to smooth out bursts of traffic.

For example, if the current window is 25% through, then we weight the previous window’s count by 75%. The relatively small number of data points needed to track per key allows us to scale and distribute across large clusters.


  • flexibility to scale rate limiting with good performance

  • avoids the starvation problem of leaky bucket

  • avoid the boundary bursting problems of fixed window implementations



  • can not enforce hard throttling (only soft throttling)

*Grokking System Design Interview has a different definition of the Rolling Window Algorithm.

Rate Limiting in Distributed Systems

Global Rate Limit

when you are using a cluster of multiple nodes

Sticky Sessions in Load Balancer

The simplest way to enforce the limit is to set up sticky sessions in your load balancer so that each consumer gets sent to exactly one node.


  • Simple


  • The disadvantages include a lack of fault tolerance and scaling problems when nodes get overloaded.

Centralized Data Store

Use a centralized data store such as Redis or Cassandra, to store the counts for each window and consumer


More flexible for load-balancing rule


  • increased latency making requests to the data store,

  • race conditions

race condition:

One of the largest problems with a centralized data store is the potential for race conditions in high concurrency request patterns.

This happens when you use a naïve “get-then-set” approach, wherein you retrieve the current rate limit counter, increment it, and then push it back to the datastore.

One way to avoid this problem is to put a “lock” around the key in question, preventing any other processes from accessing or writing to the counter. This would quickly become a major performance bottleneck, and does not scale well, particularly when using remote servers like Redis as the backing datastore.

A better approach is to use a “set-then-get” mindset, relying on atomic operators that implement locks in a very performant fashion, allowing you to quickly increment and check counter values without letting the atomic operations get in the way.

High Level System Design


How to Design a Scalable Rate Limiting Algorithm (By Robert Paprocki)

Grokking the System Design Interview: Designing an API Rate Limiter

Last updated