# Design Twitter

Social network service

## Scenario

### User stories, requirements of the system

* Post tweets
* Follow other people
* Favorite tweets
* Timeline consisting of top tweets from people the user follows&#x20;
* Tweets can contain photos and videos.

**Extended Requirements**

1. Searching for tweets.
2. Replying to a tweet.
3. Trending topics – current hot topics/searches.
4. Tagging other users.
5. Tweet Notification.
6. Who to follow? Suggestions?
7. Moments.

### Capacity Estimation

**Estimate**: 200 Million DAU

**Tweets** **View (Read)**: 200M users \* 100 tweets \~ 20B per day

**Storage (Write)**:

* New Tweets (Text): 100M tweets \* (280 + 30 ) bytes per tweets \~ 30GB per day
* Multi-media (Photos, Videos): (100M / 5 photos \* 200KB) + (100M/10 videos \* 2MB) \~= 24TB per day

## Service

### System APIs

```
tweet(api_dev_key, tweet_data, tweet_location, user_location, media_ids)
```

> **Parameters:**
>
> **api\_dev\_key** (string): The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota.
>
> **tweet\_data** (string): The text of the tweet, typically up to 140 characters.
>
> **tweet\_location** (string): Optional location (longitude, latitude) this Tweet refers to. user\_location (string): Optional location (longitude, latitude) of the user adding the tweet.
>
> **media\_ids** (number\[]): Optional list of media\_ids to be associated with the Tweet. (All the media photo, video, etc. need to be uploaded separately).
>
> **Returns:** (string)
>
> A successful post will return the **URL** to access that tweet. Otherwise, an appropriate HTTP error is returned.

### High Level System Component

**Write**: 100M / 86400s = 1150 tweets per second

**Read**: 20B / 86400s = 230K tweets per second

**Read-heavy system**.

This traffic will be distributed unevenly throughout the day, though, at **peak time** we should expect at least **a** **few** **thousand** **write** requests and around **1M read**

## Storage

### Database Schema

```
Tweet
---
PK - TweetID: int
    UserID: int
    Content: varchar(140)
    TweetLatitude: int
    TweetLongitude: int
    CreationDate: datetime
    NumFavorites: int
```

\-

```
User
---
PK - UserID: int
    UserName: varchar(32)
    Email: varchar(32)
    Name: varchar(30)
    DataOfBirth: date
    CreationDate: datetime
    LastLogin: datetime
```

\-

```
UserFollow
---
PK UserID1: int UserID2: int
```

\-

```
Favorite
---
PK TweetID: int UserID: int
    CreationDate: datetime
```

### Data Sharding

#### **Sharding based on UserID:**

We can try storing all the data of a user on one server. While storing, we can pass the UserID to our hash function that will map the user to a database server where we will store all of the user’s tweets, favorites, follows, etc.

**Issues with this approach:**

* **Hot user problem** - lot of queries on the server holding hot user, high load will affect performance of the service
* **Non-uniform distribution of storage** - some user can end up storing a lot more tweets or having a lot more followers
* **What if one user's data can't fit in one shard?** - What if we cannot store all tweets of a user on one shard?
* **Unavailability of all of the user's data on the shard -** if that shard is down or higher latency if it's serving high load for specific user

\--

To recover from these situation: **repartition/redistribute our data** or use **consistent hashing**

#### **Sharding based on TweetID**

Our hash function will map each TweetID to a random server where we will store that Tweet.

**Pros:**

* Solve the problem of hot users

**Cons:**

* In contrast to sharding by UserID, we have to query all database partitions to find tweets of a user, which can result in higher latencies

Store hot tweets in cache in front of database servers
