# Reservoir Sampling

Wikipedia:

> [**Reservoir sampling**](http://en.wikipedia.org/wiki/Reservoir_sampling) **is a family of randomized algorithms for randomly choosing** `k` **samples from a list of** `n` **items, where** `n` **is either a very large or unknown number. Typically** `n` **is large enough that the list doesn’t fit into main memory.**

***O(n)*****&#x20;time solution:**

1. Create an array `reservoir[0..k-1]` and copy first `k` items of `stream[]` to it.&#x20;
2. Now one by one consider all items from (k+1)th item to nth item.&#x20;
   1. Generate a random number from 0 to i where `i` is index of current item in `stream[]`. Let the generated random number is `j`.&#x20;
   2. If `j` is in range `0` to `k-1`, replace `reservoir[j]` with `arr[i]`

**Code**&#x20;

```java
// An efficient Java program to randomly 
// select k items from a stream of items 
import java.util.Arrays; 
import java.util.Random; 
public class ReservoirSampling 
{ 
    // A function to randomly select k items from stream[0..n-1]. 
    static void selectKItems(int stream[], int n, int k) 
    { 
        int i; // index for elements in stream[] 

        // reservoir[] is the output array. Initialize it with 
        // first k elements from stream[] 
        int reservoir[] = new int[k]; 
        for (i = 0; i < k; i++) {
            reservoir[i] = stream[i]; 
        }

        Random r = new Random(); 

        // Iterate from the (k+1)th element to nth element 
        for (; i < n; i++) 
        { 
            // Pick a random index from 0 to i. 
            int j = r.nextInt(i + 1); 

            // If the randomly picked index is smaller than k, 
            // then replace the element present at the index 
            // with new element from stream 
            if(j < k) {
                reservoir[j] = stream[i];
            }

        } 

        System.out.println("Following are k randomly selected items"); 
        System.out.println(Arrays.toString(reservoir)); 
    } 

    //Driver Program to test above method 
    public static void main(String[] args) { 
        int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}; 
        int n = stream.length; 
        int k = 5; 
        selectKItems(stream, n, k); 
    } 
} 
//This code is contributed by Sumit Ghosh
```

#### How does it work?

To Prove: **The probability that any item** `stream[i]` **where** `0 <= i < n` **will be in final** `reservoir[]` **is** `k/n`**.**

**Case 1: For last n-k stream items, i.e., for stream\[i] where k <= i < n**

For `stream[n - 1]`:

```
The probability that the last item is in final reservoir 

= The probability that one of the first k indexes is picked for last item 

= k/n (the probability of picking one of the k items from a list of size n)
```

For `stream[n-2]`:

```
The probability that the second last item is in final reservoir[]

= [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X 
    [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ] 

= [k/(n-1)]*[(n-1)/n] = k/n.
```

**Case 2: For first k stream items, i.e., for stream\[i] where 0 <= i < k**

The first k items are initially copied to reservoir\[] and may be removed later in iterations for stream\[k] to stream\[n].

```
The probability that an item from stream[0..k-1] is in final array 

= Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered 

= [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n
```

#### Implementation: Select K Items from A Stream of N element

```java
static void selectKItems(int stream[], int n, int k)
{
    int i; // index for elements in stream[]
    // reservoir[] is the output array. Initialize it with
    // first k elements from stream[]
    int reservoir[] = new int[k];
    for (i = 0; i < k; i++) {
        reservoir[i] = stream[i];
    }
    Random r = new Random();
    // Iterate from the (k+1)th element to nth element
    for (; i < n; i++)
    {
        // Pick a random index from 0 to i.
        int j = r.nextInt(i + 1);
        // If the randomly picked index is smaller than k,
        // then replace the element present at the index
        // with new element from stream
        if(j < k) {
            reservoir[j] = stream[i];
        }
    }
    System.out.println("Following are k randomly selected items"); 
    System.out.println(Arrays.toString(reservoir));
}
```

## Interview Questions

### 面试题：等概率挑出文件中的一行

#### 问题描述

Amazon: 一个文件中有很多行，不能全部放到内存中，如何等概率的随机挑出其中的一行？

题目来源：<https://www.careercup.com/question?id=13218749>

#### 问题解答

先将第一行设为候选的被选中的那一行，然后一行一行的扫描文件。假如现在是第 K 行，那么第 K 行被选中踢掉现在的候选行成为新的候选行的概率为 1/K。用一个随机函数看一下是否命中这个概率即可。命中了，就替换掉现在的候选行然后继续，没有命中就继续看下一行。

### 面试题：等概率的挑选Google搜索记录日志中的一百万条中文搜索记录

#### 问题描述

给你一个 Google 搜索日志记录，存有上亿挑搜索记录（Query）。这些搜索记录包含不同的语言。随机挑选出其中的 100 万条中文搜索记录。假设判断一条 Query 是不是中文的工具已经写好了。

题目来源：<https://www.careercup.com/question?id=83697>

#### 问题解答

这个题是一个经典的概率算法问题。这个问题的本质是一个数据流问题，虽然题目跟你说的是给了你一个“死”文件，但如果你的算法是基于 Offline 的数据的话，面试官也一定会追问一个 Online 的算法，即如何在一条一条的搜索记录飞驰而过的过程中，随机挑选出 100 万条中文搜索记录。

#### 那在线算法是怎样的？

这个方法你记住答案即可：假设你一共要挑选 N 个 Queries，设置一个 N 的 Buffer，用于存放你选中的 Queries。对于每一条飞驰而过的 Query，按照如下步骤执行你的算法：

1. 如果非中文，直接跳过
2. 如果 Buffer 不满，将这条 Query 直接加入 Buffer 中
3. 如果 Buffer 满了，假设当前一共出了过 M 条中文 Queries，用一个随机函数，以 N / M 的概率来决定这条 Query 是否能被选中留下。

   3.1 如果没有选中，则跳过该 Query，继续处理下一条 Query

   3.2 如果选中了，则用一个随机函数，以 1 / N 的概率从 Buffer 中随机挑选一个 Query 来丢掉，让当前的 Query 放进去。

Implementation: [Select K Items from A Stream of N element](#implementation-select-k-items-from-a-stream-of-n-element)

## Reference

* 👍  **Youtube - Reservoir Sampling**: <https://www.youtube.com/watch?v=A1iwzSew5QY>
  * (1 / i) \* (1 - 1/ (i + 1)) \* (1 - 1/(i + 2)) \* ... \* (1 - 1 / n) = 1/n
* **GeeksforGeeks**: <https://www.geeksforgeeks.org/reservoir-sampling/>
* **Wikipedia**: <https://en.wikipedia.org/wiki/Reservoir_sampling>
