Reservoir Sampling


Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from a list of n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory.

O(n) time solution:

  1. Create an array reservoir[0..k-1] and copy first k items of stream[] to it.

  2. Now one by one consider all items from (k+1)th item to nth item.

    1. Generate a random number from 0 to i where i is index of current item in stream[]. Let the generated random number is j.

    2. If j is in range 0 to k-1, replace reservoir[j] with arr[i]


// An efficient Java program to randomly 
// select k items from a stream of items 
import java.util.Arrays; 
import java.util.Random; 
public class ReservoirSampling 
    // A function to randomly select k items from stream[0..n-1]. 
    static void selectKItems(int stream[], int n, int k) 
        int i; // index for elements in stream[] 

        // reservoir[] is the output array. Initialize it with 
        // first k elements from stream[] 
        int reservoir[] = new int[k]; 
        for (i = 0; i < k; i++) {
            reservoir[i] = stream[i]; 

        Random r = new Random(); 

        // Iterate from the (k+1)th element to nth element 
        for (; i < n; i++) 
            // Pick a random index from 0 to i. 
            int j = r.nextInt(i + 1); 

            // If the randomly picked index is smaller than k, 
            // then replace the element present at the index 
            // with new element from stream 
            if(j < k) {
                reservoir[j] = stream[i];


        System.out.println("Following are k randomly selected items"); 

    //Driver Program to test above method 
    public static void main(String[] args) { 
        int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}; 
        int n = stream.length; 
        int k = 5; 
        selectKItems(stream, n, k); 
//This code is contributed by Sumit Ghosh

How does it work?

To Prove: The probability that any item stream[i] where 0 <= i < n will be in final reservoir[] is k/n.

Case 1: For last n-k stream items, i.e., for stream[i] where k <= i < n

For stream[n - 1]:

The probability that the last item is in final reservoir 

= The probability that one of the first k indexes is picked for last item 

= k/n (the probability of picking one of the k items from a list of size n)

For stream[n-2]:

The probability that the second last item is in final reservoir[]

= [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X 
    [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ] 

= [k/(n-1)]*[(n-1)/n] = k/n.

Case 2: For first k stream items, i.e., for stream[i] where 0 <= i < k

The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n].

The probability that an item from stream[0..k-1] is in final array 

= Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered 

= [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n

Implementation: Select K Items from A Stream of N element

Interview Questions



Amazon: 一个文件中有很多行,不能全部放到内存中,如何等概率的随机挑出其中的一行?



先将第一行设为候选的被选中的那一行,然后一行一行的扫描文件。假如现在是第 K 行,那么第 K 行被选中踢掉现在的候选行成为新的候选行的概率为 1/K。用一个随机函数看一下是否命中这个概率即可。命中了,就替换掉现在的候选行然后继续,没有命中就继续看下一行。



给你一个 Google 搜索日志记录,存有上亿挑搜索记录(Query)。这些搜索记录包含不同的语言。随机挑选出其中的 100 万条中文搜索记录。假设判断一条 Query 是不是中文的工具已经写好了。



这个题是一个经典的概率算法问题。这个问题的本质是一个数据流问题,虽然题目跟你说的是给了你一个“死”文件,但如果你的算法是基于 Offline 的数据的话,面试官也一定会追问一个 Online 的算法,即如何在一条一条的搜索记录飞驰而过的过程中,随机挑选出 100 万条中文搜索记录。


这个方法你记住答案即可:假设你一共要挑选 N 个 Queries,设置一个 N 的 Buffer,用于存放你选中的 Queries。对于每一条飞驰而过的 Query,按照如下步骤执行你的算法:

  1. 如果非中文,直接跳过

  2. 如果 Buffer 不满,将这条 Query 直接加入 Buffer 中

  3. 如果 Buffer 满了,假设当前一共出了过 M 条中文 Queries,用一个随机函数,以 N / M 的概率来决定这条 Query 是否能被选中留下。

    3.1 如果没有选中,则跳过该 Query,继续处理下一条 Query

    3.2 如果选中了,则用一个随机函数,以 1 / N 的概率从 Buffer 中随机挑选一个 Query 来丢掉,让当前的 Query 放进去。

