is a family of randomized algorithms for randomly choosing ksamples from a list of nitems, where nis either a very large or unknown number. Typically nis large enough that the list doesn’t fit into main memory.
O(n) time solution:
Create an array reservoir[0..k-1] and copy first k items of stream[] to it.
Now one by one consider all items from (k+1)th item to nth item.
Generate a random number from 0 to i where i is index of current item in stream[]. Let the generated random number is j.
If j is in range 0 to k-1, replace reservoir[j] with arr[i]
Code
// An efficient Java program to randomly
// select k items from a stream of items
import java.util.Arrays;
import java.util.Random;
public class ReservoirSampling
{
// A function to randomly select k items from stream[0..n-1].
static void selectKItems(int stream[], int n, int k)
{
int i; // index for elements in stream[]
// reservoir[] is the output array. Initialize it with
// first k elements from stream[]
int reservoir[] = new int[k];
for (i = 0; i < k; i++) {
reservoir[i] = stream[i];
}
Random r = new Random();
// Iterate from the (k+1)th element to nth element
for (; i < n; i++)
{
// Pick a random index from 0 to i.
int j = r.nextInt(i + 1);
// If the randomly picked index is smaller than k,
// then replace the element present at the index
// with new element from stream
if(j < k) {
reservoir[j] = stream[i];
}
}
System.out.println("Following are k randomly selected items");
System.out.println(Arrays.toString(reservoir));
}
//Driver Program to test above method
public static void main(String[] args) {
int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
int n = stream.length;
int k = 5;
selectKItems(stream, n, k);
}
}
//This code is contributed by Sumit Ghosh
How does it work?
To Prove: The probability that any item stream[i] where 0 <= i < n will be in final reservoir[] is k/n.
Case 1: For last n-k stream items, i.e., for stream[i] where k <= i < n
For stream[n - 1]:
The probability that the last item is in final reservoir
= The probability that one of the first k indexes is picked for last item
= k/n (the probability of picking one of the k items from a list of size n)
For stream[n-2]:
The probability that the second last item is in final reservoir[]
= [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X
[Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ]
= [k/(n-1)]*[(n-1)/n] = k/n.
Case 2: For first k stream items, i.e., for stream[i] where 0 <= i < k
The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n].
The probability that an item from stream[0..k-1] is in final array
= Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered
= [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n
Implementation: Select K Items from A Stream of N element
static void selectKItems(int stream[], int n, int k)
{
int i; // index for elements in stream[]
// reservoir[] is the output array. Initialize it with
// first k elements from stream[]
int reservoir[] = new int[k];
for (i = 0; i < k; i++) {
reservoir[i] = stream[i];
}
Random r = new Random();
// Iterate from the (k+1)th element to nth element
for (; i < n; i++)
{
// Pick a random index from 0 to i.
int j = r.nextInt(i + 1);
// If the randomly picked index is smaller than k,
// then replace the element present at the index
// with new element from stream
if(j < k) {
reservoir[j] = stream[i];
}
}
System.out.println("Following are k randomly selected items");
System.out.println(Arrays.toString(reservoir));
}
Interview Questions
面试题:等概率挑出文件中的一行
问题描述
Amazon: 一个文件中有很多行,不能全部放到内存中,如何等概率的随机挑出其中的一行?
问题解答
先将第一行设为候选的被选中的那一行,然后一行一行的扫描文件。假如现在是第 K 行,那么第 K 行被选中踢掉现在的候选行成为新的候选行的概率为 1/K。用一个随机函数看一下是否命中这个概率即可。命中了,就替换掉现在的候选行然后继续,没有命中就继续看下一行。
面试题:等概率的挑选Google搜索记录日志中的一百万条中文搜索记录
问题描述
给你一个 Google 搜索日志记录,存有上亿挑搜索记录(Query)。这些搜索记录包含不同的语言。随机挑选出其中的 100 万条中文搜索记录。假设判断一条 Query 是不是中文的工具已经写好了。