LintCode & LeetCode
  • Introduction
  • Linked List
    • Sort List
    • Merge Two Sorted Lists
    • Merge k Sorted Lists
    • Linked List Cycle
    • Linked List Cycle II
    • Add Two Numbers II
    • Add Two Numbers
    • Odd Even Linked List
    • Intersection of Two Linked Lists
    • Reverse Linked List
    • Reverse Linked List II
    • Remove Linked List Elements
    • Remove Nth Node From End of List
    • Middle of the Linked List
    • Design Linked List
      • Design Singly Linked List
      • Design Doubly Linked List
    • Palindrome Linked List
    • Remove Duplicates from Sorted List
    • Remove Duplicates from Sorted List II
    • Implement Stack Using Singly Linked List
    • Copy List with Random Pointer
  • Binary Search
    • Search in Rotated Sorted Array
    • Search in Rotated Sorted Array II
    • Search in a Sorted Array of Unknown Size
    • First Bad Version
    • Find Minimum in Rotated Sorted Array
    • Find Minimum in Rotated Sorted Array II
    • Find Peak Element
    • Search for a Range
    • Find K Closest Elements
    • Search Insert Position
    • Peak Index in a Mountain Array
    • Heaters
  • Hash Table
    • Jewels and Stones
    • Single Number
    • Subdomain Visit Count
    • Design HashMap
    • Design HashSet
    • Logger Rate Limiter
    • Isomorphic Strings
    • Minimum Index Sum of Two Lists
    • Contains Duplicate II
    • Contains Duplicate III
    • Longest Consecutive Sequence
    • Valid Sudoku
    • Distribute Candies
    • Shortest Word Distance
    • Shortest Word Distance II
  • String
    • Rotate String
    • Add Binary
    • Implement strStr()
    • Longest Common Prefix
    • Reverse Words in a String
    • Reverse Words in a String II
    • Reverse Words in a String III
    • Valid Word Abbreviation
    • Group Anagrams
    • Unique Email Addresses
    • Next Closest Time
    • License Key Formatting
    • String to Integer - atoi
    • Ransom Note
    • Multiply Strings
    • Text Justification
    • Reorder Log Files
    • Most Common Word
    • Valid Parenthesis String
    • K-Substring with K different characters
    • Find All Anagrams in a String
    • Find the Closest Palindrome
    • Simplify Path
  • Array
    • Partition Array
    • Median of Two Sorted Arrays
    • Intersection of Two Arrays
    • Intersection of Two Arrays II
    • Maximum Subarray Sum
    • Minimum Subarray Sum
    • Maximum Subarray II
    • Maximum Subarray III
    • Subarray Sum Closest
    • Subarray Sum
    • Plus One
    • Maximum Subarray Difference
    • Maximum Subarray IV
    • Subarray Sum Equals K
    • Intersection of Two Arrays
    • Intersection of Two Arrays II
    • Find Pivot Index
    • Rotate Array
    • Get Smallest Nonnegative Integer Not In The Array
    • Maximize Distance to Closest Person
    • Sort Colors
    • Next Permutation
    • Rotate Image
    • Pour Water
    • Prison Cells After N Days
    • Majority Element
    • Can Place Flowers
    • Candy
  • Matrix
    • Spiral Matrix
    • Set Matrix Zeroes
    • Diagonal Traverse
  • Queue
    • Design Circular Queue
    • Implement Queue using Stacks
    • Implement Queue by Two Stacks
    • Implement Stack using Queues
    • Moving Average from Data Stream
    • Walls and Gates
    • Open the Lock
    • Sliding Window Maximum
    • Implement Queue Using Fixed Length Array
    • Animal Shelter
  • Stack
    • Valid Parentheses
    • Longest Valid Parentheses
    • Min Stack
    • Max Stack
    • Daily Temperatures
    • Evaluate Reverse Polish Notation
    • Next Greater Element I
    • Next Greater Element II
    • Next Greater Element III
    • Largest Rectangle in Histogram
    • Maximal Rectangle
    • Car Fleet
  • Heap
    • Trapping Rain Water II
    • The Skyline Problem
    • Top K Frequent Words
    • Top K Frequent Words II
    • Top K Frequent Elements
    • Top k Largest Numbers
    • Top k Largest Numbers II
    • Minimum Cost to Hire K Workers
    • Kth Largest Element in an Array
    • Kth Smallest Number in Sorted Matrix
    • Kth Smallest Sum In Two Sorted Arrays
    • K Closest Points to the Origin
    • Merge K Sorted Lists
    • Merge K Sorted Arrays
    • Top K Frequent Words - Map Reduce
  • Data Structure & Design
    • Hash Function
    • Heapify
    • LRU Cache
    • LFU Cache
    • Rehashing
    • Stack Sorting
    • Animal Shelter
    • Sliding Window Maximum
    • Moving Average from Data Stream
    • Find Median from Data Stream
    • Sliding Window Median
    • Design Hit Counter
    • Read N Characters Given Read4 II - Call multiple times
    • Read N Characters Given Read4
    • Flatten 2D Vector
    • Flatten Nested List Iterator
    • Design Search Autocomplete System
    • Time Based Key-Value Store
    • Design Tic-Tac-Toe
    • Insert Delete GetRandom O(1)
  • Union Find
    • Find the Connected Component in the Undirected Graph
    • Find the Weak Connected Component in the Directed Graph
    • Graph Valid Tree
    • Number of Islands
    • Number of Islands II
    • Surrounded Regions
    • Most Stones Removed with Same Row or Column
    • Redundant Connection
  • Trie
    • Implement Trie
    • Add and Search Word
    • Word Search II
    • Longest Word in Dictionary
    • Palindrome Pairs
    • Trie Serialization
    • Trie Service
    • Design Search Autocomplete System
    • Typeahead
  • Trees
    • Binary Tree Inorder Traversal
    • Binary Tree Postorder Traversal
    • Binary Tree Preorder Traversal
    • Binary Tree Level Order Traversal
    • Binary Tree Zigzag Level Order Traversal
    • Binary Tree Vertical Order Traversal
    • N-ary Tree Level Order Traversal
    • N-ary Tree Preorder Traversal
    • N-ary Tree Postorder Traversal
    • Construct Binary Tree from Preorder and Inorder Traversal
    • Populating Next Right Pointers in Each Node
    • Populating Next Right Pointers in Each Node II
    • Maximum Depth of Binary Tree
    • Symmetric Tree
    • Validate Binary Search Tree
    • Convert Sorted Array to Binary Search Tree
    • Path Sum
    • Path Sum II
    • Path Sum III
    • Binary Tree Maximum Path Sum
    • Kth Smallest Element in a BST
    • Same Tree
    • Lowest Common Ancestor of a Binary Tree
    • Lowest Common Ancestor of a Binary Search Tree
    • Nested List Weight Sum II
    • BST Node Distance
    • Minimum Distance (Difference) Between BST Nodes
    • Closet Common Manager
    • N-ary Tree Postorder Traversal
    • Serialize and Deserialize Binary Tree
    • Serialize and Deserialize N-ary Tree
    • Diameter of a Binary Tree
    • Print Binary Trees
  • Segment Tree
    • Segment Tree Build
    • Range Sum Query - Mutable
  • Binary Indexed Tree
  • Graph & Search
    • Clone Graph
    • N Queens
    • Six Degrees
    • Number of Islands
    • Number of Distinct Islands
    • Word Search
    • Course Schedule
    • Course Schedule II
    • Word Ladder
    • Redundant Connection
    • Redundant Connection II
    • Longest Increasing Path in a Matrix
    • Reconstruct Itinerary
    • The Maze
    • The Maze II
    • The Maze III
    • Topological Sorting
    • Island Perimeter
    • Flood Fill
    • Cheapest Flights Within K Stops
    • Evaluate Division
    • Alien Dictionary
    • Cut Off Trees for Golf Event
    • Jump Game II
    • Most Stones Removed with Same Row or Column
  • Backtracking
    • Subsets
    • Subsets II
    • Letter Combinations of a Phone Number
    • Permutations
    • Permutations II
    • Combinations
    • Combination Sum
    • Combination Sum II
    • Combination Sum III
    • Combination Sum IV
    • N-Queens
    • N-Queens II
    • Generate Parentheses
    • Subsets of Size K
  • Two Pointers
    • Two Sum II
    • Triangle Count
    • Trapping Rain Water
    • Container with Most Water
    • Minimum Size Subarray Sum
    • Minimum Window Substring
    • Longest Substring Without Repeating Characters
    • Longest Substring with At Most K Distinct Characters
    • Longest Substring with At Most Two Distinct Characters
    • Fruit Into Baskets
    • Nuts & Bolts Problem
    • Valid Palindrome
    • The Smallest Difference
    • Reverse String
    • Remove Element
    • Max Consecutive Ones
    • Max Consecutive Ones II
    • Remove Duplicates from Sorted Array
    • Remove Duplicates from Sorted Array II
    • Move Zeroes
    • Longest Repeating Character Replacement
    • 3Sum With Multiplicity
    • Merge Sorted Array
    • 3Sum Smaller
    • Backspace String Compare
  • Mathematics
    • Ugly Number
    • Ugly Number II
    • Super Ugly Number
    • Sqrt(x)
    • Random Number 1 to 7 With Equal Probability
    • Pow(x, n)
    • Narcissistic Number
    • Rectangle Overlap
    • Happy Number
    • Add N Days to Given Date
    • Reverse Integer
    • Greatest Common Divisor or Highest Common Factor
  • Bit Operation
    • IP to CIDR
  • Random
    • Random Pick with Weight
    • Random Pick Index
    • Linked List Random Node
  • Dynamic Programming
    • House Robber
    • House Robber II
    • House Robber III
    • Longest Increasing Continuous Subsequence
    • Longest Increasing Continuous Subsequence II
    • Coins in a Line
    • Coins in a Line II
    • Coins in a Line III
    • Maximum Product Subarray
    • Longest Palindromic Substring
    • Stone Game
    • Burst Balloons
    • Perfect Squares
    • Triangle
    • Pascal's Triangle
    • Pascal's Triangle II
    • Min Cost Climbing Stairs
    • Climbing Stairs
    • Unique Paths
    • Unique Paths II
    • Minimum Path Sum
    • Word Break
    • Word Break II
    • Range Sum Query - Immutable
    • Decode Ways
    • Edit Distance
    • Unique Binary Search Trees
    • Unique Binary Search Trees II
    • Maximal Rectangle
    • Maximal Square
    • Regular Expression Matching
    • Wildcard Matching
    • Flip Game II
    • Longest Increasing Subsequence
    • Target Sum
    • Partition Equal Subset Sum
    • Coin Change
    • Jump Game
    • Can I Win
    • Maximum Sum Rectangle in a 2D Matrix
    • Cherry Pick
  • Knapsack
    • Backpack
    • Backpack II
    • Backpack III
    • Backpack IV
    • Backpack V
    • Backpack VI
    • Backpack VII
    • Coin Change
    • Coin Change II
  • High Frequency
    • 2 Sum Closest
    • 3 Sum
    • 3 Sum Closest
    • Sort Colors II
    • Majority Number
    • Majority Number II
    • Majority Number III
    • Best Time to Buy and Sell Stock
    • Best Time to Buy and Sell Stock II
    • Best Time to Buy and Sell Stock III
    • Best Time to Buy and Sell Stock IV
    • Two Sum
    • Two Sum II - Input array is sorted
    • Two Sum III - Data structure design
    • Two Sum IV - Input is a BST
    • 4 Sum
    • 4 Sum II
  • Sorting
  • Greedy
    • Jump Game II
    • Remove K Digits
  • Minimax
    • Nim Game
    • Can I Win
  • Sweep Line & Interval
    • Meeting Rooms
    • Meeting Rooms II
    • Merge Intervals
    • Insert Interval
    • Number of Airplanes in the Sky
    • Exam Room
    • Employee Free Time
    • Closest Pair of Points
    • My Calendar I
    • My Calendar II
    • My Calendar III
    • Add Bold Tag in String
  • Other Algorithms and Data Structure
    • Huffman Coding
    • Reservoir Sampling
    • Bloom Filter
    • External Sorting
    • Construct Quad Tree
  • Company Tag
    • Google
      • Guess the Word
      • Raindrop on Sidewalk
    • Airbnb
      • Display Pages (Pagination)
    • Amazon
  • Problem Solving Summary
    • String or Array Rotation
    • Tips for Avoiding Bugs
    • Substring or Subarray Search
    • Sliding Window
    • K Sums
    • Combination Sum Series
    • Knapsack Problems
    • Depth-first Search
    • Large Number Operation
    • Implementation - Simulation
    • Monotonic Stack & Queue
    • Top K Problems
    • Java Interview Tips
      • OOP in Java
      • Conversion in Java
      • Data Structures in Java
    • Algorithm Optimization Tips
  • Reference
Powered by GitBook
On this page
  • Interview Questions
  • 面试题:等概率挑出文件中的一行
  • 面试题:等概率的挑选Google搜索记录日志中的一百万条中文搜索记录
  • Reference

Was this helpful?

  1. Other Algorithms and Data Structure

Reservoir Sampling

PreviousHuffman CodingNextBloom Filter

Last updated 5 years ago

Was this helpful?

Wikipedia:

is a family of randomized algorithms for randomly choosing k samples from a list of n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory.

O(n) time solution:

  1. Create an array reservoir[0..k-1] and copy first k items of stream[] to it.

  2. Now one by one consider all items from (k+1)th item to nth item.

    1. Generate a random number from 0 to i where i is index of current item in stream[]. Let the generated random number is j.

    2. If j is in range 0 to k-1, replace reservoir[j] with arr[i]

Code

// An efficient Java program to randomly 
// select k items from a stream of items 
import java.util.Arrays; 
import java.util.Random; 
public class ReservoirSampling 
{ 
    // A function to randomly select k items from stream[0..n-1]. 
    static void selectKItems(int stream[], int n, int k) 
    { 
        int i; // index for elements in stream[] 

        // reservoir[] is the output array. Initialize it with 
        // first k elements from stream[] 
        int reservoir[] = new int[k]; 
        for (i = 0; i < k; i++) {
            reservoir[i] = stream[i]; 
        }

        Random r = new Random(); 

        // Iterate from the (k+1)th element to nth element 
        for (; i < n; i++) 
        { 
            // Pick a random index from 0 to i. 
            int j = r.nextInt(i + 1); 

            // If the randomly picked index is smaller than k, 
            // then replace the element present at the index 
            // with new element from stream 
            if(j < k) {
                reservoir[j] = stream[i];
            }

        } 

        System.out.println("Following are k randomly selected items"); 
        System.out.println(Arrays.toString(reservoir)); 
    } 

    //Driver Program to test above method 
    public static void main(String[] args) { 
        int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}; 
        int n = stream.length; 
        int k = 5; 
        selectKItems(stream, n, k); 
    } 
} 
//This code is contributed by Sumit Ghosh

How does it work?

To Prove: The probability that any item stream[i] where 0 <= i < n will be in final reservoir[] is k/n.

Case 1: For last n-k stream items, i.e., for stream[i] where k <= i < n

For stream[n - 1]:

The probability that the last item is in final reservoir 

= The probability that one of the first k indexes is picked for last item 

= k/n (the probability of picking one of the k items from a list of size n)

For stream[n-2]:

The probability that the second last item is in final reservoir[]

= [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X 
    [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ] 

= [k/(n-1)]*[(n-1)/n] = k/n.

Case 2: For first k stream items, i.e., for stream[i] where 0 <= i < k

The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n].

The probability that an item from stream[0..k-1] is in final array 

= Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered 

= [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n

Implementation: Select K Items from A Stream of N element

static void selectKItems(int stream[], int n, int k)
{
    int i; // index for elements in stream[]
    // reservoir[] is the output array. Initialize it with
    // first k elements from stream[]
    int reservoir[] = new int[k];
    for (i = 0; i < k; i++) {
        reservoir[i] = stream[i];
    }
    Random r = new Random();
    // Iterate from the (k+1)th element to nth element
    for (; i < n; i++)
    {
        // Pick a random index from 0 to i.
        int j = r.nextInt(i + 1);
        // If the randomly picked index is smaller than k,
        // then replace the element present at the index
        // with new element from stream
        if(j < k) {
            reservoir[j] = stream[i];
        }
    }
    System.out.println("Following are k randomly selected items"); 
    System.out.println(Arrays.toString(reservoir));
}

Interview Questions

面试题:等概率挑出文件中的一行

问题描述

Amazon: 一个文件中有很多行,不能全部放到内存中,如何等概率的随机挑出其中的一行?

问题解答

先将第一行设为候选的被选中的那一行,然后一行一行的扫描文件。假如现在是第 K 行,那么第 K 行被选中踢掉现在的候选行成为新的候选行的概率为 1/K。用一个随机函数看一下是否命中这个概率即可。命中了,就替换掉现在的候选行然后继续,没有命中就继续看下一行。

面试题:等概率的挑选Google搜索记录日志中的一百万条中文搜索记录

问题描述

给你一个 Google 搜索日志记录,存有上亿挑搜索记录(Query)。这些搜索记录包含不同的语言。随机挑选出其中的 100 万条中文搜索记录。假设判断一条 Query 是不是中文的工具已经写好了。

问题解答

这个题是一个经典的概率算法问题。这个问题的本质是一个数据流问题,虽然题目跟你说的是给了你一个“死”文件,但如果你的算法是基于 Offline 的数据的话,面试官也一定会追问一个 Online 的算法,即如何在一条一条的搜索记录飞驰而过的过程中,随机挑选出 100 万条中文搜索记录。

那在线算法是怎样的?

这个方法你记住答案即可:假设你一共要挑选 N 个 Queries,设置一个 N 的 Buffer,用于存放你选中的 Queries。对于每一条飞驰而过的 Query,按照如下步骤执行你的算法:

  1. 如果非中文,直接跳过

  2. 如果 Buffer 不满,将这条 Query 直接加入 Buffer 中

  3. 如果 Buffer 满了,假设当前一共出了过 M 条中文 Queries,用一个随机函数,以 N / M 的概率来决定这条 Query 是否能被选中留下。

    3.1 如果没有选中,则跳过该 Query,继续处理下一条 Query

    3.2 如果选中了,则用一个随机函数,以 1 / N 的概率从 Buffer 中随机挑选一个 Query 来丢掉,让当前的 Query 放进去。

Reference

    • (1 / i) * (1 - 1/ (i + 1)) * (1 - 1/(i + 2)) * ... * (1 - 1 / n) = 1/n

题目来源:

题目来源:

Implementation:

👍 Youtube - Reservoir Sampling:

GeeksforGeeks:

Wikipedia:

Reservoir sampling
https://www.careercup.com/question?id=13218749
https://www.careercup.com/question?id=83697
https://www.youtube.com/watch?v=A1iwzSew5QY
https://www.geeksforgeeks.org/reservoir-sampling/
https://en.wikipedia.org/wiki/Reservoir_sampling
Select K Items from A Stream of N element