LintCode & LeetCode
  • Introduction
  • Linked List
    • Sort List
    • Merge Two Sorted Lists
    • Merge k Sorted Lists
    • Linked List Cycle
    • Linked List Cycle II
    • Add Two Numbers II
    • Add Two Numbers
    • Odd Even Linked List
    • Intersection of Two Linked Lists
    • Reverse Linked List
    • Reverse Linked List II
    • Remove Linked List Elements
    • Remove Nth Node From End of List
    • Middle of the Linked List
    • Design Linked List
      • Design Singly Linked List
      • Design Doubly Linked List
    • Palindrome Linked List
    • Remove Duplicates from Sorted List
    • Remove Duplicates from Sorted List II
    • Implement Stack Using Singly Linked List
    • Copy List with Random Pointer
  • Binary Search
    • Search in Rotated Sorted Array
    • Search in Rotated Sorted Array II
    • Search in a Sorted Array of Unknown Size
    • First Bad Version
    • Find Minimum in Rotated Sorted Array
    • Find Minimum in Rotated Sorted Array II
    • Find Peak Element
    • Search for a Range
    • Find K Closest Elements
    • Search Insert Position
    • Peak Index in a Mountain Array
    • Heaters
  • Hash Table
    • Jewels and Stones
    • Single Number
    • Subdomain Visit Count
    • Design HashMap
    • Design HashSet
    • Logger Rate Limiter
    • Isomorphic Strings
    • Minimum Index Sum of Two Lists
    • Contains Duplicate II
    • Contains Duplicate III
    • Longest Consecutive Sequence
    • Valid Sudoku
    • Distribute Candies
    • Shortest Word Distance
    • Shortest Word Distance II
  • String
    • Rotate String
    • Add Binary
    • Implement strStr()
    • Longest Common Prefix
    • Reverse Words in a String
    • Reverse Words in a String II
    • Reverse Words in a String III
    • Valid Word Abbreviation
    • Group Anagrams
    • Unique Email Addresses
    • Next Closest Time
    • License Key Formatting
    • String to Integer - atoi
    • Ransom Note
    • Multiply Strings
    • Text Justification
    • Reorder Log Files
    • Most Common Word
    • Valid Parenthesis String
    • K-Substring with K different characters
    • Find All Anagrams in a String
    • Find the Closest Palindrome
    • Simplify Path
  • Array
    • Partition Array
    • Median of Two Sorted Arrays
    • Intersection of Two Arrays
    • Intersection of Two Arrays II
    • Maximum Subarray Sum
    • Minimum Subarray Sum
    • Maximum Subarray II
    • Maximum Subarray III
    • Subarray Sum Closest
    • Subarray Sum
    • Plus One
    • Maximum Subarray Difference
    • Maximum Subarray IV
    • Subarray Sum Equals K
    • Intersection of Two Arrays
    • Intersection of Two Arrays II
    • Find Pivot Index
    • Rotate Array
    • Get Smallest Nonnegative Integer Not In The Array
    • Maximize Distance to Closest Person
    • Sort Colors
    • Next Permutation
    • Rotate Image
    • Pour Water
    • Prison Cells After N Days
    • Majority Element
    • Can Place Flowers
    • Candy
  • Matrix
    • Spiral Matrix
    • Set Matrix Zeroes
    • Diagonal Traverse
  • Queue
    • Design Circular Queue
    • Implement Queue using Stacks
    • Implement Queue by Two Stacks
    • Implement Stack using Queues
    • Moving Average from Data Stream
    • Walls and Gates
    • Open the Lock
    • Sliding Window Maximum
    • Implement Queue Using Fixed Length Array
    • Animal Shelter
  • Stack
    • Valid Parentheses
    • Longest Valid Parentheses
    • Min Stack
    • Max Stack
    • Daily Temperatures
    • Evaluate Reverse Polish Notation
    • Next Greater Element I
    • Next Greater Element II
    • Next Greater Element III
    • Largest Rectangle in Histogram
    • Maximal Rectangle
    • Car Fleet
  • Heap
    • Trapping Rain Water II
    • The Skyline Problem
    • Top K Frequent Words
    • Top K Frequent Words II
    • Top K Frequent Elements
    • Top k Largest Numbers
    • Top k Largest Numbers II
    • Minimum Cost to Hire K Workers
    • Kth Largest Element in an Array
    • Kth Smallest Number in Sorted Matrix
    • Kth Smallest Sum In Two Sorted Arrays
    • K Closest Points to the Origin
    • Merge K Sorted Lists
    • Merge K Sorted Arrays
    • Top K Frequent Words - Map Reduce
  • Data Structure & Design
    • Hash Function
    • Heapify
    • LRU Cache
    • LFU Cache
    • Rehashing
    • Stack Sorting
    • Animal Shelter
    • Sliding Window Maximum
    • Moving Average from Data Stream
    • Find Median from Data Stream
    • Sliding Window Median
    • Design Hit Counter
    • Read N Characters Given Read4 II - Call multiple times
    • Read N Characters Given Read4
    • Flatten 2D Vector
    • Flatten Nested List Iterator
    • Design Search Autocomplete System
    • Time Based Key-Value Store
    • Design Tic-Tac-Toe
    • Insert Delete GetRandom O(1)
  • Union Find
    • Find the Connected Component in the Undirected Graph
    • Find the Weak Connected Component in the Directed Graph
    • Graph Valid Tree
    • Number of Islands
    • Number of Islands II
    • Surrounded Regions
    • Most Stones Removed with Same Row or Column
    • Redundant Connection
  • Trie
    • Implement Trie
    • Add and Search Word
    • Word Search II
    • Longest Word in Dictionary
    • Palindrome Pairs
    • Trie Serialization
    • Trie Service
    • Design Search Autocomplete System
    • Typeahead
  • Trees
    • Binary Tree Inorder Traversal
    • Binary Tree Postorder Traversal
    • Binary Tree Preorder Traversal
    • Binary Tree Level Order Traversal
    • Binary Tree Zigzag Level Order Traversal
    • Binary Tree Vertical Order Traversal
    • N-ary Tree Level Order Traversal
    • N-ary Tree Preorder Traversal
    • N-ary Tree Postorder Traversal
    • Construct Binary Tree from Preorder and Inorder Traversal
    • Populating Next Right Pointers in Each Node
    • Populating Next Right Pointers in Each Node II
    • Maximum Depth of Binary Tree
    • Symmetric Tree
    • Validate Binary Search Tree
    • Convert Sorted Array to Binary Search Tree
    • Path Sum
    • Path Sum II
    • Path Sum III
    • Binary Tree Maximum Path Sum
    • Kth Smallest Element in a BST
    • Same Tree
    • Lowest Common Ancestor of a Binary Tree
    • Lowest Common Ancestor of a Binary Search Tree
    • Nested List Weight Sum II
    • BST Node Distance
    • Minimum Distance (Difference) Between BST Nodes
    • Closet Common Manager
    • N-ary Tree Postorder Traversal
    • Serialize and Deserialize Binary Tree
    • Serialize and Deserialize N-ary Tree
    • Diameter of a Binary Tree
    • Print Binary Trees
  • Segment Tree
    • Segment Tree Build
    • Range Sum Query - Mutable
  • Binary Indexed Tree
  • Graph & Search
    • Clone Graph
    • N Queens
    • Six Degrees
    • Number of Islands
    • Number of Distinct Islands
    • Word Search
    • Course Schedule
    • Course Schedule II
    • Word Ladder
    • Redundant Connection
    • Redundant Connection II
    • Longest Increasing Path in a Matrix
    • Reconstruct Itinerary
    • The Maze
    • The Maze II
    • The Maze III
    • Topological Sorting
    • Island Perimeter
    • Flood Fill
    • Cheapest Flights Within K Stops
    • Evaluate Division
    • Alien Dictionary
    • Cut Off Trees for Golf Event
    • Jump Game II
    • Most Stones Removed with Same Row or Column
  • Backtracking
    • Subsets
    • Subsets II
    • Letter Combinations of a Phone Number
    • Permutations
    • Permutations II
    • Combinations
    • Combination Sum
    • Combination Sum II
    • Combination Sum III
    • Combination Sum IV
    • N-Queens
    • N-Queens II
    • Generate Parentheses
    • Subsets of Size K
  • Two Pointers
    • Two Sum II
    • Triangle Count
    • Trapping Rain Water
    • Container with Most Water
    • Minimum Size Subarray Sum
    • Minimum Window Substring
    • Longest Substring Without Repeating Characters
    • Longest Substring with At Most K Distinct Characters
    • Longest Substring with At Most Two Distinct Characters
    • Fruit Into Baskets
    • Nuts & Bolts Problem
    • Valid Palindrome
    • The Smallest Difference
    • Reverse String
    • Remove Element
    • Max Consecutive Ones
    • Max Consecutive Ones II
    • Remove Duplicates from Sorted Array
    • Remove Duplicates from Sorted Array II
    • Move Zeroes
    • Longest Repeating Character Replacement
    • 3Sum With Multiplicity
    • Merge Sorted Array
    • 3Sum Smaller
    • Backspace String Compare
  • Mathematics
    • Ugly Number
    • Ugly Number II
    • Super Ugly Number
    • Sqrt(x)
    • Random Number 1 to 7 With Equal Probability
    • Pow(x, n)
    • Narcissistic Number
    • Rectangle Overlap
    • Happy Number
    • Add N Days to Given Date
    • Reverse Integer
    • Greatest Common Divisor or Highest Common Factor
  • Bit Operation
    • IP to CIDR
  • Random
    • Random Pick with Weight
    • Random Pick Index
    • Linked List Random Node
  • Dynamic Programming
    • House Robber
    • House Robber II
    • House Robber III
    • Longest Increasing Continuous Subsequence
    • Longest Increasing Continuous Subsequence II
    • Coins in a Line
    • Coins in a Line II
    • Coins in a Line III
    • Maximum Product Subarray
    • Longest Palindromic Substring
    • Stone Game
    • Burst Balloons
    • Perfect Squares
    • Triangle
    • Pascal's Triangle
    • Pascal's Triangle II
    • Min Cost Climbing Stairs
    • Climbing Stairs
    • Unique Paths
    • Unique Paths II
    • Minimum Path Sum
    • Word Break
    • Word Break II
    • Range Sum Query - Immutable
    • Decode Ways
    • Edit Distance
    • Unique Binary Search Trees
    • Unique Binary Search Trees II
    • Maximal Rectangle
    • Maximal Square
    • Regular Expression Matching
    • Wildcard Matching
    • Flip Game II
    • Longest Increasing Subsequence
    • Target Sum
    • Partition Equal Subset Sum
    • Coin Change
    • Jump Game
    • Can I Win
    • Maximum Sum Rectangle in a 2D Matrix
    • Cherry Pick
  • Knapsack
    • Backpack
    • Backpack II
    • Backpack III
    • Backpack IV
    • Backpack V
    • Backpack VI
    • Backpack VII
    • Coin Change
    • Coin Change II
  • High Frequency
    • 2 Sum Closest
    • 3 Sum
    • 3 Sum Closest
    • Sort Colors II
    • Majority Number
    • Majority Number II
    • Majority Number III
    • Best Time to Buy and Sell Stock
    • Best Time to Buy and Sell Stock II
    • Best Time to Buy and Sell Stock III
    • Best Time to Buy and Sell Stock IV
    • Two Sum
    • Two Sum II - Input array is sorted
    • Two Sum III - Data structure design
    • Two Sum IV - Input is a BST
    • 4 Sum
    • 4 Sum II
  • Sorting
  • Greedy
    • Jump Game II
    • Remove K Digits
  • Minimax
    • Nim Game
    • Can I Win
  • Sweep Line & Interval
    • Meeting Rooms
    • Meeting Rooms II
    • Merge Intervals
    • Insert Interval
    • Number of Airplanes in the Sky
    • Exam Room
    • Employee Free Time
    • Closest Pair of Points
    • My Calendar I
    • My Calendar II
    • My Calendar III
    • Add Bold Tag in String
  • Other Algorithms and Data Structure
    • Huffman Coding
    • Reservoir Sampling
    • Bloom Filter
    • External Sorting
    • Construct Quad Tree
  • Company Tag
    • Google
      • Guess the Word
      • Raindrop on Sidewalk
    • Airbnb
      • Display Pages (Pagination)
    • Amazon
  • Problem Solving Summary
    • String or Array Rotation
    • Tips for Avoiding Bugs
    • Substring or Subarray Search
    • Sliding Window
    • K Sums
    • Combination Sum Series
    • Knapsack Problems
    • Depth-first Search
    • Large Number Operation
    • Implementation - Simulation
    • Monotonic Stack & Queue
    • Top K Problems
    • Java Interview Tips
      • OOP in Java
      • Conversion in Java
      • Data Structures in Java
    • Algorithm Optimization Tips
  • Reference
Powered by GitBook
On this page
  • 标准布隆过滤器 Standard Bloom Filter
  • 实现步骤
  • 伪代码 Pseudo Code
  • 空间优化
  • 计数布隆过滤器 Counting Bloom Filter
  • 实现步骤
  • 伪代码
  • LintCode 练习地址
  • Q & A

Was this helpful?

  1. Other Algorithms and Data Structure

Bloom Filter

PreviousReservoir SamplingNextExternal Sorting

Last updated 5 years ago

Was this helpful?

Source: Jiuzhang's Tutorial:

BF是一个更省空间的哈希表。在海量数据处理类问题中,我们经常需要用到哈希表,也经常会碰到内存不够的问题。那么 BF 就是一个很好的选择。

Bloom Filter一般有两个作用:

  1. 检测一个元素在不在一个集合中

  2. 统计一个元素的出现次数

BF能做的事情确实就是哈希表能做的事情,但是BF 相比哈希表,耗费更少的存储空间。既然节省了空间,同样也有一个副作用:存在 False Positive

什么是False Positive? 简单的说就是,如果是 Hash 的话,他说这个元素在集合里,那就是在集合里。

但BF不会有False Negative。BF 说这个元素不在集合里,那就一定不在集合里。

根据要处理的问题的不同,BF(BloomFilter的专业简称)可以分为:

  1. 标准型布隆过滤器(Standard Bloom Filter,简写为 SBF,对应到 Java 里的 HashSet)

  2. 计数型布隆过滤器(Counting Bloom Filter,简写为 CBF,对应到 Java 里的 HashMap)

k个独立的哈希函数

可以使用几个不同的算法,来获得不同的哈希函数。一个比较通用的哈希函数的写法是这样:

def hashfunc(string, hashsize):
    code = 0
    for c in string:
        code = code * 31 + ord(c)
        code = code % hashsize

    return code

如果需要设计 k 个独立的哈希函数,只需要简单的修改上面的函数中的 Magic Number31即可,比如换成 37,41 这样。

Magic Number 31 是什么?

上面的这个算法,相当于把一个字符串当做了 31 进制,然后转换为整数。一遍转换的过程中一遍对 hashsize 取模,避免溢出。

这个 31 并不是唯一的选择,但是有一些基本的法则我们需要遵循:

  1. 不能太小。太小的话,容易出现 hashfunc 算出来的值在字符串比较短的时候出现扎堆的情况。增加了哈希碰撞的几率。

  2. 不能太大。太大的话,影响了计算效率。

  3. 尽量不要是合数。合数也可能会增加哈希碰撞的几率。

标准布隆过滤器 Standard Bloom Filter

标准布隆过滤器的作用相当于一个 HashSet,即提供了这样一个数据结构,他支持如下操作:

  1. 在集合中加入一个元素

  2. 判断一个元素是否在集合中(允许 False Positive)

实现步骤

  1. 初始化:开一个足够大的 boolean 数组,初始值都是 false。

  2. 插入一个元素:通过k个哈希函数,计算出元素的k个哈希值,对 boolean 数组的长度取模之后,标记对应的k个位置上的值为 true。

  3. 查询一个元素:通过同样的k个哈希函数,在 boolean 数组中取出对应的k个位置上的值。如果都是 true,则表示该元素可能存在,如果有一个 false,则表示一定不存在。

伪代码 Pseudo Code

class StandardBloomFilter:

    def __init__(self, capacity, hash_functions):
        # capacity is the initial size of the SBF
        # it should be as big as possible to contains all
        # of the keys
        self.capacity = capacity
        self.bitset = [False] * capacity 

        # k hash functions
        self.hash_functions = hash_functions

    def add(self, key):
        for func in self.hash_functions:
            position = func(key) % self.capacity
            self.bitset[position] = True

    def contains(self, key):
        for func in self.hash_functions:
            position = func(key) % self.capacity
            if self.bitset[position] is False:
                return False

        return True

空间优化

具体实现的时候,为了更好的节省空间,可以用位运算的方式来取代 boolean 数组。Java 中可以直接用 BitSet 这个结构。

Q: 如果空间不够了怎么办呢?一开始开的 boolean 数组不够的话,如果全部都被赋为 true 了,contains 不就每次都返回 true 了么?

A: 实际运用中,我们通常需要进行预估,也就是估算一下大概需要用到多少的空间,开多大比较合适。另外一个解决办法,是采用 Extended Bloom Filter。具体的解决方案是,当一个 BloomFilter 满了的时候,开一个新的,capacity 更大的(两倍) BloomFilter。原来的 Bloom Filter 依旧保留,这样插入的时候,总是插入到新的 BloomFilter 里,而查询的时候,所有的 BloomFilter 都要查一遍。

Q: 如何定义一个 BloomFilter 是不是满了?

A: 如果哈希函数用4个的话,boolean 数组的大小和实际能够存储的元素个数之间的比例,在 40: 1 比较合适。这是一个经验值。

计数布隆过滤器 Counting Bloom Filter

基于标准的 BloomFilter 稍加改动,把存储所用的 boolean 数组改为 int 数组,就成为了可以计数的 BloomFilter -> Counting Bloom Filter(简写为CBF)。这种数据结构类似 Java 中的 HashMap,但只能用作计数。提供如下的几种操作:

  1. O(1)时间内,在集合中加入一个元素

  2. O(1)时间内,统计某个元素在该集合中出现的次数 - 但是可能会比实际出现次数要大一些

实现步骤

  1. 初始化:开一个足够大的 int 数组,初始值都是 0。

  2. 插入一个元素:通过k个哈希函数,计算出元素的k个哈希值,对 int 数组的长度取模之后,将对应的k个位置上的值都加一

  3. 查询一个元素的出现次数:通过同样的k哈希函数,在 int 数组中取出对应的k个位置上的值。并取其中的最小值来作为该元素的出现次数预估。

伪代码

class CountingBloomFilter:

    def __init__(self, capacity, hash_functions):
        # capacity is the initial size of the SBF
        # it should be as big as possible to contains all
        # of the keys
        self.capacity = capacity
        self.bitset = [0] * capacity 

        # k hash functions
        self.hash_functions = hash_functions

    def add(self, key):
        for func in self.hash_functions:
            position = func(key) % self.capacity
            self.bitset[position] += 1

    def contains(self, key):
        count = sys.maxint
        for func in self.hash_functions:
            position = func(key) % self.capacity
            count = min(count, self.bitset[position])

        return count

LintCode 练习地址

Q & A

Q: 为什么要取最小值? A: 比如我们使用两个哈希函数,key1 算出来的两个下标是 0, 1, key2 算出来的 两个下标是 1, 2。这里 counts[1] 会等于 2,他被 2个 key 都影响到了。所以取最小值,能够让这个计数尽可能的毕竟真实计数。

Q: 为什么说是预估的出现次数?而不是精确的出现次数? A: 承接上面问题的解答,如果还有一个 key3 算出来的下标是 0 和 2。那么 counts[0~2] 都会是 2,无论对 key1~3 的任何一个 key 取计数,都得到的是 2,要比实际的出现次数大。

Q: CBF算出来的计数,有可能比实际出现次数小么? A: 不可能

在 LintCode 上练习这个知识点:

https://www.jiuzhang.com/tutorial/big-data-interview-questions/238
http://www.lintcode.com/problem/hash-function/
http://www.lintcode.com/problem/standard-bloom-filter/
http://www.lintcode.com/problem/counting-bloom-filter/