LintCode & LeetCode
  • Introduction
  • Linked List
    • Sort List
    • Merge Two Sorted Lists
    • Merge k Sorted Lists
    • Linked List Cycle
    • Linked List Cycle II
    • Add Two Numbers II
    • Add Two Numbers
    • Odd Even Linked List
    • Intersection of Two Linked Lists
    • Reverse Linked List
    • Reverse Linked List II
    • Remove Linked List Elements
    • Remove Nth Node From End of List
    • Middle of the Linked List
    • Design Linked List
      • Design Singly Linked List
      • Design Doubly Linked List
    • Palindrome Linked List
    • Remove Duplicates from Sorted List
    • Remove Duplicates from Sorted List II
    • Implement Stack Using Singly Linked List
    • Copy List with Random Pointer
  • Binary Search
    • Search in Rotated Sorted Array
    • Search in Rotated Sorted Array II
    • Search in a Sorted Array of Unknown Size
    • First Bad Version
    • Find Minimum in Rotated Sorted Array
    • Find Minimum in Rotated Sorted Array II
    • Find Peak Element
    • Search for a Range
    • Find K Closest Elements
    • Search Insert Position
    • Peak Index in a Mountain Array
    • Heaters
  • Hash Table
    • Jewels and Stones
    • Single Number
    • Subdomain Visit Count
    • Design HashMap
    • Design HashSet
    • Logger Rate Limiter
    • Isomorphic Strings
    • Minimum Index Sum of Two Lists
    • Contains Duplicate II
    • Contains Duplicate III
    • Longest Consecutive Sequence
    • Valid Sudoku
    • Distribute Candies
    • Shortest Word Distance
    • Shortest Word Distance II
  • String
    • Rotate String
    • Add Binary
    • Implement strStr()
    • Longest Common Prefix
    • Reverse Words in a String
    • Reverse Words in a String II
    • Reverse Words in a String III
    • Valid Word Abbreviation
    • Group Anagrams
    • Unique Email Addresses
    • Next Closest Time
    • License Key Formatting
    • String to Integer - atoi
    • Ransom Note
    • Multiply Strings
    • Text Justification
    • Reorder Log Files
    • Most Common Word
    • Valid Parenthesis String
    • K-Substring with K different characters
    • Find All Anagrams in a String
    • Find the Closest Palindrome
    • Simplify Path
  • Array
    • Partition Array
    • Median of Two Sorted Arrays
    • Intersection of Two Arrays
    • Intersection of Two Arrays II
    • Maximum Subarray Sum
    • Minimum Subarray Sum
    • Maximum Subarray II
    • Maximum Subarray III
    • Subarray Sum Closest
    • Subarray Sum
    • Plus One
    • Maximum Subarray Difference
    • Maximum Subarray IV
    • Subarray Sum Equals K
    • Intersection of Two Arrays
    • Intersection of Two Arrays II
    • Find Pivot Index
    • Rotate Array
    • Get Smallest Nonnegative Integer Not In The Array
    • Maximize Distance to Closest Person
    • Sort Colors
    • Next Permutation
    • Rotate Image
    • Pour Water
    • Prison Cells After N Days
    • Majority Element
    • Can Place Flowers
    • Candy
  • Matrix
    • Spiral Matrix
    • Set Matrix Zeroes
    • Diagonal Traverse
  • Queue
    • Design Circular Queue
    • Implement Queue using Stacks
    • Implement Queue by Two Stacks
    • Implement Stack using Queues
    • Moving Average from Data Stream
    • Walls and Gates
    • Open the Lock
    • Sliding Window Maximum
    • Implement Queue Using Fixed Length Array
    • Animal Shelter
  • Stack
    • Valid Parentheses
    • Longest Valid Parentheses
    • Min Stack
    • Max Stack
    • Daily Temperatures
    • Evaluate Reverse Polish Notation
    • Next Greater Element I
    • Next Greater Element II
    • Next Greater Element III
    • Largest Rectangle in Histogram
    • Maximal Rectangle
    • Car Fleet
  • Heap
    • Trapping Rain Water II
    • The Skyline Problem
    • Top K Frequent Words
    • Top K Frequent Words II
    • Top K Frequent Elements
    • Top k Largest Numbers
    • Top k Largest Numbers II
    • Minimum Cost to Hire K Workers
    • Kth Largest Element in an Array
    • Kth Smallest Number in Sorted Matrix
    • Kth Smallest Sum In Two Sorted Arrays
    • K Closest Points to the Origin
    • Merge K Sorted Lists
    • Merge K Sorted Arrays
    • Top K Frequent Words - Map Reduce
  • Data Structure & Design
    • Hash Function
    • Heapify
    • LRU Cache
    • LFU Cache
    • Rehashing
    • Stack Sorting
    • Animal Shelter
    • Sliding Window Maximum
    • Moving Average from Data Stream
    • Find Median from Data Stream
    • Sliding Window Median
    • Design Hit Counter
    • Read N Characters Given Read4 II - Call multiple times
    • Read N Characters Given Read4
    • Flatten 2D Vector
    • Flatten Nested List Iterator
    • Design Search Autocomplete System
    • Time Based Key-Value Store
    • Design Tic-Tac-Toe
    • Insert Delete GetRandom O(1)
  • Union Find
    • Find the Connected Component in the Undirected Graph
    • Find the Weak Connected Component in the Directed Graph
    • Graph Valid Tree
    • Number of Islands
    • Number of Islands II
    • Surrounded Regions
    • Most Stones Removed with Same Row or Column
    • Redundant Connection
  • Trie
    • Implement Trie
    • Add and Search Word
    • Word Search II
    • Longest Word in Dictionary
    • Palindrome Pairs
    • Trie Serialization
    • Trie Service
    • Design Search Autocomplete System
    • Typeahead
  • Trees
    • Binary Tree Inorder Traversal
    • Binary Tree Postorder Traversal
    • Binary Tree Preorder Traversal
    • Binary Tree Level Order Traversal
    • Binary Tree Zigzag Level Order Traversal
    • Binary Tree Vertical Order Traversal
    • N-ary Tree Level Order Traversal
    • N-ary Tree Preorder Traversal
    • N-ary Tree Postorder Traversal
    • Construct Binary Tree from Preorder and Inorder Traversal
    • Populating Next Right Pointers in Each Node
    • Populating Next Right Pointers in Each Node II
    • Maximum Depth of Binary Tree
    • Symmetric Tree
    • Validate Binary Search Tree
    • Convert Sorted Array to Binary Search Tree
    • Path Sum
    • Path Sum II
    • Path Sum III
    • Binary Tree Maximum Path Sum
    • Kth Smallest Element in a BST
    • Same Tree
    • Lowest Common Ancestor of a Binary Tree
    • Lowest Common Ancestor of a Binary Search Tree
    • Nested List Weight Sum II
    • BST Node Distance
    • Minimum Distance (Difference) Between BST Nodes
    • Closet Common Manager
    • N-ary Tree Postorder Traversal
    • Serialize and Deserialize Binary Tree
    • Serialize and Deserialize N-ary Tree
    • Diameter of a Binary Tree
    • Print Binary Trees
  • Segment Tree
    • Segment Tree Build
    • Range Sum Query - Mutable
  • Binary Indexed Tree
  • Graph & Search
    • Clone Graph
    • N Queens
    • Six Degrees
    • Number of Islands
    • Number of Distinct Islands
    • Word Search
    • Course Schedule
    • Course Schedule II
    • Word Ladder
    • Redundant Connection
    • Redundant Connection II
    • Longest Increasing Path in a Matrix
    • Reconstruct Itinerary
    • The Maze
    • The Maze II
    • The Maze III
    • Topological Sorting
    • Island Perimeter
    • Flood Fill
    • Cheapest Flights Within K Stops
    • Evaluate Division
    • Alien Dictionary
    • Cut Off Trees for Golf Event
    • Jump Game II
    • Most Stones Removed with Same Row or Column
  • Backtracking
    • Subsets
    • Subsets II
    • Letter Combinations of a Phone Number
    • Permutations
    • Permutations II
    • Combinations
    • Combination Sum
    • Combination Sum II
    • Combination Sum III
    • Combination Sum IV
    • N-Queens
    • N-Queens II
    • Generate Parentheses
    • Subsets of Size K
  • Two Pointers
    • Two Sum II
    • Triangle Count
    • Trapping Rain Water
    • Container with Most Water
    • Minimum Size Subarray Sum
    • Minimum Window Substring
    • Longest Substring Without Repeating Characters
    • Longest Substring with At Most K Distinct Characters
    • Longest Substring with At Most Two Distinct Characters
    • Fruit Into Baskets
    • Nuts & Bolts Problem
    • Valid Palindrome
    • The Smallest Difference
    • Reverse String
    • Remove Element
    • Max Consecutive Ones
    • Max Consecutive Ones II
    • Remove Duplicates from Sorted Array
    • Remove Duplicates from Sorted Array II
    • Move Zeroes
    • Longest Repeating Character Replacement
    • 3Sum With Multiplicity
    • Merge Sorted Array
    • 3Sum Smaller
    • Backspace String Compare
  • Mathematics
    • Ugly Number
    • Ugly Number II
    • Super Ugly Number
    • Sqrt(x)
    • Random Number 1 to 7 With Equal Probability
    • Pow(x, n)
    • Narcissistic Number
    • Rectangle Overlap
    • Happy Number
    • Add N Days to Given Date
    • Reverse Integer
    • Greatest Common Divisor or Highest Common Factor
  • Bit Operation
    • IP to CIDR
  • Random
    • Random Pick with Weight
    • Random Pick Index
    • Linked List Random Node
  • Dynamic Programming
    • House Robber
    • House Robber II
    • House Robber III
    • Longest Increasing Continuous Subsequence
    • Longest Increasing Continuous Subsequence II
    • Coins in a Line
    • Coins in a Line II
    • Coins in a Line III
    • Maximum Product Subarray
    • Longest Palindromic Substring
    • Stone Game
    • Burst Balloons
    • Perfect Squares
    • Triangle
    • Pascal's Triangle
    • Pascal's Triangle II
    • Min Cost Climbing Stairs
    • Climbing Stairs
    • Unique Paths
    • Unique Paths II
    • Minimum Path Sum
    • Word Break
    • Word Break II
    • Range Sum Query - Immutable
    • Decode Ways
    • Edit Distance
    • Unique Binary Search Trees
    • Unique Binary Search Trees II
    • Maximal Rectangle
    • Maximal Square
    • Regular Expression Matching
    • Wildcard Matching
    • Flip Game II
    • Longest Increasing Subsequence
    • Target Sum
    • Partition Equal Subset Sum
    • Coin Change
    • Jump Game
    • Can I Win
    • Maximum Sum Rectangle in a 2D Matrix
    • Cherry Pick
  • Knapsack
    • Backpack
    • Backpack II
    • Backpack III
    • Backpack IV
    • Backpack V
    • Backpack VI
    • Backpack VII
    • Coin Change
    • Coin Change II
  • High Frequency
    • 2 Sum Closest
    • 3 Sum
    • 3 Sum Closest
    • Sort Colors II
    • Majority Number
    • Majority Number II
    • Majority Number III
    • Best Time to Buy and Sell Stock
    • Best Time to Buy and Sell Stock II
    • Best Time to Buy and Sell Stock III
    • Best Time to Buy and Sell Stock IV
    • Two Sum
    • Two Sum II - Input array is sorted
    • Two Sum III - Data structure design
    • Two Sum IV - Input is a BST
    • 4 Sum
    • 4 Sum II
  • Sorting
  • Greedy
    • Jump Game II
    • Remove K Digits
  • Minimax
    • Nim Game
    • Can I Win
  • Sweep Line & Interval
    • Meeting Rooms
    • Meeting Rooms II
    • Merge Intervals
    • Insert Interval
    • Number of Airplanes in the Sky
    • Exam Room
    • Employee Free Time
    • Closest Pair of Points
    • My Calendar I
    • My Calendar II
    • My Calendar III
    • Add Bold Tag in String
  • Other Algorithms and Data Structure
    • Huffman Coding
    • Reservoir Sampling
    • Bloom Filter
    • External Sorting
    • Construct Quad Tree
  • Company Tag
    • Google
      • Guess the Word
      • Raindrop on Sidewalk
    • Airbnb
      • Display Pages (Pagination)
    • Amazon
  • Problem Solving Summary
    • String or Array Rotation
    • Tips for Avoiding Bugs
    • Substring or Subarray Search
    • Sliding Window
    • K Sums
    • Combination Sum Series
    • Knapsack Problems
    • Depth-first Search
    • Large Number Operation
    • Implementation - Simulation
    • Monotonic Stack & Queue
    • Top K Problems
    • Java Interview Tips
      • OOP in Java
      • Conversion in Java
      • Data Structures in Java
    • Algorithm Optimization Tips
  • Reference
Powered by GitBook
On this page
  • 外排序算法的基本步骤
  • 面试题
  • 求两个超大文件中 URLs 的交集

Was this helpful?

  1. Other Algorithms and Data Structure

External Sorting

PreviousBloom FilterNextConstruct Quad Tree

Last updated 5 years ago

Was this helpful?

Source: Jiuzhang's Tutorial:

所谓的外排序算法(External Sorting),是指在内存不够的情况下,如何对存储在一个或者多个大文件中的数据进行排序的算法。

参考资料:

外排序算法通常是解决一些大数据处理问题的第一个步骤,或者是面试官所会考察的算法基本功。

外排序算法的基本步骤

外排序算法分为两个基本步骤:

  1. 将大文件切分为若干个个小文件,并分别使用内存排好序

  2. 使用K路归并算法将若干个排好序的小文件合并到一个大文件中

第一步:文件拆分

根据内存的大小,尽可能多的分批次的将数据 Load 到内存中,并使用系统自带的内存排序函数(或者自己写个快速排序算法),将其排好序,并输出到一个个小文件中。比如一个文件有1T,内存有1G,那么我们就这个大文件中的内容按照 1G 的大小,分批次的导入内存,排序之后输出得到 1024 个 1G 的小文件。

第二步:K路归并算法

K路归并算法使用的是数据结构堆(Heap)来完成的,使用 Java 或者 C++ 的同学可以直接用语言自带的 PriorityQueue(C++中叫priority_queue)来代替。

我们将 K 个文件中的第一个元素加入到堆里,假设数据是从小到大排序的话,那么这个堆是一个最小堆(Min Heap)。每次从堆中选出最小的元素,输出到目标结果文件中,然后如果这个元素来自第 x 个文件,则从第 x 个文件中继续读入一个新的数进来放到堆里,并重复上述操作,直到所有元素都被输出到目标结果文件中。

Follow up: 一个个从文件中读入数据,一个个输出到目标文件中操作很慢,如何优化?

如果我们每个文件只读入1个元素并放入堆里的话,总共只用到了 1024 个元素,这很小,没有充分的利用好内存。另外,单个读入和单个输出的方式也不是磁盘的高效使用方式。因此我们可以为输入和输出都分别加入一个缓冲(Buffer)。假如一个元素有10个字节大小的话,1024 个元素一共 10K,1G的内存可以支持约 100K 组这样的数据,那么我们就为每个文件设置一个 100K 大小的 Buffer,每次需要从某个文件中读数据,都将这个 Buffer 装满。当然 Buffer 中的数据都用完的时候,再批量的从文件中读入。输出同理,设置一个 Buffer 来避免单个输出带来的效率缓慢。

面试题

求两个超大文件中 URLs 的交集

问题描述

给定A、B两个文件,各存放50亿个URLs,每个 URL 各占 64 字节,内存限制是 4G,让你找出A、B文件共同的 URLs?

问题分析

首先需要跟面试官澄清一个问题:

这两个文件各自是否已经没有重复?

通常面试官会先让你假设没有重复,然后再来看有重复的情况怎么处理。那我们就先来看没有重复的情况。

A,B各自没有重复 URLs

方法1:文件拆分 Sharding(也可以叫 Partitioning) 50亿,每个 URLs 64 字节,也就是每个文件 320G 的大小。很显然我们不能直接全部 Load 到内存中去处理。这种内存不够的问题,通常我们的解决方法都可以是使用 hash function 来将大文件拆分为若干个小文件。比如按照hashfunc(url) % 200进行拆分的话,可以拆分成为,200 个小文件 —— 也就是如果 hashfunc(url) % 200 = 1 就把这个 url 放到 1 号文件里。每个小文件理想状况下,大小约是 1.6 G,完全可以 Load 到内存里。

这种方法的好处在于,因为我们的目标是要去重,那么那些A和B中重复的 URLs,会被hashfunc(url) % 200映射到同一个文件中。这样在这个小文件中,来自 A 和 B 的 URls 在理想状况下一共 3.2G,可以全部导入内存进入重复判断筛选出有重复的 URLs。

Q: 刚才一直在说理想情况下,那么不理想情况下是什么样的?该怎么处理?

A: 不理想的情况下,如果 hashfunc(url) % 200 的结果比较集中,那么有可能会造成不同的 URLs 在同一个文件中扎堆的情况。这种情况下,有一些文件的大小可能会超过 4G。对于这种情况,处理的办法是进行二次拆分,把这些仍然比较大的小文件,用一个新的 hashfunc 进行拆分:hashfunc'(url) % X。这里再拆成多少个文件,可以根据文件的实际大小来定。如果二次拆分之后还是存在很大的文件,就进行三次拆分。直到每个小文件都小于 4G。

方法2:BloomFilter 我们可以使用一个 4G 的 Bloom Filter,它大概包含 320 亿 个 bit。把 A 文件的 50亿 个 URLs 丢入 BF 中,然后查询 B 文件的 每个 URL 是否在 BF 里。这种方法的缺点在于,320 亿个 bit 的 BF 里存 50 亿个 URLs 实在是太满了(要考虑到BF可能会用4个哈希函数),错误率会很高。因此仍然还需需要方法1中的文件拆分来分批处理。

方法3:外排序算法

将A,B文件分别拆分为80个小文件,每个小文件4G。每个文件在拆分的时候,每4G的数据在内存中做快速排序并将有序的URLs输出到小文件中。

用多路归并算法,将这160个小文件进行归并,在归并的过程中,即可知道哪些是重复的 URLs。只需将重复的 URLs 记录下来即可。

Follow up: A,B各自有重复的 URLs

当 A, B 各自有重复的 URLs 的时候,比如最坏情况下,A里的50亿个URLs 全部一样。B里也是。这样采用方法1这种比较容易想到的 Sharding 方法,是不奏效的,因为所有 URLs 的 hashcode 都一样,就算换不同的 hashfunc 也一样。这种情况下,需要先对两个文件进行单独的去重,方法是每 4G 的数据,放到内存中用简单的哈希表进行去重。这样,在最坏情况下,总共 320G 的数据里,一个 URLs 最多重复 8次,则不会出现太严重的扎堆情况了。算法上唯一需要稍微改动的地方是,由于 A 存在多个重复的 URLs,所以当和 B 的 URLs 被sharding 到同一个文件里的时候,需要标记一下这个 URLs 来自哪个文件,这样才能知道是否在A和B中同时出现过。

另外,使用外排序的方法,是无需对两个文件进行单独去重的步骤的。

https://www.jiuzhang.com/tutorial/big-data-interview-questions/271
CNBLOG
GeeksForGeeks
WIKI