哈希表

1. Introduction

Hash Table is a data structure to map key to values (also called Table or Map Abstract Data Type/ADT). It uses a hash function to map large or even non-integer keys into a small range of integer indices (typically [0..hash_table_size-1]).

The probability of two distinct keys colliding into the same index is relatively high and each of this potential collision needs to be resolved to maintain data integrity.

There are several collision resolution strategies that will be highlighted in this visualization: Open Addressing (Linear Probing, Quadratic Probing, and Double Hashing) and Closed Addressing (Separate Chaining). Try clicking Search(7) for a sample animation of searching a specific value 7 in a randomly created Hash Table using Separate Chaining technique (duplicates are allowed).

2. 动机

Hashing is an algorithm (via a hash function) that maps large data sets of variable length, called keys, not necessarily integers, into smaller integer data sets of a fixed length.

A Hash Table is a data structure that uses a hash function to efficiently map keys to values (Table or Map ADT), for efficient search/retrieval, insertion, and/or removals.

Hash Table is widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.

In this e-Lecture, we will digress to Table ADT, the basic ideas of Hashing, the discussion of Hash Functions before going into the details of Hash Table data structure itself.

2-1. 表ADT

表ADT必须至少支持以下三种操作，并且尽可能高效：

搜索（v） - 确定v是否存在于ADT中，
插入（v） - 将v插入ADT，
删除（v） - 从ADT中删除v。

哈希表是这个表ADT的一个可能的好实现（另一个是这个）。

PS1：对于Table ADT的两个较弱的实现，您可以单击相应的链接：未排序数组或排序数组来阅读详细讨论。

PS2：在现场课程中，您可能想要比较Table ADT和List ADT的要求。

2-2. 直接寻址表

When the range of the integer keys is small, e.g., [0..M-1], we can use an initially empty (Boolean) array A of size M and implement the following Table ADT operations directly:

Search(v): Check if A[v] is true (filled) or false (empty),
Insert(v): Set A[v] to be true (filled),
Remove(v): Set A[v] to be false (empty).

That's it, we use the small integer key itself to determine the address in array A, hence the name Direct Addressing. It is clear that all three major Table ADT operations are O(1).

PS: This idea is also used elsewhere, e.g., in Counting Sort.

2-3. DAT的例子

In Singapore (as of Mar 2026), bus routes are numbered from [2..993].

Not all integers between [2..993] are currently used, e.g., there is no bus route 989 — Search(989) should return false. A new bus route x may be introduced, i.e., Insert(x) or an existing bus route y may be discontinued, i.e., Remove(y).

As the range of possible bus routes is small, to record the data whether a bus route number exists or not, we can use a DAT of Boolean array of size 1000 (generally, it is useful to give a few extra buffer cells on top of the current largest bus number of 993).

PS: The closest prime number to 1000 is 997 - but using prime table size is not important for DAT.

2-4. 带数据的DAT示例

请注意，我们总是可以添加卫星数据，而不仅仅是使用布尔数组来记录键的存在。

例如，我们可以使用关联字符串数组A来将公交路线号映射到其运营商名称，例如，

A[2] = "Go-Ahead Singapore",
A[10] = "SBS Transit",
A[183] = "Tower Transit Singapore",
A[188] = "SMRT Buses", 等等。

讨论：你能想到其他一些现实生活中的DAT例子吗？

2-5. 答案

[This is a hidden slide]

2-6. DAT的局限性

The keys must be (or can be easily mapped to) non-negative integer values. Note that basic DAT has problem in the full version of the example in the previous few slides as there are actually variations of bus route numbers in Singapore, e.g., 96B, 151A, NR10, etc.

The range of keys must be small.
The memory usage will be (insanely) large if we have (insanely) large range.

The keys must be dense, i.e., not many gaps in the key values.
DAT will contain too many empty (and wasted) cells otherwise.

We will overcome these restrictions with hashing.

3. 散列：想法

Using hashing, we can:

Map (some) non-integer keys (e.g., strings) to integers keys,
Map large integers to smaller integers.

3-1. 电话号码示例

例如，我们有 N = 400个新加坡电话号码（新加坡电话号码有8位数字，所以新加坡最多有10^8 = 1亿个可能的电话号码）。

我们可以使用以下简单的哈希函数 h(v) = v%997，而不是使用DAT并使用一个最大为 M = 1亿的巨大数组。

这样，我们将8位电话号码 6675 2378 和 6874 4483 映射到最多3位的 h(6675 2378) = 237 和 h(6874 4483) = 336。因此，我们只需要准备一个大小为 M = 997（997是一个质数）的数组，而不是 M = 1亿的数组。

3-2. 哈希表预览

With hashing, we can now implement the following Table ADT operations using integer array (instead of Boolean array) as follows:

Search(v): Check if A[h(v)] != -1 (we use -1 for an empty cell assuming v ≥ 0),
Insert(v): Set A[h(v)] = v (we hash v into h(v) so we need to somehow record key v),
Remove(v): Set A[h(v)] = -1 — to be elaborated further.

3-3. 带数据的哈希表

If we have keys that map to satellite data and we want to record the original keys too, we can implement the Hash Table using pair of (integer, satellite-data-type) array as follows:

Search(v): Return A[h(v)], which is a pair (v, satellite-data), possibly empty,
Insert(v, satellite-data): Set A[h(v)] = pair (v, satellite-data),
Remove(v): Set A[h(v)] = (empty pair) — to be elaborated further.

However, by now you should notice that something is incomplete...

3-4. 冲突

A hash function may, and quite likely, map different keys (integer or not) into the same integer slot, i.e., a many-to-one mapping instead of one-to-one mapping.

For example, h(6675 2378) = 237 from three slides earlier and if we want to insert another phone number 6675 4372, we will have a problem as h(6675 4372) = 237 too.

This situation is called a collision, i.e., two (or more) keys have the same hash value.

3-5. 冲突的概率

生日（冯米塞斯）悖论问：'在一个大小为365个座位（单元）的房间（哈希表）中，必须有多少人（键的数量），才能有人同一天生日（碰撞，两个键被哈希到同一个单元）的概率变得>50%（即比不发生更有可能）？

Reveal这个答案对我们中的一些人来说可能令人惊讶。
我们来做一些计算吧。

3-6. 计算

Let Q(n) be the probability of unique birthday for n people in a room.
Q(n) = 365/365 × 364/365 × 363/365 × ... × (365-n+1)/365,
i.e., the first person's birthday can be any of the 365 days, the second person's birthday can be any of the 365 days except the first person's birthday, and so on.

Let P(n) be the probability of same birthday (collision) for n people in a room.
P(n) = 1-Q(n).

We compute that P(23) = 0.507 > 0.5 (50%).

Thus, we only need 23 people (a small amount of keys) in the room (Hash Table) of size 365 seats (cells) for a (more than) 50% chance collision to happen (the birthday of two different people in that room is one of 365 days/slots).

A good heuristic: we only need sqrt(n) keys in a hash Table of n cells to have more than 50% chance of collision [derivation omitted, but it is based on this birthday paradox].

3-7. 两个重要的问题

Issue 1: We have seen a simple hash function like the h(v) = v%997 used in Phone Numbers example that maps large range of integer keys into a smaller range of integer keys, but how about non integer keys? How to do such hashing efficiently?

Issue 2: We have seen that by hashing, or mapping, large range into smaller range, there will very likely be a collision. How to deal with them?

4. 哈希函数

如何用这些理想的属性创建一个好的散列函数？

快速计算，即在O（1）中，
尽可能使用最小插槽/散列表的大小M，
尽可能均匀地将键分散到不同的基地址∈[0..M-1]，
尽可能减少碰撞。

4-1. 初步措施

Suppose we have a hash table of size M where keys are used to identify the satellite-data and a specific hash function is used to compute a hash value.

A hash value/hash code of key v is computed from the key v with the use of a hash function to get an integer in the range 0 to M-1. This hash value is used as the base/home index/address of the Hash Table entry for the satellite-data.

4-2. 不好的哈希函数的例子

使用电话号码示例，如果我们定义h(v) = floor(v/1 000 000)，
即，我们选择电话号码的前两位数字。

h(66 75 2378) = 66
h(68 74 4483) = 68

讨论：当你使用这个哈希函数时会发生什么？提示：参见这个。

4-3. 答案

[This is a hidden slide]

4-4. 完美的哈希函数

在讨论现实之前，让我们讨论理想的情况：完美的散列函数。
完美的散列函数是键和散列值之间的一对一映射，即根本不存在冲突。如果事先知道所有的键是可能的。例如，编译器/解释器搜索保留关键字。但是，这种情况很少见。
当表格大小与提供的关键字数量相同时，实现最小的完美散列函数。这种情况更为罕见。
如果你有兴趣，你可以探索GNU gperf，这是一个用C++编写的免费可用的完美哈希函数生成器，可以从用户提供的关键字列表中自动构建完美的函数（C++程序）。

4-5. 散列整数 - 最佳实践

People has tried various ways to hash a large range of integers into a smaller range of integers as uniformly as possible. In this e-Lecture, we jump directly to one of the best and most popular version: h(v) = v%M, i.e., map v into Hash Table of size M slots. The (%) is a modulo operator that gives the remainder after division. This is clearly fast, i.e., O(1) assuming that v does not exceed the natural integer data type limit.

The Hash Table size M is set to be a reasonably large prime not near a power of 2, about 2+ times larger than the expected number of keys N that will ever be used in the Hash Table. This way, the load factor α = N/M < 0.5 — we shall see later that having low load factor, thereby sacrificing empty spaces, help improving Hash Table performance.

Discuss: What if we set M to be a power of 10 (decimal) or power of 2 (binary)?

4-6. 答案

[This is a hidden slide]

4-7. 散列串 - 最佳实践

People has also tried various ways to hash Strings into a small range of integers as uniformly as possible. In this e-Lecture, we jump directly to one of the best and most popular version, shown below:

int hash_function(string v) { // note 1: v uses ['A'..'Z'] only
  int sum = 0;                // note 2: v is a short string
  for (auto& c : v) // for each character c in v
    sum = ((sum*26)%M + (c-'A'+1))%M; // M is table size
  return sum;
}

Interactive (M = ∞), i.e., the modulo operation has no effect
v = , hash_string(v) = 0.

Discussion: In real life class, discuss the components of the hash function above, e.g., why loop through all characters?, will that be slower than O(1)?, why multiply with 26?, what if the string v uses more than just UPPERCASE chars?, etc.

4-8. 答案

[This is a hidden slide]

5. 冲突解决方案

There are two major ideas: Closed Addressing versus Open Addressing method.

In Closed Addressing, the Hash Table looks like an Adjacency List (a graph data structure). The hash code of a key gives its fixed/closed base address. Collision is resolved by appending the collided keys inside an auxiliary data structure (usually any form of List ADT) identified by the base address.

In Open Addressing, all hashed keys are located in a single array. The hash code of a key gives its base address. Collision is resolved by checking/probing multiple alternative addresses (hence the name open) in the table based on a certain rule.

5-1. 分离链接法（SC）

Separate Chaining (SC) collision resolution technique is simple. We use M copies of auxiliary data structures, usually Doubly Linked Lists (PS: simpler auxiliary data structures exist). If two keys a and b both have the same hash value i, both will be appended to the (front/back) of Doubly Linked List i (in this visualization, we append to the back in O(1) with help of tail pointer). That's it, where the keys will be slotted in is completely dependent on the hash function itself, hence we also call Separate Chaining as Closed Addressing collision resolution technique.

If we use Separate Chaining, the load factor α = N/M is the average length of the M lists (unlike in Open Addressing, α can be "slightly over 1.0") and it will determine the performance of Search(v) as we may have to explore α elements on average. As Remove(v) also requires Search(v), its performance is similar as Search(v). Insert(v) is clearly O(1).

If we can bound α to be a small constant (true if we know the expected largest N in our Hash Table application so that we can set up M accordingly), then all Search(v), Insert(v), and Remove(v) operations using Separate Chaining will be O(1).

5-2. 开放寻址法（OA）

在这个可视化中，我们讨论了三种开放寻址 (OA) 冲突解决技术：线性探测 (LP)，二次探测 (QP) 和双重哈希 (DH)。

要在这三种模式之间切换，请点击相应的标题。

设：
M = HT.length = 当前哈希表的大小，
base = (key%HT.length)，
step = 当前的探测步骤，
secondary = smaller_prime - key%smaller_prime (为了避免零 —— 很快会详细说明)

我们很快就会看到，这三种模式的探测序列是：
线性探测：i=(base+step*1) % M，
二次探测：i=(base+step*step) % M，和
双重哈希：i=(base+step*secondary) % M。

所有三种 OA 技术都要求负载因子 α = N/M < 1.0 (否则无法进行更多的插入)。如果我们可以将 α 限制在一个小的常数 (如果我们知道我们的哈希表应用中预期的最大 N，以便我们可以相应地设置 M，最好 < 0.5 对于大多数 OA 变体)，那么所有使用开放寻址的 Search(v)，Insert(v) 和 Remove(v) 操作都将是 O(1) —— 详细信息省略。

6. 可视化

View the visualization of Hash Table above.

In this visualization, we allow the insertion of duplicate keys (i.e., a multiset). Since a multiset is more general than a set, simply just insert distinct integers in this visualization if you want to see how Hash Table works on distict integer keys only.

Due to limited screen space, we switch from default (1.0x) scale to 0.5x scale whenever you want to visualize Hash Table size M ∈ [46..90] for OA techniques. The limit is a bit lower, i.e., M ∈ [20..31] for SC technique.

The Hash Table is visualized horizontally like an array where index 0 is placed at the leftmost of the first row and index M-1 is placed at the rightmost of the last row but the details are different when we are visualizing Separate Chaining (only the top row) versus Open Addressing (usually spans multiple rows) collision resolution techniques.

6-1. 分离链接版本

对于分离链接 (SC) 冲突解决技术，第一行包含 M 个 "H" (头) 指针，这些指针指向 M 个双向链表。

然后，每个双向链表 i 包含所有被哈希到 i 的键，顺序是任意的（在0.5x的比例下，顶点标签显示在较小的黑点上方）。从数学上讲，所有可以表示为 i (mod M) 的键 — 包括所有 i 的重复项 — 都被哈希到 DLL i。再次强调，我们在这个可视化中不存储任何卫星数据。

6-2. 开址法版本

在这个可视化中，我们讨论了三种开放寻址冲突解决技术：线性探测 (LP)，二次探测 (QP) 和双重哈希 (DH)。

对于所有三种技术，每个哈希表单元格都显示为一个顶点，单元格值 [0..99] 显示为顶点标签（在 0.5x 缩放中，顶点标签显示在较小的黑点上方）。在不失一般性的情况下，我们在这个可视化中不显示任何卫星数据，因为我们只关注键的排列。我们保留值 -1 来表示一个 '空单元格'（可视化为一个空白顶点）和 -2 来表示一个 '已删除单元格'（可视化为一个带有缩写标签 "DEL" 的顶点）。范围从 [0..M-1] 的单元格索引显示为每个顶点下方的红色标签（在 1.0x 缩放中为 15 行索引，或在 0.5x 缩放中为 25 行索引）。

7. 分离链接（SC）

Try Insert([9,16,23,30,37,44]) to see how Insert(v) operation works if we use Separate Chaining as collision resolution technique. On such random insertions, the performance is good and each insertion is clearly O(1).

However if we try Insert([68,90]), notice that all integers {68,90} are 2 (modulo 11) so all of them will be appended into the (back of) Doubly Linked List 2. We will have a long chain in that list. Note that due to the screen limitation, we limit the length of each Doubly Linked List to be at maximum 6.

7-1. Search(35) 和 Remove(35)

尝试 Search(35)，可以看到 Search(v) 的运行时间可以达到 O(1+α)。

尝试 Remove(35)，可以看到 Remove(v) 的运行时间也可以达到 O(1+α)。

如果 α 很大，分离链接的性能实际上并不是 O(1)。然而，如果我们大致知道我们的应用程序可能会使用的最大键值数量 N，那么我们可以相应地设置表大小 M，使得 α = N/M 是一个非常低的正（浮点）数，从而使分离链接的性能预期为 O(1)。

8. 线性探测

在线性探测冲突解决方法中，每当发生冲突时，我们会一次扫描一个索引，以查找下一个空/被删除的插槽（当我们到达最后一个插槽时再重新环绕）。
例如，我们假设我们以表格大小M = HT.length = 7的空的散列表 HT开始，如上所示，使用索引0到 M-1 = 7-1 = 6。请注意，7是素数。（主）散列函数很简单，h（v）= v％M。
此演练将向您展示使用线性探测作为冲突解决技术时Insert（v），Search（v）和Remove（v）操作所采取的步骤。

8-1. 插入（[18，14，21])

Now click Insert([18,14,21]) — three individual insertions in one command.

Recap (to be shown after you click the button above).

Formally, we describe Linear Probing index i as i = (base+step*1) % M where base is the (primary) hash value of key v, i.e., h(v) and step is the Linear Probing step starting from 1.

Tips: To do a quick mental calculation of a (small) integer V modulo M, we simply subtract V with the largest multiple of M ≤ V, e.g., 18%7 = 18-14 = 4, as 14 is the largest multiple of 7 that is ≤ 18.

8-2. 插入([1,35])

现在点击Insert([1,35])（在上一张幻灯片中插入的前三个值的顶部）。
回顾（在点击上面的按钮后显示）

8-3. 搜索（35）和搜索（8）

现在我们说明使用线性探测作为碰撞解析技术的Search（v）操作。所采取的步骤与 Insert（v）操作非常相似，即我们从（主）散列键值开始，检查是否已找到v，否则我们一次向前移动一个索引（如果遇到结尾则从开头重新开始）并重新检查。当我们遇到一个空单元时，我们就停止，这意味着v完全不在哈希表中（因为之前的Insert（v）操作会将v放在那里）。

现在点击Search(35) - 你应该看到探测序列[0,1,2,3（找到键35）]。

现在单击Search(7) - [1,2,3,4,5（空单元格，因此在哈希表中找不到键8）]。

8-4. 删除（v） - 初步措施

现在我们来讨论移除（v）操作。
如果我们直接设置HT [i] = EMPTY单元，其中i是包含v的索引（在必要时进行线性探测之后），您是否意识到我们会导致问题？为什么？

提示：查看以前的三张幻灯片，了解插入（v）和搜索（v）的行为。

8-5. 答案

[This is a hidden slide]

8-6. 删除（21）

现在让我们看看完整的删除（v）。如果我们在索引 i 处找到 v（必要时在线性探测之后），我们必须设置 HT [i] = DELETED（在此可视化中缩写为DEL），其中DEL是一个特殊符号（通常你只能在你的应用程序种没有使用过的使用一个特殊符号）指示该单元格可以在将来的搜索（v）需要时绕过，但可以被将来的插入（w）覆盖。这种策略被称为懒惰删除。
现在点击Remove(21) - [0,1（找到关键21，我们设置H [1] = DEL）]。
然后请继续讨论下一张幻灯片。

8-7. 再次搜索（35）

现在点击Search(35) - [0,1（绕过那个DELETED cell），2,3（找到键35）]。
想象一下，如果我们错误地设置H [1] = EMPTY，会发生什么。

8-8. 插入（28） - 覆盖DEL

现在单击Insert(28) - 您应该看到探测序列[0,1（使用DEL符号找到一个单元格）]，因此，实际上可以用新值覆盖，而不会影响将来搜索（v）的正确性。因此，我们把28放在索引1中。

8-9. 初级聚类，第一部分

尽管我们可以用线性探测来解决冲突，但这并不是最有效的方法。

我们定义一个簇为一组连续的占用槽。覆盖键的基地址的簇被称为键的主簇。

现在注意到线性探测可以创建大的主簇，这将使Search(v)/Insert(v)/Remove(v)操作的运行时间超过宣传的O(1)。

看看上面的一个例子，其中M = 31，我们已经插入了15个键[0..14]，使它们占用单元格[0..14] (α = 15/31 < 0.5)。现在看看 Insert(31) (第16个键)有多么"慢"。

8-10. 线性探测顺序

The probe sequence of Linear Probing can be formally described as follows:

 h(v) // base address
(h(v) + 1*1) % M // 1st probe if there is a collision
(h(v) + 2*1) % M // 2nd probe if there is still a collision
(h(v) + 3*1) % M // 3rd probe if there is still a collision
...
(h(v) + k*1) % M // k-th probe, etc...

During Insert(v), if there is a collision but there is an empty (or DEL) slot remains in the Hash Table, we are sure to find it after at most M Linear Probing steps, i.e., in O(M). And when we do, the collision will be resolved, but the primary cluster of the key v is expanded as a result and future Hash Table operations will get slower too. Try the slow Search(31) on the same Hash Table as in the previous slide but with many DEL markers (suppose {4, 5, 8, 9, 10, 12, 14} have just been deleted).

8-11. 初级聚类，第二部分

在上一张幻灯片中（主要聚类，第1部分），我们打破了哈希函数应该将键均匀分布在 [0..M-1] 周围的假设。在下一个例子中，我们将展示即使哈希函数将键分布到 [0..M-1] 周围的几个相对较短的主要聚类中，主要聚类的问题仍然可能发生。

在屏幕上，看到 M = 31，插入了15个在 [0..99] 之间的随机整数（有几个随机但短的主要聚类）。如果我们接着插入这4个键 {2, 9, 12, 1}，前三个键将“插入”三个空单元格，并意外地合并那些相邻的（但以前是分离的）聚类成为一个（非常）长的主要聚类。因此，下一个插入键1的操作将在这个长的主要聚类的开始处进行，最终需要进行几乎 O(M) 的探测步骤才能找到一个空单元格。尝试 Insert([2,9,12,1])。

9. 二次探测（QP）

To reduce primary clustering, we can modify the probe sequence to:

 h(v) // base address
(h(v) + 1*1) % M // 1st probe if there is a collision
(h(v) + 2*2) % M // 2nd probe if there is still a collision
(h(v) + 3*3) % M // 3rd probe if there is still a collision
...
(h(v) + k*k) % M // k-th probe, etc...

That's it, the probe jumps quadratically, wrapping around the Hash Table as necessary.

A very common mistake as this is a different kind of Quadratic Probing:
Doing h(v), (h(v)+1) % M, (h(v)+1+4) % M, (h(v)+1+4+9) % M, ...

9-1. 插入(38)

假设我们已经调用 Insert（18）和Insert（10）到大小为M = HT.length = 7的初始空Hash表中。由于18％7 = 4和10％7 = 3,18和3不会相互碰撞，如上所示分别位于指数4和3中。
现在，我们点击Insert(38)。
回顾（点击上面的按钮后显示）。

9-2. 移除（18），再次搜索（38）

删除（x）和搜索（y）操作的定义类似。只是这次我们使用二次探测而不是线性探测。
例如，假设我们在上一张幻灯片之后调用了Remove（18），并且标记了HT [4] = DEL。如果我们调用Search(38)，我们将使用与上一张幻灯片相同的二次探测序列，但是会通过标记为DELETED的HT [4]。

9-3. 比线性探测更好？

一目了然，二次探测（Quadratic Probing）能够快速解决我们之前对线性探测的主要聚类问题，但它是完美的碰撞解决技术吗？
尝试Insert([12,17])。
你意识到刚刚发生了什么吗？

9-4. 细节

我们可以很容易地插入12，因为h（12）= 12％7 = 5 以前是空的（见上文）。
但是，即使我们仍然有3个空槽，我们仍然存在主要问题：h（17）= 17％7 = 3已被键10占用，（3 + 1 * 1）％7 = 4已被占用（3 + 2 * 2）％7 = 0已被键38占用，（3 + 3 * 3）％7 = 5已被键12占用，（3 + 4 * 4）％7 = 5 （3 + 5 * 5）％7 = 0再次被键38占据，（3 + 6 * 6）％7 = 4又被键18占据，（3 + 7 * 7）％7 = 3再次被键10占用，如果我们继续二次探测，它将永远循环...
尽管我们仍然有几个（3）空单元格，但我们无法将此新值17插入到哈希表中。

9-5. 定理

如果 α < 0.5 并且 M 是一个质数（> 3），那么我们总是可以使用（这种形式的）二次探测找到一个空的插槽。回忆：α 是负载因子，M 是哈希表的大小（HT.length）。

如果满足上述两个要求，我们可以证明，包括基地址 h(v) 在内的前 M/2 个二次探测索引都是不同且唯一的。

但是超过这个范围就不能保证了。因此，如果我们想使用二次探测，我们需要确保 α < 0.5（在这个可视化中没有强制执行，但我们在 M 步后会中断循环以防止无限循环）。

9-6. 证明

We will use proof by contradiction. We first assume that two Quadratic Probing steps:
x and y, x != y (let's say x < y), can yield the same address modulo M.

h(v) + x*x = h(v) + y*y (mod M)
x*x = y*y (mod M) // strike out h(v) from both sides
x*x - y*y = 0 (mod M) // move y*y to LHS
(x-y)*(x+y) = 0 (mod M) // rearrange the formula

Now, either (x-y) or (x+y) has to be equal to zero.
As our assumption says x != y, then (x-y) cannot be 0.
As 0 ≤ x < y ≤ (M/2) and M is a prime > 3 (an odd integer),
then (x+y) also cannot be 0 modulo M.

Contradiction!

So the first M/2 Quadratic Probing steps cannot yield the same address modulo M
(if we set M to be a prime number greater than 3).

Discussion: Can we make Quadratic Probing able to use the other ~50% of the table cells?

9-7. 更好的二次探测

[This is a hidden slide]

9-8. 二次聚类

在二次探测中，集群是沿着探测路径形成的，而不是像线性探测那样围绕基地址形成的。这些集群被称为次级集群，与困扰线性探测的主要集群相比，它们“不那么明显”。

次级集群是由于碰撞键使用相同的探测模式而形成的，即，如果两个不同的键有相同的基地址，那么它们的二次探测序列将会是相同的。

为了说明这一点，看看屏幕上M = 31。我们只用10个键填充了这个哈希表（所以负载因子 α = 10/31 ≤ 0.5），哈希表看起来“稀疏足够”（没有明显的大的主要集群）。然而，如果我们接着插入 Insert(62,93)，尽管有很多（31-10 = 21）空单元格和62 != 93（最终散列到索引0的不同键），我们最终在这个“不那么明显”的次要聚集中进行了10次探测步骤（注意，{62, 93}都遵循类似的二次探测序列）。

二次探测中的次级集群并不像线性探测中的主要集群那么糟糕，因为一个好的哈希函数理论上应该将键分散到不同的基地址∈ [0..M-1]中。

10. 双倍散列（DH）

To reduce primary and secondary clustering, we can modify the probe sequence to:

 h(v) // base address
(h(v) + 1*h2(v)) % M // 1st probe if there is a collision
(h(v) + 2*h2(v)) % M // 2nd probe if there is still a collision
(h(v) + 3*h2(v)) % M // 3rd probe if there is still a collision
...
(h(v) + k*h2(v)) % M // k-th probe, etc...

That's it, the probe jumps according to the value of the second hash function h2(v), wrapping around the Hash Table as necessary.

10-1. 次要散列函数h2（v）

If h2(v) = 1, then Double Hashing works exactly the same as Linear Probing.
So we generally wants h2(v) > 1 to avoid primary clustering.

If h2(v) = 0, then Double Hashing does not work for an obvious reason as any probing step multiplied by 0 remains 0, i.e. we stay at the base address forever during a collision. We need to avoid this.

Usually (for integer keys), h2(v) = M' - v%M' where M' is a smaller prime than M.
This makes h2(v) ∈ [1..M'], which is diverse enough to avoid secondary clustering.

The usage of the secondary hash function makes it theoretically hard to have either primary or secondary clustering issue.

10-2. 插入（[35,42])

点击Insert([35,42])插入35，然后插入42到上面的当前哈希表。
回顾（点击上面的按钮后显示）。

10-3. 移除（17），再次搜索（35）

删除（x）和搜索（y）操作的定义类似。只是这次我们使用双散列（Double Hashing）而不是线性探测或二次探测。
例如，假设我们在上一张幻灯片之后调用了Remove（17），并且标记了HT [3] = DEL。如果我们调用Search(35)，我们将使用与前一张幻灯片相同的双哈希序列，但是会通过标记为DELETED的HT [3]。

10-4. 好的OA冲突解决方案

In summary, a good Open Addressing collision resolution technique needs to:

Always find an empty slot if it exists,
Minimize clustering (of any kind),
Give different probe sequences when 2 different keys collide,
Fast, O(1).

Now, let's see the same test case that plagues Quadratic Probing earlier. Now try Insert(62,93) again. Although h(62) = h(93) = 0 and their collide with 31 that already occupy index 0, their probing steps are not the same: h2(62) = 29-62%29 = 25 is not the same as h2(93) = 29-93%29 = 23.

Discussion: Double Hashing seems to fit the bill. But... Is Double Hashing strategy flexible enough to be used as the default library implementation of a Hash Table? Or in a more general sense, which of the two collision resolution technique (Separate Chaining versus Open Addressing: Double Hashing) is the better one?

10-5. The (Current) Answer

[This is a hidden slide]

11. 额外的

您已经完成了这个哈希表数据结构的基本工作，我们鼓励您在探索模式中进一步探索。

但是，我们仍然会为您提供一些本节中概述的更有趣的哈希表的挑战。

11-1. （重新散列）Rehash

当负载因子α变高时，哈希表的性能会降低。对于（标准）二次探测冲突解决方法，当哈希表的α> 0.5时，插入可能失败。
如果发生这种情况，我们可以重新散列（rehash）。我们用一个新的散列函数构建另一个大约两倍的散列表。我们遍历原始哈希表中的所有键，重新计算新的哈希值，然后将键（及其卫星数据）重新插入新的更大的哈希表中，最后删除较早的较小哈希表。
一个经验法则是，如果使用开放寻址，并且当α>小的常数（根据需要接近1.0），如果使用单独链接，当α≥0.5时重新散列。
如果我们知道可能的最大键数，我们可以始终将α作为一个低的数字。

11-2. 哈希表的实现

However, if you ever need to implement a Hash Table in C++, Python, or Java, and your keys are either integers or strings, you can use the built-in C++ STL, Python standard library, or Java API, respectively. They already have good built-in implementation of default hash functions for integers or strings.

See C++ STL std::unordered_map, std::unordered_set, Python dict, set, or Java HashMap, HashSet.

For C++, note that the std::multimap/std::multiset implementations are also exist where duplicate keys are allowed.

For OCaml, we can use Hashtbl.

However, here is our take of a simple Separate Chaining implementation: HashTableDemo.cpp | py | java.

11-3. 数据结构组合？

[This is a hidden slide]

11-4. 对于表ADT的备选数据结构

Hash Table is an extremely good data structure to implement Table ADT if the (integer or string) keys only need to be mapped to satellite-data, with O(1) performance for Search(v), Insert(v), and Remove(v) operations if the Hash Table is set up properly.

However, if we need to do much more with the keys, we may need to use an alternative data structure.

11-5. 在线小测试

关于这个数据结构的一些更有趣的问题，请在哈希表培训模块上练习（不需要登录，但是只需短暂且中等难度设置）。
但是，对于注册用户，您应该登录，然后转到主要培训页面以正式清除此模块，这些成果将记录在您的用户帐户中。

11-6. 在线评判练习

尝试解决一些基本的编程问题，就很可能需要使用哈希表（特别是如果输入大小很大的时候）：

Kattis - cd（输入已经被排序，因此存在替代的非哈希表解决方案; 如果未对输入进行排序，则在哈希表的帮助下最好解决此组交集问题），
Kattis - oddmanout（我们可以将较大的邀请代码映射到较小范围的整数;这是整数散列（大范围）的一种做法），
Kattis - whatdoesthefoxsay（我们把不是狐狸的声音放入无序集合;这是散列字符串的做法）。

哈希表

1. Introduction

2. 动机

2-1. 表ADT

2-2. 直接寻址表

2-3. DAT的例子

2-4. 带数据的DAT示例

2-5. 答案

2-6. DAT的局限性

3. 散列：想法

3-1. 电话号码示例

3-2. 哈希表预览

3-3. 带数据的哈希表

3-4. 冲突

3-5. 冲突的概率

3-6. 计算

3-7. 两个重要的问题

4. 哈希函数

4-1. 初步措施

4-2. 不好的哈希函数的例子

4-3. 答案

4-4. 完美的哈希函数

4-5. 散列整数 - 最佳实践

4-6. 答案

4-7. 散列串 - 最佳实践

4-8. 答案

5. 冲突解决方案

5-1. 分离链接法（SC）

5-2. 开放寻址法（OA）

6. 可视化

6-1. 分离链接版本

6-2. 开址法版本

7. 分离链接（SC）

7-1. Search(35) 和 Remove(35)

8. 线性探测

8-1. 插入（[18，14，21])

8-2. 插入([1,35])

8-3. 搜索（35）和搜索（8）

8-4. 删除（v） - 初步措施

8-5. 答案

8-6. 删除（21）

8-7. 再次搜索（35）

8-8. 插入（28） - 覆盖DEL

8-9. 初级聚类，第一部分

8-10. 线性探测顺序

8-11. 初级聚类，第二部分

9. 二次探测（QP）

9-1. 插入(38)

9-2. 移除（18）， 再次搜索（38）

9-3. 比线性探测更好？

9-4. 细节

9-5. 定理

9-6. 证明

9-7. 更好的二次探测

9-8. 二次聚类

10. 双倍散列（DH）

10-1. 次要散列函数h2（v）

10-2. 插入（[35,42])

10-3. 移除（17）， 再次搜索（35）

10-4. 好的OA冲突解决方案

10-5. The (Current) Answer

11. 额外的

11-1. （重新散列）Rehash

11-2. 哈希表的实现

11-3. 数据结构组合？

11-4. 对于表ADT的 备选数据结构

11-5. 在线小测试

11-6. 在线评判练习

9-2. 移除（18），再次搜索（38）

10-3. 移除（17），再次搜索（35）

11-4. 对于表ADT的备选数据结构