>

>
1x
go to beginning previous frame pause play next frame go to end

Hash Table is a data structure to map key to values (also called Table or Map Abstract Data Type/ADT). It uses a hash function to map large or even non-integer keys into a small range of integer indices (typically [0..hash_table_size-1]).


The probability of two distinct keys colliding into the same index is relatively high and each of this potential collision needs to be resolved to maintain data integrity.


There are several collision resolution strategies that will be highlighted in this visualization: Open Addressing (Linear Probing, Quadratic Probing, and Double Hashing) and Closed Addressing (Separate Chaining). Try clicking Search(7) for a sample animation of searching a specific value 7 in a randomly created Hash Table using Separate Chaining technique (duplicates are allowed).


Remarks: By default, we show e-Lecture Mode for first time (or non logged-in) visitor.
If you are an NUS student and a repeat visitor, please login.

🕑

Hashing is an algorithm (via a hash function) that maps large data sets of variable length, called keys, not necessarily integers, into smaller integer data sets of a fixed length.


A Hash Table is a data structure that uses a hash function to efficiently map keys to values (Table or Map ADT), for efficient search/retrieval, insertion, and/or removals.


Hash Table is widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.


In this e-Lecture, we will digress to Table ADT, the basic ideas of Hashing, the discussion of Hash Functions before going into the details of Hash Table data structure itself.


Pro-tip 1: Since you are not logged-in, you may be a first time visitor (or not an NUS student) who are not aware of the following keyboard shortcuts to navigate this e-Lecture mode: [PageDown]/[PageUp] to go to the next/previous slide, respectively, (and if the drop-down box is highlighted, you can also use [→ or ↓/← or ↑] to do the same),and [Esc] to toggle between this e-Lecture mode and exploration mode.

🕑

表ADT必须至少支持以下三种操作,并且尽可能高效:

  1. 搜索(v) - 确定v是否存在于ADT中,
  2. 插入(v) - 将v插入ADT,
  3. 删除(v) - 从ADT中删除v。

哈希表是这个表ADT的一个可能的好实现(另一个是这个)。


PS1:对于Table ADT的两个较弱的实现,您可以单击相应的链接:未排序数组排序数组来阅读详细讨论。


PS2:在现场课程中,您可能想要比较Table ADT和List ADT的要求。


Pro-tip 2: We designed this visualization and this e-Lecture mode to look good on 1366x768 resolution or larger (typical modern laptop resolution in 2021). We recommend using Google Chrome to access VisuAlgo. Go to full screen mode (F11) to enjoy this setup. However, you can use zoom-in (Ctrl +) or zoom-out (Ctrl -) to calibrate this.

🕑

When the range of the integer keys is small, e.g., [0..M-1], we can use an initially empty (Boolean) array A of size M and implement the following Table ADT operations directly:

  1. Search(v): Check if A[v] is true (filled) or false (empty),
  2. Insert(v): Set A[v] to be true (filled),
  3. Remove(v): Set A[v] to be false (empty).

That's it, we use the small integer key itself to determine the address in array A, hence the name Direct Addressing. It is clear that all three major Table ADT operations are O(1).


PS: This idea is also used elsewhere, e.g., in Counting Sort.


Pro-tip 3: Other than using the typical media UI at the bottom of the page, you can also control the animation playback using keyboard shortcuts (in Exploration Mode): Spacebar to play/pause/replay the animation, / to step the animation backwards/forwards, respectively, and -/+ to decrease/increase the animation speed, respectively.

🕑

In Singapore (as of Mar 2026), bus routes are numbered from [2..993].


Not all integers between [2..993] are currently used, e.g., there is no bus route 989 — Search(989) should return false. A new bus route x may be introduced, i.e., Insert(x) or an existing bus route y may be discontinued, i.e., Remove(y).


As the range of possible bus routes is small, to record the data whether a bus route number exists or not, we can use a DAT of Boolean array of size 1000 (generally, it is useful to give a few extra buffer cells on top of the current largest bus number of 993).


PS: The closest prime number to 1000 is 997 - but using prime table size is not important for DAT.

🕑

请注意,我们总是可以添加卫星数据,而不仅仅是使用布尔数组来记录键的存在。


例如,我们可以使用关联字符串数组A来将公交路线号映射到其运营商名称,例如,

A[2] = "Go-Ahead Singapore",
A[10] = "SBS Transit",
A[183] = "Tower Transit Singapore",
A[188] = "SMRT Buses", 等等。

讨论:你能想到其他一些现实生活中的DAT例子吗?

🕑

The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.


If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.


FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.

🕑

The keys must be (or can be easily mapped to) non-negative integer values. Note that basic DAT has problem in the full version of the example in the previous few slides as there are actually variations of bus route numbers in Singapore, e.g., 96B, 151A, NR10, etc.


The range of keys must be small.
The memory usage will be (insanely) large if we have (insanely) large range.


The keys must be dense, i.e., not many gaps in the key values.
DAT will contain too many empty (and wasted) cells otherwise.


We will overcome these restrictions with hashing.

🕑

Using hashing, we can:

  1. Map (some) non-integer keys (e.g., strings) to integers keys,
  2. Map large integers to smaller integers.
🕑

例如,我们有 N = 400个新加坡电话号码(新加坡电话号码有8位数字,所以新加坡最多有10^8 = 1亿个可能的电话号码)。


我们可以使用以下简单的哈希函数 h(v) = v%997,而不是使用DAT并使用一个最大为 M = 1亿的巨大数组。


这样,我们将8位电话号码 6675 23786874 4483 映射到最多3位的 h(6675 2378) = 237h(6874 4483) = 336。因此,我们只需要准备一个大小为 M = 997(997是一个质数)的数组,而不是 M = 1亿的数组。

🕑

With hashing, we can now implement the following Table ADT operations using integer array (instead of Boolean array) as follows:

  1. Search(v): Check if A[h(v)] != -1 (we use -1 for an empty cell assuming v ≥ 0),
  2. Insert(v): Set A[h(v)] = v (we hash v into h(v) so we need to somehow record key v),
  3. Remove(v): Set A[h(v)] = -1 — to be elaborated further.
🕑

If we have keys that map to satellite data and we want to record the original keys too, we can implement the Hash Table using pair of (integer, satellite-data-type) array as follows:

  1. Search(v): Return A[h(v)], which is a pair (v, satellite-data), possibly empty,
  2. Insert(v, satellite-data): Set A[h(v)] = pair (v, satellite-data),
  3. Remove(v): Set A[h(v)] = (empty pair) — to be elaborated further.

However, by now you should notice that something is incomplete...

🕑

A hash function may, and quite likely, map different keys (integer or not) into the same integer slot, i.e., a many-to-one mapping instead of one-to-one mapping.


For example, h(6675 2378) = 237 from three slides earlier and if we want to insert another phone number 6675 4372, we will have a problem as h(6675 4372) = 237 too.


This situation is called a collision, i.e., two (or more) keys have the same hash value.

🕑
生日(冯米塞斯)悖论问:'在一个大小为365个座位(单元)的房间(哈希表)中,必须有多少人(键的数量),才能有人同一天生日(碰撞,两个键被哈希到同一个单元)的概率变得>50%(即比不发生更有可能)?

Reveal这个答案对我们中的一些人来说可能令人惊讶。
我们来做一些计算吧。
🕑

Let Q(n) be the probability of unique birthday for n people in a room.
Q(n) = 365/365 × 364/365 × 363/365 × ... × (365-n+1)/365,
i.e., the first person's birthday can be any of the 365 days, the second person's birthday can be any of the 365 days except the first person's birthday, and so on.


Let P(n) be the probability of same birthday (collision) for n people in a room.
P(n) = 1-Q(n).


We compute that P(23) = 0.507 > 0.5 (50%).


Thus, we only need 23 people (a small amount of keys) in the room (Hash Table) of size 365 seats (cells) for a (more than) 50% chance collision to happen (the birthday of two different people in that room is one of 365 days/slots).


A good heuristic: we only need sqrt(n) keys in a hash Table of n cells to have more than 50% chance of collision [derivation omitted, but it is based on this birthday paradox].

🕑

Issue 1: We have seen a simple hash function like the h(v) = v%997 used in Phone Numbers example that maps large range of integer keys into a smaller range of integer keys, but how about non integer keys? How to do such hashing efficiently?


Issue 2: We have seen that by hashing, or mapping, large range into smaller range, there will very likely be a collision. How to deal with them?

🕑
如何用这些理想的属性创建一个好的散列函数?
  1. 快速计算,即在O(1)中,
  2. 尽可能使用最小插槽/散列表的大小M,
  3. 尽可能均匀地将键分散到不同的基地址∈[0..M-1],
  4. 尽可能减少碰撞。
🕑

Suppose we have a hash table of size M where keys are used to identify the satellite-data and a specific hash function is used to compute a hash value.


A hash value/hash code of key v is computed from the key v with the use of a hash function to get an integer in the range 0 to M-1. This hash value is used as the base/home index/address of the Hash Table entry for the satellite-data.

🕑

使用电话号码示例,如果我们定义h(v) = floor(v/1 000 000)
即,我们选择电话号码的前两位数字。

h(66 75 2378) = 66
h(68 74 4483) = 68

讨论:当你使用这个哈希函数时会发生什么?提示:参见这个

🕑

The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.


If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.


FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.

🕑

在讨论现实之前,让我们讨论理想的情况:完美的散列函数
完美的散列函数是键和散列值之间的一对一映射,即根本不存在冲突。 如果事先知道所有的键是可能的。 例如,编译器/解释器搜索保留关键字。 但是,这种情况很少见。
当表格大小与提供的关键字数量相同时,实现最小的完美散列函数。 这种情况更为罕见。
如果你有兴趣,你可以探索GNU gperf,这是一个用C++编写的免费可用的完美哈希函数生成器,可以从用户提供的关键字列表中自动构建完美的函数(C++程序)。

🕑

People has tried various ways to hash a large range of integers into a smaller range of integers as uniformly as possible. In this e-Lecture, we jump directly to one of the best and most popular version: h(v) = v%M, i.e., map v into Hash Table of size M slots. The (%) is a modulo operator that gives the remainder after division. This is clearly fast, i.e., O(1) assuming that v does not exceed the natural integer data type limit.


The Hash Table size M is set to be a reasonably large prime not near a power of 2, about 2+ times larger than the expected number of keys N that will ever be used in the Hash Table. This way, the load factor α = N/M < 0.5 — we shall see later that having low load factor, thereby sacrificing empty spaces, help improving Hash Table performance.


Discuss: What if we set M to be a power of 10 (decimal) or power of 2 (binary)?

🕑

The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.


If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.


FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.

🕑

People has also tried various ways to hash Strings into a small range of integers as uniformly as possible. In this e-Lecture, we jump directly to one of the best and most popular version, shown below:

int hash_function(string v) { // note 1: v uses ['A'..'Z'] only
int sum = 0; // note 2: v is a short string
for (auto& c : v) // for each character c in v
sum = ((sum*26)%M + (c-'A'+1))%M; // M is table size
return sum;
}

Interactive (M = ∞), i.e., the modulo operation has no effect
v = , hash_string(v) = 0.


Discussion: In real life class, discuss the components of the hash function above, e.g., why loop through all characters?, will that be slower than O(1)?, why multiply with 26?, what if the string v uses more than just UPPERCASE chars?, etc.

🕑

The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.


If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.


FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.

🕑

There are two major ideas: Closed Addressing versus Open Addressing method.


In Closed Addressing, the Hash Table looks like an Adjacency List (a graph data structure). The hash code of a key gives its fixed/closed base address. Collision is resolved by appending the collided keys inside an auxiliary data structure (usually any form of List ADT) identified by the base address.


In Open Addressing, all hashed keys are located in a single array. The hash code of a key gives its base address. Collision is resolved by checking/probing multiple alternative addresses (hence the name open) in the table based on a certain rule.

🕑

Separate Chaining (SC) collision resolution technique is simple. We use M copies of auxiliary data structures, usually Doubly Linked Lists (PS: simpler auxiliary data structures exist). If two keys a and b both have the same hash value i, both will be appended to the (front/back) of Doubly Linked List i (in this visualization, we append to the back in O(1) with help of tail pointer). That's it, where the keys will be slotted in is completely dependent on the hash function itself, hence we also call Separate Chaining as Closed Addressing collision resolution technique.


If we use Separate Chaining, the load factor α = N/M is the average length of the M lists (unlike in Open Addressing, α can be "slightly over 1.0") and it will determine the performance of Search(v) as we may have to explore α elements on average. As Remove(v) also requires Search(v), its performance is similar as Search(v). Insert(v) is clearly O(1).


If we can bound α to be a small constant (true if we know the expected largest N in our Hash Table application so that we can set up M accordingly), then all Search(v), Insert(v), and Remove(v) operations using Separate Chaining will be O(1).

🕑

在这个可视化中,我们讨论了三种开放寻址 (OA) 冲突解决技术:线性探测 (LP),二次探测 (QP) 和双重哈希 (DH)。


要在这三种模式之间切换,请点击相应的标题。


设:
M = HT.length = 当前哈希表的大小,
base = (key%HT.length),
step = 当前的探测步骤,
secondary = smaller_prime - key%smaller_prime (为了避免零 —— 很快会详细说明)

我们很快就会看到,这三种模式的探测序列是:
线性探测:i=(base+step*1) % M,
二次探测:i=(base+step*step) % M,和
双重哈希:i=(base+step*secondary) % M。


所有三种 OA 技术都要求负载因子 α = N/M < 1.0 (否则无法进行更多的插入)。如果我们可以将 α 限制在一个小的常数 (如果我们知道我们的哈希表应用中预期的最大 N,以便我们可以相应地设置 M,最好 < 0.5 对于大多数 OA 变体),那么所有使用开放寻址的 Search(v),Insert(v) 和 Remove(v) 操作都将是 O(1) —— 详细信息省略。

🕑

View the visualization of Hash Table above.


In this visualization, we allow the insertion of duplicate keys (i.e., a multiset). Since a multiset is more general than a set, simply just insert distinct integers in this visualization if you want to see how Hash Table works on distict integer keys only.


Due to limited screen space, we switch from default (1.0x) scale to 0.5x scale whenever you want to visualize Hash Table size M ∈ [46..90] for OA techniques. The limit is a bit lower, i.e., M ∈ [20..31] for SC technique.


The Hash Table is visualized horizontally like an array where index 0 is placed at the leftmost of the first row and index M-1 is placed at the rightmost of the last row but the details are different when we are visualizing Separate Chaining (only the top row) versus Open Addressing (usually spans multiple rows) collision resolution techniques.

🕑

对于分离链接 (SC) 冲突解决技术,第一行包含 M 个 "H" (头) 指针,这些指针指向 M双向链表


然后,每个双向链表 i 包含所有被哈希到 i 的键,顺序是任意的(在0.5x的比例下,顶点标签显示在较小的黑点上方)。从数学上讲,所有可以表示为 i (mod M) 的键 — 包括所有 i 的重复项 — 都被哈希到 DLL i。再次强调,我们在这个可视化中不存储任何卫星数据。

🕑

在这个可视化中,我们讨论了三种开放寻址冲突解决技术:线性探测 (LP),二次探测 (QP) 和双重哈希 (DH)。


对于所有三种技术,每个哈希表单元格都显示为一个顶点,单元格值 [0..99] 显示为顶点标签(在 0.5x 缩放中,顶点标签显示在较小的黑点上方)。在不失一般性的情况下,我们在这个可视化中不显示任何卫星数据,因为我们只关注键的排列。我们保留值 -1 来表示一个 '空单元格'(可视化为一个空白顶点)和 -2 来表示一个 '已删除单元格'(可视化为一个带有缩写标签 "DEL" 的顶点)。范围从 [0..M-1] 的单元格索引显示为每个顶点下方的红色标签(在 1.0x 缩放中为 15 行索引,或在 0.5x 缩放中为 25 行索引)。

🕑

Try Insert([9,16,23,30,37,44]) to see how Insert(v) operation works if we use Separate Chaining as collision resolution technique. On such random insertions, the performance is good and each insertion is clearly O(1).


However if we try Insert([68,90]), notice that all integers {68,90} are 2 (modulo 11) so all of them will be appended into the (back of) Doubly Linked List 2. We will have a long chain in that list. Note that due to the screen limitation, we limit the length of each Doubly Linked List to be at maximum 6.

🕑

尝试 Search(35),可以看到 Search(v) 的运行时间可以达到 O(1+α)。


尝试 Remove(35),可以看到 Remove(v) 的运行时间也可以达到 O(1+α)。


如果 α 很大,分离链接的性能实际上并不是 O(1)。然而,如果我们大致知道我们的应用程序可能会使用的最大键值数量 N,那么我们可以相应地设置表大小 M,使得 α = N/M 是一个非常低的正(浮点)数,从而使分离链接的性能预期为 O(1)。

🕑
在线性探测冲突解决方法中,每当发生冲突时,我们会一次扫描一个索引,以查找下一个空/被删除的插槽(当我们到达最后一个插槽时再重新环绕)。
例如,我们假设我们以表格大小M = HT.length = 7的空的散列表 HT开始,如上所示,使用索引0到 M-1 = 7-1 = 6。请注意,7是素数。 (主)散列函数很简单,h(v)= v%M
此演练将向您展示使用线性探测作为冲突解决技术时Insert(v),Search(v)和Remove(v)操作所采取的步骤。
🕑

Now click Insert([18,14,21]) — three individual insertions in one command.


Recap (to be shown after you click the button above).


Formally, we describe Linear Probing index i as i = (base+step*1) % M where base is the (primary) hash value of key v, i.e., h(v) and step is the Linear Probing step starting from 1.


Tips: To do a quick mental calculation of a (small) integer V modulo M, we simply subtract V with the largest multiple of MV, e.g., 18%7 = 18-14 = 4, as 14 is the largest multiple of 7 that is ≤ 18.

🕑
现在点击Insert([1,35])(在上一张幻灯片中插入的前三个值的顶部)。
回顾(在点击上面的按钮后显示)
🕑
现在我们说明使用线性探测作为碰撞解析技术的Search(v)操作。 所采取的步骤与 Insert(v)操作非常相似,即我们从(主)散列键值开始,检查是否已找到v,否则我们一次向前移动一个索引(如果遇到结尾则从开头重新开始)并重新检查。 当我们遇到一个空单元时,我们就停止,这意味着v完全不在哈希表中(因为之前的Insert(v)操作会将v放在那里)。

现在点击Search(35) - 你应该看到探测序列[0,1,2,3(找到键35)]。

现在单击Search(7) - [1,2,3,4,5(空单元格,因此在哈希表中找不到键8)]。
🕑
现在我们来讨论移除(v)操作。
如果我们直接设置HT [i] = EMPTY单元,其中i是包含v的索引(在必要时进行线性探测之后),您是否意识到我们会导致问题? 为什么?

提示:查看以前的三张幻灯片,了解插入(v)和搜索(v)的行为。
🕑

The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.


If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.


FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.

🕑
现在让我们看看完整的删除(v)。 如果我们在索引 i 处找到 v(必要时在线性探测之后),我们必须设置 HT [i] = DELETED(在此可视化中缩写为DEL),其中DEL是一个特殊符号(通常你只能在你的应用程序种没有使用过的使用一个特殊符号)指示该单元格可以在将来的搜索(v)需要时绕过,但可以被将来的插入(w)覆盖。 这种策略被称为懒惰删除。
现在点击Remove(21) - [0,1(找到关键21,我们设置H [1] = DEL)]。
然后请继续讨论下一张幻灯片。
🕑
现在点击Search(35) - [0,1(绕过那个DELETED cell),2,3(找到键35)]。
想象一下,如果我们错误地设置H [1] = EMPTY,会发生什么。
🕑
现在单击Insert(28) - 您应该看到探测序列[0,1(使用DEL符号找到一个单元格)],因此,实际上可以用新值覆盖,而不会影响将来搜索(v)的正确性。 因此,我们把28放在索引1中。
🕑

尽管我们可以用线性探测来解决冲突,但这并不是最有效的方法。


我们定义一个为一组连续的占用槽。覆盖键的基地址的簇被称为键的主簇


现在注意到线性探测可以创建大的主簇,这将使Search(v)/Insert(v)/Remove(v)操作的运行时间超过宣传的O(1)。


看看上面的一个例子,其中M = 31,我们已经插入了15个键[0..14],使它们占用单元格[0..14] (α = 15/31 < 0.5)。现在看看 Insert(31) (第16个键)有多么"慢"。

🕑

The probe sequence of Linear Probing can be formally described as follows:

 h(v) // base address
(h(v) + 1*1) % M // 1st probe if there is a collision
(h(v) + 2*1) % M // 2nd probe if there is still a collision
(h(v) + 3*1) % M // 3rd probe if there is still a collision
...
(h(v) + k*1) % M // k-th probe, etc...

During Insert(v), if there is a collision but there is an empty (or DEL) slot remains in the Hash Table, we are sure to find it after at most M Linear Probing steps, i.e., in O(M). And when we do, the collision will be resolved, but the primary cluster of the key v is expanded as a result and future Hash Table operations will get slower too. Try the slow Search(31) on the same Hash Table as in the previous slide but with many DEL markers (suppose {4, 5, 8, 9, 10, 12, 14} have just been deleted).

🕑

在上一张幻灯片中(主要聚类,第1部分),我们打破了哈希函数应该将键均匀分布在 [0..M-1] 周围的假设。在下一个例子中,我们将展示即使哈希函数将键分布到 [0..M-1] 周围的几个相对较短的主要聚类中,主要聚类的问题仍然可能发生。


在屏幕上,看到 M = 31,插入了15个在 [0..99] 之间的随机整数(有几个随机但短的主要聚类)。如果我们接着插入这4个键 {2, 9, 12, 1},前三个键将“插入”三个空单元格,并意外地合并那些相邻的(但以前是分离的)聚类成为一个(非常)长的主要聚类。因此,下一个插入键1的操作将在这个长的主要聚类的开始处进行,最终需要进行几乎 O(M) 的探测步骤才能找到一个空单元格。尝试 Insert([2,9,12,1])

🕑

To reduce primary clustering, we can modify the probe sequence to:

 h(v) // base address
(h(v) + 1*1) % M // 1st probe if there is a collision
(h(v) + 2*2) % M // 2nd probe if there is still a collision
(h(v) + 3*3) % M // 3rd probe if there is still a collision
...
(h(v) + k*k) % M // k-th probe, etc...

That's it, the probe jumps quadratically, wrapping around the Hash Table as necessary.


A very common mistake as this is a different kind of Quadratic Probing:
Doing h(v), (h(v)+1) % M, (h(v)+1+4) % M, (h(v)+1+4+9) % M, ...

🕑
假设我们已经调用 Insert(18)和Insert(10)到大小为M = HT.length = 7的初始空Hash表中。由于18%7 = 4和10%7 = 3,18和3不会相互碰撞, 如上所示分别位于指数4和3中。
现在,我们点击Insert(38)
回顾(点击上面的按钮后显示)。
🕑
删除(x)和搜索(y)操作的定义类似。 只是这次我们使用二次探测而不是线性探测。
例如,假设我们在上一张幻灯片之后调用了Remove(18),并且标记了HT [4] = DEL。 如果我们调用Search(38),我们将使用与上一张幻灯片相同的二次探测序列,但是会通过标记为DELETED的HT [4]。
🕑
一目了然,二次探测(Quadratic Probing)能够快速解决我们之前对线性探测的主要聚类问题,但它是完美的碰撞解决技术吗?
尝试Insert([12,17])
你意识到刚刚发生了什么吗?
🕑
我们可以很容易地插入12,因为h(12)= 12%7 = 5 以前是空的(见上文)。
但是,即使我们仍然有3个空槽,我们仍然存在主要问题:h(17)= 17%7 = 3已被键10占用,(3 + 1 * 1)%7 = 4已被占用 (3 + 2 * 2)%7 = 0已被键38占用,(3 + 3 * 3)%7 = 5已被键12占用,(3 + 4 * 4)%7 = 5 (3 + 5 * 5)%7 = 0再次被键38占据,(3 + 6 * 6)%7 = 4又被键18占据,(3 + 7 * 7)%7 = 3再次被键10占用,如果我们继续二次探测,它将永远循环...
尽管我们仍然有几个(3)空单元格,但我们无法将此新值17插入到哈希表中。
🕑

如果 α < 0.5 并且 M 是一个质数(> 3),那么我们总是可以使用(这种形式的)二次探测找到一个空的插槽。回忆:α 是负载因子,M 是哈希表的大小(HT.length)。


如果满足上述两个要求,我们可以证明,包括基地址 h(v) 在内的前 M/2 个二次探测索引都是不同且唯一的。


但是超过这个范围就不能保证了。因此,如果我们想使用二次探测,我们需要确保 α < 0.5(在这个可视化中没有强制执行,但我们在 M 步后会中断循环以防止无限循环)。

🕑

We will use proof by contradiction. We first assume that two Quadratic Probing steps:
x and y, x != y (let's say x < y), can yield the same address modulo M.

h(v) + x*x = h(v) + y*y (mod M)
x*x = y*y (mod M) // strike out h(v) from both sides
x*x - y*y = 0 (mod M) // move y*y to LHS
(x-y)*(x+y) = 0 (mod M) // rearrange the formula

Now, either (x-y) or (x+y) has to be equal to zero.
As our assumption says x != y, then (x-y) cannot be 0.
As 0 ≤ x < y ≤ (M/2) and M is a prime > 3 (an odd integer),
then (x+y) also cannot be 0 modulo M.


Contradiction!


So the first M/2 Quadratic Probing steps cannot yield the same address modulo M
(if we set M to be a prime number greater than 3).


Discussion: Can we make Quadratic Probing able to use the other ~50% of the table cells?

🕑

The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.


If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.


FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.

🕑

在二次探测中,集群是沿着探测路径形成的,而不是像线性探测那样围绕基地址形成的。这些集群被称为次级集群,与困扰线性探测的主要集群相比,它们“不那么明显”。


次级集群是由于碰撞键使用相同的探测模式而形成的,即,如果两个不同的键有相同的基地址,那么它们的二次探测序列将会是相同的。


为了说明这一点,看看屏幕上M = 31。我们只用10个键填充了这个哈希表(所以负载因子 α = 10/31 ≤ 0.5),哈希表看起来“稀疏足够”(没有明显的大的主要集群)。然而,如果我们接着插入 Insert(62,93),尽管有很多(31-10 = 21)空单元格和62 != 93(最终散列到索引0的不同键),我们最终在这个“不那么明显”的次要聚集中进行了10次探测步骤(注意,{62, 93}都遵循类似的二次探测序列)。


二次探测中的次级集群并不像线性探测中的主要集群那么糟糕,因为一个好的哈希函数理论上应该将键分散到不同的基地址∈ [0..M-1]中。

🕑

To reduce primary and secondary clustering, we can modify the probe sequence to:

 h(v) // base address
(h(v) + 1*h2(v)) % M // 1st probe if there is a collision
(h(v) + 2*h2(v)) % M // 2nd probe if there is still a collision
(h(v) + 3*h2(v)) % M // 3rd probe if there is still a collision
...
(h(v) + k*h2(v)) % M // k-th probe, etc...

That's it, the probe jumps according to the value of the second hash function h2(v), wrapping around the Hash Table as necessary.

🕑

If h2(v) = 1, then Double Hashing works exactly the same as Linear Probing.
So we generally wants h2(v) > 1 to avoid primary clustering.


If h2(v) = 0, then Double Hashing does not work for an obvious reason as any probing step multiplied by 0 remains 0, i.e. we stay at the base address forever during a collision. We need to avoid this.


Usually (for integer keys), h2(v) = M' - v%M' where M' is a smaller prime than M.
This makes h2(v) ∈ [1..M'], which is diverse enough to avoid secondary clustering.


The usage of the secondary hash function makes it theoretically hard to have either primary or secondary clustering issue.

🕑
点击Insert([35,42])插入35,然后插入42到上面的当前哈希表。
回顾(点击上面的按钮后显示)。
🕑
删除(x)和搜索(y)操作的定义类似。 只是这次我们使用双散列(Double Hashing)而不是线性探测或二次探测。
例如,假设我们在上一张幻灯片之后调用了Remove(17),并且标记了HT [3] = DEL。 如果我们调用Search(35),我们将使用与前一张幻灯片相同的双哈希序列,但是会通过标记为DELETED的HT [3]。
🕑

In summary, a good Open Addressing collision resolution technique needs to:

  1. Always find an empty slot if it exists,
  2. Minimize clustering (of any kind),
  3. Give different probe sequences when 2 different keys collide,
  4. Fast, O(1).

Now, let's see the same test case that plagues Quadratic Probing earlier. Now try Insert(62,93) again. Although h(62) = h(93) = 0 and their collide with 31 that already occupy index 0, their probing steps are not the same: h2(62) = 29-62%29 = 25 is not the same as h2(93) = 29-93%29 = 23.


Discussion: Double Hashing seems to fit the bill. But... Is Double Hashing strategy flexible enough to be used as the default library implementation of a Hash Table? Or in a more general sense, which of the two collision resolution technique (Separate Chaining versus Open Addressing: Double Hashing) is the better one?

🕑

The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.


If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.


FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.

🕑

您已经完成了这个哈希表数据结构的基本工作,我们鼓励您在探索模式中进一步探索。

但是,我们仍然会为您提供一些本节中概述的更有趣的哈希表的挑战。

🕑
当负载因子α变高时,哈希表的性能会降低。 对于(标准)二次探测冲突解决方法,当哈希表的α> 0.5时,插入可能失败。
如果发生这种情况,我们可以重新散列(rehash)。 我们用一个新的散列函数构建另一个大约两倍的散列表。 我们遍历原始哈希表中的所有键,重新计算新的哈希值,然后将键(及其卫星数据)重新插入新的更大的哈希表中,最后删除较早的较小哈希表。
一个经验法则是,如果使用开放寻址,并且当α>小的常数(根据需要接近1.0),如果使用单独链接,当α≥0.5时重新散列。
如果我们知道可能的最大键数,我们可以始终将α作为一个低的数字。
🕑

However, if you ever need to implement a Hash Table in C++, Python, or Java, and your keys are either integers or strings, you can use the built-in C++ STL, Python standard library, or Java API, respectively. They already have good built-in implementation of default hash functions for integers or strings.


See C++ STL std::unordered_map, std::unordered_set, Python dict, set, or Java HashMap, HashSet.


For C++, note that the std::multimap/std::multiset implementations are also exist where duplicate keys are allowed.


For OCaml, we can use Hashtbl.


However, here is our take of a simple Separate Chaining implementation: HashTableDemo.cpp | py | java.

🕑

The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.


If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.


FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.

🕑

Hash Table is an extremely good data structure to implement Table ADT if the (integer or string) keys only need to be mapped to satellite-data, with O(1) performance for Search(v), Insert(v), and Remove(v) operations if the Hash Table is set up properly.


However, if we need to do much more with the keys, we may need to use an alternative data structure.

🕑

关于这个数据结构的一些更有趣的问题,请在哈希表培训模块上练习(不需要登录,但是只需短暂且中等难度设置)。
但是,对于注册用户,您应该登录,然后转到主要培训页面以正式清除此模块,这些成果将记录在您的用户帐户中。

🕑

尝试解决一些基本的编程问题,就很可能需要使用哈希表(特别是如果输入大小很大的时候):

  1. Kattis - cd(输入已经被排序,因此存在替代的非哈希表解决方案; 如果未对输入进行排序,则在哈希表的帮助下最好解决此组交集问题),
  2. Kattis - oddmanout(我们可以将较大的邀请代码映射到较小范围的整数;这是整数散列(大范围)的一种做法),
  3. Kattis - whatdoesthefoxsay(我们把不是狐狸的声音放入无序集合;这是散列字符串的做法)。


You have reached the last slide. Return to 'Exploration Mode' to start exploring!

Note that if you notice any bug in this visualization or if you want to request for a new visualization feature, do not hesitate to drop an email to the project leader: Dr Steven Halim via his email address: stevenhalim at gmail dot com.

🕑

新建(M, N)

插入(v)

移除(v)

>

New HT of size

M =

and

N =

random integers (α = N/M = 0.5)

前进

v =

前进

v =

前进

关于 团队 使用条款
隐私政策

关于

VisuAlgo最初由副教授Steven Halim于2011年构思,旨在通过提供自学、互动式学习平台,帮助学生更深入地理解数据结构和算法。

VisuAlgo涵盖了Steven Halim博士与Felix Halim博士、Suhendry Effendy博士合著的书《竞技编程》中讨论的许多高级算法。即使过去十年,VisuAlgo仍然是可视化和动画化这些复杂算法的独家平台。

虽然VisuAlgo主要面向新加坡国立大学(NUS)的学生,包括各种数据结构和算法课程(例如CS1010/等价课程,CS2040/等价课程(包括IT5003),CS3230,CS3233和CS4234),但它也是全球好奇心的宝贵资源,促进在线学习。

最初,VisuAlgo并不适用于智能手机等小触摸屏,因为复杂的算法可视化需要大量的像素空间和点击拖动交互。为了获得最佳用户体验,建议使用最低分辨率为1366x768的屏幕。然而,自2022年4月以来,VisuAlgo的移动(精简)版本已经推出,使得在智能手机屏幕上使用VisuAlgo的部分功能成为可能。

VisuAlgo仍然在不断发展中,正在开发更复杂的可视化。目前,该平台拥有24个可视化模块。

VisuAlgo配备了内置的问题生成器和答案验证器,其“在线测验系统”使学生能够测试他们对基本数据结构和算法的理解。问题根据特定规则随机生成,并且学生提交答案后会自动得到评分。随着越来越多的计算机科学教师在全球范围内采用这种在线测验系统,它可以有效地消除许多大学标准计算机科学考试中手工基本数据结构和算法问题。通过给通过在线测验的学生分配一个小但非零的权重,计算机科学教师可以显著提高学生对这些基本概念的掌握程度,因为他们可以在参加在线测验之前立即验证几乎无限数量的练习题。每个VisuAlgo可视化模块现在都包含自己的在线测验组件。

VisuAlgo已经被翻译成三种主要语言:英语、中文和印尼语。此外,我们还用各种语言撰写了关于VisuAlgo的公开笔记,包括印尼语、韩语、越南语和泰语:

id, kr, vn, th.

团队

项目领导和顾问(2011年7月至今)
Associate Professor Steven Halim, School of Computing (SoC), National University of Singapore (NUS)
Dr Felix Halim, Senior Software Engineer, Google (Mountain View)

本科生研究人员 1
CDTL TEG 1: Jul 2011-Apr 2012: Koh Zi Chun, Victor Loh Bo Huai

最后一年项目/ UROP学生 1
Jul 2012-Dec 2013: Phan Thi Quynh Trang, Peter Phandi, Albert Millardo Tjindradinata, Nguyen Hoang Duy
Jun 2013-Apr 2014 Rose Marie Tan Zhao Yun, Ivan Reinaldo

本科生研究人员 2
CDTL TEG 2: May 2014-Jul 2014: Jonathan Irvin Gunawan, Nathan Azaria, Ian Leow Tze Wei, Nguyen Viet Dung, Nguyen Khac Tung, Steven Kester Yuwono, Cao Shengze, Mohan Jishnu

最后一年项目/ UROP学生 2
Jun 2014-Apr 2015: Erin Teo Yi Ling, Wang Zi
Jun 2016-Dec 2017: Truong Ngoc Khanh, John Kevin Tjahjadi, Gabriella Michelle, Muhammad Rais Fathin Mudzakir
Aug 2021-Apr 2023: Liu Guangyuan, Manas Vegi, Sha Long, Vuong Hoang Long, Ting Xiao, Lim Dewen Aloysius

本科生研究人员 3
Optiver: Aug 2023-Oct 2023: Bui Hong Duc, Tay Ngan Lin

最后一年项目/ UROP学生 3
Aug 2023-Apr 2024: Xiong Jingya, Radian Krisno, Ng Wee Han, Tan Chee Heng
Aug 2024-Apr 2025: Edbert Geraldy Cangdinata, Huang Xing Chen, Nicholas Patrick

List of translators who have contributed ≥ 100 translations can be found at statistics page.

致谢
NUS CDTL gave Teaching Enhancement Grant to kickstart this project.

For Academic Year 2023/24 - present (currently AY 2025/26) - generous donations from Optiver will be used to further develop VisuAlgo.

使用条款

VisuAlgo慷慨地向全球计算机科学界提供免费服务。如果您喜欢VisuAlgo,我们恳请您向其他计算机科学学生和教师宣传它的存在。您可以通过社交媒体平台(如Facebook、YouTube、Instagram、TikTok、Twitter等)、课程网页、博客评论、电子邮件等方式分享VisuAlgo。

数据结构与算法(DSA)的学生和教师可以直接在课堂上使用本网站。如果您从本网站截取屏幕截图或视频,可以在其他地方使用,但请引用本网站的URL(https://visualgo.net)和/或下面的出版物列表作为参考。但请不要下载VisuAlgo的客户端文件并将其托管在您的网站上,因为这构成了抄袭行为。目前,我们不允许他人分叉此项目或创建VisuAlgo的变体。个人使用离线副本的客户端VisuAlgo是可以接受的。

请注意,VisuAlgo的在线测验组件具有重要的服务器端元素,保存服务器端脚本和数据库并不容易。目前,普通公众只能通过“培训模式”访问在线测验系统。“测试模式”提供了一个更受控制的环境,用于在新加坡国立大学的真实考试中使用随机生成的问题和自动验证。


出版物列表

这项工作曾在2012年国际大学生程序设计竞赛(波兰,华沙)的CLI研讨会上和2012年国际信息学奥林匹克竞赛(意大利,锡尔米奥内-蒙蒂基亚里)的IOI会议上展示过。您可以点击此链接阅读我们2012年关于该系统的论文(当时还没有称为VisuAlgo),以及此链接阅读2015年的简短更新(将VisuAlgo与之前的项目关联起来)。


错误报告或新功能请求

VisuAlgo并不是一个完成的项目。Steven Halim副教授仍在积极改进VisuAlgo。如果您在使用VisuAlgo时发现任何可视化页面/在线测验工具中的错误,或者您想要请求新功能,请联系Steven Halim副教授。他的联系方式是将他的名字连接起来,然后加上gmail dot com。

隐私政策

版本 1.2 (更新于2023年8月18日星期五)。

自2023年8月18日(星期五)起,我们不再使用 Google Analytics。因此,我们现在使用的所有 cookies 仅用于此网站的运营。即使是首次访问的用户,烦人的 cookie 同意弹窗现在也已关闭。

自2023年6月7日(星期五)起,由于 Optiver 的慷慨捐赠,全世界的任何人都可以自行创建一个 VisuAlgo 账户,以存储一些自定义设置(例如,布局模式,默认语言,播放速度等)。

此外,对于 NUS 学生,通过使用 VisuAlgo 账户(一个 NUS 官方电子邮件地址,课堂名册中的学生姓名,以及在服务器端加密的密码 - 不存储其他个人数据),您同意您的课程讲师跟踪您的电子讲义阅读和在线测验培训进度,这是顺利进行课程所必需的。您的 VisuAlgo 账户也将用于参加 NUS 官方的 VisuAlgo 在线测验,因此,将您的账户凭据传递给他人代您进行在线测验构成学术违规。课程结束后,您的用户账户将被清除,除非您选择保留您的账户(OPT-IN)。访问完整的 VisuAlgo 数据库(包含加密密码)的权限仅限于 Halim 教授本人。

对于全球其他已经给 Steven 写过信的 CS 讲师,需要一个 VisuAlgo 账户(您的(非 NUS)电子邮件地址,您可以使用任何显示名称,以及加密密码)来区分您的在线凭据与世界其他地方。您的账户将具有 CS 讲师特定的功能,即能够查看隐藏的幻灯片,这些幻灯片包含了在隐藏幻灯片之前的幻灯片中提出的问题的(有趣的)答案。您还可以访问 VisuAlgo 在线测验的 Hard 设置。您可以自由地使用这些材料来增强您的数据结构和算法课程。请注意,未来可能会有其他 CS 讲师特定的功能。

对于任何拥有 VisuAlgo 账户的人,如果您希望不再与 VisuAlgo 工具有关联,您可以自行删除您的账户。