







Hash Table is a data structure to map key to values (also called Table or Map Abstract Data Type/ADT). It uses a hash function to map large or even non-Integer keys into a small range of Integer indices (typically [0..hash_table_size-1]).
The probability of two distinct keys colliding into the same index is relatively high and each of this potential collision needs to be resolved to maintain data integrity.
There are several collision resolution strategies that will be highlighted in this visualization: Open Addressing (Linear Probing, Quadratic Probing, and Double Hashing) and Closed Addressing (Separate Chaining). Try clicking
for a sample animation of searching a specific value 7 in a randomly created Hash Table using Separate Chaining technique (duplicates are allowed).Remarks: By default, we show e-Lecture Mode for first time (or non logged-in) visitor.
If you are an NUS student and a repeat visitor, please login.
散列是一种算法(通过散列函数),将大型可变长度数据集(称为键,不一定是整数)映射为固定长度的较小整数数据集。
哈希表是一种数据结构,它使用哈希函数有效地将键映射到值(表或地图ADT),以便进行高效的搜索/检索,插入和/或删除。
散列表广泛应用于多种计算机软件中,特别是关联数组,数据库索引,缓存和集合。
在这个电子讲座中,我们将深入讨论表ADT,哈希表的基本概念,讨论哈希函数,然后再讨论哈希表数据结构本身的细节。
Pro-tip 1: Since you are not logged-in, you may be a first time visitor (or not an NUS student) who are not aware of the following keyboard shortcuts to navigate this e-Lecture mode: [PageDown]/[PageUp] to go to the next/previous slide, respectively, (and if the drop-down box is highlighted, you can also use [→ or ↓/← or ↑] to do the same),and [Esc] to toggle between this e-Lecture mode and exploration mode.
表ADT必须至少支持以下三种操作,并且尽可能高效:
- 搜索(v) - 确定v是否存在于ADT中,
- 插入(v) - 将v插入ADT,
- 删除(v) - 从ADT中删除v。
哈希表是这个表ADT的一个可能的好实现(另一个是这个)。
PS1:对于Table ADT的两个较弱的实现,您可以单击相应的链接:未排序数组或排序数组来阅读详细讨论。
PS2:在现场课程中,您可能想要比较Table ADT和List ADT的要求。
Pro-tip 2: We designed this visualization and this e-Lecture mode to look good on 1366x768 resolution or larger (typical modern laptop resolution in 2021). We recommend using Google Chrome to access VisuAlgo. Go to full screen mode (F11) to enjoy this setup. However, you can use zoom-in (Ctrl +) or zoom-out (Ctrl -) to calibrate this.
当整数键的范围很小时,例如 [0..M-1],我们可以使用大小为M的初始空(Boolean)数组A,并直接实现以下表ADT操作:
- 搜索(v):检查A [v]是true(填充)还是false(空),
- 插入(v):将A [v]设置为true(填充),
- 移除(v):将A [v]设置为false(空白)。
就是这样,我们使用小整数键本身来确定数组A中的地址,因此称为直接寻址。 很明显,所有三种主要的ADT操作都是O(1)。
题外话:这个想法在其他地方也有使用,例如在计数排序中。
Pro-tip 3: Other than using the typical media UI at the bottom of the page, you can also control the animation playback using keyboard shortcuts (in Exploration Mode): Spacebar to play/pause/replay the animation, ←/→ to step the animation backwards/forwards, respectively, and -/+ to decrease/increase the animation speed, respectively.
In Singapore (as of Apr 2023), bus routes are numbered from [2..991].
Not all integers between [2..991] are currently used, e.g., there is no bus route 989 — Search(989) should return false. A new bus route x may be introduced, i.e., Insert(x) or an existing bus route y may be discontinued, i.e., Remove(y).
As the range of possible bus routes is small, to record the data whether a bus route number exists or not, we can use a DAT with a Boolean array of size 1000 (generally, it is useful to give a few extra buffer cells on top of the current largest bus number of 991).
Notice that we can always add satellite data instead of just using a Boolean array to record the existence of the keys.
For example, we can use an associative String array A instead to map a bus route number to its operator name, e.g.,
A[2] = "Go-Ahead Singapore",
A[10] = "SBS Transit",
A[183] = "Tower Transit Singapore",
A[188] = "SMRT Buses", etc.
Discussion: Can you think of a few other real-life DAT examples?
The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.
If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.
FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.
键的范围必须很小。 如果我们有(非常)大范围的话,内存使用量会(非常的)大。
键必须密集,即键值中没有太多空白。 否则DAT将包含太多的空的(浪费的)单元。
我们将用哈希来克服这些限制。
使用哈希,我们可以:
- 将(一些)非整数(例如字符串)键映射成整数键,
- 将大整数映射成较小的整数。
For example, we have N = 400 Singapore phone numbers (Singapore phone number has 8 digits, so there are up to 10^8 = 100M possible phone numbers in Singapore).
Instead of using a DAT and use a gigantic array up to size M = 100 Million, we can use the following simple hash function h(v) = v%997.
This way, we map 8 digits phone numbers 6675 2378 and 6874 4483 into up to 3 digits h(6675 2378) = 237 and h(6874 4483) = 336, respectively. Therefore, we only need to prepare an array of size M = 997 (997 is a prime) instead of M = 100 Million.
- 搜索(v):检查A [h(v)]!= -1(假设 v≥0,我们对空单元使用-1)
- 插入(v):设置A [h(v)] = v(我们把 v 散列到h(v)中,所以我们需要以某种方式记录键 v),
- 删除(v):设置A [h(v)] = -1 --- 将进一步阐述。
如果我们有映射到卫星数据的键,并且我们也想记录原始键,我们可以使用如下一对(整数,卫星数据类型)数组实现哈希表:
- 搜索(v):返回A [h(v)],它是一对(v,卫星数据),可能是空的,
- 插入(v,卫星数据):设置A [h(v)] =对(v,卫星数据),
- 删除(v):设置A [h(v)] =(空对) - 将进一步详细阐述。
但是,现在你应该注意到有些地方是不完整的......
散列函数可能且很可能将不同的键(整数或不是)映射到同一个整数槽中,即多对一映射而不是一对一映射。
例如,早些时候三张幻灯片中的h(6675 2378)= 237,如果我们想插入另一个电话号码6675 4372,我们也会遇到问题,因为h(6675 4372)= 237。
这种情况称为碰撞,即两个(或更多)键具有相同的散列值。
我们来做一些计算吧。
假设Q(n)是房间中n个人不同生日的概率。
Q(n)= 365/365×364/365×363/365×...×(365-n + 1)/ 365,
即第一人的生日可以是365天中的任何一天,第二人的生日可以是除第一人的生日之外的任何365天,等等。
设P(n)为房间中 n 个人的相同生日(碰撞)的概率。P(n)= 1-Q(n)
我们计算P(23) = 0.507> 0.5(50%)。
因此,我们只需要在有365人的座位(单元格)的房间(哈希表)中有23人(少量键),发生碰撞事件(该房间中两个不同人的生日 是365天/插槽之一)就会超过50%。
问题1:我们已经看到了一个简单的散列函数,如电话号码示例中使用的h(v)= v%997,它将大范围的整数键映射到整数键的较小范围内,但非整数键的情况如何? 如何有效地做这样的散列?
问题2:我们已经看到,通过将大范围散列或映射到更小范围,很可能会发生碰撞。 如何处理它们?
- 快速计算,即在O(1)中,
- 尽可能使用最小插槽/散列表的大小M,
- 尽可能均匀地将键分散到不同的基地址∈[0..M-1],
- 尽可能减少碰撞。
使用散列函数从键v计算键v的散列值/散列码以获得范围从0到M-1的整数。 该散列值用作卫星数据的散列表条目的基址/归位索引/地址。
Using the Phone Numbers example, if we we define h(v) = floor(v/1 000 000),
i.e., we select the first two digits a phone number.
h(66 75 2378) = 66
h(68 74 4483) = 68
Discuss: What happen when you use that hash function? Hint: See this.
The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.
If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.
FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.
在讨论现实之前,让我们讨论理想的情况:完美的散列函数。
完美的散列函数是键和散列值之间的一对一映射,即根本不存在冲突。 如果事先知道所有的键是可能的。 例如,编译器/解释器搜索保留关键字。 但是,这种情况很少见。
当表格大小与提供的关键字数量相同时,实现最小的完美散列函数。 这种情况更为罕见。
如果你有兴趣,你可以探索GNU gperf,这是一个用C++编写的免费可用的完美哈希函数生成器,可以从用户提供的关键字列表中自动构建完美的函数(C++程序)。
哈希表大小M被设置为不接近2的幂的相当大的素数,大约比将在哈希表中使用的期望的键的数量N大2倍以上。 这样,负载因子α= N / M <0.5 — 稍后我们将看到具有低负载因子,从而牺牲空闲空间,有助于提高哈希表性能。
讨论:如果我们将M设置为10(十进制)的幂或2(幂的二进制),那该怎么办?
The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.
If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.
FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.
People has also tried various ways to hash Strings into a small range of Integers as uniformly as possible. In this e-Lecture, we jump directly to one of the best and most popular version, shown below:
int hash_function(string v) { // assumption 1: v uses ['A'..'Z'] only
int sum = 0; // assumption 2: v is a short string
for (auto& c : v) // for each character c in v
sum = ((sum*26)%M + (c-'A'+1))%M; // M is table size
return sum;
}
Interactive (M = ∞), i.e., the modulo operation has no effect
v = , = 0.
Discussion: In real life class, discuss the components of the hash function above, e.g., why loop through all characters?, will that be slower than O(1)?, why multiply with 26?, what if the string v uses more than just UPPERCASE chars?, etc.
The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.
If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.
FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.
There are two major ideas: Open Addressing versus Closed Addressing method.
In Open Addressing, all hashed keys are located in a single array. The hash code of a key gives its base address. Collision is resolved by checking/probing multiple alternative addresses (hence the name open) in the table based on a certain rule.
In Closed Addressing, the Hash Table looks like an Adjacency List (a graph data structure). The hash code of a key gives its fixed/closed base address. Collision is resolved by appending the collided keys inside an auxiliary data structure (usually any form of List ADT) identified by the base address.
There are three Open Addressing (OA) collision resolution techniques discussed in this visualization: Linear Probing (LP), Quadratic Probing (QP), and Double Hashing (DH).
To switch between the three modes, please click on the respective header.
Let:
M = HT.length = the current hash table size,
base = (key%HT.length),
step = the current probing step,
secondary = smaller_prime - key%smaller_prime (to avoid zero — elaborated soon)
We will soon see that the probing sequences of the three modes are:
Linear Probing: i=(base+step*1) % M,
Quadratic Probing: i=(base+step*step) % M, and
Double Hashing: i=(base+step*secondary) % M.
All three OA techniques require that the load factor α = N/M < 1.0 (otherwise no more insertion is possible). If we can bound α to be a small constant (true if we know the expected largest N in our Hash Table application so that we can set up M accordingly, preferably < 0.5 for most OA variants), then all Search(v), Insert(v), and Remove(v) operations using Open Addressing will be O(1) — details omitted.
Separate Chaining (SC) collision resolution technique is simple. We use M copies of auxiliary data structures, usually Doubly Linked Lists. If two keys a and b both have the same hash value i, both will be appended to the (front/back) of Doubly Linked List i (in this visualization, we append to the back in O(1) with help of tail pointer). That's it, where the keys will be slotted in is completely dependent on the hash function itself, hence we also call Separate Chaining as Closed Addressing collision resolution technique.
If we use Separate Chaining, the load factor α = N/M is the average length of the M lists (unlike in Open Addressing, α can be "slightly over 1.0") and it will determine the performance of Search(v) as we may have to explore α elements on average. As Remove(v) also requires Search(v), its performance is similar as Search(v). Insert(v) is clearly O(1).
If we can bound α to be a small constant (true if we know the expected largest N in our Hash Table application so that we can set up M accordingly), then all Search(v), Insert(v), and Remove(v) operations using Separate Chaining will be O(1).
View the visualization of Hash Table above.
In this visualization, we allow the insertion of duplicate keys (i.e., a multiset). Since a multiset is more general than a set, simply just insert distinct integers in this visualization if you want to see how Hash Table works on distict integer keys only.
Due to limited screen space, we switch from default (1.0x) scale to 0.5x scale whenever you want to visualize Hash Table size M ∈ [46..90] for OA techniques. The limit is a bit lower, i.e., M ∈ [20..31] for SC technique.
The Hash Table is visualized horizontally like an array where index 0 is placed at the leftmost of the first row and index M-1 is placed at the rightmost of the last row but the details are different when we are visualizing Open Addressing (usually spans multiple rows) versus Separate Chaining (only the top row) collision resolution techniques.
There are three Open Addressing collision resolution techniques discussed in this visualization: Linear Probing (LP), Quadratic Probing (QP), and Double Hashing (DH).
For all three techniques, each Hash Table cell is displayed as a vertex with cell value of [0..99] displayed as the vertex label (in 0.5x scale, the vertex label is displayed on top of the smaller black dot). Without loss of generality, we do not show any satellite data in this visualization as we concentrate only on the arrangement of the keys. We reserve value -1 to indicate an 'EMPTY cell' (visualized as a blank vertex) and -2 to indicate a 'DELETED cell' (visualized as a vertex with abbreviated label "DEL"). The cell indices ranging from [0..M-1] are shown as red label below each vertex (rows of 15 indices in 1.0x scale or rows of 25 indices in 0.5x scale).
For Separate Chaining (SC) collision resolution technique, the first row contains the M "H" (Head) pointers of M Doubly Linked Lists.
Then, each Doubly Linked List i contains all keys that are hashed into i in arbitrary order (in 0.5x scale, the vertex label is displayed on top of the smaller black dot). Mathematically, all keys that can be expressed as i (mod M) — including all duplicates of i — are hashed into DLL i. Again, we do not store any satellite data in this visualization.
例如,我们假设我们以表格大小M = HT.length = 7的空的散列表 HT开始,如上所示,使用索引0到 M-1 = 7-1 = 6。请注意,7是素数。 (主)散列函数很简单,h(v)= v%M。
此演练将向您展示使用线性探测作为冲突解决技术时Insert(v),Search(v)和Remove(v)操作所采取的步骤。
Now click
— three individual insertions in one command.Recap (to be shown after you click the button above).
Formally, we describe Linear Probing index i as i = (base+step*1) % M where base is the (primary) hash value of key v, i.e., h(v) and step is the Linear Probing step starting from 1.
Tips: To do a quick mental calculation of a (small) Integer V modulo M, we simply subtract V with the largest multiple of M ≤ V, e.g., 18%7 = 18-14 = 4, as 14 is the largest multiple of 7 that is ≤ 18.
回顾(在点击上面的按钮后显示)
现在点击 - 你应该看到探测序列[0,1,2,3(找到键35)]。
现在单击 - [1,2,3,4,5(空单元格,因此在哈希表中找不到键8)]。
如果我们直接设置HT [i] = EMPTY单元,其中i是包含v的索引(在必要时进行线性探测之后),您是否意识到我们会导致问题? 为什么?
提示:查看以前的三张幻灯片,了解插入(v)和搜索(v)的行为。
The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.
If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.
FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.
现在点击 - [0,1(找到关键21,我们设置H [1] = DEL)]。
然后请继续讨论下一张幻灯片。
想象一下,如果我们错误地设置H [1] = EMPTY,会发生什么。
Although we can resolve collision with Linear Probing, it is not the most effective way.
We define a cluster to be a collection of consecutive occupied slots. A cluster that covers the base address of a key is called the primary cluster of the key.
Now notice that Linear Probing can create large primary clusters that will increase the running time of Search(v)/Insert(v)/Remove(v) operations beyond the advertised O(1).
See an example above with M = 31 and we have inserted 15 keys [0..14] so that they occupy cells [0..14] (α = 15/31 < 0.5). Now see how 'slow'
(the 16th key) is.The probe sequence of Linear Probing can be formally described as follows:
h(v) // base address
(h(v) + 1*1) % M // 1st probing step if there is a collision
(h(v) + 2*1) % M // 2nd probing step if there is still a collision
(h(v) + 3*1) % M // 3rd probing step if there is still a collision
...
(h(v) + k*1) % M // k-th probing step, etc...
During Insert(v), if there is a collision but there is an empty (or DEL) slot remains in the Hash Table, we are sure to find it after at most M Linear Probing steps, i.e., in O(M). And when we do, the collision will be resolved, but the primary cluster of the key v is expanded as a result and future Hash Table operations will get slower too. Try the slow
on the same Hash Table as in the previous slide but with many DEL markers (suppose {4, 5, 8, 9, 10, 12, 14} have just been deleted).In the previous slide (Primary Clustering, Part 1), we break the assumption that the hash function should uniformly distribute keys around [0..M-1]. In the next example, we will show that the problem of primary clustering can still happen even if the hash function distribute the keys into several relatively short primary clusters around [0..M-1].
On screen, you see M = 31 with 15 random integers between [0..99] inserted (there are several random but short primary clusters). If we then insert these next 4 keys {2, 9, 12, 1}, the first three keys will "plug" the three empty cells and accidentally annex (or combine) those neighboring (but previously disjointed) clusters into a (very) long primary cluster. So the next insertion of a key 1 that lands at (the beginning of) this long primary cluster will end up performing almost O(M) probing steps just to find an empty cell. Try
.h(v)//基地址
(h(v)+ 1 * 1)%M //第一个探测步骤,如果发生碰撞
(h(v)+ 2 * 2)%M //第2次探测步骤,如果仍有冲突
(h(v)+ 3 * 3)%M //第三次探测步骤,如果仍有冲突
...
(h(v)+ k * k)%M //第k个探测步骤等...
就是这样,探针按照二次方跳转,根据需要环绕哈希表。
由于这是一种不同的二次探测,所以一个非常常见的错误就是:做H(V),(H(v)的1)%M,(H(v)的+ 1 + 4)%M,(H(V)+ 1 + 4 + 9)%M,...
现在,我们点击 。
回顾(点击上面的按钮后显示)。
例如,假设我们在上一张幻灯片之后调用了Remove(18),并且标记了HT [4] = DEL。 如果我们调用 ,我们将使用与上一张幻灯片相同的二次探测序列,但是会通过标记为DELETED的HT [4]。
尝试 。
你意识到刚刚发生了什么吗?
但是,即使我们仍然有3个空槽,我们仍然存在主要问题:h(17)= 17%7 = 3已被键10占用,(3 + 1 * 1)%7 = 4已被占用 (3 + 2 * 2)%7 = 0已被键38占用,(3 + 3 * 3)%7 = 5已被键12占用,(3 + 4 * 4)%7 = 5 (3 + 5 * 5)%7 = 0再次被键38占据,(3 + 6 * 6)%7 = 4又被键18占据,(3 + 7 * 7)%7 = 3再次被键10占用,如果我们继续二次探测,它将永远循环...
尽管我们仍然有几个(3)空单元格,但我们无法将此新值17插入到哈希表中。
If α < 0.5 and M is a prime (> 3), then we can always find an empty slot using (this form of) Quadratic Probing. Recall: α is the load factor and M is the Hash Table size (HT.length).
If the two requirements above are satisfied, we can prove that the first M/2 Quadratic Probing indices, including the base address h(v) are all distinct and unique.
But there is no such guarantee beyond that. Hence if we want to use Quadratic Probing, we need to ensure that α < 0.5 (not enforced in this visualization but we do break the loop after M steps to prevent infinite loop).
h(v) + x*x = h(v) + y*y (mod M)
x*x - y*y = 0 (mod M) //将y * y移动到左边
现在,(x-y)或(x + y)必须等于零。 我们的假设是x != y, 那么(x-y)不能为0. 由于0≤ x
因此,第一个M / 2二次探测步骤不能产生相同的地址模M(如果我们将M设置为大于3的质数)。
讨论:我们可以使二次探测能够使用其他约50%的表格单元吗?
The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.
If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.
FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.
In Quadratic Probing, clusters are formed along the path of probing, instead of around the base address like in Linear Probing. These clusters are called Secondary Clusters and it is 'less visible' compared to the Primary Clusters that plagued the Linear Probing.
Secondary clusters are formed as a result of using the same pattern in probing by colliding keys, i.e., if two distinct keys have the same base address, their Quadratic Probing sequences are going to be the same.
To illustrate this, see the screen with M = 31. We have populated this Hash Table with only 10 keys (so load factor α = 10/31 ≤ 0.5) and the Hash Table looks 'sparse enough' (no visibly big primary cluster). However, if we then insert
, despite the fact that there are many (31-10 = 21) empty cells and 62 != 93 (different keys that ends up hashed into index 0), we end up doing 10 probing steps along this 'less visible' secondary cluster (notice that both {62, 93} follow similar Quadratic Probing sequences).Secondary clustering in Quadratic Probing is not as bad as primary clustering in Linear Probing as a good hash function should theoretically disperse the keys into different base addresses ∈ [0..M-1] in the first place.
h(v)//基地址
(h(v)+ 1 * h2(v))%M //第一个探测步骤,如果有碰撞
(h(v)+ 2 * h2(v))%M //第2次探测步骤,如果仍有冲突
(h(v)+ 3 * h2(v))%M //第三次探测步骤,如果仍有冲突
...
(h(v)+ k * h2(v))%M //第k个探测步骤等...
就是这样,探测器根据第二个散列函数h2(v)的值跳转,根据需要环绕散列表。
如果h2(v)= 0,那么Double Hashing不起作用的原因很明显,因为任何探测步数乘以0仍然是0,即我们在碰撞期间永远停留在基地址我们需要避免这种情况。
通常(对于整数键),h2(v)= M' - v%M'其中M'是一个小于M的质数。这使得h2(v)∈[1..M'],它足够多样 二次聚类。
二次散列函数的使用使得理论上难以产生主要或次要群集问题。
回顾(点击上面的按钮后显示)。
例如,假设我们在上一张幻灯片之后调用了Remove(17),并且标记了HT [3] = DEL。 如果我们调用 ,我们将使用与前一张幻灯片相同的双哈希序列,但是会通过标记为DELETED的HT [3]。
In summary, a good Open Addressing collision resolution technique needs to:
- Always find an empty slot if it exists,
- Minimize clustering (of any kind),
- Give different probe sequences when 2 different keys collide,
- Fast, O(1).
Now, let's see the same test case that plagues Quadratic Probing earlier. Now try again. Although h(62) = h(93) = 0 and their collide with 31 that already occupy index 0, their probing steps are not the same: h2(62) = 29-62%29 = 25 is not the same as h2(93) = 29-93%29 = 23.
Discussion: Double Hashing seems to fit the bill. But... Is Double Hashing strategy flexible enough to be used as the default library implementation of a Hash Table? Let's see...
Try
to see how Insert(v) operation works if we use Separate Chaining as collision resolution technique. On such random insertions, the performance is good and each insertion is clearly O(1).However if we try
, notice that all Integers {68,90} are 2 (modulo 11) so all of them will be appended into the (back of) Doubly Linked List 2. We will have a long chain in that list. Note that due to the screen limitation, we limit the length of each Doubly Linked List to be at maximum 6.Try
to see that Search(v) can be made to run in O(1+α).Try
to see that Remove(v) can be made to run in O(1+α) too.If α is large, Separate Chaining performance is not really O(1). However, if we roughly know the potential maximum number of keys N that our application will ever use, then we can set table size M accordingly such that α = N/M is a very low positive (floating-point) number, thereby making Separate Chaining performances to be expected O(1).
The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.
If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.
FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.
您已经完成了这个哈希表数据结构的基本工作,我们鼓励您在探索模式中进一步探索。
但是,我们仍然会为您提供一些本节中概述的更有趣的哈希表的挑战。
如果发生这种情况,我们可以重新散列(rehash)。 我们用一个新的散列函数构建另一个大约两倍的散列表。 我们遍历原始哈希表中的所有键,重新计算新的哈希值,然后将键(及其卫星数据)重新插入新的更大的哈希表中,最后删除较早的较小哈希表。
一个经验法则是,如果使用开放寻址,并且当α>小的常数(根据需要接近1.0),如果使用单独链接,当α≥0.5时重新散列。
如果我们知道可能的最大键数,我们可以始终将α作为一个低的数字。
但是,如果您需要使用C ++, Python或Java实现哈希表,并且您的键(Keys)是整数或字符串,则可以使用内置的C ++ STL,Python标准库或Java API。 他们已经有了整数或字符串默认散列函数的内置实现。
请参阅C ++ STL unordered_map,unordered_set, Python dict, set, 或 Java HashMap,HashSet。
请注意,multimap / multiset实现也存在(允许重复键)。
对于OCaml, 我们可以用 Hashtbl。
这里是我们分离链接法的实现: HashTableDemo.cpp | py | java.
The content of this interesting slide (the answer of the usually intriguing discussion point from the earlier slide) is hidden and only available for legitimate CS lecturer worldwide. This mechanism is used in the various flipped classrooms in NUS.
If you are really a CS lecturer (or an IT teacher) (outside of NUS) and are interested to know the answers, please drop an email to stevenhalim at gmail dot com (show your University staff profile/relevant proof to Steven) for Steven to manually activate this CS lecturer-only feature for you.
FAQ: This feature will NOT be given to anyone else who is not a CS lecturer.
如果(整数或字符串)键只需要映射到卫星数据,则散列表是实现Table ADT的非常好的数据结构,对于Search(v),Insert(v)和Remove( v)如果哈希表设置正确,则执行时间复杂度为O(1)。
但是,如果我们需要更多地使用 键,我们可能需要使用另一种数据结构。
关于这个数据结构的一些更有趣的问题,请在哈希表培训模块上练习(不需要登录,但是只需短暂且中等难度设置)。
但是,对于注册用户,您应该登录,然后转到主要培训页面以正式清除此模块,这些成果将记录在您的用户帐户中。
尝试解决一些基本的编程问题,就很可能需要使用哈希表(特别是如果输入大小很大的时候):
- Kattis - cd(输入已经被排序,因此存在替代的非哈希表解决方案; 如果未对输入进行排序,则在哈希表的帮助下最好解决此组交集问题),
- Kattis - oddmanout(我们可以将较大的邀请代码映射到较小范围的整数;这是整数散列(大范围)的一种做法),
- Kattis - whatdoesthefoxsay(我们把不是狐狸的声音放入无序集合;这是散列字符串的做法)。
You have reached the last slide. Return to 'Exploration Mode' to start exploring!
Note that if you notice any bug in this visualization or if you want to request for a new visualization feature, do not hesitate to drop an email to the project leader: Dr Steven Halim via his email address: stevenhalim at gmail dot com.
创建(M, N)
搜索(v)
插入(v)
移除(v)