# 哈希表

## 1. Introduction

Hash Table is a data structure to map key to values (also called Table or Map Abstract Data Type/ADT). It uses a hash function to map large or even non-Integer keys into a small range of Integer indices (typically [0..hash_table_size-1]).

The probability of two distinct keys colliding into the same index is relatively high and each of this potential collision needs to be resolved to maintain data integrity.

There are several collision resolution strategies that will be highlighted in this visualization: Open Addressing (Linear Probing, Quadratic Probing, and Double Hashing) and Closed Addressing (Separate Chaining). Try clicking Search(8) for a sample animation of searching a value in a Hash Table using Separate Chaining technique.

## 2. 动机

A Table ADT must support at least the following three operations as efficient as possible:

1. Search(v) — determine if v exists in the ADT or not,
2. Insert(v) — insert v into the ADT,
3. Remove(v) — remove v from the ADT.

Hash Table is one possible good implementation for this Table ADT (the other one is this).

PS1: For two weaker implementations of Table ADT, you can click the respective link: unsorted array or a sorted array to read the detailed discussions.

PS2: In live class, you may want to compare the requirements of Table ADT vs List ADT.

### 2-2. 直接寻址表

When the range of the Integer keys is small, e.g., [0..M-1], we can use an initially empty (Boolean) array A of size M and implement the following Table ADT operations directly:

1. Search(v): Check if A[v] is true (filled) or false (empty),
2. Insert(v): Set A[v] to be true (filled),
3. Remove(v): Set A[v] to be false (empty).

That's it, we use the small Integer key itself to determine the address in array A, hence the name Direct Addressing. It is clear that all three major Table ADT operations are O(1).

PS: This idea is also used elsewhere, e.g., in Counting Sort.

### 2-3. DAT的例子

In Singapore (as of Sep 2021), bus routes are numbered from [2..991].

Not all integers between [2..991] are currently used, e.g., there is no bus route 989 — Search(989) should return false. A new bus route x may be introduced, i.e., Insert(x) or an existing bus route y may be discontinued, i.e., Remove(y).

As the range of possible bus routes is small, to record the data whether a bus route number exists or not, we can use a DAT with a Boolean array of size 1000.

Discussion: In real life class, we may discuss on why we use 1000 instead of 991 (or 992).

### 2-4. 带数据的DAT示例

Notice that we can always add satellite data instead of just using a Boolean array to record the existence of the keys.

For example, we can use an associative String array A instead to map a bus route number to its operator name, e.g.,

`A[2] = "Go-Ahead Singapore",A[10] = "SBS Transit",A[183] = "Tower Transit Singapore",A[188] = "SMRT Buses", etc.`

Discussion: Can you think of a few other real-life DAT examples?

### 2-5. 答案

[This is a hidden slide]

### 2-6. DAT的局限性

The keys must be (or can be easily mapped to) non-negative Integer values. Note that basic DAT has problem in the full version of the example in the previous few slides as there are actually variations of bus route numbers in Singapore, e.g., 96B, 151A, NR10, etc.

The range of keys must be small.
The memory usage will be (insanely) large if we have (insanely) large range.

The keys must be dense, i.e., not many gaps in the key values.
DAT will contain too many empty (and wasted) cells otherwise.

We will overcome these restrictions with hashing.

## 3. 散列：想法

Using hashing, we can:

1. Map (some) non-Integer keys (e.g., Strings) to Integers keys,
2. Map large Integers to smaller Integers,

### 3-1. 电话号码示例

For example, we have N = 400 Singapore phone numbers (Singapore phone number has 8 digits, so there are up to 10^8 = 100M possible phone numbers in Singapore).

Instead of using a DAT and use a gigantic array up to size M = 100 Million, we can use the following simple hash function h(v) = v%997.

This way, we map 8 digits phone numbers 6675 2378 and 6874 4483 into up to 3 digits h(6675 2378) = 237 and h(6874 4483) = 336, respectively. Therefore, we only need to prepare array of size M = 997 (or just simplify to 1000) instead of M = 100 Million.

### 3-2. 哈希表预览

1. 搜索（v）：检查A [h（v）]！= -1（假设 v≥0，我们对空单元使用-1）
2. 插入（v）：设置A [h（v）] = v（我们把 v 散列到h（v）中，所以我们需要以某种方式记录键 v），
3. 删除（v）：设置A [h（v）] = -1 --- 将进一步阐述。

### 3-3. 带数据的哈希表

1. 搜索（v）：返回A [h（v）]，它是一对v，卫星数据），可能是空的，
2. 插入（v，卫星数据）：设置A [h（v）] =对（v，卫星数据）
3. 删除（v）：设置A [h（v）] =（空对） - 将进一步详细阐述。

### 3-4. 冲突

A hash function may, and quite likely, map different keys (Integer or not) into the same Integer slot, i.e., a many-to-one mapping instead of one-to-one mapping.

For example, h(6675 2378) = 237 from three slides earlier and if we want to insert another phone number 6675 4372, we will have a problem as h(6675 4372) = 237 too.

This situation is called a collision, i.e., two (or more) keys have the same hash value.

### 3-5. 冲突的概率

The Birthday (von Mises) Paradox asks: 'How many people (number of keys) must be in a room (Hash Table) of size 365 seats (cells) before the probability that some person share a birthday (collision, two keys are hashed to the same cell), ignoring the leap years (i.e., all years have 365 days), becomes > 50 percent (i.e., more likely than not)?'

The answer, which maybe surprising for some of us, is Reveal.

Let's do some calculation.

### 3-6. 计算

Let Q(n) be the probability of unique birthday for n people in a room.
Q(n) = 365/365 × 364/365 × 363/365 × ... × (365-n+1)/365,
i.e., the first person's birthday can be any of the 365 days, the second person's birthday can be any of the 365 days except the first person's birthday, and so on.

Let P(n) be the probability of same birthday (collision) for n people in a room.
P(n) = 1-Q(n).

We compute that P(23) = 0.507 > 0.5 (50%).

Thus, we only need 23 people (a small amount of keys) in the room (Hash Table) of size 365 seats (cells) for a (more than) 50% chance collision to happen (the birthday of two different people in that room is one of 365 days/slots).

## 4. 哈希函数

How to create a good hash function with these desirable properties?

1. Fast to compute, i.e., in O(1),
2. Uses as minimum slots/Hash Table size M as possible,
3. Scatter the keys into different base addresses as uniformly as possible ∈ [0..M-1],
4. Experience as minimum collisions as possible.

### 4-3. 答案

[This is a hidden slide]

### 4-4. 完美的哈希函数

Before discussing the reality, let's discuss the ideal case: perfect hash functions.

A perfect hash function is a one-to-one mapping between keys and hash values, i.e., no collision at all. It is possible if all keys are known beforehand. For example, a compiler/interpreter search for reserved keywords. However, such cases are rare.

A minimal perfect hash function is achieved when the table size is the same as the number of keywords supplied. This case is even rarer.

If you are interested, you can explore GNU gperf, a freely available perfect hash function generator written in C++ that automatically constructs perfect functions (a C++ program) from a user supplied list of keywords.

### 4-5. 散列整数 - 最佳实践

People has tried various ways to hash a large range of Integers into a smaller range of Integers as uniformly as possible. In this e-Lecture, we jump directly to one of the best and most popular version: h(v) = v%M, i.e., map v into Hash Table of size M slots. The (%) is a modulo operator that gives the remainder after division. This is clearly fast, i.e., O(1) assuming that v does not exceed the natural Integer data type limit.

The Hash Table size M is set to be a reasonably large prime not near a power of 2, about 2+ times larger than the expected number of keys N that will ever be used in the Hash Table. This way, the load factor α = N/M < 0.5 — we shall see later that having low load factor, thereby sacrificing empty spaces, help improving Hash Table performance.

Discuss: What if we set M to be a power of 10 (decimal) or power of 2 (binary)?

### 4-6. 答案

[This is a hidden slide]

### 4-7. 散列串 - 最佳实践

People has also tried various ways to hash Strings into a small range of Integers as uniformly as possible. In this e-Lecture, we jump directly to one of the best and most popular version, shown below:

`int hash_function(string v) { // assumption 1: v uses ['A'..'Z'] only  int sum = 0;                // assumption 2: v is a short string  for (auto& c : v) // for each character c in v    sum = ((sum*26)%M + (c-'A'+1))%M; // M is table size  return sum;}`

Discussion: In real life class, discuss the components of the hash function above, e.g., why loop through all characters?, will that be slower than O(1)?, why multiply with 26?, what if the string v uses more than just UPPERCASE chars?, etc.

### 4-8. 答案

[This is a hidden slide]

## 5. 冲突解决方案

HT_size = 现在哈希表的大小

### 5-2. 分离链接法（SC）

Separate Chaining (SC) collision resolution technique is simple. We use M copies of auxiliary data structures, usually Doubly Linked Lists. If two keys a and b both have the same hash value i, both will be appended to the (front/back) of Doubly Linked List i (in this visualization, we append to the back in O(1) with help of tail pointer). That's it, where the keys will be slotted in is completely dependent on the hash function itself, hence we also call Separate Chaining as Closed Addressing collision resolution technique.

If we use Separate Chaining, the load factor α = N/M describes the average length of the M lists and it will determine the performance of Search(v) as we may have to explore α elements on average. As Remove(v) — also requires Search(v), its performance will be similar as Search(v). Insert(v) is clearly O(1).

If we can bound α to be a small constant (true if we know the expected largest N in our Hash Table application so that we can set up M accordingly), then all Search(v), Insert(v), and Remove(v) operations using Separate Chaining will be O(1).

## 6. 可视化

View the visualisation of Hash Table above.

In this visualization, we prevent insertion of duplicate keys.

Due to limited screen space, we limit the maximum Hash Table size to be M = 19.

The Hash Table is visualized horizontally like an array where index 0 is placed leftmost and index M-1 is placed rightmost but the details are different when we are visualizing Open Addressing versus Separate Chaining collision resolution techniques.

Discuss: What to modify in order to allow duplicate keys?

### 6-1. 开址法版本

There are three Open Addressing collision resolution techniques discussed in this visualization: Linear Probing (LP), Quadratic Probing (QP), and Double Hashing (DH).

For all three techniques, each Hash Table cell is displayed as a vertex with cell value of [0..99] displayed as the vertex label. Without loss of generality, we do not show any satellite data in this visualization as we concentrate only on the arrangement of the keys. We reserve value -1 to indicate an 'EMPTY cell' (visualized as a blank vertex) and -2 to indicate a 'DELETED cell' (visualized as a vertex with abbreviated label "DEL"). The cell indices ranging from [0..M-1] are shown as red label below each vertex.

## 7. 线性探测

### 7-1. 插入（[18，14，21])

Now click Insert([18,14,21]) — three individual insertions in one command.

Recap (to be shown after you click the button above).

Formally, we describe Linear Probing index i as i = (base+step*1) % M where base is the (primary) hash value of key v, i.e., h(v) and step is the Linear Probing step starting from 1.

Tips: To do a quick mental calculation of a (small) Integer V modulo M, we simply subtract V with the largest multiple of MV, e.g., 18%7 = 18-14 = 4, as 14 is the largest multiple of 7 that is ≤ 18.

### 7-3. 搜索（35）和搜索（8）

Now we illustrate Search(v) operation while using Linear Probing as collision resolution technique. The steps taken are very similar as with Insert(v) operation, i.e., we start from the (primary) hash key value and check if we have found v, otherwise we move one index forward at a time (wrapping around if necessary) and recheck on whether we have found v. We stop when we encounter an empty cell which implies that v is not in Hash Table at all (as earlier Insert(v) operation would have placed v there otherwise).

Now click Search(35) — you should see probing sequence [0,1,2,3 (key 35 found)].

Now click Search(8) — [1,2,3,4, 5 (empty cell, so key 8 is not found in the Hash Table)].

### 7-5. 答案

[This is a hidden slide]

### 7-9. Primary Clustering, Part 1

Although we can resolve collision with Linear Probing, it is not the most effective way.

We define a cluster to be a collection of consecutive occupied slots. A cluster that covers the base address of a key is called the primary cluster of the key.

Now notice that Linear Probing can create large primary clusters that will increase the running time of Search(v)/Insert(v)/Remove(v) operations beyond the advertised O(1).

See an example above with M = 11 and we have inserted 8 keys that are all 6 (modulo 11), i.e., all have remainder 6 when divided by 11. Now see how 'slow' Insert(94) (the 9th key) is.

### 7-10. 线性探测顺序

The probe sequence of Linear Probing can be formally described as follows:

` h(v) // base address(h(v) + 1*1) % M // 1st probing step if there is a collision(h(v) + 2*1) % M // 2nd probing step if there is still a collision(h(v) + 3*1) % M // 3rd probing step if there is still a collision...(h(v) + k*1) % M // k-th probing step, etc...`

During Insert(v), if there is a collision but there is an empty (or DEL) slot remains in the Hash Table, we are sure to find it after at most M Linear Probing steps, i.e., in O(M). And when we do, the collision will be resolved, but the primary cluster of the key v is expanded as a result and future Hash Table operations will get slower too. Try the slow Search(94) on the same Hash Table as in the previous slide but with many DEL markers.

### 7-11. Primary Clustering, Part 2

In the previous slide (Primary Clustering, Part 1), we break the assumption that the hash function should uniformly distribute keys around [0..M-1]. In the next example, we will show that the problem of primary clustering can still happen even if the hash function distribute the keys somewhat uniformly around [0..M-1].

On screen, you see M = 11 where the first 4 keys have been inserted {11, 2, 4, 6}. If we then insert these next 4 keys {22, 24, 26, 28}, these keys will initially collide with the cells that already contain {11, 2, 4, 6}, have "short" one-step probes, then by doing so "plug" the empty cells and accidentally annex (or combine) those neighboring (but previously disjointed) clusters into a long primary cluster. So the next insertion of a key {33} that lands at (the beginning of) this long primary cluster will end up performing almost O(M) probing steps just to find an empty cell. Try Insert([22,24,26,28,33]).

## 8. 二次探测（QP）

h（v）//基地址
（h（v）+ 1 * 1）％M //第一个探测步骤，如果发生碰撞
（h（v）+ 2 * 2）％M //第2次探测步骤，如果仍有冲突
（h（v）+ 3 * 3）％M //第三次探测步骤，如果仍有冲突
...
（h（v）+ k * k）％M //第k个探测步骤等...

### 8-6. 证明

h(v) + x*x = h(v) + y*y (mod M)
x*x = y*y (mod M)//从两边敲出h（v）
x*x - y*y = 0 (mod M) //将y * y移动到左边
(x-y)*(x+y) = 0 (mod M)//重新排列公式

### 8-7. 更好的二次探测

[This is a hidden slide]

### 8-8. 二次聚类

In Quadratic Probing, clusters are formed along the path of probing, instead of around the base address like in Linear Probing. These clusters are called Secondary Clusters and it is 'less visible' compared to the Primary Clusters that plagued the Linear Probing.

Secondary clusters are formed as a result of using the same pattern in probing by colliding keys, i.e., if two distinct keys have the same base address, their Quadratic Probing sequences are going to be the same.

To illustrate this, see the screen with M = 19. We have populated this Hash Table with only 7 keys (so load factor α = 7/19 ≥ 0.5) and the Hash Table looks 'sparse enough' (no visibly big primary cluster). However, if we then insert Insert(38), despite the fact that there are many (19-7 = 12) empty cells and 19 != 38 (different keys that ends up hashed into index 0), we end up doing 7 probing steps along this 'less visible' secondary cluster.

Secondary clustering in Quadratic Probing is not as bad as primary clustering in Linear Probing as a good hash function should theoretically disperse the keys into different base addresses ∈ [0..M-1] in the first place.

## 9. 双倍散列（DH）

h（v）//基地址
（h（v）+ 1 * h2（v））％M //第一个探测步骤，如果有碰撞
（h（v）+ 2 * h2（v））％M //第2次探测步骤，如果仍有冲突
（h（v）+ 3 * h2（v））％M //第三次探测步骤，如果仍有冲突
...
（h（v）+ k * h2（v））％M //第k个探测步骤等...

### 9-4. 好的OA冲突解决方案

In summary, a good Open Addressing collision resolution technique needs to:

1. Always find an empty slot if it exists,
2. Minimize clustering (of any kind),
3. Give different probe sequences when 2 different keys collide,
4. Fast, O(1).

Now, let's see the same test case that plagues Quadratic Probing earlier. Now try Insert(38) again. Although h(19) = h(38) = 0 and their collide, their probing steps are not the same: h2(19) = 17-19%17 = 15 is not the same as h2(38) = 17-38%17 = 13.

Discussion: Double Hashing seems to fit the bill. But... Is Double Hashing strategy flexible enough to be used as the default library implementation of a Hash Table? Let's see...

## 10. 分离链接（SC）

Try Insert([9,16,23,30,37,44]) to see how Insert(v) operation works if we use Separate Chaining as collision resolution technique. On such random insertions, the performance is good and each insertion is clearly O(1).

However if we try Insert([79,68,90]), notice that all Integers {79,68,90} are 2 (modulo 11) so all of them will be appended into the (back of) Doubly Linked List 2. We will have a long chain in that list. Note that due to the screen limitation, we limit the length of each Doubly Linked List to be 6.

### 10-1. 搜索（35）和删除（7）

Try Search(35) to see that Search(v) can be made to run in O(1+α).

Try Remove(7) to see that Remove(v) can be made to run in O(1+α) too.

If α is large, Separate Chaining performance is not really O(1). However, if we roughly know the potential maximum number of keys N that our application will ever use, then we can set table size M accordingly such that α = N/M is a very low positive (floating-point) number, thereby making Separate Chaining performances to be expected O(1).

### 10-3. （当前）答案

[This is a hidden slide]

## 11. 额外的

### 11-2. 哈希表的实现

However, if you ever need to implement a Hash Table in C++, Python, or Java, and your keys are either Integers or Strings, you can use the built-in C++ STL, Python standard library, or Java API, respectively. They already have good built-in implementation of default hash functions for Integers or Strings.

See C++ STL std::unordered_map, std::unordered_set, Python dict, set, or Java HashMap, HashSet.

For C++, note that the std::multimap/std::multiset implementations are also exist where duplicate keys are allowed.

For OCaml, we can use Hashtbl.

However, here is our take of a simple Separate Chaining implementation: HashTableDemo.cpp | py | java.

### 11-3. 数据结构组合？

[This is a hidden slide]

### 11-6. 在线评判练习

1. Kattis - cd（输入已经被排序，因此存在替代的非哈希表解决方案; 如果未对输入进行排序，则在哈希表的帮助下最好解决此组交集问题），
2. Kattis - oddmanout（我们可以将较大的邀请代码映射到较小范围的整数;这是整数散列（大范围）的一种做法），
3. Kattis - whatdoesthefoxsay（我们把不是狐狸的声音放入无序集合;这是散列字符串的做法）。