7    VisuAlgo.net / /hashtable Login LP QP DH SC
Modo de Exploração ▿

>

>
lento
rápido
go to beginning previous frame pause play next frame go to end

Hash Table is a data structure to map key to values (also called Table or Map Abstract Data Type/ADT). It uses a hash function to map large or even non-Integer keys into a small range of Integer indices (typically [0..hash_table_size-1]).


The probability of two distinct keys colliding into the same index is relatively high and each of this potential collision needs to be resolved to maintain data integrity.


There are several collision resolution strategies that will be highlighted in this visualization: Open Addressing (Linear Probing, Quadratic Probing, and Double Hashing) and Closed Addressing (Separate Chaining). Try clicking Search(8) for a sample animation of searching a value in a Hash Table using Separate Chaining technique.


Click 'Next' (on the top right)/press 'Page Down' to advance this e-Lecture slide, use the drop down list/press 'Space' to jump to a specific slide, or Click 'X' (on the bottom right)/press 'Esc' to go to Exploration mode.


Remarks: By default, we show e-Lecture Mode for first time (or non logged-in) visitor.
Please login if you are a repeated visitor or register for an (optional) free account first.

X Esc
Próx. PgDn

Hashing is an algorithm (via a hash function) that maps large data sets of variable length, called keys, not necessarily Integers, into smaller Integer data sets of a fixed length.


A Hash Table is a data structure that uses a hash function to efficiently map keys to values (Table or Map ADT), for efficient search/retrieval, insertion, and/or removals.


Hash Table is widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.


In this e-Lecture, we will digress to Table ADT, the basic ideas of Hashing, the discussion of Hash Functions before going into the details of Hash Table data structure itself.


Pro-tip: Since you are not logged-in, you may be a first time visitor who are not aware of the following keyboard shortcuts to navigate this e-Lecture mode: [PageDown] to advance to the next slide, [PageUp] to go back to the previous slide, [Esc] to toggle between this e-Lecture mode and exploration mode.

X Esc
Ant. PgUp
Próx. PgDn

A Table ADT must support at least the following three operations as efficient as possible:

  1. Search(v) — determine if v exists in the ADT or not,
  2. Insert(v) — insert v into the ADT,
  3. Remove(v) — remove v from the ADT.

Hash Table is one possible good implementation for this Table ADT (the other one is this).


PS1: For two weaker implementations of Table ADT, you can click the respective link: unsorted array or a sorted array to read the detailed discussions.


PS2: In live class, you may want to compare the requirements of Table ADT vs List ADT.


Another pro-tip: We designed this visualization and this e-Lecture mode to look good on 1366x768 resolution or larger (typical modern laptop resolution in 2017). We recommend using Google Chrome to access VisuAlgo. Go to full screen mode (F11) to enjoy this setup. However, you can use zoom-in (Ctrl +) or zoom-out (Ctrl -) to calibrate this.

X Esc
Ant. PgUp
Próx. PgDn

When the range of the Integer keys is small, e.g. [0..M-1], we can use an initially empty (Boolean) array A of size M and implement the following Table ADT operations directly:

  1. Search(v): Check if A[v] is true (filled) or false (empty),
  2. Insert(v): Set A[v] to be true (filled),
  3. Remove(v): Set A[v] to be false (empty).

That's it, we use the small Integer key itself to determine the address in array A, hence the name Direct Addressing. It is clear that all three major Table ADT operations are O(1).

X Esc
Ant. PgUp
Próx. PgDn

In Singapore (as of Mar 2018), bus routes are numbered from [2..990].


Not all integers between [2..990] are currently used, e.g. there is no bus route 989 — Search(989) should return false. A new bus route x may be introduced, i.e. Insert(x) or an existing bus route y may be discontinued, i.e. Remove(y).


As the range of possible bus routes is small, to record the data whether a bus route number exists or not, we can use a DAT with a Boolean array of size 1 000.


Discussion: In real life class, we may discuss on why we use 1 000 instead of 990 (or 991).

X Esc
Ant. PgUp
Próx. PgDn

Notice that we can always add satellite data instead of just using a Boolean array to record the existence of the keys.


For example, we can use an associative String array A instead to map a bus route number to its operator name, e.g.

A[2] = "Go-Ahead Singapore",
A[10] = "SBS Transit",
A[183] = "Tower Transit Singapore",
A[188] = "SMRT Buses", etc.

Discussion: Can you think of a few other real-life DAT examples?

X Esc
Ant. PgUp
Próx. PgDn

e-Lecture: The content of this slide is hidden and only available for legitimate CS lecturer worldwide. Drop an email to visualgo.info at gmail dot com if you want to activate this CS lecturer-only feature and you are really a CS lecturer (show your University staff profile).

X Esc
Ant. PgUp
Próx. PgDn

The keys must be (or can be easily mapped to) non-negative Integer values.
Basic DAT has problem in the full version of the example in the previous two slides as there are actually variations of bus route numbers in Singapore, e.g. 96B, 151A, NR10, etc.


The range of keys must be small.
The memory usage will be (insanely) large if we have (insanely) large range.


The keys must be dense, i.e. not many gaps in the key values.
DAT will contain too many empty cells otherwise.


We will overcome these restrictions with hashing.

X Esc
Ant. PgUp
Próx. PgDn

Using hashing, we can:

  1. Map (some) non-Integer keys to Integers keys,
  2. Map large Integers to smaller Integers,
  3. Influence the density, or load factor α = N/M, of the Hash Table where N is the number of keys and M is the size of the Hash Table.
X Esc
Ant. PgUp
Próx. PgDn

For example, we have N = 400 Singapore phone numbers (Singapore phone number has 8 digits, so there are up to 10^8 = 100M possible phone numbers in Singapore).


Instead of using a DAT and use a gigantic array up to size M = 100 Million, we can use the following simple hash function h(v) = v%997.


This way, we map 8 digits phone numbers 6675 2378 and 6874 4483 into up to 3 digits h(6675 2378) = 237 and h(6874 4483) = 336, respectively. Therefore, we only need to prepare array of size M = 997 (or 1000) instead of M = 100 Million.

X Esc
Ant. PgUp
Próx. PgDn

With hashing, we can now implement the following Table ADT operations using Integer array (instead of Boolean array) as follows:

  1. Search(v): Check if A[h(v)] != -1 (we use -1 for an empty cell assuming v ≥ 0),
  2. Insert(v): Set A[h(v)] = v (we hash v into h(v) so we need to somehow record key v),
  3. Remove(v): Set A[h(v)] = -1 — to be elaborated further.
X Esc
Ant. PgUp
Próx. PgDn

If we have keys that map to satellite data and we want to record the original keys too, we can implement the Hash Table using pair of (Integer, satellite-data-type) array as follows:

  1. Search(v): Return A[h(v)], which is a pair (v, satellite-data), possibly empty,
  2. Insert(v, satellite-data): Set A[h(v)] = pair(v, satellite-data),
  3. Remove(v): Set A[h(v)] = (empty pair) — to be elaborated further.

However, by now you should notice that something is incomplete...

X Esc
Ant. PgUp
Próx. PgDn

A hash function may, and quite likely, map different keys (Integer or not) into the same Integer slot, i.e. a many-to-one mapping instead of one-to-one mapping.


For example, h(6675 2378) = 237 from three slides earlier and if we want to insert another phone number 6675 4372, we will have a problem as h(6675 4372) = 237 too.


This situation is called a collision, i.e. two (or more) keys have the same hash value.

X Esc
Ant. PgUp
Próx. PgDn

The Birthday (von Mises) Paradox asks: 'How many people (number of keys) must be in a room (a Hash Table of size 365 cells) before the probability that some share a birthday (collision, two keys are hashed to the same cell), ignoring the year and leap days (i.e. all years have 365 days), becomes > 50 percent (i.e. more likely than not)?'


The answer, which maybe surprising for some of us, is Reveal.


Let's do some calculation.

X Esc
Ant. PgUp
Próx. PgDn

Let Q(n) be the probability of unique birthday for n people in a room.
Q(n) = 365/365 × 364/365 × 363/365 × ... × (365-n+1)/365,
i.e. the first person's birthday can be any of the 365 days, the second person's birthday can be any of the 365 days except the first person's birthday, and so on.


Let P(n) be the probability of same birthday (collision) for n people in a room.
P(n) = 1-Q(n).


Now P(23) = 0.507 > 0.5 (50%).


Thus, we only need 23 people (a small amount of keys) in the room (Hash Table of size 365 cells) for a (more than) 50% chance collision to happen (the birthday of two different people in that room is one of 365 days/slots).

X Esc
Ant. PgUp
Próx. PgDn

Issue 1: We have seen a simple hash function like the h(v) = x%997 used in Phone Numbers example that maps large range of Integer keys into a smaller range of Integer keys, but how about non Integer keys? How to do such hashing efficiently?


Issue 2: We have seen that by hashing, or mapping, large range into smaller range, there will very likely be a collision. How to deal with them?

X Esc
Ant. PgUp
Próx. PgDn

How to create a good one with these desirable properties?

  1. Fast to compute, i.e. in O(1),
  2. Uses as minimum slots/Hash Table size M as possible,
  3. Scatter the keys into different base addresses as uniformly as possible ∈ [0..M-1],
  4. Experience as minimum collisions as possible.
X Esc
Ant. PgUp
Próx. PgDn

Suppose we have a hash table of size M where keys are used to identify the satellite-data and a specific hash function is used to compute a hash value.


A hash value/hash code of key v is computed from the key v with the use of a hash function to get an Integer in the range 0 to M-1. This hash value is used as the base/home index/address of the Hash Table entry for the satellite-data.

X Esc
Ant. PgUp
Próx. PgDn

Using the Phone Numbers example, if we we define h(v) = floor(v/1 000 000),
i.e. we select the first two digits a phone number.

h(66 75 2378) = 66
h(68 74 4483) = 68

Discuss: What happen when you use that hash function? Hint: See this.

X Esc
Ant. PgUp
Próx. PgDn

e-Lecture: The content of this slide is hidden and only available for legitimate CS lecturer worldwide. Drop an email to visualgo.info at gmail dot com if you want to activate this CS lecturer-only feature and you are really a CS lecturer (show your University staff profile).

X Esc
Ant. PgUp
Próx. PgDn

Before discussing the reality, let's discuss the ideal case: perfect hash functions.


A perfect hash function is a one-to-one mapping between keys and hash values, i.e. no collision at all. It is possible if all keys are known beforehand. For example, a compiler/interpreter search for reserved keywords. However, such cases are rare.


A minimal perfect hash function is achieved when the table size is the same as the number of keywords supplied. This case is even rarer.


If you are interested, you can explore GNU gperf, a freely available perfect hash function generator written in C++ that automatically constructs perfect functions (a C++ program) from a user supplied list of keywords.

X Esc
Ant. PgUp
Próx. PgDn

People has tried various ways to hash a large range of Integers into a smaller range of Integers as uniformly as possible. In this e-Lecture, we jump directly to one of the best and most popular version: h(v) = v%M, i.e. map v into Hash Table of size M slots. The (%) is a modulo operator that gives the remainder after division. This is clearly fast, i.e. O(1) assuming that v does not exceed natural Integer data type limit.


The Hash Table size M is set to be a reasonably large prime not near a power of 2, about 2+ times larger than the expected number of keys N that will ever be used in the Hash Table. This way, the load factor α = N/M < 0.5 — we shall see later that having low load factor, thereby sacrificing empty spaces, help improving Hash Table performance.


Discuss: What if we set M to be a power of 10 (decimal) or power of 2 (binary)?

X Esc
Ant. PgUp
Próx. PgDn

e-Lecture: The content of this slide is hidden and only available for legitimate CS lecturer worldwide. Drop an email to visualgo.info at gmail dot com if you want to activate this CS lecturer-only feature and you are really a CS lecturer (show your University staff profile).

X Esc
Ant. PgUp
Próx. PgDn

People has also tried various ways to hash Strings into a small range of Integers as uniformly as possible. In this e-Lecture, we jump directly to one of the best and most popular version, shown below:

int hash_function(string v) { // assumption 1: v uses ['A'..'Z'] only
int sum = 0; // assumption 2: v is a short string
for (auto &c : v) // for each character c in v
sum = ((sum*26)%M + (c-'A'+1))%M; // M is table size
return sum;
}

Discussion: In real life class, discuss the components of the hash function above, e.g. why loop through all characters?, will that be slower than O(1)?, why multiply with 26?, what if the string v uses more than just UPPERCASE chars?, etc

X Esc
Ant. PgUp
Próx. PgDn

e-Lecture: The content of this slide is hidden and only available for legitimate CS lecturer worldwide. Drop an email to visualgo.info at gmail dot com if you want to activate this CS lecturer-only feature and you are really a CS lecturer (show your University staff profile).

X Esc
Ant. PgUp
Próx. PgDn

There are two major ideas: Open Addressing versus Closed Addressing method.


In Open Addressing, all hashed keys are located in a single array. The hash code of a key gives its base address. Collision is resolved by checking/probing multiple alternative addresses (hence the name open) in the table based on a certain rule.


In Closed Addressing, the Hash Table looks like an Adjacency List (a graph data structure). The hash code of a key gives its fixed/closed base address. Collision is resolved by appending the collided keys inside a (Doubly) Linked List identified by the base address.

X Esc
Ant. PgUp
Próx. PgDn

There are three Open Addressing (OA) collision resolution techniques discussed in this visualization: Linear Probing (LP), Quadratic Probing (QP), and Double Hashing (DH).


To switch between the three modes, please click on the respective header.


Let:
M = HT.length = the current hash table size,
base = (key%HT.length),
step = the current probing step,
secondary = smaller_prime - key%smaller_prime (to avoid zero — elaborated soon)

We will soon see that the probing sequences of the three modes are:
Linear Probing: i=(base+step*1) % M,
Quadratic Probing: i=(base+step*step) % M, and
Double Hashing: i=(base+step*secondary) % M.

X Esc
Ant. PgUp
Próx. PgDn

Separate Chaining (SC) collision resolution technique is simple. We use M copies of auxiliary data structures, usually Doubly Linked Lists. If two keys a and b both have the same hash value i, both will be appended to the (front/back) of Doubly Linked List i (in this visualization, we append to the back in O(1) with help of tail pointer). That's it, where the keys will be slotted in is completely dependent on the hash function itself, hence we also call Separate Chaining as Closed Addressing collision resolution technique.


If we use Separate Chaining, the load factor α = N/M describes the average length of the M lists and it will determine the performance of Search(v) as we may have to explore α elements on average. As Remove(v) — also requires Search(v), its performance will be similar as Search(v). Insert(v) is clearly O(1).


If we can bound α to be a small constant, all Search(v), Insert(v), and Remove(v) operations using Separate Chaining will be O(1).

X Esc
Ant. PgUp
Próx. PgDn

View the visualisation of Hash Table above.


In this visualization, we prevent insertion of duplicate keys.


Due to limited screen space, we limit the maximum Hash Table size to be M = 19.


The Hash Table is visualized horizontally like an array where index 0 is placed leftmost and index M-1 is placed rightmost but the details are different when we are visualizing Open Addressing versus Separate Chaining collision resolution techniques.

X Esc
Ant. PgUp
Próx. PgDn

There are three Open Addressing collision resolution techniques discussed in this visualization: Linear Probing (LP), Quadratic Probing (QP), and Double Hashing (DH).


For all three techniques, each Hash Table cell is displayed as a vertex with cell value of [0..99] displayed as the vertex label. Without loss of generality, we do not show any satellite data in this visualization as we concentrate only on the arrangement of the keys. We reserve value -1 to indicate an 'EMPTY cell' (visualized as a blank vertex) and -2 to indicate a 'DELETED cell' (visualized as a vertex with abbreviated label "DEL"). The cell indices ranging from [0..M-1] are shown as red label below each vertex.

X Esc
Ant. PgUp
Próx. PgDn

For Separate Chaining (SC) collision resolution technique, the first row contains the M "H" (Head) pointers of M Doubly Linked Lists.


Then, each Doubly Linked List i contains all keys that are hashed into i in arbitrary order. Mathematically, all keys that can be expressed as i (mod M) are hashed into DLL i. Again, we do not store any satellite data in this visualization.

X Esc
Ant. PgUp
Próx. PgDn

In Linear Probing collision resolution technique, we scan forwards one index at a time for the next empty/deleted slot (wrapping around when we have reached the last slot) whenever there is a collision.


For example, let's assume we start with an empty Hash Table HT with table size M = HT.length = 7 as shown above that uses index 0 to M-1 = 7-1 = 6. Notice that 7 is a prime number. The (primary) hash function is simple, h(v) = v%M.


This walk-through will show you the steps taken by Insert(v), Search(v), and Remove(v) operations when using linear probing as collision resolution technique.

X Esc
Ant. PgUp
Próx. PgDn

Now click Insert([18,14,21]) — three individual insertions in one command.


Recap (to be shown after you click the button above).


Formally, we describe Linear Probing index i as i = (base+step*1) % M where base is the (primary) hash value of key v, i.e. h(v) and step is the linear probing step starting from 1.


Tips: To do a quick mental calculation of a (small) integer V modulo M, we simply subtract V with the largest multiple of MV, e.g. 18%7 = 18-14 = 4, as 14 is the largest multiple of 7 that is ≤ 18.

X Esc
Ant. PgUp
Próx. PgDn

Now click Insert([1,35]) (on top of the first three values inserted in the previous slide).


Recap (to be shown after you click the button above)

X Esc
Ant. PgUp
Próx. PgDn

Now we illustrate Search(v) operation while using Linear Probing as collision resolution technique. The steps taken are very similar as with Insert(v) operation, i.e. we start from the (primary) hash key value and check if we have found v, otherwise we move one index forward at a time (wrapping around if necessary) and recheck on whether we have found v. We stop when we encounter an empty cell which implies that v is not in Hash Table at all (as earlier Insert(v) operation would have placed v there otherwise).


Now click Search(35) — you should see probing sequence [0,1,2,3 (key 35 found)].


Now click Search(8) — [1,2,3,4, 5 (empty cell, so key 8 is not found in the hash table)].

X Esc
Ant. PgUp
Próx. PgDn

Now let's discuss Remove(v) operation.


If we just set HT[i] = EMPTY cell straightaway where i is the index that contains v (after linear probing if necessary), do you realize that we will cause a problem? Why?


Hint: Review the past three slides on how Insert(v) and Search(v) behave.

X Esc
Ant. PgUp
Próx. PgDn

e-Lecture: The content of this slide is hidden and only available for legitimate CS lecturer worldwide. Drop an email to visualgo.info at gmail dot com if you want to activate this CS lecturer-only feature and you are really a CS lecturer (show your University staff profile).

X Esc
Ant. PgUp
Próx. PgDn

Now let's see the complete Remove(v). If we find v at index i (after linear probing if necessary), we have to set HT[i] = DELETED (abbreviated as DEL in this visualization) where DEL is a special symbol (generally you should only use a symbol that is not used in your application) to indicate that cell can be by-passed if necessary by future Search(v), but can be overwritten by future Insert(w). This strategy is called Lazy Deletion.


Now click Remove(21) — [0,1 (key 21 found and we set H[1] = DEL)].


Afterwards, please continue the discussion in the next slide.

X Esc
Ant. PgUp
Próx. PgDn

Now click Search(35) — [0,1 (bypassing that DELETED cell), 2,3 (found key 35)].


Imagine what would have happened if we wrongly set H[1] = EMPTY.

X Esc
Ant. PgUp
Próx. PgDn

Now click Insert(28) — you should see probing sequence [0,1 (found a cell with DEL symbol)], so it is actually can be overwritten with a new value without affecting the correctness of future Search(v). Therefore, we put 28 in index 1.

X Esc
Ant. PgUp
Próx. PgDn

Although we can resolve collision with Linear Probing, it is not the most effective way.


We define a cluster to be a collection of consecutive occupied slots. A cluster that covers the base address of a key is called the primary cluster of the key.


Now notice that Linear Probing can create large primary clusters that will increase the running time of Search(v)/Insert(v)/Remove(v) operations beyond the advertised O(1).


See an example above with M = 11 and we have inserted keys that are all 6 (modulo 11), i.e. all have remainder 6 when divided by 11. Now see how 'slow' Insert(94) is.

X Esc
Ant. PgUp
Próx. PgDn

The probe sequence of linear probing can be formally described as follows:

 h(v) // base address
(h(v) + 1*1) % M // 1st probing step if there is a collision
(h(v) + 2*1) % M // 2nd probing step if there is still a collision
(h(v) + 3*1) % M // 3rd probing step if there is still a collision
...
(h(v) + k*1) % M // k-th probing step, etc...

During Insert(v), if there is a collision but there is an empty (or DEL) slot remains in the Hash Table, we are sure to find it after at most M linear probing steps. And when we do, the collision will be resolved, but the primary cluster of the key v is expanded as a result and future Hash Table operations will get slower too.


The primary cluster size can be very big due to the annexation (or combination) of neighbouring (but previously disjointed) clusters.

X Esc
Ant. PgUp
Próx. PgDn

To reduce primary clustering, we can modify the probe sequence to:

 h(v) // base address
(h(v) + 1*1) % M // 1st probing step if there is a collision
(h(v) + 2*2) % M // 2nd probing step if there is still a collision
(h(v) + 3*3) % M // 3rd probing step if there is still a collision
...
(h(v) + k*k) % M // k-th probing step, etc...

That's it, the probe jumps quadratically, wrapping around the Hash Table as necessary.


A very common mistake as this is a different kind of Quadratic Probing:
Doing h(v), (h(v)+1) % M, (h(v)+1+4) % M, (h(v)+1+4+9) % M, ...

X Esc
Ant. PgUp
Próx. PgDn

Assume that we have called Insert(18) and Insert(10) into an initially empty Hash Table of size M = HT.length = 7. As 18%7 = 4 and 10%7 = 3, 18 and 3 do not collide and both reside in index 4 and 3 respectively as shown above.


Now, let's click Insert(38).


Recap (to be shown after you click the button above).

X Esc
Ant. PgUp
Próx. PgDn

Remove(x) and Search(y) operations are defined similarly. Just that this time we use Quadratic Probing instead of Linear Probing.


For example, assume that we have called Remove(18) after the previous slide and we mark HT[4] = DEL. If we then call Search(38), we will use the same Quadratic Probing sequence as with previous slide, but passing through HT[4] which marked as DELETED.

X Esc
Ant. PgUp
Próx. PgDn

In a glance, Quadratic Probing that jumps +1, +4, +9, +16, ... quadratically seems able to solve the primary clustering issue that we have with Linear Probing earlier, but is it the perfect collision resolution technique?


Try Insert([12,17]).


Do you realized what has just happened?

X Esc
Ant. PgUp
Próx. PgDn

We can insert 12 easily as h(12) = 12%7 = 5 was empty previously (see above).


However we have major problem inserting key 17 even if we still have 3 empty slots as:
h(17) = 17%7 = 3 is already occupied by key 10,
(3+1*1) % 7 = 4 is already occupied by key 18,
(3+2*2) % 7 = 0 is already occupied by key 38,
(3+3*3) % 7 = 5 is already occupied by key 12,
(3+4*4) % 7 = 5 again is already occupied by key 12,
(3+5*5) % 7 = 0 again is already occupied by key 38,
(3+6*6) % 7 = 4 again is already occupied by key 18,
(3+7*7) % 7 = 3 again is already occupied by key 10,
it will cycle forever if we continue the Quadratic Probing...


Although we still have a few (3) empty cells, we are unable to insert this new value 17 into the Hash Table...

X Esc
Ant. PgUp
Próx. PgDn

If α < 0.5 and M is a prime (> 3), then we can always find an empty slot using Quadratic Probing. Recall: α is the load factor and M is the Hash Table size (HT.length).


If the two requirements above are satisfied, we can prove that the first M/2 quadratic probing indices, including the base address h(v) are all distinct and unique.


But there is no such guarantee beyond that. Hence if we want to use Quadratic Probing, we need to ensure that α < 0.5 (not enforced in this visualization but we do break the loop after M steps to prevent infinite loop).

X Esc
Ant. PgUp
Próx. PgDn

We will use proof by contradiction. We first assume that two Quadratic Probing steps:
x and y, x != y (let's say x < y), can yield the same address modulo M.

h(v) + x*x = h(v) + y*y (mod M)
x*x = y*y (mod M) // strike out h(v) from both sides
x*x - y*y = 0 (mod M) // move y*y to LHS
(x-y)*(x+y) = 0 (mod M) // rearrange the formula

Now, either (x-y) or (x+y) has to be equal to zero.
As our assumption says x != y, then (x-y) cannot be 0.
As 0 ≤ x < y ≤ (M/2) and M is a prime > 3 (an odd Integer),
then (x+y) also cannot be 0 modulo M.


Contradiction!


So the first M/2 Quadratic Probing steps cannot yield the same address modulo M
(if we set M to be a prime number greater than 3).


Discussion: Can we make Quadratic Probing able to use the other ~50% of the table cells?

X Esc
Ant. PgUp
Próx. PgDn

e-Lecture: The content of this slide is hidden and only available for legitimate CS lecturer worldwide. Drop an email to visualgo.info at gmail dot com if you want to activate this CS lecturer-only feature and you are really a CS lecturer (show your University staff profile).

X Esc
Ant. PgUp
Próx. PgDn

In Quadratic Probing, clusters are formed along the path of probing, instead of around the base address like in Linear Probing. These clusters are called Secondary Clusters.


Secondary clusters are formed as a result of using the same pattern in probing by all keys. Notice that if two distinct keys have the same base address, their Quadratic Probing sequences are going to be the same.


Secondary clustering in Quadratic Probing is not as bad as primary clustering in Linear Probing as a good hash function should theoretically disperse the keys into different base addresses ∈ [0..M-1] in the first place.

X Esc
Ant. PgUp
Próx. PgDn

To reduce primary and secondary clustering, we can modify the probe sequence to:

 h(v) // base address
(h(v) + 1*h2(v)) % M // 1st probing step if there is a collision
(h(v) + 2*h2(v)) % M // 2nd probing step if there is still a collision
(h(v) + 3*h2(v)) % M // 3rd probing step if there is still a collision
...
(h(v) + k*h2(v)) % M // k-th probing step, etc...

That's it, the probe jumps according to the value of the second hash function h2(v), wrapping around the Hash Table as necessary.

X Esc
Ant. PgUp
Próx. PgDn

If h2(v) = 1, then Double Hashing works exactly the same as Linear Probing.
So we generally wants h2(v) > 1 to avoid primary clustering.


If h2(v) = 0, then Double Hashing does not work for an obvious reason as any probing step multiplied by 0 remains 0, i.e. we stay at the base address forever during a collision
We need to avoid this.


Usually (for Integer keys), h2(v) = M' - v%M' where M' is a smaller prime than M.
This makes h2(v) ∈ [1..M'], which is diverse enough to avoid secondary clustering.


The usage of the secondary hash function makes it theoretically hard to have either primary or secondary clustering issue.

X Esc
Ant. PgUp
Próx. PgDn

Click Insert([35,42]) to insert 35 and then 42 to the current Hash Table above.


Recap (to be shown after you click the button above).

X Esc
Ant. PgUp
Próx. PgDn

Remove(x) and Search(y) operations are defined similarly. Just that this time we use Double Hashing instead of Linear Probing or Quadratic Probing.


For example, assume that we have called Remove(17) after the previous slide and we mark HT[3] = DEL. If we then call Search(35), we will use the same Double Hashing sequence as with previous slide, but passing through HT[3] which marked as DELETED.

X Esc
Ant. PgUp
Próx. PgDn

In summary, a good Open Addressing collision resolution technique needs to:

  1. Always find an empty slot if it exists,
  2. Minimize clustering (of any kind),
  3. Give different probe sequences when 2 different keys collide,
  4. Fast, O(1).

Discussion: Double Hashing seems to fit the bill. But... Is Double Hashing strategy flexible enough to be used as the default library implementation of a Hash Table? Let's see...

X Esc
Ant. PgUp
Próx. PgDn

Try Insert([9,16,23,30,37,44]) to see how Insert(v) operation works if we use Separate Chaining as collision resolution technique. Note that all Integers {9,16,23,30,37,44} are 2 (modulo 7) so all of them will be appended into the (back of) Doubly Linked List 2 and each insertion is clearly O(1).


Due to the screen limitation, we limit the length of each Doubly Linked List to be 6.

X Esc
Ant. PgUp
Próx. PgDn

Try Search(35) to see that Search(v) can be made to run in O(1+α).


Try Remove(7) to see that Remove(v) can be made to run in O(1+α) too.


If α is large, Separate Chaining performance is not really O(1). However, if we roughly know the potential number of keys n that our application will ever use, then we can set table size m accordingly such that α = n/m is a very low positive number, thereby making Separate Chaining performances all O(1).

X Esc
Ant. PgUp
Próx. PgDn
Discussion: After all these explanations, which of the two collision resolution technique is the better one?
X Esc
Ant. PgUp
Próx. PgDn

e-Lecture: The content of this slide is hidden and only available for legitimate CS lecturer worldwide. Drop an email to visualgo.info at gmail dot com if you want to activate this CS lecturer-only feature and you are really a CS lecturer (show your University staff profile).

X Esc
Ant. PgUp
Próx. PgDn

You have reached the end of the basic stuffs of this Hash Table data structure and we encourage you to explore further in the Exploration Mode.


However, we still have a few more interesting Hash Table challenges for you that are outlined in this section.

X Esc
Ant. PgUp
Próx. PgDn

The performance of Hash Table degrades when the load factor α gets higher. For (standard) Quadratic Probing collision resolution technique, insertions might fail when the Hash Table α > 0.5.


If that happens, we can rehash. We build another Hash Table about twice as big with a new hash function. We go through all keys in the original Hash Table, recompute the new hash values, and re-insert the keys (with their satellite-data) into the new, bigger Hash Table, before finally we delete the older, smaller Hash Table.


A rule of thumb is to rehash when α ≥ 0.5 if using Open Addressing and when α > small constant close to 1.0 if using Separate Chaining.


If we know the maximum number of total possible keys, we can always influence α to be a low number.

X Esc
Ant. PgUp
Próx. PgDn

However, if you ever need to implement a Hash Table in C++ or Java and your keys are either Integers or Strings, you can use the built-in C++ STL or Java API. They already have good built-in implementation of default hash functions for Integers or Strings.


See C++ STL unordered_map, unordered_set or Java HashMap, HashSet.


Notice that multimap/multiset implementations also exist (duplicate keys are allowed).


However, here is our take of a simple Separate Chaining implementation in C++
(not generic enough though).

X Esc
Ant. PgUp
Próx. PgDn

e-Lecture: The content of this slide is hidden and only available for legitimate CS lecturer worldwide. Drop an email to visualgo.info at gmail dot com if you want to activate this CS lecturer-only feature and you are really a CS lecturer (show your University staff profile).

X Esc
Ant. PgUp
Próx. PgDn

Hash Table is an extremely good data structure to implement Table ADT if the (Integer or String) keys only need to be mapped to satellite-data, with O(1) performance for Search(v), Insert(v), and Remove(v) operations if the hash table is set up properly.


However, if we need to do much more with the keys, we may need to use an alternative data structure.

X Esc
Ant. PgUp
Próx. PgDn

For a few more interesting questions about this data structure, please practice on Hash Table training module (no login is required, but short and of medium difficulty setting only).


However, for registered users, you should login and then go to the Main Training Page to officially clear this module and such achievement will be recorded in your user account.

X Esc
Ant. PgUp
Próx. PgDn

Try to solve a few basic programming problems that somewhat requires the usage of Hash Table (especially if the input size is much larger):

  1. Kattis - cd (the inputs are already sorted so alternative, non Hash Table solution exists; if the inputs are not sorted, this set intersection problem is best solved with help of a Hash Table),
  2. Kattis - oddmanout (we can map large invitation codes into smaller range of integers; this is a practice of hashing (large range) of integers),
  3. Kattis - whatdoesthefoxsay (we put sounds that are not fox into an unordered set; this is a practice of hashing strings).
X Esc
Ant. PgUp
Próx. PgDn

As the action is being carried out, each step will be described in the status panel.

X Esc
Ant. PgUp
Próx. PgDn

e-Lecture: The content of this slide is hidden and only available for legitimate CS lecturer worldwide. Drop an email to visualgo.info at gmail dot com if you want to activate this CS lecturer-only feature and you are really a CS lecturer (show your University staff profile).

X Esc
Ant. PgUp
Próx. PgDn

Control the animation with the player controls! Keyboard shortcuts are:

Spacebar: play/pause/replay
Left/right arrows: step backward/step forward
-/+: decrease/increase speed
X Esc
Ant. PgUp
Próx. PgDn

Return to 'Exploration Mode' to start exploring!


Note that if you notice any bug in this visualization or if you want to request for a new visualization feature, do not hesitate to drop an email to the project leader: Dr Steven Halim via his email address: stevenhalim at gmail dot com.

X Esc
Ant. PgUp

Criar

Inserir(v)

Remover(v)

>

Create empty hash table of size =

Vai!

v =

Vai!

v =

Vai!

Sobre Time Termos de uso

Sobre

O VisuAlgo foi conceitualizado em 2011 por Dr. Steven Halim como uma ferramenta para auxiliar seus estudantes a entenderem melhor estruturas de dados e algoritmos, permitindo que eles aprendessem o básico por conta e em seu próprio ritmo.
VisuAlgo contém muitos algoritmos avançados que são discutidos no livro de Dr. Steven Halim ('Competitive Programming', em co-autoria com seu irmão Dr. Felix Halim) e além. Hoje, algumas visualizações/animações destes algoritmos avançados só podem ser encontrados no VisuAlgo.
Apesar de ter sido especificamente projetado para os estudantes da Universidade Nacional de Singapura (NUS) cursando várias disciplinas de estruturas de dados e algoritmos (ex.: CS1010, CS1020, CS2010, CS2020, CS3230, e CS3230), como defensores do aprendizado online, nós esperamos que mentes curiosas ao redor do mundo achem estas visualizações úteis também.
VisuAlgo não foi projetado para funcionar bem em telas de toque pequenas (ex.: smartphones) desde o  princípio devido à necessidade de atender a muitas visualizações complexas de algoritmos que requerem vários pixels e gestos de clicar-e-arrastar para interação. A resolução mínima para uma experiência de usuário respeitável é 1024x768 e somente a página inicial é relativamente amigável a dispositivos móveis. 
VisuAlgo é um projeto em andamento e mais visualizações complexas ainda estão em desenvolvimento. 
O desenvolvimento mais excitante é o gerador de questões e verificador automático (o sistema de quiz online) que permite aos estudantes testar seus conhecimentos de estruturas de dados e algoritmos básicos. As questões são aleatoriamente geradas através de algumas regras e as respostas dos estudantes são instantaneamente e automaticamente avaliadas assim que são submetidas para o nosso servidor de avaliação. Este sistema de quiz online, quando for adotado por mais instrutores de Ciência da Computação ao redor do mundo, deve tecnicamente eliminar questões manuais sobre estruturas de dados e algoritmos básicos de provas típicas de Ciência da Computação em muitas Universidades. Definindo um peso pequeno (mas não-zero) para aqueles aprovados no quiz online, um instrutor de Ciência da Computação pode (significativamente) melhorar o domínio de seus estudantes sobre estas questões básicas, uma vez que os estudantes têm virtualmente um número infinito de questões para praticar que podem ser verificadas instantaneamente antes que eles possam fazer o quiz online. O modo de treino atualmente contém questões para 12 módulos de visualização. Em breve nós adicionaremos os 8 módulos de visualização restantes, para que todos os módulos de visualização no VisuAlgo tenham um componente de quiz online.
Outro ramo de desenvolvimento em atividade é o subprojeto de internacionalização do VisuAlgo. Nós queremos preparar uma base de dados de termos de Ciência da Computação para todos os textos em inglês que aparecem no sistema VisuAlgo. Esta é uma tarefa grande e requer crowdsourcing. Uma vez que o sistema estiver pronto, nós convidaremos os visitantes do VisuAlgo a contribuir, especialmente se você não for um falante nativo de inglês. Atualmente, nós também temos notas públicas sobre o VisuAlgo em vários idiomas:
zh, id, kr, vn, th.

Time

Líder do Projeto & Conselheiro (Julho de 2011-presente)
Dr Steven Halim, Senior Lecturer, School of Computing (SoC), National University of Singapore (NUS)
Dr Felix Halim, Software Engineer, Google (Mountain View)

Estudantes Pesquisadores de Graduação 1 (Jul 2011-Apr 2012)
Koh Zi Chun, Victor Loh Bo Huai

Projeto Final do Ano/Estudantes do Programa de Oportunidades de Pesquisa para a Graduação (UROP) 1 (Jul 2012-Dec 2013)
Phan Thi Quynh Trang, Peter Phandi, Albert Millardo Tjindradinata, Nguyen Hoang Duy

Projeto Final do Ano/Estudantes do Programa de Oportunidades de Pesquisa para a Graduação (UROP) 2 (Jun 2013-Apr 2014)
Rose Marie Tan Zhao Yun, Ivan Reinaldo

Estudantes Pesquisadores de Graduação 2 (May 2014-Jul 2014)
Jonathan Irvin Gunawan, Nathan Azaria, Ian Leow Tze Wei, Nguyen Viet Dung, Nguyen Khac Tung, Steven Kester Yuwono, Cao Shengze, Mohan Jishnu

Projeto Final do Ano/Estudantes do Programa de Oportunidades de Pesquisa para a Graduação (UROP) 3 (Jun 2014-Apr 2015)
Erin Teo Yi Ling, Wang Zi

Projeto Final do Ano/Estudantes do Programa de Oportunidades de Pesquisa para a Graduação (UROP) 4 (Jun 2016-Dec 2017)
Truong Ngoc Khanh, John Kevin Tjahjadi, Gabriella Michelle, Muhammad Rais Fathin Mudzakir

List of translators who have contributed ≥100 translations can be found at statistics page.

Agradecimentos
Este projeto foi tornado possível pela generosa Concessão de Aperfeiçoamento de Ensino do Centro de Desenvolvimento de Ensino e Aprendizado (CDTL) da Universidade Nacional de Singapura (NUS).

Termos de uso

VisuAlgo é gratuito para a comunidade de Ciência da Computação na Terra. Se você gosta do VisuAlgo, o único pagamento que lhe pedimos é que você fale da existência do VisuAlgo para outros estudantes/instrutores de Ciência da Computação que você conhece =) via Facebook, Twitter, página do curso, blog, email, etc.
Se você é um estudante/instrutor de estruturas de dados e algoritmos, você tem permissão para usar este site diretamente para suas aulas. Se você tirar capturas de tela (vídeos) deste site, você pode usar as capturas de tela (vídeos) em outros lugares desde que você cite a URL deste website (http://visualgo.net) e/ou a lista de publicações abaixo como referência. Contudo, você NÃO tem permissão para baixar os arquivos do VisuAlgo (do lado do cliente) e hospedá-los em seu website, uma vez que isso configura plágio. No momento, nós NÃO permitimos a outras pessoas copiar este projeto e criar variantes do VisuAlgo. Não há problemas em usar a cópia offline (lado do cliente) do VisuAlgo para seu uso pessoal.
Note que o componente do quiz online do VisuAlgo, por natureza, é um componente pesado para os servidores e não há maneira fácil de salvar os scripts e bases de dados do servidor localmente. Atualmente, o público em geral pode apenas usar o 'modo de treinamento' para acessar este sistema de quiz online. Atualmente, o 'modo de prova' é um ambiente mais controlado para usar estas questões geradas randomicamente e verificação automática para um exame real na Universidade Nacional de Singapura (NUS). Outros instrutores de Ciência da Computação interessados devem contatar o prof. Dr. Steven Halim se você quiser experimentar este 'modo de prova'.'

Lista de Publicações

Este trabalho foi apresentado brevemente no CLI Workshop durante a Final Mundial do ACM ICPC 2012 (Polônia, Varsóvia) e na IOI Conference durante a IOI 2012 (Sirmione-Montichiari, Itália). Você pode clicar neste link para ler nosso paper de 2012 sobre este sistema (ele ainda não era chamado VisuAlgo em 2012).
Este trabalho foi feito em sua maioria por meus estudantes anteriores. Os relatórios finais mais recentes estão aqui: Erin, Wang Zi, Rose, Ivan.

Avisos de Bugs ou Solicitações de Novas Funcionalidades

VisuAlgo não é um projeto finalizado. Dr. Steven Halim ainda está ativamente melhorando o VisuAlgo. Se você está usando o VisuAlgo e perceber um bug em qualquer uma de nossas páginas de visualizações/ferramenta de quiz online ou se você quiser solicitar novas funcionalidades, por favor contate o Dr. Steven Halim. O contato dele é a concatenação de seu nome e adicione gmail ponto com.