Hash Tables
1. Introduction
Definition
A hash table (hash map) is a data structure that maps keys to values using a hash function to compute an index into an array of buckets. It provides average-case time for insert, delete, and search.
The Core Idea
Given a key , compute , where is the hash function and is the table size. Store the key-value pair at this index.
2. Hash Functions
Properties of a Good Hash Function
- Deterministic: The same key always produces the same hash
- Efficient: Computed in time
- Uniform distribution: Maps keys evenly across all indices
- Minimal collisions: Different keys rarely hash to the same index
Common Hash Functions
Division Method
Choosing . Avoid powers of 2 (patterns in keys align with binary structure). Choose to be a prime not close to a power of 2.
warning the hash function will map many keys to the same bucket.
Multiplication Method
where is a constant (Knuth suggests ).
Advantage: Works well with any value of .
Polynomial Rolling Hash (for strings)
where is a prime (typically 31 or 37) and is a large prime or .
def polynomial_hash(s, p=31, m=10**9 + 7):
h = 0
for ch in s:
h = (h * p + ord(ch)) % m
return h
3. Collisions
Definition
A collision occurs when two distinct keys hash to the same index: .
Theorem (Pigeonhole Principle). If a hash table has buckets and stores more than keys, at least one collision must occur.
Proof. By the pigeonhole principle: keys mapped to buckets where means at least one bucket contains keys.
4. Collision Resolution: Chaining
Method
Each bucket contains a linked list of all key-value pairs that hash to that index.
class ChainedHashTable:
def __init__(self, size=10):
self.size = size
self.table = [[] for _ in range(size)]
def _hash(self, key):
return hash(key) % self.size
def insert(self, key, value):
index = self._hash(key)
for i, (k, v) in enumerate(self.table[index]):
if k == key:
self.table[index][i] = (key, value)
return
self.table[index].append((key, value))
def search(self, key):
index = self._hash(key)
for k, v in self.table[index]:
if k == key:
return v
return None
def delete(self, key):
index = self._hash(key)
for i, (k, v) in enumerate(self.table[index]):
if k == key:
del self.table[index][i]
return
Complexity Analysis
Theorem. Under simple uniform hashing (each key is equally likely to hash to any of the buckets, independently), the expected length of a chain for keys and buckets is (the load factor).
Proof. By linearity of expectation, the expected number of keys in any particular bucket is .
| Operation | Average case | Worst case |
|---|---|---|
| Search | ||
| Insert | ||
| Delete |
Theorem. Chaining requires total memory.
Proof. bucket headers plus nodes, each storing a key, value, and pointer.
5. Collision Resolution: Open Addressing
Method
All key-value pairs are stored in the table itself (no linked lists). When a collision occurs, the algorithm probes for the next available slot.
Linear Probing
where is the probe sequence.
class LinearProbeHashTable:
def __init__(self, size):
self.size = size
self.keys = [None] * size
self.values = [None] * size
self.deleted = [False] * size
def insert(self, key, value):
i = 0
while i < self.size:
index = (hash(key) + i) % self.size
if self.keys[index] is None or self.deleted[index]:
self.keys[index] = key
self.values[index] = value
self.deleted[index] = False
return
if self.keys[index] == key:
self.values[index] = value
return
i += 1
raise Exception("Table full")
def search(self, key):
i = 0
while i < self.size:
index = (hash(key) + i) % self.size
if self.keys[index] is None:
return None
if self.keys[index] == key and not self.deleted[index]:
return self.values[index]
i += 1
return None
Clustering in Linear Probing
Theorem. Linear probing suffers from primary clustering: consecutive occupied slots form clusters, and new insertions are more likely to land in larger clusters, making them grow even faster.
Proof of clustering. Consider a cluster of length . The probability that the next insertion lands in this cluster is (one more than the cluster length, since the slot after the cluster also triggers probing into the cluster). When an insertion lands in the cluster, the cluster grows by at least 1. This positive feedback means large clusters grow disproportionately, leading to degraded performance. The expected number of probes for an unsuccessful search grows rapidly as the load factor approaches 1 (vs for uniform hashing).
More formally, after inserting keys into a table of size with linear probing, the expected number of probes for an unsuccessful search is approximately:
where is the load factor. As , this grows to .
Quadratic Probing
Eliminates primary clustering but may cause secondary clustering (keys with the same initial hash follow the same probe sequence).
Theorem. If the table size is prime and the load factor , quadratic probing will always find an empty slot.
Double Hashing
Uses a second hash function to determine the probe step. Eliminates both primary and secondary clustering.
Requirement: and is coprime with . Common choice: .
6. Load Factor and Resizing
Definition
where is the number of stored keys and is the table size.
Resizing (Rehashing)
When exceeds a threshold (typically 0.7 for chaining, 0.5 for open addressing), allocate a new table of size and rehash all keys.
Theorem. Resizing maintains amortised time per operation.
Proof. Similar to dynamic array analysis. The total cost of insertions with resizing at powers of 2 is dominated by the last resize: . Amortised per insertion: .
7. Comparison of Collision Resolution Methods
| Property | Chaining | Linear Probing | Double Hashing |
|---|---|---|---|
| Memory | |||
| Cache performance | Poor (pointers) | Good (contiguous) | Good |
| Clustering | None | Primary | None |
| Delete | Lazy deletion | Lazy deletion | |
| Load factor limit | No hard limit |
Board-specific
- AQA requires understanding of hash functions, collision resolution (linear probing, quadratic probing, rehashing), and calculating hash table load factor
- CIE (9618) covers hashing and collision handling; may use different terminology
- OCR (A) requires hash tables with collision resolution using linear probing and rehashing
- Edexcel covers hash tables and collision resolution methods
Problem Set
Problem 1. Using the division method with table size , compute the hash values for keys: 14, 21, 28, 35, 42. What do you observe?
Answer
:
- 14 mod 7 = 0
- 21 mod 7 = 0
- 28 mod 7 = 0
- 35 mod 7 = 0
- 42 mod 7 = 0
All keys hash to index 0 — maximum collisions. This demonstrates why should not divide common key patterns. If and all keys are multiples of 7, every key collides. Choose to be a prime not dividing common key values.
Problem 2. Insert keys 10, 22, 31, 4, 15, 28, 17 into a hash table of size 11 using linear probing. Show the table after each insertion.
Answer
| Key | Hash | Probe sequence | Final index |
|---|---|---|---|
| 10 | 10 | 10 | 10 |
| 22 | 0 | 0 | 0 |
| 31 | 9 | 9 | 9 |
| 4 | 4 | 4 | 4 |
| 15 | 4 | 4, 5 | 5 |
| 28 | 6 | 6 | 6 |
| 17 | 6 | 6, 7 | 7 |
Final table:
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Key | 22 | — | — | — | 4 | 15 | 28 | 17 | — | 31 | 10 |
Problem 3. Using the same keys as Problem 2, insert into a hash table using chaining with . Show each bucket.
Answer
| Bucket | Keys (chain) |
|---|---|
| 0 | 10 → 15 → 28 |
| 1 | 31 |
| 2 | 22 → 17 |
| 3 | |
| 4 | 4 |
Load factor: .
Problem 4. Compute the polynomial rolling hash of the string "abc" using base and modulus .
Answer
remainder ... let me recalculate.
. . So .
Alternatively, computing iteratively:
- remainder
- remainder
.
Problem 5. Explain why the load factor must be less than 1 for open addressing but can exceed 1 for chaining.
Answer
In open addressing, all keys are stored in the table itself. If , there are more keys than slots, making it impossible to store all keys (the table is full). With linear probing, the search for an empty slot may never terminate.
In chaining, each bucket can hold an arbitrary number of keys (via a linked list). The table never "fills up" — chains simply grow longer. However, performance degrades as increases, since search time is proportional to chain length.
Problem 6. Prove that with double hashing, the probe sequence visits all slots before repeating if and are coprime.
Answer
Proof. The probe sequence is: .
This is an arithmetic progression modulo with common difference . The sequence visits distinct values as long as is coprime with . If , the sequence cycles through only distinct values. When , by the properties of modular arithmetic, the sequence covers all residues before repeating.
This is why must be chosen so that .
Problem 7. A hash table with chaining has 1000 slots and contains 500 elements. What is the expected number of probes for a successful search? For an unsuccessful search?
Answer
Load factor: .
Unsuccessful search: Expected probes = (we traverse the entire chain on average).
Successful search: Expected probes = .
More precisely: for successful search, we examine half the chain on average (the target is equally likely to be at any position in the chain). Expected chain length = , so expected probes = .
Problem 8. Design a hash table that maps student IDs (7-digit integers) to names. Specify the hash function, table size, and collision resolution method. Justify your choices.
Answer
Table size: (a prime number, allowing for up to ~7000 students with load factor < 0.7).
Hash function: Division method: .
Collision resolution: Chaining with linked lists.
Justification:
- Division method is simple and efficient ()
- Prime table size reduces collisions from patterns in student IDs (e.g., sequential IDs)
- Chaining handles variable load gracefully and simplifies deletion
- Load factor < 0.7 ensures average operations
For revision on complexity, see Complexity Analysis.
Problems
Problem 1. A hash table uses the division method with table size . Calculate the hash value for each of the following keys: 26, 42, 93, 17, 77, 31, 55, 20.
Hint
Apply to each key. Remember that the modulo operation gives the remainder after division.
Answer
:
| Key | Calculation | Hash value |
|---|---|---|
| 26 | 26 ÷ 13 = 2 r 0 | 0 |
| 42 | 42 ÷ 13 = 3 r 3 | 3 |
| 93 | 93 ÷ 13 = 7 r 2 | 2 |
| 17 | 17 ÷ 13 = 1 r 4 | 4 |
| 77 | 77 ÷ 13 = 5 r 12 | 12 |
| 31 | 31 ÷ 13 = 2 r 5 | 5 |
| 55 | 55 ÷ 13 = 4 r 3 | 3 |
| 20 | 20 ÷ 13 = 1 r 7 | 7 |
Note: 42 and 55 both hash to index 3 — a collision occurs.
Problem 2. A hash function for strings uses the polynomial rolling hash: , where is the ASCII code of the -th character. Calculate the hash value for the string "Cat".
Hint
Look up the ASCII values: C = 67, a = 97, t = 116. Apply the formula: .
Answer
ASCII values: C = 67, a = 97, t = 116
Calculate each term:
The hash value of "Cat" is 50.
Problem 3. Insert the keys 44, 17, 31, 88, 61, 5, 22 into a hash table of size 7 using linear probing with . Show the state of the table after each insertion.
Hint
For linear probing, if index is occupied, try , then , etc., wrapping around using modulo 7.
Answer
| Key | Probe sequence | Final index | |
|---|---|---|---|
| 44 | 2 | 2 | 2 |
| 17 | 3 | 3 | 3 |
| 31 | 3 | 3 (occupied), 4 | 4 |
| 88 | 4 | 4 (occupied), 5 | 5 |
| 61 | 5 | 5 (occupied), 6 | 6 |
| 5 | 5 | 5, 6, 0 | 0 |
| 22 | 1 | 1 | 1 |
Step-by-step table states:
After inserting 44: [—, —, 44, —, —, —, —] After inserting 17: [—, —, 44, 17, —, —, —] After
inserting 31: [—, —, 44, 17, 31, —, —] After inserting 88: [—, —, 44, 17, 31, 88, —] After
inserting 61: [—, —, 44, 17, 31, 88, 61] After inserting 5: [5, —, 44, 17, 31, 88, 61] After
inserting 22: [5, 22, 44, 17, 31, 88, 61]
Final table:
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|---|
| Key | 5 | 22 | 44 | 17 | 31 | 88 | 61 |
Problem 4. Using the final hash table from Problem 3, trace the search for key 61 using linear probing. Show each probe.
Hint
To search, compute and probe sequentially. Stop when you find the key or reach an empty slot.
Answer
Search for 61 with :
Probe sequence:
- Index 5: value is 88. 88 ≠ 61, continue probing.
- Index 6: value is 61. 61 = 61. Found!
The key 61 was found at index 6 after 2 probes.
Note: This illustrates a drawback of linear probing — even though , the key is stored at index 6 due to earlier collisions. We must continue probing past occupied slots until we find the key or an empty slot.
Problem 5. Insert the keys 19, 36, 50, 5, 69, 14, 75 into a hash table of size 11 using quadratic probing with where . Show the table after all insertions.
Hint
Quadratic probing tries offsets of (i.e., for ). If the target index is occupied, try .
Answer
,
| Key | Probe sequence () | Final index | |
|---|---|---|---|
| 19 | 8 | 8 | |
| 36 | 3 | 3 | |
| 50 | 6 | 6 | |
| 5 | 5 | 5 | |
| 69 | 3 | occupied, | 4 |
| 14 | 3 | occupied, occupied, | 7 |
| 75 | 9 | 9 |
Final table:
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Key | — | — | — | 36 | 69 | 5 | 50 | 14 | 19 | 75 | — |
Load factor:
Problem 6. A hash table of size 7 contains the keys {14, 21, 7, 28, 35} using the division
method . The load factor threshold is 0.7. A new key 42 needs to be inserted.
Describe the rehashing process: choose a new table size, rehash all existing keys plus the new key,
and show the final table using linear probing.
Hint
First check if rehashing is needed. If so, double the table size (to a prime), then reinsert every key into the new table using the new modulo.
Answer
Current state: 5 keys in table of size 7. Load factor .
After inserting 42: 6 keys, . Rehashing is needed.
Step 1: Choose new table size. Double and find nearest prime: , but 14 is not prime. Nearest prime ≥ 14 is 17. New .
Step 2: Rehash all keys using .
| Key | Index | |
|---|---|---|
| 14 | 14 | 14 |
| 21 | 4 | 4 |
| 7 | 7 | 7 |
| 28 | 11 | 11 |
| 35 | 1 | 1 |
| 42 | 8 | 8 |
No collisions occur with the new table size.
Final table (size 17):
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 16 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Key | — | 35 | — | — | 21 | — | — | 7 | 42 | — | — | 28 | — | — | 14 | — |
New load factor: ✓
Problem 7. A hash table using chaining has 500 buckets and currently stores 350 key-value pairs. Calculate the load factor. If the table is resized to 1001 buckets (a prime), what is the new load factor? What is the expected number of comparisons for an unsuccessful search before and after resizing?
Hint
Load factor . For chaining with unsuccessful search, expected comparisons = .
Answer
Before resizing:
- Load factor:
- Expected comparisons for unsuccessful search:
After resizing to 1001 buckets:
- Load factor:
- Expected comparisons for unsuccessful search:
For successful search, the expected comparisons before resizing is , and after resizing is .
Resizing roughly halves the expected search time, demonstrating why maintaining a low load factor is important for performance.
Problem 8. Compare linear probing and chaining as collision resolution methods. For each method, discuss: (a) the best-case and worst-case time complexity for search, (b) how deletion is handled, and (c) the effect of increasing load factor on performance.
Hint
Consider how each method stores data (in the table itself vs. in linked lists) and how this affects probe sequences, memory usage, and deletion.
Answer
| Aspect | Linear Probing | Chaining |
|---|---|---|
| Best-case search | — key at its hash index | — key is alone in its bucket |
| Worst-case search | — all keys cluster together | — all keys hash to same bucket |
| Deletion | Requires "lazy deletion" (mark slot as deleted, not empty). If simply emptied, search would break by stopping early at the gap. | — simply remove node from linked list |
| Load factor effect | Performance degrades rapidly as . At , clustering causes significant slowdown. Must keep . | Performance degrades gradually. Chains grow linearly with . No hard upper limit on (but should keep < 1 for efficiency). |
| Memory | — fixed array size | — array + linked list nodes |
| Cache performance | Good — contiguous memory access | Poor — following pointers to scattered nodes |
Key trade-off: Linear probing has better cache performance but suffers from clustering and requires careful load factor management. Chaining is simpler to implement and handles deletion cleanly but uses extra memory for pointers.
Problem 9. A library system needs a hash table to store book records, where each book has a 10-digit ISBN (e.g., 9780132350884). Design a suitable hash function. Explain your choice and show worked examples for at least three ISBNs.
Hint
Consider using the division method or a digit-selection technique. Think about what properties the hash function should have (uniform distribution, efficiency, minimal collisions).
Answer
Design choice: Use the division method with a prime table size. Since ISBNs are large integers, we can select specific digit groups to reduce the number while preserving distribution.
Hash function: Extract the last 5 digits of the ISBN and use the division method with (a prime):
Rationale:
- Taking the last 5 digits (
mod 100000) gives values from 0–99999, which is manageable - Using a prime ensures good distribution (avoids patterns in ISBNs aligning with divisors of )
- The function is and deterministic
Worked examples:
| ISBN | Last 5 digits | Hash | ||
|---|---|---|---|---|
| 9780132350884 | 50884 | 50884 | 50884 mod 10007 = 50884 − 5×10007 = 50884 − 50035 = 849 | 849 |
| 9780596007126 | 07126 | 7126 | 7126 mod 10007 = 7126 | 7126 |
| 9780201896831 | 96831 | 96831 | 96831 mod 10007 = 96831 − 9×10007 = 96831 − 90063 = 6768 | 6768 |
This gives good distribution across the table. With and a load factor target of 0.7, the table can hold approximately 7000 books before needing resizing.
Problem 10. (Exam-style multi-step question) A hash table of size 11 uses double hashing with and . The following keys are to be inserted in order: 47, 25, 63, 14, 80, 36, 52.
(a) Show the probe sequence for each key and the final state of the hash table. (b) What is the load factor of the final table? (c) Trace the search for key 80 in the final table. How many probes are needed? (d) If key 25 is deleted, explain why lazy deletion (tombstone marking) is necessary in open addressing, and show what happens if a search for key 47 is performed after deletion without tombstones.
Hint
For double hashing: . Calculate for each key first, then trace the probe sequence.
Answer
(a) Insertion with double hashing:
,
| Key | Probe sequence () mod 11 | Final index | ||
|---|---|---|---|---|
| 47 | 3 | 3 | ||
| 25 | 3 | occupied, | 0 | |
| 63 | 8 | 8 | ||
| 14 | 3 | occ, | 9 | |
| 80 | 3 | occ, | 1 | |
| 36 | 3 | occ, | 4 | |
| 52 | 8 | occ, | 5 |
Final table:
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Key | 25 | 80 | — | 47 | 36 | 52 | — | — | 63 | 14 | — |
(b) Load factor:
(c) Search for key 80:
Probe sequence:
- → value is 47 ≠ 80, continue
- → value is 80 = 80. Found!
Key 80 found at index 1 after 2 probes.
(d) Deletion and lazy deletion:
If key 25 at index 0 is deleted by simply setting the slot to empty, a subsequent search for key 47 would fail:
,
Search for 47:
- Index 3 → 47 found immediately. (This actually works for 47.)
But consider searching for key 25 after re-inserting it at a different location — or consider searching for key 80 if index 0 were emptied: , probing goes to index 1 (found). However, the problem arises with keys that were inserted after the deleted key and probed past it.
Consider searching for a hypothetical key that hashed to index 0 and was placed further along due to probing. If index 0 is simply emptied, the search would stop prematurely at the empty slot, incorrectly concluding the key is not in the table.
Solution: Use lazy deletion (tombstone). Mark index 0 as DELETED (not EMPTY). During search, treat DELETED slots as occupied (continue probing past them). During insert, treat DELETED slots as available for insertion.
:::
:::