线性哈希是一种动态扩展哈希表的方法,其“线性”的名字源于这种方法每次只扩展一个Bucket的容量。
这种方法需要两个哈希函数。
At any given point of time, this method works with at most two hashing functions. The hash function changes its nature underneath dynamically and the hash table algorithms take care of using the right hashing algorithm for insert(), lookup(), remove() operations on a given key. Linear Hashing uses the chaining or overflow bucket approach. This along with the fact that hash table size grows by exactly one bucket at a time are two important facts about this method.
线性哈希的数学原理
假设我们有5,9,13
这三个Key,还有两个哈希函数,分别为 H1(KEY) = KEY % 4
,H2(KEY) = KEY % 8
。
现在我们使用这两个哈希函数来给Key计算Value:
Key | H1(Key) | H2(Key) |
---|---|---|
5 | 1 | 5 |
9 | 1 | 1 |
13 | 1 | 5 |
显然可见,如果Key % n = M
,那么Key % 2n = M 或 M+n
概念要点
- 哈希表的起始容量为
N
(N为整数,且为2的 I 次幂,即N=2^{I}
),其中N不包括浮动的 Bucket; - Bucket 标号为 0~N-1;
- 给定一个Key,通过哈希函数H()得到一个字节串,我们称该串为R;
- 我们使用字节串R中的低 I 位作为 Bucket 序号,该Bucket负责存储该Key。例如H(2)=010,I=2,则 Bucket[10] 负责存储(2,Value);
- 向已满的 Bucket 中插入数据,会造成“分裂”,我们用一个指针
S
来标识将被分裂的 Bucket; - 负载因子
load factor = (哈希中的KV数)/(Bucket数量)
表示哈希表“满的程度”,通常情况下,我们用这个参数来触发“分裂”。 - 分裂 Bucket B 时,先建立一个 Bucket B*,然后对 B 中的内容进行 rehash,该操作会将其中的内容分配到 B 和 B* 两个 Bucket 中。分裂后,标识分裂位置的指针向后移动。
线性哈希最重点的一个特点是:被分裂的 Bucket 不一定是浮动 Bucket,因为分裂的顺序是线性的(指针S
的移动是线性的),而且触发分裂的条件可以任意选择。
在可扩展哈希(extendible hash)中,被分裂的 Bucket 是浮动Bucket。
例子
初始哈希表有两个 bucket,序号分别为0和1。假设每个 bucket 可以存储两个键值对。
我们使用取余函数作为哈希函数。现在执行下列4个操作:
操作 | KEY | R=H(KEY) |
---|---|---|
Put | 2 | 01 0 |
Put | 1 | 00 1 |
Put | 3 | 01 1 |
Put | 4 | 10 0 |
R 的最低位即存储该 KEY 的 bucket 序号,也是 H1(K) = K % 2
的结果。
记住:
- 哈希表的 size 为
$N=2^{1}$
; - I = 1,所以我们使用字节串中的最低1位来选择Bucket。
分裂指针 | Bucket 序号 | KEYs | 浮动 KEYs |
---|---|---|---|
* | 0 | 2,4 | |
1 | 1,3 |
再执行一次插入,触发分裂:
操作 | KEY | R=H(KEY) |
---|---|---|
Put | 5 | 10 1 |
KEY 5 应该被存储在1号Bucket中,该Bucket已满,创建一个浮动Bucket来存储它:
分裂指针 | Bucket 序号 | KEYs | 浮动 KEYs |
---|---|---|---|
* | 0 | 2,4 | |
1 | 1,3 | 5 |
分裂被指针S指向的0号Bucket,为方便说明,下面将该Bucket称为B,先建立一个新Bucket B*:
分裂指针 | Bucket 序号 | KEYs | 浮动 KEYs |
---|---|---|---|
* | 0 (B) | 2,4 | |
1 | 1,3 | 5 | |
x (B*) |
对B中的内容进行rehash:
- This is done using an alternate hash function H2(Key)=Key%(2*N). This is because we no longer have 2 buckets but 3 now and so I should be 2.
- Another representation of H2(Key)=Key%(2*N): using one more bit of bitstring of the Key.
- 2=(010), 4=(100), so store [2,V] in bucket[10] and [4,V] in bucket[00].
- Increment S and now it points to bucket[1].
Split pointer | Bucket number | Records | Overflow KEYs |
---|---|---|---|
00 | 4 | ||
* | 1 | 1,3 | 5 |
10 | 2 |
Let\'s talk further about split().
Operation | KEY | R=H(KEY) |
---|---|---|
Put | 14 | 1110 |
Wait, which is our choice? H1() or H2()? H1(14)=0 (00). The result is smaller than(before) split pointer. So H2(14)=2 (10). The [14, V] goes to bucket[10] then.
Split pointer | Bucket number | Records | Overflow KEYs |
---|---|---|---|
- | 00 | 4 | |
* | 1 | 1,3 | 5 |
- | 10 | 2,14 |
New request comes:
Operation | KEY | R=H(KEY) |
---|---|---|
Get | 2 | 0010 |
H1(2)=0 (00) is smaller than(before) split pointer. So H2(2) = (10). Bucket[10] stores [2, V]! Hence, the split pointer not only indicates which bucket to split but also is a flag which tells us what should we choose between H1() or H2(). Let\'s do one more Put().
Operation | KEY | R=H(KEY) |
---|---|---|
Put | 15 | 1111 |
H1(15)=1 (01). Why we just use H1()? It is because the resulting bucket[01] has not been split(in the current round).
- We create a new overflow bucket, allocate memory for it, and then link it to the chain.
Split pointer | Bucket number | Records | Overflow KEYs |
---|---|---|---|
00 | 4 | ||
* | 1 | 1,3 | [5], [15] |
10 | 2,14 |
Note that [5] and [15] are stored in two overflow buckets. - Split bucket[1]
- Now split pointer points to bucket B
- Create a new bucket B*
- Rehash items in bucket B
- We have 4 buckets now, I = 2 is enough. So H2() is still valid now.
- H2(1)=01, H2(3)=11, H2(5)=01, H2(15)=11
Spliting:
Split pointer | Bucket number | Records | Overflow KEYs |
---|---|---|---|
00 | 4 | ||
* | 1 (B) | 1,3 | [5], [15] |
10 | 2,14 | ||
x (B*) |
After split:
Split pointer | Bucket number | Records | Overflow KEYs |
---|---|---|---|
* | 00 | 4 | |
01 (B) | 1,5 | ||
10 | 2,14 | ||
11 (B*) | 3,15 |
We started out with N=2 buckets. Now both the buckets are split, and hash table size grew to 4 buckets. This completes a single round of linear hashing. Now N = 4, and the split pointer points to bucket[00]. We can start a new round. In the new round, H1(Key) = old H2(Key) = Key%4, H2(Key) = Key%8. That means we use 2 LSBs(before split) or 4 LSBs(after split) as bucket index. Now have a look of another round.
Operation | KEY | R=H(KEY) |
---|---|---|
Put | 8 | 1000 |
Put | 12 | 1100 |
Split pointer | Bucket number | Records | Overflow KEYs |
---|---|---|---|
* | 00 | 4,8 | 12 |
- | 01 | 1,5 | |
- | 10 | 2,14 | |
- | 11 | 3,15 |
Split
Split pointer | Bucket number | Records | Overflow KEYs |
---|---|---|---|
* | 000 (B) | 4,8 | 12 |
01 | 1,5 | ||
10 | 2,14 | ||
11 | 3,15 | ||
x (B*) |
4: 01008: 100012:1100
Split pointer | Bucket number | Records | Overflow KEYs |
---|---|---|---|
000 (B) | 4 | ||
* | 01 | 1,5 | |
10 | 2,14 | ||
11 | 3,15 | ||
100 (B*) | 4,12 |
This round of linear hashing will continue until buckets[01] [10] [11] are split, and the table at the end would be 8 buckets.