Linear Hashing

In this blog post, I will give an introduction to a hashing methodology called Linear Hashing, which is one of the Dramatic Hashing methods.

Linear Hashing

Linear Hashing is a dynamic hashing scheme. It allows the hash table size to grow in a linear fashion; one bucket at a time, and that is where the method gets its name from. At any given point of time, this method works with at most two hashing functions. The hash function changes its nature underneath dynamically and the hash table algorithms take care of using the right hashing algorithm for insert(), lookup(), remove() operations on a given key.

Linear Hashing uses the chaining or overflow bucket approach. This along with the fact that hash table size grows by exactly "one bucket" at a time are two important facts about this method.

The mathematical fundamental of Linear Hashing

We assume that there is a sequence of KEYs: 5, 9 13 and we have the following two hash functions:

$$H1(KEY) = KEY \% 4$$
$$H2(KEY) = KEY \% 8$$

Now calculate the VALUEs with the two hash functions:

H1():

$$5,9,13 \% 4 = 1$$

H2():

$$5,13 \% 8 = 5$$
$$9 \% 8=1$$

We can conclude from the above equations that:

$$KEY \% n = M$$
$$KEY \% 2n = M \ or \ M \ + \ n$$

High-level points

  1. We start with “N” as the initial size of the hash table where “N” is some power of $2$. Note that “N” is the number of main hash buckets in the hash table and not the overflow buckets.
  2. Buckets are numbered through 0 to N-1
  3. \(N=2^{I}\), obviously "I" is an integer.
  4. The hash function $H_{i}(K)$ will first compute a bitstring R using the KEY K.
  5. We use the least "I" bits in the bitstring R as the bucket index for the given KEY K.
    An example: H(2) = 010, I = 2, where we store the record [2,Value] in bucket[10].
  6. An attempt to insert a record into a full bucket will cause a split(or you may say resize). A split pointer S indicates which bucket to split.
  7. $load factor = (number of items in the hash table)/(number of buckets)$. Load factor indicates the degree of fullness. Usually we use this load factor as the trigger for a split.
  8. When a resize happens, we split a particular bucket B by create a new bucket B*, and rehashing the contents of B, after which the contents will be distributed among B and B*. The split pointer is incremented to point to the subsequent bucket after every split.
  9. A threshold value of load factor L is set (say 75% or something). It implies that whenever the hash table is 75% full, the hash table will be resized. In case of any hashing scheme, this would mean increasing the size of the hash table by a certain number of new buckets followed by a rehash. However, in the case of linear hashing, this would imply:
    • Creating a new bucket B where B = N + B; N is the then size of the hash table, and B is the bucket we are splitting.
    • Rehashing the contents of bucket B (pointed to by S) between B and B*. This is what is meant by the split.

One of the most important facts about Linear Hashing that distinguishes the technique from its counterparts is "the bucket that is split is not necessarily the bucket that overflowed".

  • The split is done in a linear bucket number order. The trigger for the split can be anything. It may even be "every time we attach an overflow bucket to a particular bucket’s chain, we split". But yes, the bucket that we will split will always be the one pointed to by S which may not be the bucket that just overflowed.
  • This is completely different from extendible hashing technique where we always split the bucket that overflowed.

A simple example

Let's start with a simple example where the hash table has 2 buckets numbered 0 and 1. We assume that each bucket can store 2 records (KV items) before we chain out and create an overflow bucket. Similarly, each overflow bucket will store only 1 record.

We use MOD as the hash function and integers as the KEYs for the hash table.

Following operations are carries out:

Operation KEY R=H(KEY)
Put 2 010
Put 1 001
Put 3 011
Put 4 100

The hignlighted LSB in the bitstring R is really the bucket index and is also the result of hash function H1(K)=K%2.

Remember that:

  • The hash table's size is $N=2^{1}$,
  • so I = 1, and we use the least one bit of R as the bucket index for the given key

In order to make the representation simple, the value (say V) of each [K, V] record is hidden in the following tables.

Split pointer Bucket number Records Overflow KEYs
* 0 2,4
- 1 1,3

Now we Get(2), H(2) = 0 => bucket 0

Let's look into split().

To trigger split, let's do one more Put():

Operation KEY R=H(KEY)
Put 5 101

H(5)=5%2=1 (or the least one bit of R is 1). Bucket 1 is full, so create an overflow bucket and allocate memory for it, then store this KV pair in this new bucket. Link the new bucket to the overflow chain of bucket 1.

Now the hash table looks like:

Split pointer Bucket number Records Overflow KEYs
* 0 2,4
- 1 1,3 5

We split the bucket pointed by split pointer S (bucket 0) before returning from Put().

  • Create a new bucket B*
Split pointer Bucket number Records Overflow KEYs
* 0 (B) 2,4
- 1 1,3 5
- x (B*)
  • Rehash items in B
    • This is done using an alternate hash function H2(Key)=Key%(2*N). This is because we no longer have 2 buckets but 3 now and so "I" should be 2.
    • Another representation of H2(Key)=Key%(2*N): using one more bit of bitstring of the "Key".
    • 2=(010), 4=(100), so store [2,V] in bucket[10] and [4,V] in bucket[00].
  • Increment S and now it points to bucket[1].
Split pointer Bucket number Records Overflow KEYs
- 00 4
* 1 1,3 5
- 10 2

Let's talk further about split().

Operation KEY R=H(KEY)
Put 14 1110

Wait, which is our choice? H1() or H2()?

H1(14)=0 (00). The result is smaller than(before) split pointer.
So H2(14)=2 (10). The [14, V] goes to bucket[10] then.

Split pointer Bucket number Records Overflow KEYs
- 00 4
* 1 1,3 5
- 10 2,14

New request comes:

Operation KEY R=H(KEY)
Get 2 0010

H1(2)=0 (00) is smaller than(before) split pointer.
So H2(2) = (10). Bucket[10] stores [2, V]!

Hence, the split pointer not only indicates which bucket to split but also is a flag which tells us what should we choose between H1() or H2().

Let's do one more Put().

Operation KEY R=H(KEY)
Put 15 1111

H1(15)=1 (01). Why we just use H1()? It is because the resulting bucket[01] has not been split(in the current round).

  • We create a new overflow bucket, allocate memory for it, and then link it to the chain.
Split pointer Bucket number Records Overflow KEYs
- 00 4
* 1 1,3 [5], [15]
- 10 2,14

Note that [5] and [15] are stored in two overflow buckets.

  • Split bucket[1]
    • Now split pointer points to bucket B
    • Create a new bucket B*
    • Rehash items in bucket B
    • We have 4 buckets now, I = 2 is enough. So H2() is still valid now.
    • H2(1)=01, H2(3)=11, H2(5)=01, H2(15)=11

Spliting:

Split pointer Bucket number Records Overflow KEYs
- 00 4
* 1 (B) 1,3 [5], [15]
- 10 2,14
- x (B*)

After split:

Split pointer Bucket number Records Overflow KEYs
* 00 4
- 01 (B) 1,5
- 10 2,14
- 11 (B*) 3,15

We started out with N=2 buckets. Now both the buckets are split, and hash table size grew to 4 buckets. This completes a single round of linear hashing.

Now N = 4, and the split pointer points to bucket[00]. We can start a new round.
In the new round, H1(Key) = old H2(Key) = Key%4, H2(Key) = Key%8. That means we use 2 LSBs(before split) or 4 LSBs(after split) as bucket index.

Now have a look of another round.

Operation KEY R=H(KEY)
Put 8 1000
Put 12 1100
Split pointer Bucket number Records Overflow KEYs
* 00 4,8 12
- 01 1,5
- 10 2,14
- 11 3,15

Split

Split pointer Bucket number Records Overflow KEYs
* 000 (B) 4,8 12
- 01 1,5
- 10 2,14
- 11 3,15
- x (B*)

4: 0100
8: 1000
12:1100

Split pointer Bucket number Records Overflow KEYs
- 000 (B) 4
* 01 1,5
- 10 2,14
- 11 3,15
- 100 (B*) 4,12

This round of linear hashing will continue until buckets[01] [10] [11] are split, and the table at the end would be 8 buckets.

发表评论

电子邮件地址不会被公开。 必填项已用*标注

微信扫一扫,分享到朋友圈

Linear Hashing
返回顶部

显示

忘记密码?

显示

显示

获取验证码

Close