In this blog post, I will give an introduction to a hashing methodology called Linear Hashing, which is one of the Dramatic Hashing methods.

Linear Hashing

Linear Hashing is a dynamic hashing scheme. It allows the hash table size to grow in a linear fashion; one bucket at a time, and that is where the method gets its name from. At any given point of time, this method works with at most two hashing functions. The hash function changes its nature underneath dynamically and the hash table algorithms take care of using the right hashing algorithm for insert(), lookup(), remove() operations on a given key.

Linear Hashing uses the chaining or overflow bucket approach. This along with the fact that hash table size grows by exactly "one bucket" at a time are two important facts about this method.

The mathematical fundamental of Linear Hashing

We assume that there is a sequence of KEYs: 5, 9 13 and we have the following two hash functions:

$$H1(KEY) = KEY \% 4$$
$$H2(KEY) = KEY \% 8$$

Now calculate the VALUEs with the two hash functions:

H1():

$$5,9,13 \% 4 = 1$$

H2():

$$5,13 \% 8 = 5$$
$$9 \% 8=1$$

We can conclude from the above equations that:

$$KEY \% n = M$$
$$KEY \% 2n = M \ or \ M \ + \ n$$

High-level points

  1. We start with “N” as the initial size of the hash table where “N” is some power of $2$. Note that “N” is the number of main hash buckets in the hash table and not the overflow buckets.
  2. Buckets are numbered through 0 to N-1
  3. $N=2^{I}$, obviously "I" is an integer.
  4. The hash function $H_{i}(K)$ will first compute a bitstring R using the KEY K.
  5. We use the least "I" bits in the bitstring R as the bucket index for the given KEY K.
    An example: H(2) = 010, I = 2, where we store the record [2,Value] in bucket[10].
  6. An attempt to insert a record into a full bucket will cause a split(or you may say resize). A split pointer S indicates which bucket to split.
  7. $load factor = (number of items in the hash table)/(number of buckets)$. Load factor indicates the degree of fullness. Usually we use this load factor as the trigger for a split.
  8. When a resize happens, we split a particular bucket B by create a new bucket B*, and rehashing the contents of B, after which the contents will be distributed among B and B*. The split pointer is incremented to point to the subsequent bucket after every split.
  9. A threshold value of load factor L is set (say 75% or something). It implies that whenever the hash table is 75% full, the hash table will be resized. In case of any hashing scheme, this would mean increasing the size of the hash table by a certain number of new buckets followed by a rehash. However, in the case of linear hashing, this would imply:

    • Creating a new bucket B where B = N + B; N is the then size of the hash table, and B is the bucket we are splitting.
    • Rehashing the contents of bucket B (pointed to by S) between B and B*. This is what is meant by the split.

One of the most important facts about Linear Hashing that distinguishes the technique from its counterparts is "the bucket that is split is not necessarily the bucket that overflowed".

  • The split is done in a linear bucket number order. The trigger for the split can be anything. It may even be "every time we attach an overflow bucket to a particular bucket’s chain, we split". But yes, the bucket that we will split will always be the one pointed to by S which may not be the bucket that just overflowed.
  • This is completely different from extendible hashing technique where we always split the bucket that overflowed.

A simple example

Let's start with a simple example where the hash table has 2 buckets numbered 0 and 1. We assume that each bucket can store 2 records (KV items) before we chain out and create an overflow bucket. Similarly, each overflow bucket will store only 1 record.

We use MOD as the hash function and integers as the KEYs for the hash table.

Following operations are carries out:

OperationKEYR=H(KEY)
Put2010
Put1001
Put3011
Put4100

The hignlighted LSB in the bitstring R is really the bucket index and is also the result of hash function H1(K)=K%2.

Remember that:

  • The hash table's size is $N=2^{1}$,
  • so I = 1, and we use the least one bit of R as the bucket index for the given key

In order to make the representation simple, the value (say V) of each [K, V] record is hidden in the following tables.

Split pointerBucket numberRecordsOverflow KEYs
*02,4
-11,3

Now we Get(2), H(2) = 0 => bucket 0

Let's look into split().

To trigger split, let's do one more Put():

OperationKEYR=H(KEY)
Put5101

H(5)=5%2=1 (or the least one bit of R is 1). Bucket 1 is full, so create an overflow bucket and allocate memory for it, then store this KV pair in this new bucket. Link the new bucket to the overflow chain of bucket 1.

Now the hash table looks like:

Split pointerBucket numberRecordsOverflow KEYs
*02,4
-11,35

We split the bucket pointed by split pointer S (bucket 0) before returning from Put().

  • Create a new bucket B*
Split pointerBucket numberRecordsOverflow KEYs
*0 (B)2,4
-11,35
-x (B*)
  • Rehash items in B

    • This is done using an alternate hash function H2(Key)=Key%(2*N). This is because we no longer have 2 buckets but 3 now and so "I" should be 2.
    • Another representation of H2(Key)=Key%(2*N): using one more bit of bitstring of the "Key".
    • 2=(010), 4=(100), so store [2,V] in bucket[10] and [4,V] in bucket[00].
  • Increment S and now it points to bucket[1].
Split pointerBucket numberRecordsOverflow KEYs
-004
*11,35
-102

Let's talk further about split().

OperationKEYR=H(KEY)
Put141110

Wait, which is our choice? H1() or H2()?

H1(14)=0 (00). The result is smaller than(before) split pointer.
So H2(14)=2 (10). The [14, V] goes to bucket[10] then.

Split pointerBucket numberRecordsOverflow KEYs
-004
*11,35
-102,14

New request comes:

OperationKEYR=H(KEY)
Get20010

H1(2)=0 (00) is smaller than(before) split pointer.
So H2(2) = (10). Bucket[10] stores [2, V]!

Hence, the split pointer not only indicates which bucket to split but also is a flag which tells us what should we choose between H1() or H2().

Let's do one more Put().

OperationKEYR=H(KEY)
Put151111

H1(15)=1 (01). Why we just use H1()? It is because the resulting bucket[01] has not been split(in the current round).

  • We create a new overflow bucket, allocate memory for it, and then link it to the chain.
Split pointerBucket numberRecordsOverflow KEYs
-004
*11,3[5], [15]
-102,14

Note that [5] and [15] are stored in two overflow buckets.

  • Split bucket[1]

    • Now split pointer points to bucket B
    • Create a new bucket B*
    • Rehash items in bucket B
    • We have 4 buckets now, I = 2 is enough. So H2() is still valid now.
    • H2(1)=01, H2(3)=11, H2(5)=01, H2(15)=11

Spliting:

Split pointerBucket numberRecordsOverflow KEYs
-004
*1 (B)1,3[5], [15]
-102,14
-x (B*)

After split:

Split pointerBucket numberRecordsOverflow KEYs
*004
-01 (B)1,5
-102,14
-11 (B*)3,15

We started out with N=2 buckets. Now both the buckets are split, and hash table size grew to 4 buckets. This completes a single round of linear hashing.

Now N = 4, and the split pointer points to bucket[00]. We can start a new round.
In the new round, H1(Key) = old H2(Key) = Key%4, H2(Key) = Key%8. That means we use 2 LSBs(before split) or 4 LSBs(after split) as bucket index.

Now have a look of another round.

OperationKEYR=H(KEY)
Put81000
Put121100
Split pointerBucket numberRecordsOverflow KEYs
*004,812
-011,5
-102,14
-113,15

Split

Split pointerBucket numberRecordsOverflow KEYs
*000 (B)4,812
-011,5
-102,14
-113,15
-x (B*)

4: 0100
8: 1000
12:1100

Split pointerBucket numberRecordsOverflow KEYs
-000 (B)4
*011,5
-102,14
-113,15
-100 (B*)4,12

This round of linear hashing will continue until buckets[01] [10] [11] are split, and the table at the end would be 8 buckets.