CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 5, 2000 11:52 AM

Collision Resolution

Unless your list is static and you have created a perfect hashing function, there will be collisions. SO, every hashing algorithm must have a way of dealing with collisions. A function that has a large number of collisions is said to exhibit primary clustering. A probe is an access to a distinct location (i.e. one look into the list).

Linear Probing: (also called progressive overflow) one of the simplest

- scan the file in a circular fashion, placing the synonym in the next available location
- simple, easy to maintain but not very efficient
- searching for a key not found could result in a search through the entire file (!)
- one improvement involves a 2-pass load: load all (first) records to their proper hashed locations and the second pass loads the synonyms.
- tends to produce primary clustering
 
Linear Quotient: (also called double hashing) similar to progressive overflow but it uses a variable increment rather than one to go to the next location
- when you have a synonym, hash again to get the next location (use a different function)
- the next synonym can use the same second function again or one modified by the number of synonyms (this tends to randomize better: sometimes called quadratic probing)
- a potential problem to look out for: it is possible to think we have searched the entire file and found it full when actually we have simply skipped over available spaces. Can avoid this by keeping track of the number of locations searched.
Variation: truncate the data (eg. use the last digit) and use that as your step value

Separate Overflow Area (cellar): set aside a separate area for overflow records
- remember to have the hashing function generate addresses only for the primary area.

Synonym Chaining: (also called Chained Progressive Overflow) use pointers to connect synonyms
- needs an extra field but search length is cut dramatically for keys not found

Coalesced Hashing: (variation called Direct Chaining): Uses synonym chaining but when a key hashes to a location that is already used by one that is NOT a synonym it is added to the end of the current tenant's chain. This method finds the next available address by looking from the bottom up by convention. There are many choices for how to build the chain and choose the locations of overflow records:

  1. LISCH (late insertion standard coalesced hashing) - at the end
  2. LICH (last insertion) - at the end using the cellar
  3. EISCH (early insertion.....) - at the beginning
  4. EICH (early insertion - with a cellar)
  5. REISCH (R for random) - choosing a random location for the overflow record
  6. BEISCH (B for bi-directional ) - Look up first then down, then up, then down, etc.
  7. Direct Chaining (DCWC - direct chaining without coalescing) moving records not stored at their home address.

Computed Chaining: (DYNAMIC COLLISION RESOLUTION - keys can be moved once stored) Compute the location of the next key in the chain : this has the advantage of not using a big extra field.

The Algorithm:

  1. Hash the key of the record to be inserted to obtain the home address for storing the record.
  2. If the home address is empty, insert the record at that location.
  3. If the record is a duplicate, terminate with a "duplicate record" message.
  4. If the term stored at the hashed location is not at its home address, move it to the next empty location found by stepping through the table using the increment associated with its predecessor element, and then insert the incoming record into the hashed location, else
    1. Locate the end of the probe chain and in the process, check for a duplicate record.
    2. Use the increment associated with the last item in the probe chain to find an empty location for the incoming record. In the process, check for a full table.
    3. Set the pseudolink at the position of the predecessor record to connect to the empty location.
    4. Insert the record in the empty location.

Brent's Method: Dynamic collision resolution

The Algorithm:

  1. Hash the key of the record to be inserted to obtain the home address for storing the record.
  2. If the home address is empty, insert the record at that location, else
    1. Compute the next potential address for storing the incoming record. Initialize S <= 2.
    2. While potential storage address is not empty,
      1. Check if it is the home address. If it is, the table is full, terminate with a "full table" message.
      2. If the record stored at the potential storage address is the same as the incoming record, terminate with a "duplicate record" message.
      3. Compute the next potential address for storing the incoming record. Set S <= S + 1.

/* Attempt to move a record previously inserted */

  1. Initialize i <= 1 and j <= 1.
  2. While (I + J < S)
    1. Determine if the record stored at the ith position on the primary probe chain can be moved j offsets along its secondary probe chain.
    2. If it can be moved, then
      1. Move it and insert the incoming record into the vacated position I along its primary probe chain; terminate with a successful insertion, else
      2. Vary I and/or j to minimize the sum of (j + I); if I = j, minimize on i.

/* Moving has failed */

Progressive values of i and j as search along primary and secondary probe chains progresses. In order to be a viable move, sum(i,j) must be less than S (S is the chain length for the incoming record if nothing is moved). i is the position of a placed record along the primary probe chain and j is the length of this placed record's probe chain should we decide to move it (i.e. the secondary probe chain). The general idea is to reduce the total of the chain lengths by re-arranging keys.
 
 i 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5
j 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1
sum(i,j) 2 3 3 4 4 4 5 5 5 5 6 6 6 6 6

Binary Trees: If Brent's method is good, can we get better? How about a binary decision tree? At each point in the process there are essentially two choices: continue to the next address along the probe chain or move the item being stored. Left branch means continue; and Right branch means move. The tree here is used as a control structure - not to store data. The tree is generated in a breadth first fashion (like the nodes in a heap sort).


Summary : Collision Resolution

Static Methods: (keys, once placed, don't move)

Linear Probing: (Open Addressing) Simplest; find next available location for record using linear search
Variation: 2-pass load with Linear Probing for synonyms only

Double Hashing: (Open Addressing) use second function to find the next location ('randomize' offset)
Variation 1: use H2[K] for second hash, then revert to Linear Probing if that fails
Variation 2: use H2[K] to generate step-size rather than new address

Synonym Chaining: (Chained Overflow/ Single Hash) use pointer to connect synonyms; cuts down # probes required to find; first value must be at home address
Coalesced Hashing: (Chained/ Single Hash) add record to end of current chain; don't need first value at home address

Dynamic Methods: (any key can be moved if it becomes advantageous to do so)

Direct Chaining: (Chained/ Single Hash) Dynamic variant of Coalesced Chaining: if placed key is not at it's home address, remove it and all other keys after it in its chain; place incoming key and re-insert those that were removed.
Computed Chaining: (Chained/ Double Hash) store # of links rather than actual address; move records not at their home address; always use i-value of last record in chain to compute step value (i.e. step value changes at each link)
Brent's Method: (Open Addressing/ Double Hash) move any record if we will achieve a net gain (always use step value of incoming/searched for key; doesn't use additional pointer space)
Binary Tree Insertion: (Open Addressing Double Hash) move any record(s) to achieve a net gain (always use step value of incoming/searched for key; doesn't use additional pointer space)

Algorithm

 Static/ Dynamic

Addressing Method

 1- or 2-Pass Load
 Type of Link
 Single/Double Hash Functions
Key used by H2[K]
 Linear Probing
 static
open
 1
 n/a
 single
 n/a
 Double Hashing
static
open
1
n/a
double (2 types)
Type 1: incoming, new address
Type 2: incoming, step size
 Chained Overflow
static
chained
2
actual link
single
 n/a
Coalesced Hashing
static
chained
1
actual link
single
 n/a
Computed Chaining
dynamic
chained
1
pseudolink
double
resident (last value in chain)
 Brent's Method
dynamic
open
1
n/a
double
incoming
Binary Tree Insertion
dynamic
open
1
n/a
double
incoming

Another Variation:

Use patterns of record access to pre-sort the records before loading. This way frequently accessed records can be assured of having the shortest chains. Q: How can this be worked into the various collision resolution techniques? Can it be combined with all of them?

Back to Top
CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 5, 2000 11:52 AM