CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 5, 2000 11:52 AM

Collision Resolution

Unless your list is static and you have created a perfect hashing function, there will be collisions. SO, every hashing algorithm must have a way of dealing with collisions. A function that has a large number of collisions is said to exhibit primary clustering. A probe is an access to a distinct location (i.e. one look into the list).

Linear Probing: (also called progressive overflow) one of the simplest

- scan the file in a circular fashion, placing the synonym in the next available location
- simple, easy to maintain but not very efficient
- searching for a key not found could result in a search through the entire file (!)
- one improvement involves a 2-pass load: load all (first) records to their proper hashed locations and the second pass loads the synonyms.
- tends to produce primary clustering

Linear Quotient: (also called double hashing) similar to progressive overflow but it uses a variable increment rather than one to go to the next location
- when you have a synonym, hash again to get the next location (use a different function)
- the next synonym can use the same second function again or one modified by the number of synonyms (this tends to randomize better: sometimes called quadratic probing)
- a potential problem to look out for: it is possible to think we have searched the entire file and found it full when actually we have simply skipped over available spaces. Can avoid this by keeping track of the number of locations searched.
Variation: truncate the data (eg. use the last digit) and use that as your step value

Separate Overflow Area (cellar): set aside a separate area for overflow records
- remember to have the hashing function generate addresses only for the primary area.

Synonym Chaining: (also called Chained Progressive Overflow) use pointers to connect synonyms
- needs an extra field but search length is cut dramatically for keys not found

PROBLEM: needs to know that it will always be able to find a key by starting at its home address. This can only be done with two-pass loading or a separate overflow area. If we have an existing chain that uses someone else's home address, a record search that correctly started at that address would end up following the wrong chain.

Coalesced Hashing: (variation called Direct Chaining): Uses synonym chaining but when a key hashes to a location that is already used by one that is NOT a synonym it is added to the end of the current tenant's chain. This method finds the next available address by looking from the bottom up by convention. There are many choices for how to build the chain and choose the locations of overflow records:

LISCH (late insertion standard coalesced hashing) - at the end
LICH (last insertion) - at the end using the cellar
EISCH (early insertion.....) - at the beginning
EICH (early insertion - with a cellar)
REISCH (R for random) - choosing a random location for the overflow record
BEISCH (B for bi-directional ) - Look up first then down, then up, then down, etc.
Direct Chaining (DCWC - direct chaining without coalescing) moving records not stored at their home address.

Computed Chaining: (DYNAMIC COLLISION RESOLUTION - keys can be moved once stored) Compute the location of the next key in the chain : this has the advantage of not using a big extra field.

Instead of storing the actual address as the link, store a pseudolink.
What is stored is the chain length count. We use this value to compute the needed address. The advantages: we need to store only a small value so 'pointer size' is small and we no longer need to retrieve the intermediate values to get the one we need.
This method avoids coalescing by moving records found at someone else's home address. Uses the key actually stored at the probe address to compute the next probe address and not the key being stored or retrieved.

The Algorithm:

Hash the key of the record to be inserted to obtain the home address for storing the record.
If the home address is empty, insert the record at that location.
If the record is a duplicate, terminate with a "duplicate record" message.
If the term stored at the hashed location is not at its home address, move it to the next empty location found by stepping through the table using the increment associated with its predecessor element, and then insert the incoming record into the hashed location, else
1. Locate the end of the probe chain and in the process, check for a duplicate record.
2. Use the increment associated with the last item in the probe chain to find an empty location for the incoming record. In the process, check for a full table.
3. Set the pseudolink at the position of the predecessor record to connect to the empty location.
4. Insert the record in the empty location.

Brent's Method: Dynamic collision resolution

Assumes any record can be moved if it becomes convenient. Justify additional processing because we usually insert an item only once but will want to retrieve it many times.
Primary Probe Chain is the sequence of locations visited during the insertion or retrieval of a record.
Involves moving a record if we want to put some other record there - so we can store a record at its home address - we move the one that shouldn't be there.
The Secondary Probe Chain is the one we follow when trying to move a record from the primary probe chain.

The Algorithm:

Hash the key of the record to be inserted to obtain the home address for storing the record.
If the home address is empty, insert the record at that location, else
1. Compute the next potential address for storing the incoming record. Initialize S <= 2.
2. While potential storage address is not empty,
  1. Check if it is the home address. If it is, the table is full, terminate with a "full table" message.
  2. If the record stored at the potential storage address is the same as the incoming record, terminate with a "duplicate record" message.
  3. Compute the next potential address for storing the incoming record. Set S <= S + 1.

/* Attempt to move a record previously inserted */

Initialize i <= 1 and j <= 1.
While (I + J < S)
1. Determine if the record stored at the ith position on the primary probe chain can be moved j offsets along its secondary probe chain.
2. If it can be moved, then
  1. Move it and insert the incoming record into the vacated position I along its primary probe chain; terminate with a successful insertion, else
  2. Vary I and/or j to minimize the sum of (j + I); if I = j, minimize on i.

/* Moving has failed */

Insert the incoming record at position s on its primary probe chain; terminate with a successful insertion.

Progressive values of i and j as search along primary and secondary probe chains progresses. In order to be a viable move, sum(i,j) must be less than S (S is the chain length for the incoming record if nothing is moved). i is the position of a placed record along the primary probe chain and j is the length of this placed record's probe chain should we decide to move it (i.e. the secondary probe chain). The general idea is to reduce the total of the chain lengths by re-arranging keys.

i 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5

j 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1

sum(i,j) 2 3 3 4 4 4 5 5 5 5 6 6 6 6 6

Binary Trees: If Brent's method is good, can we get better? How about a binary decision tree? At each point in the process there are essentially two choices: continue to the next address along the probe chain or move the item being stored. Left branch means continue; and Right branch means move. The tree here is used as a control structure - not to store data. The tree is generated in a breadth first fashion (like the nodes in a heap sort).

The Algorithm:
1. Hash the key of the record to be inserted to obtain the home address for storing the record.
2. If the home address is empty, insert the record at that location, else
  1. Until an empty location or a "full" table is encountered,
    1. Generate a binary tree control structure in a breadth first left t right fashion. The address of the Ichild of a node is determined by adding (I) the increment associated with the key of the record coming in to the node to (ii) the current address. The address of the rchild of a node is determined by adding (I) the increment associated with the key of the record stored in the node to (ii) the current address.
    2. At the leftmost node on each level, check against the record associated with it for a duplicate record. If found, terminate with a "duplicate record" message.
  2. If a "full table", terminate with a "full table" message.
  3. If an empty node is found, the path from the empty node back to the root determines which records, if any, need to be moved. Each right link signifies that a relocation is necessary. First set the current node pointer to the last node generated in the binary tree and set the empty location pointer to the table address associated with the last node generated.
  4. Until the current pointer equals the root node of the binary tree (bottom up), note the type of branch from the parent of the current node to the current node.
    1. On a right branch, move the record stored at the location contained in the parent node into the location indicated by the empty location pointer. Set the empty location pointer to the newly vacated position and make the parent node the current node.
    2. On a left branch, make the parent node the current node.
  5. Insert the record coming into the root position into the empty location. Terminate with a successful insertion.

Summary : Collision Resolution

Static Methods: (keys, once placed, don't move)

Linear Probing: (Open Addressing) Simplest; find next available location for record using linear search

Variation: 2-pass load with Linear Probing for synonyms only

Double Hashing: (Open Addressing) use second function to find the next location ('randomize' offset)

Variation 1: use H2[K] for second hash, then revert to Linear Probing if that fails

Variation 2: use H2[K] to generate step-size rather than new address

Synonym Chaining: (Chained Overflow/ Single Hash) use pointer to connect synonyms; cuts down # probes required to find; first value must be at home address

Coalesced Hashing: (Chained/ Single Hash) add record to end of current chain; don't need first value at home address

Dynamic Methods: (any key can be moved if it becomes advantageous to do so)

Direct Chaining: (Chained/ Single Hash) Dynamic variant of Coalesced Chaining: if placed key is not at it's home address, remove it and all other keys after it in its chain; place incoming key and re-insert those that were removed.

Computed Chaining: (Chained/ Double Hash) store # of links rather than actual address; move records not at their home address; always use i-value of last record in chain to compute step value (i.e. step value changes at each link)

Brent's Method: (Open Addressing/ Double Hash) move any record if we will achieve a net gain (always use step value of incoming/searched for key; doesn't use additional pointer space)

Binary Tree Insertion: (Open Addressing Double Hash) move any record(s) to achieve a net gain (always use step value of incoming/searched for key; doesn't use additional pointer space)

Algorithm

Static/ Dynamic

Addressing Method
1- or 2-Pass Load Type of Link Single/Double Hash Functions Key used by H2[K]

Linear Probing static open 1 n/a single n/a

Double Hashing static open 1 n/a double (2 types)

Type 1: incoming, new address
Type 2: incoming, step size

Chained Overflow static chained 2 actual link single n/a

Coalesced Hashing static chained 1 actual link single n/a

Computed Chaining dynamic chained 1 pseudolink double resident (last value in chain)

Brent's Method dynamic open 1 n/a double incoming

Binary Tree Insertion dynamic open 1 n/a double incoming

Another Variation:

Use patterns of record access to pre-sort the records before loading. This way frequently accessed records can be assured of having the shortest chains. Q: How can this be worked into the various collision resolution techniques? Can it be combined with all of them?

Algorithm	Static/ Dynamic	Addressing Method	1- or 2-Pass Load	Type of Link	Single/Double Hash Functions	Key used by H2[K]
Linear Probing	static	open	1	n/a	single	n/a
Double Hashing	static	open	1	n/a	double (2 types)	Type 1: incoming, new address Type 2: incoming, step size
Chained Overflow	static	chained	2	actual link	single	n/a
Coalesced Hashing	static	chained	1	actual link	single	n/a
Computed Chaining	dynamic	chained	1	pseudolink	double	resident (last value in chain)
Brent's Method	dynamic	open	1	n/a	double	incoming
Binary Tree Insertion	dynamic	open	1	n/a	double	incoming

i	1	1	2	1	2	3	1	2	3	4	1	2	3	4	5
j	1	2	1	3	2	1	4	3	2	1	5	4	3	2	1
sum(i,j)	2	3	3	4	4	4	5	5	5	5	6	6	6	6	6