Cosequential Processing 1


Defined: coordinated processing of two or more sequential lists to produce a single output list.
Reasons: merge; union ; matching; intersection

Working with two lists

Intersection (matching fields from two lists):
need to initialize properly
need to synchronize lists so no matches are missed
need to handle end of file properly
need to recognize errors (like values out of sequence)
want to be efficient, simple, maintainable

METHOD 1:
startup:

open input files
create output file
/* should treat 1st record separately */
set prev_name = low_val
more_names = true; /* reset if either file reaches EOF */

read (file1, name1)
read (file2, name2)
while (more_names) do

if (name1 < name2) then
   read ( file1, name1)
else if (name1 > name2) then
   read ( file2, name2)
else /* match found */
   write ( outfile, name1 )
   read ( file1, name1)
   read (file2, name2)
endif

endwhile
cleanup;
open input files
create output file
/* should treat 1st record separately */
set prev_name = low_val
more_names = true; /* reset if either file reaches EOF */

read (file1, name1)
read (file2, name2)
while (more_names) do

if (name1 < name2) then
   read ( file1, name1)
else if (name1 > name2) then
   read ( file2, name2)
else /* match found */
   write ( outfile, name1 )
   read ( file1, name1)
   read (file2, name2)
endif

endwhile
cleanup;
read (file1, name1)
read (file2, name2)
while (more_names) do

if (name1 < name2) then
   read ( file1, name1)
else if (name1 > name2) then
   read ( file2, name2)
else /* match found */
   write ( outfile, name1 )
   read ( file1, name1)
   read (file2, name2)
endif
if (name1 < name2) then
read ( file1, name1)
else if (name1 > name2) then
read ( file2, name2)
else /* match found */
write ( outfile, name1 )
read ( file1, name1)
read (file2, name2)
endif
endwhile
cleanup;

The read procedure handles checking for end of file:

getval (file, val);
if EOF(file) then more_names = false
getval (file, val);
if EOF(file) then more_names = false
and records out of sequence.

if val <= prev_name then ERROR
update prev_name
if val <= prev_name then ERROR
update prev_name

Merging Two Lists (without duplicates) :
basically as above
startup:

open input files
create output file
set prev_name = low_val
more_names = true; /* gets reset when either file reaches EOF */
otherlist_done = false;
read (file1, name1)
read (file2, name2)
while (more_names) do
if (name1 < name2) then
write (outfile, name1)
read ( file1, name1)
else if (name1 > name2) then
write (outfile, name2)
read ( file2, name2)
else /* match found */
endif
endwhile
read and write remaining file if necessary

* need to rewrite read so we can continue to read from remaining list after first list has reached end-of-file. Text uses HIGH_VALUE. Can also just remember which list has ended.


procedure read ( whichlist, file, val );
getval (whichlist, file, val);

if EOF (file) & otherlist_done then /* both lists done */

more_names = false
else if EOF (file) then /* just this list done */
otherlist_done = true;
else if val <= prev_name then

ERROR
update prev_name
if EOF (file) & otherlist_done then /* both lists done */

more_names = false
else if EOF (file) then /* just this list done */
otherlist_done = true;
else if val <= prev_name then

ERROR
update prev_name
more_names = false
else if EOF (file) then /* just this list done */
otherlist_done = true;
else if val <= prev_name then

ERROR
update prev_name
ERROR
update prev_name
end read

can be applied to problems involving two different lists (obviously the search fields must match) that require us to gather specific information from both lists.

[ EXAMPLE: see text p. 301 (268, 2nd edition) ]

K-Way Merge
Ascending order by name: (makes no allowances for duplicates or out-of-sequence records)
while (more_names)

out_name = min( name1, name2, name3, ... namek )
write (outfile, out_name)
*if (name1 == out_name) then
   read( file1, name1 )
*if (name2 == out_name) then
   read( file2, name2 )
*if (name3 == out_name) then
   read( file3, name3 )
.
.
.
*if (namek == out_name) then
   read( filek, namek )
endwhile
out_name = min( name1, name2, name3, ... namek )
write (outfile, out_name)
*if (name1 == out_name) then
read( file1, name1 )
*if (name2 == out_name) then
read( file2, name2 )
*if (name3 == out_name) then
read( file3, name3 )
.
.
.
*if (namek == out_name) then
read( filek, namek )
endwhile

Merging by Selection Trees

tournament tree

each node represents the winner of the comparison
root is the minimum value
write the root; replace appropriate leaf; run the tournament again
requires fewer comparisons than above

Merging by Heapsort
method of speeding up sorting in RAM
uses selection tree idea
begin sorting keys as soon as they are available
in this case we must build tree completely before we can start to write them out
rules:
1. each node has 1 key which is <= parent
2. tree is complete (leaves are only on two levels)
3. storage can be simple : children of I are at 2I and 2I+1; parent of J is at J/2

algorithm:

for I := 1 to count
   read next record into end of array (call it K)
   while K < parent(K)
       switch(K, parent(K)) /* may result in value of K changing */
   endwhile
endfor

Combining READ and BUILD_HEAP:

read records in blocks
process each block as it comes; get next one while processing current one
place each new block right in array at current end so next key to sort is where it should be; then just continue through array
possible delays while sorter waits for reader to catch up but reader should never have to wait (means sort takes just a bit longer than reading the file)

Combining TRAVERSE_HEAP and WRITE:

for I := 1 to record_count

write record at array[1] (it’s the smallest)
move array[end] to array[1] (call it K)
end = end -1

while K > both children

switch (K, smallest child)

endwhile
endfor

create block of records to write out
once we have a block we can write it while creating the next block

Merging to sort large files on disk

Sorting just the keys is a good solution for files where all keys can be held in RAM; with bigger files this doesn’t work so well.
If we want to then sort the file, it is very expensive since we still have to retrieve each record separately
SOLUTION: read part of the file, sort it in RAM; write it; read next part etc....then merge the resultant files

- can sort BIG files of virtually any size
- reading for setup is sequential, so as fast as possible
- reading for merge/output is also sequential (only do random access when switching files)
- can apply Heapsort and combine input/output and processing ops
- since all is sequential; can do tape sort

Remaining bottleneck: merge phase for K-way merge buffer size = (1/K) X size of RAM space = (1/K) X size of each run takes K seeks to read all records in each run, K runs altogether so merge = K² seeks. Sort Merge = O(K²) 100 byte records; 10 byte key field; can use 1 megabyte of RAM; can hold 10,000 records in RAM at a time 8,000,000 record file; break it into 800 runs of 10,000 records each: - assume 1 seek per sequential access - for sort have 800 seeks and transfers for reading and for writing - for merge split RAM into 800 parts (each now holds 1/800 of a run) so must access each run 800 times - 800 runs X 800 seeks = 640,000 seeks (1 megabyte buffers )

SOLUTIONS:

allocate more hardware
perform merge in several steps
increase length of sorted runs
overlap I/O operations
HARDWARE:

increase RAM
increase # of dedicated disk drives; organize files to minimize seeks
increase # of I/O channels

MULTI-STEP MERGE: 8,000,000 records

process sub-sets of 25 sets (320,000 records) of 32 runs each (10,000 records/run)

Step 1 (sort/merge each subset):

input buffer holds 1/32 of a run = 32 X 32 = 1,024 seeks.
Have 25 to do so Total = 25 X 1,024 = 25,600 seeks (each run does 32 megabytes)

Step 2 (merge the 25 sets):

allocate 1/25 of total buffer space for each run
each buffer then holds 400 records (same as 1/800 of a run; since 1 run = 32 megabytes).
800 seeks per run, so 25 X 800 seeks = 20,000 seeks

Total = 25,600 + 20,000 = 46,500 seeks

Traded extra passes for increased buffer space for each pass; increased space for each pass = increased random access.

Heapsort Version 2: Using Replacement Selection

Read set of records and sort with heapsort (call this primary heap)
Write out only first record (smallest)
Bring in new key

if it’s bigger than one just written; place it in primary heap
if it’s smaller; place it in secondary heap

Repeat until primary heap is empty

This typically increases the size of one heapsort (run) by factor of 2
Cost:

800 runs = 1600 seeks for plain heapsort
for replacement selection split I/O buffer into 7500 records for sort; 2500 records for input (waiting) buffer

= 8,000,000/2,500 = 3,200 seeks to do the file SO....
6,400 seeks for replacement selection sort (!)
BUT.....

run length for replacement selection sort is ~ 15,000 records

(7,500 spaces for records; 1 run does 2 X that = 15,000); 50% more than one run of plain heapsort

8,000,000 records / 15,000 records per run = 534 runs
Now for the merge part split RAM (1mB) into 534 buffers;

holds 18.73 records (let’s say 18) so
15,000/18 = 834 seeks per run
we get 834 seeks per run X 534 runs
= 445,356 seeks altogether (compared to 640,000 for plain heapsort)

What if we add Multistep Merge to Replacement Selection Sort?

Approach # Records/Seek to form Runs Size of Runs Formed # Runs Formed Merge Pattern Used # Seeks in Merge Phases Total # of Seeks

800 Ram Sorts 10,000 10,000 800 25 X 32-way then 25-way 25,600
20,000
127,200

Replacement Selection, Random Order 2,500 15,000 534 19 X 28-way
then 19-way
22,876
15,162
124,438

Replacement Selection, Partially Ordered 2,500 40,000 200 20 X 10-way
then 20-way
8,000
16,000
110,400

The whole thing can be further improved if we can dedicate two disks - one for input and one for output; and more yet if we can use more than one processor.
It can all fall apart if we are in a multiprogramming environment.
Ideally, sorting large files is done when machine(s) can be dedicated to just the sort. Also these improvements assume you can read and write at the same time as process.

Summary:

Use Heapsort for in-Ram sorting
Use as much RAM as possible
Use Multi-Step Merge if # initial runs is large
Consider Replacement Selection to form initial runs if file is partially ordered
Use > 1 Disk Drive and I/O Channel

Sorting on tape

Distribute unsorted file into sorted runs (Replacement Selection is good)
Merge runs into single sorted file
Balanced Merge: spread runs among the available drives to make more efficient use of them

Two-Way balanced Merge

uses 4 tape drives (goes from 2 to 2)

K-Way Balanced Merge as above but uses more drives
Multi-Phase Merge

try to eliminate empty runs
try to spread runs to maximize use of tapes while minimizing actual reading and writing


Maintenance:
adding, deleting, changing records

batch processing

- updates accumulate in a transaction file (often sorted into same key order as master file)

applied to master file in a maintenance run

- transaction files are often edited and carefully checked before being applied to the master file

- audit/error listing (log file) is standard; sometimes in a form that can be edited and used to form the basis for the next transaction file

- matched vs unmatched records (unmatched are those whose keys can be found in only one of the files - master and transaction)

Approach	# Records/Seek to form Runs	Size of Runs Formed	# Runs Formed	Merge Pattern Used	# Seeks in Merge Phases	Total # of Seeks
800 Ram Sorts	10,000	10,000	800	25 X 32-way then 25-way	25,600 20,000	127,200
Replacement Selection, Random Order	2,500	15,000	534	19 X 28-way then 19-way	22,876 15,162	124,438
Replacement Selection, Partially Ordered	2,500	40,000	200	20 X 10-way then 20-way	8,000 16,000	110,400