CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified June 18, 2002 03:48 PM

Cosequential Processing and Sorting Large Files 

Defined: coordinated processing of two or more sequential lists to produce a single output list.
Reasons: merge; union ; matching; intersection
 
Working with two lists
2-Way Intersection
Intersection (matching fields from two lists):
need to initialize properly
need to synchronize lists so no matches are missed
need to handle end of file properly
need to recognize errors (like values out of sequence)
want to be efficient, simple, maintainable

METHOD 1:
startup:    
open input files
create output file
/* should treat 1st record separately */
set prev_name = low_val
more_names = true; /* reset if either file reaches EOF */
read (file1, name1)
read (file2, name2)
while (more_names) do
if (name1 < name2) then
   read ( file1, name1)
else if (name1 > name2) then
   read ( file2, name2)
else /* match found */
   write ( outfile, name1 )
   read ( file1, name1)
   read (file2, name2)
endif
endwhile
cleanup;
 
The read procedure handles checking for end of file:
getval (file, val);
if EOF(file) then more_names = false
and records out of sequence.
if val <= prev_name then ERROR
update prev_name
 
Merging Two Lists (without duplicates) :
basically as above
startup:
2-Way Merge w/o Duplicates
open input files
create output file
set prev_name = low_val  
more_names = true; /* gets reset when either file reaches EOF */
otherlist_done = false;
read (file1, name1)
read (file2, name2)
while (more_names) do
   if (name1 < name2) then
       write (outfile, name1)
       read ( file1, name1)
   else if (name1 > name2) then
       write (outfile, name2)
       read ( file2, name2)
   else /* match found */
      write ( outfile, name1 )
read ( file1, name1)
      read (file2, name2)
   endif
endwhile
read and write remaining file if necessary

* need to rewrite read so we can continue to read from remaining list after first list has reached end-of-file. Text uses HIGH_VALUE. Can also just remember which list has ended.
 
procedure read ( whichlist, file, val );
   getval (whichlist, file, val);
if EOF (file) & otherlist_done then /* both lists done */
more_names = false
else if EOF (file) then /* just this list done */
   otherlist_done = true;
else if val <= prev_name then
ERROR
update prev_name
end read
 
can be applied to problems involving two different lists (obviously the search fields must match) that require us to gather specific information from both lists.
 
[ EXAMPLE: see text p. 301 (268, 2nd edition) ]
 
K-Way Merge
Ascending order by name: (makes no allowances for duplicates or out-of-sequence records)
while (more_names)
out_name = min( name1, name2, name3, ... namek )
write (outfile, out_name)
*if (name1 == out_name) then
   read( file1, name1 )
*if (name2 == out_name) then
   read( file2, name2 )
*if (name3 == out_name) then
   read( file3, name3 )
.
.
.
*if (namek == out_name) then
   read( filek, namek )
endwhile
 
Merging by Selection Trees
Tournament Tree
tournament tree
 
each node represents the winner of the comparison
root is the minimum value
write the root; replace appropriate leaf; run the tournament again
requires fewer comparisons than above

Merging by Heapsort
method of speeding up sorting in RAM
uses selection tree idea
begin sorting keys as soon as they are available
in this case we must build tree completely before we can start to write them out
rules:
1. each node has 1 key which is <= parent
2. tree is complete (leaves are only on two levels)
3. storage can be simple : children of I are at 2I and 2I+1; parent of J is at J/2

 

Heap Sort
algorithm:
for I := 1 to count
   read next record into end of array (call it K)
   while K < parent(K)
       switch(K, parent(K)) /* may result in value of K changing */
   endwhile
endfor

 Combining READ and BUILD_HEAP:

read records in blocks
process each block as it comes; get next one while processing current one
place each new block right in array at current end so next key to sort is where it should be; then just continue through array
possible delays while sorter waits for reader to catch up but reader should never have to wait (means sort takes just a bit longer than reading the file)
 
Combining TRAVERSE_HEAP and WRITE:
for I := 1 to record_count
write record at array[1] (it’s the smallest)
move array[end] to array[1] (call it K)
end = end -1
while K > both children
switch (K, smallest child)
endwhile
endfor
 
create block of records to write out
once we have a block we can write it while creating the next block
 
Merging to sort large files on disk
 
Sorting just the keys is a good solution for files where all keys can be held in RAM; with bigger files this doesn’t work so well.
If we want to then sort the file, it is very expensive since we still have to retrieve each record separately
SOLUTION: read part of the file, sort it in RAM; write it; read next part etc....then merge the resultant files
- can sort BIG files of virtually any size
- reading for setup is sequential, so as fast as possible
- reading for merge/output is also sequential (only do random access when switching files)
- can apply Heapsort and combine input/output and processing ops
- since all is sequential; can do tape sort
 
Remaining bottleneck: merge phase
for K-way merge buffer size =
(1/K) X size of RAM space = (1/K) X size of each run
takes K seeks to read all records in each run, K runs altogether so merge = K2 seeks.
Sort Merge = O(K2)
 
100 byte records; 10 byte key field; can use 1 megabyte of RAM;
can hold 10,000 records in RAM at a time
 
8,000,000 record file; break it into 800 runs of 10,000 records each:
- assume 1 seek per sequential access
- for sort have 800 seeks and transfers for reading and for writing
- for merge split RAM into 800 parts (each now holds 1/800 of a run) so must access each run 800 times
- 800 runs X 800 seeks = 640,000 seeks (1 megabyte buffers )
 
SOLUTIONS:
allocate more hardware
perform merge in several steps
increase length of sorted runs
overlap I/O operations
HARDWARE:
increase RAM
increase # of dedicated disk drives; organize files to minimize seeks
increase # of I/O channels
 
MULTI-STEP MERGE: 8,000,000 records
process sub-sets of 25 sets (320,000 records) of 32 runs each (10,000 records/run)
Step 1 (sort/merge each subset):
input buffer holds 1/32 of a run = 32 X 32 = 1,024 seeks.
Have 25 to do so Total = 25 X 1,024 = 25,600 seeks (each run does 32 megabytes)
Step 2 (merge the 25 sets):
allocate 1/25 of total buffer space for each run
each buffer then holds 400 records (same as 1/800 of a run; since 1 run = 32 megabytes).
800 seeks per run, so 25 X 800 seeks = 20,000 seeks
Total = 25,600 + 20,000 = 46,500 seeks
 
Traded extra passes for increased buffer space for each pass; increased space for each pass = increased random access.
 
Heapsort Version 2: Using Replacement Selection
Read set of records and sort with heapsort (call this primary heap)
Write out only first record (smallest)
Bring in new key
if it’s bigger than one just written; place it in primary heap
if it’s smaller; place it in secondary heap
Repeat until primary heap is empty
 
This typically increases the size of one heapsort (run) by factor of 2
Cost:
800 runs = 1600 seeks for plain heapsort
for replacement selection split I/O buffer into 7500 records for sort; 2500 records for input (waiting) buffer
= 8,000,000/2,500 = 3,200 seeks to do the file SO....
6,400 seeks for replacement selection sort (!)
BUT.....
 
run length for replacement selection sort is ~ 15,000 records
(7,500 spaces for records; 1 run does 2 X that = 15,000); 50% more than one run of plain heapsort
 
8,000,000 records / 15,000 records per run = 534 runs
Now for the merge part split RAM (1mB) into 534 buffers;
holds 18.73 records (let’s say 18) so
15,000/18 = 834 seeks per run
we get 834 seeks per run X 534 runs
= 445,356 seeks altogether (compared to 640,000 for plain heapsort)
 
What if we add Multistep Merge to Replacement Selection Sort?

Approach
# Records/Seek to form Runs
Size of Runs Formed
# Runs Formed
Merge Pattern Used
# Seeks in Merge Phases
Total # of Seeks
800 Ram Sorts
10,000
10,000
800
25 X 32-way then 25-way
25,600

20,000

127,200
Replacement Selection, Random Order
2,500
15,000
534
19 X 28-way

then 19-way

22,876

15,162

124,438
Replacement Selection, Partially Ordered
2,500
40,000
200
20 X 10-way

then 20-way

8,000

16,000

110,400

The whole thing can be further improved if we can dedicate two disks - one for input and one for output; and more yet if we can use more than one processor.
It can all fall apart if we are in a multiprogramming environment.
Ideally, sorting large files is done when machine(s) can be dedicated to just the sort. Also these improvements assume you can read and write at the same time as process.
 
Summary:
Use Heapsort for in-Ram sorting
Use as much RAM as possible
Use Multi-Step Merge if # initial runs is large
Consider Replacement Selection to form initial runs if file is partially ordered
Use > 1 Disk Drive and I/O Channel
 
Sorting on tape
Distribute unsorted file into sorted runs (Replacement Selection is good)
Merge runs into single sorted file
Balanced Merge: spread runs among the available drives to make more efficient use of them
Two-Way balanced Merge 2-Way Balanced Merge
uses 4 tape drives (goes from 2 to 2)
 

K-Way Balanced Merge as above but uses more drives
2-Way Balanced Merge - 2

Multi-Phase Merge
try to eliminate empty runs
try to spread runs to maximize use of tapes while minimizing actual reading and writing
 
Maintenance:
adding, deleting, changing records
 
batch processing
- updates accumulate in a transaction file (often sorted into same key order as master file)
applied to master file in a maintenance run
- transaction files are often edited and carefully checked before being applied to the master file
- audit/error listing (log file) is standard; sometimes in a form that can be edited and used to form the basis for the next transaction file
- matched vs unmatched records (unmatched are those whose keys can be found in only one of the files - master and transaction)
 

Back to Top
CPSC 461: Copyright © 2002 Katrin Becker 1998-2002 Last Modified June 18, 2002 03:48 PM