Cosequential Processing and Sorting Large Files
- open input files
- create output file
- /* should treat 1st record separately */
- set prev_name = low_val
- more_names = true; /* reset if either file reaches EOF */
- read (file1, name1)
- read (file2, name2)
- while (more_names) do
- if (name1 < name2) then
- read ( file1, name1)
- else if (name1 > name2) then
- read ( file2, name2)
- else /* match found */
- write ( outfile, name1 )
- read ( file1, name1)
- read (file2, name2)
- endif
- endwhile
- cleanup;
- getval (file, val);
- if EOF(file) then more_names = false
- if val <= prev_name then ERROR
- update prev_name
- if EOF (file) & otherlist_done then /* both lists done */
- more_names = false
- else if EOF (file) then /* just this list done */
- otherlist_done = true;
- else if val <= prev_name then
- ERROR
- update prev_name
- out_name = min( name1, name2, name3, ... namek )
- write (outfile, out_name)
- *if (name1 == out_name) then
- read( file1, name1 )
- *if (name2 == out_name) then
- read( file2, name2 )
- *if (name3 == out_name) then
- read( file3, name3 )
- .
- .
- .
- *if (namek == out_name) then
- read( filek, namek )
- endwhile
|
|
Combining READ and BUILD_HEAP:
- read records in blocks
- process each block as it comes; get next one while processing current one
- place each new block right in array at current end so next key to sort is where it should be; then just continue through array
- possible delays while sorter waits for reader to catch up but reader should never have to wait (means sort takes just a bit longer than reading the file)
- for I := 1 to record_count
- write record at array[1] (its the smallest)
- move array[end] to array[1] (call it K)
- end = end -1
- while K > both children
- switch (K, smallest child)
- endwhile
- endfor
- - can sort BIG files of virtually any size
- - reading for setup is sequential, so as fast as possible
- - reading for merge/output is also sequential (only do random access when switching files)
- - can apply Heapsort and combine input/output and processing ops
- - since all is sequential; can do tape sort
- allocate more hardware
- perform merge in several steps
- increase length of sorted runs
- overlap I/O operations
- HARDWARE:
- increase RAM
- increase # of dedicated disk drives; organize files to minimize seeks
- increase # of I/O channels
- process sub-sets of 25 sets (320,000 records) of 32 runs each (10,000 records/run)
- Step 1 (sort/merge each subset):
- input buffer holds 1/32 of a run = 32 X 32 = 1,024 seeks.
- Have 25 to do so Total = 25 X 1,024 = 25,600 seeks (each run does 32 megabytes)
- Step 2 (merge the 25 sets):
- allocate 1/25 of total buffer space for each run
- each buffer then holds 400 records (same as 1/800 of a run; since 1 run = 32 megabytes).
- 800 seeks per run, so 25 X 800 seeks = 20,000 seeks
- Total = 25,600 + 20,000 = 46,500 seeks
- Traded extra passes for increased buffer space for each pass; increased space for each pass = increased random access.
- Read set of records and sort with heapsort (call this primary heap)
- Write out only first record (smallest)
- Bring in new key
- if its bigger than one just written; place it in primary heap
- if its smaller; place it in secondary heap
- Repeat until primary heap is empty
- 800 runs = 1600 seeks for plain heapsort
- for replacement selection split I/O buffer into 7500 records for sort; 2500 records for input (waiting) buffer
- = 8,000,000/2,500 = 3,200 seeks to do the file SO....
- 6,400 seeks for replacement selection sort (!)
- BUT.....
- run length for replacement selection sort is ~ 15,000 records
- (7,500 spaces for records; 1 run does 2 X that = 15,000); 50% more than one run of plain heapsort
- 8,000,000 records / 15,000 records per run = 534 runs
- Now for the merge part split RAM (1mB) into 534 buffers;
- holds 18.73 records (lets say 18) so
- 15,000/18 = 834 seeks per run
- we get 834 seeks per run X 534 runs
- = 445,356 seeks altogether (compared to 640,000 for plain heapsort)
- What if we add Multistep Merge to Replacement Selection Sort?
20,000 then 19-way 15,162 then 20-way 16,000
- Use Heapsort for in-Ram sorting
- Use as much RAM as possible
- Use Multi-Step Merge if # initial runs is large
- Consider Replacement Selection to form initial runs if file is partially ordered
- Use > 1 Disk Drive and I/O Channel
- Distribute unsorted file into sorted runs (Replacement Selection is good)
- Merge runs into single sorted file
- Balanced Merge: spread runs among the available drives to make more efficient use of them
- uses 4 tape drives (goes from 2 to 2)
- try to eliminate empty runs
- try to spread runs to maximize use of tapes while minimizing actual reading and writing