CPSC 461: Copyright (C) 2003 Katrin Becker 1998-2002 Last Modified May 28, 2003 11:08 PM

Data Compression - Statistical Methods

- here we use variable sized codes: shorter ones are assigned to more common occurrences
E.g. Morse Code
A .- M -- 1 .---- Full-stop (period) .-.-.-
Ä .-.- N -. 2 ..--- Comma --..--
Á .--.- Ñ --.-- 3 ...-- Colon ---...
Å .--.- O --- 4 ....- Question mark (query) ..--..
B -... Ö ---. 5 ..... Apostrophe .----.
C -.-. P .--. 6 -.... Hyphen -....-
Ch ---- Q --.- 7 --... Fraction bar -..-.
D -.. R .-. 8 ---.. Brackets (parentheses) -.--.-
E . S ... 9 ----. Quotation marks .-..-.
É ..-.. T - 0 -----    
F ..-. U ..-        
G --. Ü ..--        
H .... V ...-        
I .. W .--        
J .--- X -..-        
K -.- Y -.--        
L .-.. Z --..        

If the duration of a dot is taken to be one unit then that of a dash is three units. The space between the components of one character is one unit, between characters is three units and between words seven units. To indicate that a mistake has been made and for the receiver to delete the last word send ........ (eight dots).

Credit: Nick Wayth

When assigning varying length codes 2 main problems exist:

1. assigning codes that are unambiguous
2. assigning codes with the minimum average size
For a discussion of the Information Theory background, see additional notes.
Shannon-Fano Coding
1. Start by sorting symbols in descending order by frequency.
2. Then divide the list into 2 subsets such that each subset has close to the same frequency
one set is assigned a 0 and the other is assigned a 1
3. Recursively divide each subset to assign the next part of the code.
4. Once the size of the subset = 2 we assign 0 & 1 and stop.
frequency 1st subdivision 2nd subdivision 3rd subdivision final step assigned code
0.25 0 0     00
0.20 0 1     01
0.15 1 0 0   100
0.15 1 0 1   101
0.10 1 1 0
0.10 1 1 1 0 1110
0.05 1 1 1 1111

Huffman Coding

- developed in 1952
- similar to Shannon-Fano but it constructs codes right to left (low- to high-)
- UNIX pack and unpack uses Huffman Codes on bytes

1. build list of symbols in descending order of probability
2. construct a tree, bottom up, with a symbol at every leaf
3. at each step select the 2 symbols with the smallest probability
add them to the top of the tree;
'delete' them from the list;
replace them with an auxiliary symbol that represents both
4. traverse the tree to determine symbols
( always assign code '0' to the symbol with the smallest probability )
See if you can generate the Huffman Trees for these:

Adaptive Huffman Coding:
EXCELLENT example applet to play with: http://www.cs.sfu.ca/cs/CC/365/mark/squeeze/AdaptiveHuff.html
- avoids 2 passes [compact on UNIX uses a form of this]
Starts with an empty Huffman Tree which is modified as we encounter symbols.
- 1st occurance of any symbol is written out un-compacted, assigned a code and added to the tree
- next time the symbol is encountered the code is written out instead and the frequency is incremented by one
- since the tree has been modified, it is examined to see if it is still a Huffman Tree (best codes) and if necessary, rearranged
Decompression works the same way - to work it needs to know what is compressed and what is raw so we need an escape to mark the raw data - it too must vary in length.
Question: How does the length of the escape code relate to tree depth? (Does it have to equal the maximum depth of the tree?)
Updating the Tree:
1. compare X to its successors in the tree (left to right, bottom up)
2. if immediate successor has Fsucc. > Fx
OK - nothing to do
swap X <-> SUCC. (unless SUCC is parent)
{swap with last node in <group> where Fsucc. = Fgroup. }
3. increment FX to FX + 1
4. Repeat with parent of X until ROOT

MNP5 (Microcom Networking Protocol) for modems
MNP specifies many things, including how to pack & unpack bytes before sending, how to transmit in synchronous and asynchronous modes, what modulation to use, etc.
CLASS 5 & CLASS 7 specifies the compression algorithms
Class 5 uses a 2 stage process:
1. RLE
2. adaptive frequency coding
RLE is used when >= 3 identical characters are encountered. The compressor emits 3 copies followed by the repitition count.
This has 2 problems: 1) 3-byte triples end up being 4 bytes long, and B) the max count is artificially set at 250.
The part 2 of the encoding uses a variation of the adaptive Huffman coding technique described above.
Question: Huffman codes are not unique for any given set of symbols. How many different valid Huffman codes are there? (Figure it out for 6 symbols with the following probabilities: 0.3, 0.25, 0.15, 0.15, 0.10, 0.05)

Facsimile Transmission:

Fax documents are scanned line by line; they are converted to black or white dots called pels (picture elements);

They are also encoded line by line


typical 8 1/2" line = 1728 pels/line; some only scan 8.2" - these get 1664 pels/line


T.4 (Group3) { ITU-T (International Telecommunication Union)

uses a combo of RLE & Huffman encoding
- they analyzed run lengths for many, many faxes and found the most common run lengths were:
2,3,4 black pels and 2-7 white pels
- fax are encoded by run lengths '7w5b23w2b3w2b'
- the run length values are in turn encoded using a modified Huffman scheme
- obviously run lengths can be very long
- the 1st 63 run lengths were encoded (1 white and 1 black set), then the necessary multiples of 64
- long runs are made by "making change" from large to small run lengths using the greedy algorithm
- to avoid ambiguity each line has 1 white pixel appended at the start
bbwwwbbbbww = 1w2b3w4b2w, and
wwbwwwbbb = 3w1b3w3b
- has no error correction but reasonable error-detection
- recovery is not hard - once an error has been detected, just skip to the end of the line
- there's a special code for EOL so if error detected, search for EOL; if not found ABORT
- codes are 2-12 bits long so if decoder gets through 12 bits and still hasn't detected a run-length it has found an error
- each page has 1 EOL @ start and 6 EOL @ end
- images don't encode well - especially those that represent grey-level images - all are converted to black-and-white for the FAX and grey-levels are achieved by dithering
2-D Encoding
MMR [Modified Modified READ [ relative element address designate]]
- records differences between scan lines
- quite error prone so only a few lines get encoded this way (usually 2 or 4 before we get another line encoded the original way)
- this method tends to be rather error-prone so by restricting the number of lines encoded this way it is still possible to recover without loosing too much data

T.6 (Group 4)

uses 2-D exclusively - somewhat different from Group 3

Text Compression
- typically statistical or dictionary
- Statistical approach requires a modeling stage then a decoding stage
- the model assigns probabilities
- some use simple frequencies or based on context
- if based on context the algorithm can only use symbols already seen (history) as there is no opportunity for look-ahead
- the context is said to be N symbols
- attempts to predict a symbol are based on what we have seen so far (different probabilities are assigned in different contexts)
- said to use order-N Markov Model
PPM (Cleary & Whitten)
- encoder maintains a statistical model of the text
1. input nest S
2. assign it a probability P
3. send it to an adaptive arithmetic encoder
4. encode it with probability P
- the simplest form just counts occurrences P = count/ total_symbols
- if it is context based:
Example 1: "h" has frequency 5% in normal English text but if the current symbol is "t" then the likelihood of "h" being next jumps to 30% so we say The model of typical English predicts "h".
Example 2: "u" has normal probability 2% but if the current symbol is "q", then the probability of "u" being next is nearly 100%
Question: Why should we not assign a base probability of 0 to anything? Hint: what is the entropy of the symbol if it's probability is 0?
Static PPM algorithms keep a table based on known words.
Adaptive context based methods start with an essentially empty table that changes as more is learned about the text being encoded.
The initial part of the scan will be compressed little.
This approach works well for lipograms (text that skews letter probabilities e.g.: "Gadsby" by EVWright, a novel that contains no 'e's; "Alphabetical Alphabet" by Walter Abish, chapter one has words that only start with 'a', chapter 2 has words that start with ab....)
- can use short or long contexts for assigning probabilities (long ones tend weigh input more heavily on old input than new)
- can allow variable context: Let's say we have an order-3 context : "the" has been seen 27 times;
it was followed by "r" 11 times, "s" 9 times, "h" 6 times, "m" only once
now the next symbol is "a" - we've never seen "a" after "the", so reduce to 2nd order, what about "he", has it followed that? No? then what about just "e"?
if still not, then set a special flag for a first occurrence (-1 context)
- when the endocer decides to switch to a shorter context it emits an escape character - this allows the decoder to stay in step
- if a new character is seen, the decoder must keep switching and keep sending escapes - it ends up sending N + 1 escape characters
- the job of the decoder is to find out what the next symbol actually is - it cannot use the same approach ... you can only check the symbol in context if you know what it is (which is exactly what the decoder is trying to find out)

BackCPSC 461: Copyright (C) 2003 Katrin Becker 1998-2002 Last Modified May 28, 2003 11:08 PM