 CPSC 461: Copyright (C) 2003 Katrin Becker 19982002 Last Modified May 28, 2003 11:08 PM
Data Compression  Statistical Methods
  here we use variable sized codes: shorter ones are assigned to more common occurrences
 E.g. Morse Code
A 
. 
M 
 
1 
. 
Fullstop (period) 
... 
Ä 
.. 
N 
. 
2 
.. 
Comma 
.. 
Á 
.. 
Ñ 
. 
3 
... 
Colon 
... 
Å 
.. 
O 
 
4 
.... 
Question mark (query) 
.... 
B 
... 
Ö 
. 
5 
..... 
Apostrophe 
.. 
C 
.. 
P 
.. 
6 
.... 
Hyphen 
.... 
Ch 
 
Q 
. 
7 
... 
Fraction bar 
... 
D 
.. 
R 
.. 
8 
.. 
Brackets (parentheses) 
.. 
E 
. 
S 
... 
9 
. 
Quotation marks 
.... 
É 
.... 
T 
 
0 
 


F 
... 
U 
.. 




G 
. 
Ü 
.. 




H 
.... 
V 
... 




I 
.. 
W 
. 




J 
. 
X 
.. 




K 
. 
Y 
. 




L 
... 
Z 
.. 




If the duration of a dot is taken to be one unit then that of a dash is three units. The space between the components of one character is one unit, between characters is three units and between words seven units. To indicate that a mistake has been made and for the receiver to delete the last word send ........ (eight dots).
Credit: Nick Wayth
When assigning varying length codes 2 main problems exist:
 1. assigning codes that are unambiguous
 2. assigning codes with the minimum average size
 For a discussion of the Information Theory background, see additional notes.
ShannonFano Coding

 1. Start by sorting symbols in descending order by frequency.
 2. Then divide the list into 2 subsets such that each subset has close to the same frequency
 one set is assigned a 0 and the other is assigned a 1
 3. Recursively divide each subset to assign the next part of the code.
 4. Once the size of the subset = 2 we assign 0 & 1 and stop.

 Example:
frequency 
1^{st} subdivision 
2^{nd} subdivision 
3^{rd} subdivision 
final step 
assigned code 
0.25 
0 
0 


00 
0.20 
0 
1 


01 
0.15 
1 
0 
0 

100 
0.15 
1 
0 
1 

101 
0.10 
1 
1 
0 

110 
0.10 
1 
1 
1 
0 
1110 
0.05 
1 
1 
1 
1 
1111 
Huffman Coding
  developed in 1952
  similar to ShannonFano but it constructs codes right to left (low to high)
  UNIX pack and unpack uses Huffman Codes on bytes
Algorithm:
 1. build list of symbols in descending order of probability
 2. construct a tree, bottom up, with a symbol at every leaf
 3. at each step select the 2 symbols with the smallest probability
 add them to the top of the tree;
 'delete' them from the list;
 replace them with an auxiliary symbol that represents both
 4. traverse the tree to determine symbols
 ( always assign code '0' to the symbol with the smallest probability )

 See if you can generate the Huffman Trees for these:


 Adaptive Huffman Coding:
 EXCELLENT example applet to play with: http://www.cs.sfu.ca/cs/CC/365/mark/squeeze/AdaptiveHuff.html

  avoids 2 passes [compact on UNIX uses a form of this]

 Starts with an empty Huffman Tree which is modified as we encounter symbols.

  1st occurance of any symbol is written out uncompacted, assigned a code and added to the tree
  next time the symbol is encountered the code is written out instead and the frequency is incremented by one
  since the tree has been modified, it is examined to see if it is still a Huffman Tree (best codes) and if necessary, rearranged

 Decompression works the same way  to work it needs to know what is compressed and what is raw so we need an escape to mark the raw data  it too must vary in length.

 Question: How does the length of the escape code relate to tree depth? (Does it have to equal the maximum depth of the tree?)
 Updating the Tree:
 1. compare X to its successors in the tree (left to right, bottom up)
 2. if immediate successor has F_{succ.} > F_{x}
 then
 OK  nothing to do
 else
 swap X <> SUCC. (unless SUCC is parent)
 {swap with last node in <group> where F_{succ.} = F_{group.} }
 REPEAT
 3. increment F_{X} to F_{X} + 1
 4. Repeat with parent of X until ROOT


 MNP5 (Microcom Networking Protocol) for modems

 MNP specifies many things, including how to pack & unpack bytes before sending, how to transmit in synchronous and asynchronous modes, what modulation to use, etc.

 CLASS 5 & CLASS 7 specifies the compression algorithms

 Class 5 uses a 2 stage process:
 1. RLE
 2. adaptive frequency coding

 RLE is used when >= 3 identical characters are encountered. The compressor emits 3 copies followed by the repitition count.

 This has 2 problems: 1) 3byte triples end up being 4 bytes long, and B) the max count is artificially set at 250.

 The part 2 of the encoding uses a variation of the adaptive Huffman coding technique described above.

 Question: Huffman codes are not unique for any given set of symbols. How many different valid Huffman codes are there? (Figure it out for 6 symbols with the following probabilities: 0.3, 0.25, 0.15, 0.15, 0.10, 0.05)
Facsimile Transmission:
Fax documents are scanned line by line; they are converted to black or white dots called pels (picture elements);
They are also encoded line by line
typical 8 1/2" line = 1728 pels/line; some only scan 8.2"  these get 1664 pels/line
T.4 (Group3) { ITUT (International Telecommunication Union)
 uses a combo of RLE & Huffman encoding
  they analyzed run lengths for many, many faxes and found the most common run lengths were:
 2,3,4 black pels and 27 white pels
  fax are encoded by run lengths '7w5b23w2b3w2b'
  the run length values are in turn encoded using a modified Huffman scheme
  obviously run lengths can be very long
  the 1st 63 run lengths were encoded (1 white and 1 black set), then the necessary multiples of 64
  long runs are made by "making change" from large to small run lengths using the greedy algorithm
  to avoid ambiguity each line has 1 white pixel appended at the start
 bbwwwbbbbww = 1w2b3w4b2w, and
 wwbwwwbbb = 3w1b3w3b

  has no error correction but reasonable errordetection
  recovery is not hard  once an error has been detected, just skip to the end of the line

  there's a special code for EOL so if error detected, search for EOL; if not found ABORT
  codes are 212 bits long so if decoder gets through 12 bits and still hasn't detected a runlength it has found an error
  each page has 1 EOL @ start and 6 EOL @ end

  images don't encode well  especially those that represent greylevel images  all are converted to blackandwhite for the FAX and greylevels are achieved by dithering

 2D Encoding

 MMR [Modified Modified READ [ relative element address designate]]
  records differences between scan lines
  quite error prone so only a few lines get encoded this way (usually 2 or 4 before we get another line encoded the original way)
  this method tends to be rather errorprone so by restricting the number of lines encoded this way it is still possible to recover without loosing too much data
T.6 (Group 4)

 uses 2D exclusively  somewhat different from Group 3
 Text Compression

  typically statistical or dictionary

  Statistical approach requires a modeling stage then a decoding stage
  the model assigns probabilities
  some use simple frequencies or based on context
  if based on context the algorithm can only use symbols already seen (history) as there is no opportunity for lookahead
  the context is said to be N symbols
  attempts to predict a symbol are based on what we have seen so far (different probabilities are assigned in different contexts)
  said to use orderN Markov Model

 PPM (Cleary & Whitten)

  encoder maintains a statistical model of the text
 1. input nest S
 2. assign it a probability P
 3. send it to an adaptive arithmetic encoder
 4. encode it with probability P

  the simplest form just counts occurrences P = count/ total_symbols

  if it is context based:
 Example 1: "h" has frequency 5% in normal English text but if the current symbol is "t" then the likelihood of "h" being next jumps to 30% so we say The model of typical English predicts "h".
 Example 2: "u" has normal probability 2% but if the current symbol is "q", then the probability of "u" being next is nearly 100%

 Question: Why should we not assign a base probability of 0 to anything? Hint: what is the entropy of the symbol if it's probability is 0?

 Static PPM algorithms keep a table based on known words.

 Adaptive context based methods start with an essentially empty table that changes as more is learned about the text being encoded.

 The initial part of the scan will be compressed little.

 This approach works well for lipograms (text that skews letter probabilities e.g.: "Gadsby" by EVWright, a novel that contains no 'e's; "Alphabetical Alphabet" by Walter Abish, chapter one has words that only start with 'a', chapter 2 has words that start with ab....)

  can use short or long contexts for assigning probabilities (long ones tend weigh input more heavily on old input than new)

  can allow variable context: Let's say we have an order3 context : "the" has been seen 27 times;
 it was followed by "r" 11 times, "s" 9 times, "h" 6 times, "m" only once

 now the next symbol is "a"  we've never seen "a" after "the", so reduce to 2nd order, what about "he", has it followed that? No? then what about just "e"?

 if still not, then set a special flag for a first occurrence (1 context)

  when the endocer decides to switch to a shorter context it emits an escape character  this allows the decoder to stay in step

  if a new character is seen, the decoder must keep switching and keep sending escapes  it ends up sending N + 1 escape characters

  the job of the decoder is to find out what the next symbol actually is  it cannot use the same approach ... you can only check the symbol in context if you know what it is (which is exactly what the decoder is trying to find out)
CPSC 461: Copyright (C) 2003 Katrin Becker 19982002 Last Modified May 28, 2003 11:08 PM