davidaltherr.net   (mobile) 
    home  |  journal  |  maps  |  archive  |  resume   
   
  academics : mathematics : huffman binary encodingShare


huffman binary encoding

  Download notebook at huffman binary encoding.

huffman_encoding.nb

Notebook Abstract

The following notebook develops the Huffman algorithm for reducing data to an optimized binary string.  While most solutions for this algorithm typically utilize flow control structures, e.g. 'For' or 'While' loops, this package presents a simpler approach via functional programming.  The Huffman algorithm utilizes a translation table that is based on the frequency of characters; the table may be generated from the data itself or from another dataset set with similar character frequencies.  The system is most efficient with large text files or other large datasets with a significant difference in relative character frequencies.  Also, remember that if the translation table is generated from the data itself, then the compression advantage must outweigh the overhead of the translation table.

Function Demonstration

Lets start with some sample data to demonstrate the following functions.  Remember that the data must be enclosed in a list.

[Graphics:Images/huffman_encoding_html_gr_1.gif]

The Huffman algorithm is really a three step process for encoding with one step for decoding.  First, we must determine the frequency of occurence for each character or symbol in our data.

[Graphics:Images/huffman_encoding_html_gr_2.gif]

The above function will analyze a list of data and generate a table of character and frequency relations sorted by frequency ascending.

[Graphics:Images/huffman_encoding_html_gr_3.gif]
[Graphics:Images/huffman_encoding_html_gr_4.gif]

Secondly, from the character frequency table we must generate a tree in which nodes and leafs can be represented by a binary string, i.e. there are always exactly two paths that we can take from any given node to a leaf or another node, represented by a '1' or a '0'.  The optimal stucture of the tree is determined by a recursive process which manipulates the characters based on their individual frequency.

[Graphics:Images/huffman_encoding_html_gr_5.gif]

The process begins by grouping the two characters of lowest frequency into a character 'set' under a node; the 'frequency' of the character set is then determined as the sum of the frequencies of its element characters.

[Graphics:Images/huffman_encoding_html_gr_6.gif]
[Graphics:Images/huffman_encoding_html_gr_7.gif]

The character set is then placed back into the frequency table in place of its element characters.  (Table truncated only for display purposes)

[Graphics:Images/huffman_encoding_html_gr_8.gif]
[Graphics:Images/huffman_encoding_html_gr_9.gif]

The frequency table is again sorted by frequency.   (Table truncated only for display purposes)

[Graphics:Images/huffman_encoding_html_gr_10.gif]
[Graphics:Images/huffman_encoding_html_gr_11.gif]

The process continues by grouping the two lowest frequency elements, whether they be characters or character sets, under a node.  For an original character set of n unique characters, the process will continue for n-1 recursions until we are eventually left with the full optimal tree structure, presented here as a nested list.

[Graphics:Images/huffman_encoding_html_gr_12.gif]
[Graphics:Images/huffman_encoding_html_gr_13.gif]

The next step is to reduce this structure to its binary representation in a translation table.  

[Graphics:Images/huffman_encoding_html_gr_14.gif]

Notice that you can navigate the above structure with only binary values as direction choices where '0' and '1' (respectively '1' and '2' in Mathematica) represent 'up' and 'down' as you move to the right.  Tracing the structure from the top level to a unique character or 'leaf' will result in a unique binary string.  If we map out all of the leafs and their corresponding binary indices within the structure, we are left with a translation table with character to binary assocations.

[Graphics:Images/huffman_encoding_html_gr_15.gif]
[Graphics:Images/huffman_encoding_html_gr_16.gif]

Notice that the most frequent characters have the shortest bit sequences, as is intended.  If our sample data were a large enough text such that all necessary characters were included, we might have a translation table suited for compressing most documents of the same language; English documents tend to exhibit high frequencies of the 'e', 's', 't', 'r', 'a' and space characters, whereas a document written in C++ might exhibit a high frequency of ';', '=', '(', ')', '[', ']' and tab characters.  The same holds true for many foreign languages; in fact, this algorithm is just as efficient with non-ASCII characters.  Most English texts can be written with standard 8-bit ASCII characters, but many languages require 16-bit and even 32-bit character sets, with the 32-bit character becoming the international standard.   Notice in the example above that we have reduced the most frequent characters to a minimum 5-bit representation, with the least frequent characters represented by a maximum 19-bit string.  A general rule of thumb is that greater compression is achevied with smaller alphabets and larger documents.  The next step is to make the appropriate replacements on a per-character basis.

[Graphics:Images/huffman_encoding_html_gr_17.gif]

We convert the translation table into a list of rules which we then apply to the data.  The result is the encoded bit stream seen below.

[Graphics:Images/huffman_encoding_html_gr_18.gif]
[Graphics:Images/huffman_encoding_html_gr_19.gif]

In order to decode the bit stream, we use the translation table to generate the same list of rules, only with each rule reversed.

[Graphics:Images/huffman_encoding_html_gr_20.gif]

The end result is the original list of data.

[Graphics:Images/huffman_encoding_html_gr_21.gif]
[Graphics:Images/huffman_encoding_html_gr_22.gif]

  Download notebook at huffman binary encoding.

  Return to mathematics.




  copyleft © 2010 david altherr
  free for use by open source license
  view: full | compact | mobile | print 

today:  
updated:  
2010.09.08 19.05
2009.02.05 17.26