(*********************************************************************** Mathematica-Compatible Notebook This notebook can be used on any computer system with Mathematica 4.0, MathReader 4.0, or any compatible application. The data for the notebook starts with the line containing stars above. To get the notebook into a Mathematica-compatible application, do one of the following: * Save the data starting with the line of stars above into a file with a name ending in .nb, then open the file inside the application; * Copy the data starting with the line of stars above to the clipboard, then use the Paste menu command inside the application. Data for notebooks contains only printable 7-bit ASCII and can be sent directly in email or through ftp in text mode. Newlines can be CR, LF or CRLF (Unix, Macintosh or MS-DOS style). NOTE: If you modify the data for this notebook not in a Mathematica- compatible application, you must delete the line below containing the word CacheID, otherwise Mathematica-compatible applications may try to use invalid cache data. For more information on notebooks and Mathematica-compatible applications, contact Wolfram Research: web: http://www.wolfram.com email: info@wolfram.com phone: +1-217-398-0700 (U.S.) Notebook reader applications are available free of charge from Wolfram Research. ***********************************************************************) (*CacheID: 232*) (*NotebookFileLineBreakTest NotebookFileLineBreakTest*) (*NotebookOptionsPosition[ 35760, 775]*) (*NotebookOutlinePosition[ 36462, 800]*) (* CellTagsIndexPosition[ 36418, 796]*) (*WindowFrame->Normal*) Notebook[{ Cell[CellGroupData[{ Cell[TextData[{ StyleBox["Huffman binary encoding:\n Text compression algorithm\n", FontFamily->"Arial", FontSize->24, FontWeight->"Bold"], StyleBox["\nDavid Altherr : altherda@email.uc.edu : www.davidaltherr.net", FontFamily->"Arial", FontSize->12, FontWeight->"Bold", FontSlant->"Italic"], StyleBox["\n", FontFamily->"Arial", FontSize->12], StyleBox["Copyright \[Copyright] 2001 David Altherr ; Permission granted to \ use under Open Source MIT public license.", FontFamily->"Arial", FontSize->12, FontSlant->"Italic"] }], "Title", FontColor->RGBColor[0, 0, 0.605478], Background->GrayLevel[0.900008]], Cell[CellGroupData[{ Cell["Notebook Abstract", "Section"], Cell["\<\ The following notebook develops the Huffman algorithm for reducing data to an \ optimized binary string. While most solutions for this algorithm typically \ utilize flow control structures, e.g. 'For' or 'While' loops, this package \ presents a simpler approach via functional programming. The Huffman \ algorithm utilizes a translation table that is based on the frequency of \ characters; the table may be generated from the data itself or from another \ dataset set with similar character frequencies. The system is most efficient \ with large text files or other large datasets with a significant difference \ in relative character frequencies. Also, remember that if the translation \ table is generated from the data itself, then the compression advantage must \ outweigh the overhead of the translation table.\ \>", "Text", Background->GrayLevel[0.900008]] }, Open ]], Cell[CellGroupData[{ Cell["Function Demonstration", "Section"], Cell["\<\ Lets start with some sample data to demonstrate the following functions. \ Remember that the data must be enclosed in a list.\ \>", "Text", Background->GrayLevel[0.900008]], Cell[BoxData[ \(\(data1\ = \ {"\", "\"};\)\)], "Input"], Cell["\<\ The Huffman algorithm is really a three step process for encoding with one \ step for decoding. First, we must determine the frequency of occurence for \ each character or symbol in our data.\ \>", "Text", Background->GrayLevel[0.900008]], Cell[BoxData[ \(\(getCharFreqTable[data_List] := Sort[\({Count[Flatten[\(Characters[ToString[#]] &\) /@ data], ToString[#[\([1]\)]]], ToString[#[\([1]\)]]} &\) /@ Transpose[{Union[ Flatten[\(Characters[ToString[#]] &\) /@ data]]}]];\)\)], "Input"], Cell["\<\ The above function will analyze a list of data and generate a table of \ character and frequency relations sorted by frequency ascending.\ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[CellGroupData[{ Cell[BoxData[ \(getCharFreqTable[data1]\)], "Input"], Cell[BoxData[ FormBox[ RowBox[{"(", "\[NoBreak]", GridBox[{ {"1", "\<\";\"\>"}, {"1", "\<\"D\"\>"}, {"1", "\<\"F\"\>"}, {"1", "\<\"x\"\>"}, {"2", "\<\"H\"\>"}, {"2", "\<\"k\"\>"}, {"2", "\<\"W\"\>"}, {"3", "\<\"A\"\>"}, {"3", "\<\"q\"\>"}, {"3", "\<\"T\"\>"}, {"3", "\<\"z\"\>"}, {"4", "\<\",\"\>"}, {"4", "\<\"'\"\>"}, {"6", "\<\"v\"\>"}, {"6", "\<\"w\"\>"}, {"6", "\<\"y\"\>"}, {"8", "\<\".\"\>"}, {"9", "\<\"b\"\>"}, {"11", "\<\"p\"\>"}, {"14", "\<\"u\"\>"}, {"15", "\<\"d\"\>"}, {"17", "\<\"g\"\>"}, {"20", "\<\"c\"\>"}, {"21", "\<\"m\"\>"}, {"26", "\<\"f\"\>"}, {"34", "\<\"h\"\>"}, {"34", "\<\"l\"\>"}, {"35", "\<\"n\"\>"}, {"40", "\<\"s\"\>"}, {"45", "\<\"o\"\>"}, {"50", "\<\"r\"\>"}, {"51", "\<\"i\"\>"}, {"61", "\<\"a\"\>"}, {"74", "\<\"t\"\>"}, {"79", "\<\"e\"\>"}, {"124", "\<\" \"\>"} }], "\[NoBreak]", ")"}], TraditionalForm]], "Output"] }, Open ]], Cell["\<\ Secondly, from the character frequency table we must generate a tree in which \ nodes and leafs can be represented by a binary string, i.e. there are always \ exactly two paths that we can take from any given node to a leaf or another \ node, represented by a '1' or a '0'. The optimal stucture of the tree is \ determined by a recursive process which manipulates the characters based on \ their individual frequency.\ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[BoxData[ \(getHuffmanTree[ data_List] := \(Nest[ Sort[Delete[ ReplacePart[#, {Plus @@ \((\(Transpose[ Take[#, {1, 2}]]\)[\([1]\)])\), Take[#, {1, 2}]}, {1}], {2}]] &, getCharFreqTable[data], Length[getCharFreqTable[data]] - 1]\)[\([1, 2]\)]\)], "Input"], Cell["\<\ The process begins by grouping the two characters of lowest frequency into a \ character 'set' under a node; the 'frequency' of the character set is then \ determined as the sum of the frequencies of its element characters.\ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[CellGroupData[{ Cell[BoxData[ \(\({Plus @@ \((\(Transpose[Take[#, {1, 2}]]\)[\([1]\)])\), Take[#, {1, 2}]} &\)@getCharFreqTable[data1]\)], "Input"], Cell[BoxData[ FormBox[ RowBox[{"{", RowBox[{"2", ",", RowBox[{"(", "\[NoBreak]", GridBox[{ {"1", "\<\";\"\>"}, {"1", "\<\"D\"\>"} }], "\[NoBreak]", ")"}]}], "}"}], TraditionalForm]], "Output"] }, Open ]], Cell["\<\ The character set is then placed back into the frequency table in place of \ its element characters. (Table truncated only for display purposes)\ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[CellGroupData[{ Cell[BoxData[ \(\((\(Delete[ ReplacePart[#, {Plus @@ \((\(Transpose[ Take[#, {1, 2}]]\)[\([1]\)])\), Take[#, {1, 2}]}, {1}], {2}] &\)@ getCharFreqTable[data1])\)[\([Range[1, 9]]\)]\)], "Input"], Cell[BoxData[ FormBox[ RowBox[{"(", "\[NoBreak]", GridBox[{ {"2", RowBox[{"(", "\[NoBreak]", GridBox[{ {"1", "\<\";\"\>"}, {"1", "\<\"D\"\>"} }], "\[NoBreak]", ")"}]}, {"1", "\<\"F\"\>"}, {"1", "\<\"x\"\>"}, {"2", "\<\"H\"\>"}, {"2", "\<\"k\"\>"}, {"2", "\<\"W\"\>"}, {"3", "\<\"A\"\>"}, {"3", "\<\"q\"\>"}, {"3", "\<\"T\"\>"} }], "\[NoBreak]", ")"}], TraditionalForm]], "Output"] }, Open ]], Cell["\<\ The frequency table is again sorted by frequency. (Table truncated only for \ display purposes)\ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[CellGroupData[{ Cell[BoxData[ \(\((\(Sort[ Delete[ReplacePart[#, {Plus @@ \((\(Transpose[ Take[#, {1, 2}]]\)[\([1]\)])\), Take[#, {1, 2}]}, {1}], {2}]] &\)@ getCharFreqTable[data1])\)[\([Range[1, 9]]\)]\)], "Input"], Cell[BoxData[ FormBox[ RowBox[{"(", "\[NoBreak]", GridBox[{ {"1", "\<\"F\"\>"}, {"1", "\<\"x\"\>"}, {"2", "\<\"H\"\>"}, {"2", "\<\"k\"\>"}, {"2", "\<\"W\"\>"}, {"2", RowBox[{"(", "\[NoBreak]", GridBox[{ {"1", "\<\";\"\>"}, {"1", "\<\"D\"\>"} }], "\[NoBreak]", ")"}]}, {"3", "\<\"A\"\>"}, {"3", "\<\"q\"\>"}, {"3", "\<\"T\"\>"} }], "\[NoBreak]", ")"}], TraditionalForm]], "Output"] }, Open ]], Cell["\<\ The process continues by grouping the two lowest frequency elements, whether \ they be characters or character sets, under a node. For an original \ character set of n unique characters, the process will continue for n-1 \ recursions until we are eventually left with the full optimal tree structure, \ presented here as a nested list.\ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[CellGroupData[{ Cell[BoxData[ \(getHuffmanTree[data1]\)], "Input"], Cell[BoxData[ FormBox[ RowBox[{"(", "\[NoBreak]", GridBox[{ {"329", RowBox[{"(", "\[NoBreak]", GridBox[{ {"153", RowBox[{"(", "\[NoBreak]", GridBox[{ {"74", "\<\"t\"\>"}, {"79", "\<\"e\"\>"} }], "\[NoBreak]", ")"}]}, {"176", RowBox[{"(", "\[NoBreak]", GridBox[{ {"81", RowBox[{"(", "\[NoBreak]", GridBox[{ {"40", "\<\"s\"\>"}, {"41", RowBox[{"(", "\[NoBreak]", GridBox[{ {"20", RowBox[{"(", "\[NoBreak]", GridBox[{ {"9", RowBox[{"(", "\[NoBreak]", GridBox[{ {"4", RowBox[{"(", "\[NoBreak]", GridBox[{ {"2", "\<\"W\"\>"}, {"2", RowBox[{"(", "\[NoBreak]", GridBox[{ {"1", "\<\";\"\>"}, {"1", "\<\"D\"\>"} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]}, {"5", RowBox[{"(", "\[NoBreak]", GridBox[{ {"2", RowBox[{"(", "\[NoBreak]", GridBox[{ {"1", "\<\"F\"\>"}, {"1", "\<\"x\"\>"} }], "\[NoBreak]", ")"}]}, {"3", "\<\"A\"\>"} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]}, {"11", "\<\"p\"\>"} }], "\[NoBreak]", ")"}]}, {"21", "\<\"m\"\>"} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]}, {"95", RowBox[{"(", "\[NoBreak]", GridBox[{ {"45", "\<\"o\"\>"}, {"50", "\<\"r\"\>"} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]}, {"487", RowBox[{"(", "\[NoBreak]", GridBox[{ {"223", RowBox[{"(", "\[NoBreak]", GridBox[{ {"101", RowBox[{"(", "\[NoBreak]", GridBox[{ {"50", RowBox[{"(", "\[NoBreak]", GridBox[{ {"24", RowBox[{"(", "\[NoBreak]", GridBox[{ {"12", RowBox[{"(", "\[NoBreak]", GridBox[{ {"6", "\<\"v\"\>"}, {"6", "\<\"w\"\>"} }], "\[NoBreak]", ")"}]}, {"12", RowBox[{"(", "\[NoBreak]", GridBox[{ {"6", "\<\"y\"\>"}, {"6", RowBox[{"(", "\[NoBreak]", GridBox[{ {"3", "\<\"q\"\>"}, {"3", "\<\"T\"\>"} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]}, {"26", "\<\"f\"\>"} }], "\[NoBreak]", ")"}]}, {"51", "\<\"i\"\>"} }], "\[NoBreak]", ")"}]}, {"122", RowBox[{"(", "\[NoBreak]", GridBox[{ {"61", "\<\"a\"\>"}, {"61", RowBox[{"(", "\[NoBreak]", GridBox[{ {"29", RowBox[{"(", "\[NoBreak]", GridBox[{ {"14", "\<\"u\"\>"}, {"15", "\<\"d\"\>"} }], "\[NoBreak]", ")"}]}, {"32", RowBox[{"(", "\[NoBreak]", GridBox[{ {"15", RowBox[{"(", "\[NoBreak]", GridBox[{ {"7", RowBox[{"(", "\[NoBreak]", GridBox[{ {"3", "\<\"z\"\>"}, {"4", "\<\",\"\>"} }], "\[NoBreak]", ")"}]}, {"8", "\<\".\"\>"} }], "\[NoBreak]", ")"}]}, {"17", "\<\"g\"\>"} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]}, {"264", RowBox[{"(", "\[NoBreak]", GridBox[{ {"124", "\<\" \"\>"}, {"140", RowBox[{"(", "\[NoBreak]", GridBox[{ {"68", RowBox[{"(", "\[NoBreak]", GridBox[{ {"34", "\<\"h\"\>"}, {"34", "\<\"l\"\>"} }], "\[NoBreak]", ")"}]}, {"72", RowBox[{"(", "\[NoBreak]", GridBox[{ {"35", "\<\"n\"\>"}, {"37", RowBox[{"(", "\[NoBreak]", GridBox[{ {"17", RowBox[{"(", "\[NoBreak]", GridBox[{ {"8", RowBox[{"(", "\[NoBreak]", GridBox[{ {"4", "\<\"'\"\>"}, {"4", RowBox[{"(", "\[NoBreak]", GridBox[{ {"2", "\<\"H\"\>"}, {"2", "\<\"k\"\>"} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]}, {"9", "\<\"b\"\>"} }], "\[NoBreak]", ")"}]}, {"20", "\<\"c\"\>"} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}]} }], "\[NoBreak]", ")"}], TraditionalForm]], "Output"] }, Open ]], Cell["\<\ The next step is to reduce this structure to its binary representation in a \ translation table. \ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[BoxData[ \(HuffmanTranslationTable[ data_List] := \((Clear[ArbFunc]; \[IndentingNewLine]transTable = Sort[Cases[ Flatten[MapIndexed[ArbFunc, getHuffmanTree[data], {\(-1\)}]], ArbFunc[_, {__, 2}]]]; \[IndentingNewLine]ArbFunc := {ToString@#1, StringJoin @@ \(ToString /@ \((Delete[#2, \(-1\)] - 1)\)\)} &; \[IndentingNewLine]transTable)\)\)], "Input"], Cell[TextData[{ "Notice that you can navigate the above structure with only binary values \ as direction choices where '0' and '1' (respectively '1' and '2' in ", StyleBox["Mathematica", FontSlant->"Italic"], ") represent 'up' and 'down' as you move to the right. Tracing the \ structure from the top level to a unique character or 'leaf' will result in a \ unique binary string. If we map out all of the leafs and their corresponding \ binary indices within the structure, we are left with a translation table \ with character to binary assocations." }], "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[CellGroupData[{ Cell[BoxData[ \(translationTable = HuffmanTranslationTable[data1]\)], "Input"], Cell[BoxData[ FormBox[ RowBox[{"(", "\[NoBreak]", GridBox[{ {"\<\".\"\>", "\<\"1101111111011\"\>"}, {"\<\",\"\>", "\<\"110111111101011\"\>"}, {"\<\";\"\>", "\<\"01110111010101110\"\>"}, {"\<\"'\"\>", "\<\"111111111101010\"\>"}, {"\<\" \"\>", "\<\"11110\"\>"}, {"\<\"a\"\>", "\<\"1101110\"\>"}, {"\<\"A\"\>", "\<\"011101110101111\"\>"}, {"\<\"b\"\>", "\<\"1111111111011\"\>"}, {"\<\"c\"\>", "\<\"11111111111\"\>"}, {"\<\"d\"\>", "\<\"11011111011\"\>"}, {"\<\"D\"\>", "\<\"01110111010101111\"\>"}, {"\<\"e\"\>", "\<\"01011\"\>"}, {"\<\"f\"\>", "\<\"110101011\"\>"}, {"\<\"F\"\>", "\<\"01110111010111010\"\>"}, {"\<\"g\"\>", "\<\"11011111111\"\>"}, {"\<\"h\"\>", "\<\"111111010\"\>"}, {"\<\"H\"\>", "\<\"11111111110101110\"\>"}, {"\<\"i\"\>", "\<\"1101011\"\>"}, {"\<\"k\"\>", "\<\"11111111110101111\"\>"}, {"\<\"l\"\>", "\<\"111111011\"\>"}, {"\<\"m\"\>", "\<\"011101111\"\>"}, {"\<\"n\"\>", "\<\"111111110\"\>"}, {"\<\"o\"\>", "\<\"0111110\"\>"}, {"\<\"p\"\>", "\<\"01110111011\"\>"}, {"\<\"q\"\>", "\<\"110101010111110\"\>"}, {"\<\"r\"\>", "\<\"0111111\"\>"}, {"\<\"s\"\>", "\<\"0111010\"\>"}, {"\<\"t\"\>", "\<\"01010\"\>"}, {"\<\"T\"\>", "\<\"110101010111111\"\>"}, {"\<\"u\"\>", "\<\"11011111010\"\>"}, {"\<\"v\"\>", "\<\"1101010101010\"\>"}, {"\<\"w\"\>", "\<\"1101010101011\"\>"}, {"\<\"W\"\>", "\<\"011101110101010\"\>"}, {"\<\"x\"\>", "\<\"01110111010111011\"\>"}, {"\<\"y\"\>", "\<\"1101010101110\"\>"}, {"\<\"z\"\>", "\<\"110111111101010\"\>"} }], "\[NoBreak]", ")"}], TraditionalForm]], "Output"] }, Open ]], Cell["\<\ Notice that the most frequent characters have the shortest bit sequences, as \ is intended. If our sample data were a large enough text such that all \ necessary characters were included, we might have a translation table suited \ for compressing most documents of the same language; English documents tend \ to exhibit high frequencies of the 'e', 's', 't', 'r', 'a' and space \ characters, whereas a document written in C++ might exhibit a high frequency \ of ';', '=', '(', ')', '[', ']' and tab characters. The same holds true for \ many foreign languages; in fact, this algorithm is just as efficient with \ non-ASCII characters. Most English texts can be written with standard 8-bit \ ASCII characters, but many languages require 16-bit and even 32-bit character \ sets, with the 32-bit character becoming the international standard. Notice \ in the example above that we have reduced the most frequent characters to a \ minimum 5-bit representation, with the least frequent characters represented \ by a maximum 19-bit string. A general rule of thumb is that greater \ compression is achevied with smaller alphabets and larger documents. The \ next step is to make the appropriate replacements on a per-character basis. \ \ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[BoxData[ \(HuffmanEncode[data_List, transTable_List] := StringReplace[ ToString /@ data, \(ToString[#[\([1]\)]] \[Rule] ToString[#[\([2]\)]] &\) /@ transTable]\)], "Input"], Cell["\<\ We convert the translation table into a list of rules which we then apply to \ the data. The result is the encoded bit stream seen below.\ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[CellGroupData[{ Cell[BoxData[ \(dataEncoded = HuffmanEncode[data1, translationTable]\)], "Input"], Cell[BoxData[ \(TraditionalForm\`{\ "11010101011111111111101001011111101101010110111110111111011111111011011111011\ 010101010111101011111111110110111111111111011111111001111100101001011111111111\ 101101111100111110111111111101011111111011011111011010111101010101010010111111\ 110110111110011101110110111010111100101011111101001011111101111111111010111011\ 011111010110101011110101011011101111110111011111111011110110111011111101111011\ 111111011111001111111101011010101111110100111011111111011010101101111100111111\ 111100111111010111101111101111011111010111111111111101011111111110110111111111\ 111011011111011110111001010110111011110010100111110111101101110111111110111100\ 111110011101110110101011010110111011111101011110111111101010010111101111101111\ 110111111111101111010111111111101101110011111111010101011101111001110100101001\ 111111101011111111110110111111111101111111011111101111001110111010101011111101\ 011010111111110110101111110011101111011111001110100101011110011101001111101111\ 110111101111101001010110101101111101111111100111010111101101010110111110011111\ 111110010101111110101101011011101011110110111011111101111011111111011111001111\ 111101011010101111110100111011111111001010110101010111001110111011110101111111\ 111111110111011111101111111101111010101011101111011011111010010101101011111111\ 011110101111011111110101001011111101101010111111110110111110110101010101111110\ 111111111110111110111111110010100111111011111011111101111110011101001010011111\ 111011111010111111111110101011011111010011111101011011101011011111110101111110\ 010111101111111011110111111111101111111011111101111111111010100111011101011101\ 001111100111111111111111101010111100111110011111111110111111111101010011101110\ 101010111111010110101111111101101011111111111101010111101111110110111110011111\ 001110111011011101011011111110101111110010101111110101101011011101011110011101\ 110111101110111111111111111111111010111111011101101111111101011111100111011101\ 101111110101101110100101111111111001010011101011110110111011110011101011010110\ 111011110111011101111111101101011011111111110110111001110111011011101110110111\ 111011111011011101111111111111111101011110110101010101011010111101110111101101\ 010111101111101011111111011111111111010101101011011111011111111011011101111110\ 111111001110111011011111101111101101111111101111111101110011101111011101111110\ 101111111111011011111111110111111101111110111101101010101111111111110100101111\ 110111111111101011101101111101011010101111010101101110111111011101111111101111\ 011011101111110111101111111101111100111111110101101010111111010011101111111101\ 101111101001010110101111111101111010111101111111010100101101110101111011011101\ 111001010011111111011101111111100111010111111011110111001010110101101111101111\ 111101111001010110111011111111110111111110110101111110010101111110101101110010\ 101111011010110111010111101111111111011110111001110100101111011111011111100111\ 110111111110111100101011111101001011111101101010110111111010111101010101111101\ 101111101001011111111110111111111111101010101110111100111110110101011111101111\ 111111111111101011011100111111110111011111111111010100101101111110111010011101\ 110101011101111001010111111010010111111001010110111011111111110111111110110101\ 111110011101111110111011010101011101111011111111110110101111110110111111110101\ 111111111001011011111111011100101001011110111110111111011010101101111110111110\ 011101111111100101011111101001011111101101111101111011100101011011101111011010\ 110101001110100101111111101111010101111110011111001111111111011010101101111110\ 111110011101111111101101110111111110011111001010111111010010110111111111101101\ 111101111011100101011011100111010010110101011110011101001011010101111011010101\ 010111101011010101111110101111001110101101011011101111110101111111101111011100\ 111111111101111111111111111101011011100111111110111011111111111010100101101111\ 111111011010101101111110101111010101011111011011111010010111111111101111111111\ 111010110101101110101101111111011111101111011010101011111111111101001011111100\ 111010110101010111001110100101001011011101111111101101011011101011110011101111\ 011111001110100101011110010111101010111101010111101011111111111111101011010111\ 111111100101011110110101010101111010110101011111101011110111111011110111001111\ 111101111111101011111100101001011011101110101110110101011110110101011110101111\ 111101101011011101011110011111001111111111001111100101011111101001011011111111\ 110111111011110111001111111101111111101011111101101111101111011100101011011100\ 111010010110101001110101111011010101010111101011010101111110101111011011101111\ 001110101101011110111111111111111101101011110101011110101111111111111110111011\ 111111001010111101101111101111010111101010111101010110101101111110101111111111\ 011111111111010111111011010111111111101111001111110101111111101111011100101011\ 010111101010101010010111111011111111111111111010110111001111111101110111111111\ 110101001011011111111110110101011011111101011110101010111110110111110100101111\ 111111011111111111110101101011011101011011111110111111011110011101110101111111\ 111011011101001111101101111111010111111001111110101101110111101011011101111111\ 111111101101011011111111110010101111110101101110010101111011010111101010111111\ 001010111111010010111111001010011111111011101111111100111010111111011110111001\ 010110101101111101111111101111001010110111011111111110111111110110101111110110\ 101101110101111011011111111010111111111100101101111111101110010100101111011111\ 011111101101010110111111011111001110111111110010101111110100101111110110111110\ 111101110010101101110111101101011010100111010010111111110111101010111101111111\ 010111111001010111111010010111111111101111001010111111010010111111011111111111\ 011111001110111101110111011011111101011011101001110101101011011111011111111011\ 110110111011011111011110101010101011011101111111100101011011101101111111101011\ 111100111011111101111101001110100101011110011111011011111010010101101010101011\ 010111101011110111111111111110101111001010111111010010111111001111101101010101\ 010010110111111111111010010111101110110111110111111001111101101010111111001010\ 111111010010111111001010011111111011101111111100111010111111011110111001010110\ 10110111110111111110111100101011011101111111111011111111011010111101111111011"\ , "011101110101011111101110110101010101011010111101111101111110011101110101111\ 110111111101111110011101110101111111111011010101111110100101101111110111111"}\ \)], "Output"] }, Open ]], Cell["\<\ In order to decode the bit stream, we use the translation table to generate \ the same list of rules, only with each rule reversed.\ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[BoxData[ \(HuffmanDecode[data_List, transTable_List] := StringReplace[ ToString /@ data, \(ToString[#[\([2]\)]] \[Rule] \ ToString[#[\([1]\)]] &\) /@ transTable]\)], "Input"], Cell["The end result is the original list of data.", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[CellGroupData[{ Cell[BoxData[ \(dataDecoded\ = \ HuffmanDecode[dataEncoded, translationTable]\)], "Input"], Cell[BoxData[ \(TraditionalForm\`{"The following notebook develops the Huffman \ algorithm for reducing data to an optimized binary string. While most \ solutions for this algorithm typically utilize flow control structures, e.g. \ 'For' or 'While' loops, this package presents a simpler approach via \ functional programming. The Huffman algorithm utilizes a translation table \ that is based on the frequency of characters; the table may be generated from \ the data itself or from another dataset set with similar character \ frequencies. The system is most efficient with large text files or other \ large datasets with a significant difference in relative character \ frequencies. Also, remember that if the translation table is generated from \ the data itself, then the compression advantage must outweigh the overhead of \ the translation table.", "David A. Altherr"}\)], "Output"] }, Open ]] }, Open ]], Cell[CellGroupData[{ Cell[TextData[{ StyleBox["Mathematica", FontSlant->"Italic"], " Package" }], "Section"], Cell["\<\ To implement these functions as a Package, save the following cell in a new \ file \"HuffmanEncoding.m\" in the ExtraPackages directory. Remember to use \ the 'File -> Save As Special -> Package Format' option when saving.\ \>", "Text", TextAlignment->Left, Background->GrayLevel[0.900008]], Cell[BoxData[{ \(\(BeginPackage["\"];\)\), "\[IndentingNewLine]", \(\(Begin["\<`Private`\>"];\)\[IndentingNewLine]\), "\n", \(\(getCharFreqTable::usage\ = \ "\";\)\[IndentingNewLine]\), "\ \n", \(\(getCharFreqTable[data_List] := Sort[\({Count[Flatten[\(Characters[ToString[#]] &\) /@ data], ToString[#[\([1]\)]]], ToString[#[\([1]\)]]} &\) /@ Transpose[{Union[ Flatten[\(Characters[ToString[#]] &\) /@ data]]}]];\)\n\), "\n", \(\(getHuffmanTree::usage\ = \ "\";\)\ \[IndentingNewLine]\), "\n", \(getHuffmanTree[ data_List] := \(Nest[ Sort[Delete[ ReplacePart[#, {Plus @@ \((\(Transpose[ Take[#, {1, 2}]]\)[\([1]\)])\), Take[#, {1, 2}]}, {1}], {2}]] &, getCharFreqTable[data], Length[getCharFreqTable[data]] - 1]\)[\([1, 2]\)]\[IndentingNewLine]\), "\n", \(\(HuffmanTranslationTable::usage\ = \ \ "\";\)\[IndentingNewLine]\), "\ \n", \(HuffmanTranslationTable[ data_List] := \((Clear[ArbFunc]; \[IndentingNewLine]TranslationTable = Sort[Cases[ Flatten[MapIndexed[ArbFunc, getHuffmanTree[data], {\(-1\)}]], ArbFunc[_, {__, 2}]]]; \[IndentingNewLine]ArbFunc := {ToString@#1, "\< \>" <> \ \((StringJoin @@ \(ToString /@ \((Delete[#2, \(-1\)] - 1)\)\))\) <> "\< \>"} &; \ \[IndentingNewLine]TranslationTable)\)\[IndentingNewLine]\), "\n", \(\(HuffmanEncode::usage\ = \ \ "\";\)\[IndentingNewLine]\), "\n", \(HuffmanEncode[data_List, transTable_List] := StringReplace[ ToString /@ data, \(ToString[#[\([1]\)]] \[Rule] ToString[#[\([2]\)]] &\) /@ transTable]\[IndentingNewLine]\), "\n", \(\(HuffmanDecode::usage\ = \ \ "\";\)\[IndentingNewLine]\), "\n", \(HuffmanDecode[data_List, transTable_List] := StringReplace[ ToString /@ data, \(ToString[#[\([2]\)]] -> ToString[#[\([1]\)]] &\) /@ transTable\[IndentingNewLine]]\[IndentingNewLine]\), "\n", \(\(End[];\)\), "\n", \(\(EndPackage[];\)\)}], "Input"] }, Closed]] }, Open ]] }, FrontEndVersion->"4.0 for Microsoft Windows", ScreenRectangle->{{0, 1024}, {0, 723}}, WindowSize->{1016, 696}, WindowMargins->{{0, Automatic}, {Automatic, 0}}, CellLabelAutoDelete->True, Magnification->1.25 ] (*********************************************************************** Cached data follows. If you edit this Notebook file directly, not using Mathematica, you must remove the line containing CacheID at the top of the file. The cache data will then be recreated when you save this file from within Mathematica. ***********************************************************************) (*CellTagsOutline CellTagsIndex->{} *) (*CellTagsIndex CellTagsIndex->{} *) (*NotebookFileOutline Notebook[{ Cell[CellGroupData[{ Cell[1739, 51, 664, 20, 226, "Title"], Cell[CellGroupData[{ Cell[2428, 75, 36, 0, 66, "Section"], Cell[2467, 77, 879, 13, 142, "Text"] }, Open ]], Cell[CellGroupData[{ Cell[3383, 95, 41, 0, 66, "Section"], Cell[3427, 97, 185, 4, 58, "Text"], Cell[3615, 103, 904, 12, 266, "Input"], Cell[4522, 117, 251, 5, 79, "Text"], Cell[4776, 124, 333, 6, 77, "Input"], Cell[5112, 132, 219, 5, 58, "Text"], Cell[CellGroupData[{ Cell[5356, 141, 56, 1, 35, "Input"], Cell[5415, 144, 1305, 39, 767, "Output"] }, Open ]], Cell[6735, 186, 500, 9, 100, "Text"], Cell[7238, 197, 356, 7, 98, "Input"], Cell[7597, 206, 305, 6, 79, "Text"], Cell[CellGroupData[{ Cell[7927, 216, 146, 2, 35, "Input"], Cell[8076, 220, 270, 7, 53, "Output"] }, Open ]], Cell[8361, 230, 227, 5, 58, "Text"], Cell[CellGroupData[{ Cell[8613, 239, 266, 5, 56, "Input"], Cell[8882, 246, 585, 16, 221, "Output"] }, Open ]], Cell[9482, 265, 179, 5, 58, "Text"], Cell[CellGroupData[{ Cell[9686, 274, 276, 5, 56, "Input"], Cell[9965, 281, 585, 16, 221, "Output"] }, Open ]], Cell[10565, 300, 418, 8, 79, "Text"], Cell[CellGroupData[{ Cell[11008, 312, 54, 1, 35, "Input"], Cell[11065, 315, 8209, 141, 767, "Output"] }, Open ]], Cell[19289, 459, 180, 5, 58, "Text"], Cell[19472, 466, 464, 8, 119, "Input"], Cell[19939, 476, 633, 12, 100, "Text"], Cell[CellGroupData[{ Cell[20597, 492, 82, 1, 35, "Input"], Cell[20682, 495, 1960, 39, 767, "Output"] }, Open ]], Cell[22657, 537, 1318, 20, 205, "Text"], Cell[23978, 559, 218, 5, 56, "Input"], Cell[24199, 566, 220, 5, 58, "Text"], Cell[CellGroupData[{ Cell[24444, 575, 85, 1, 35, "Input"], Cell[24532, 578, 6453, 82, 1127, "Output"] }, Open ]], Cell[31000, 663, 213, 5, 58, "Text"], Cell[31216, 670, 220, 5, 56, "Input"], Cell[31439, 677, 118, 2, 58, "Text"], Cell[CellGroupData[{ Cell[31582, 683, 102, 2, 35, "Input"], Cell[31687, 687, 897, 12, 140, "Output"] }, Open ]] }, Open ]], Cell[CellGroupData[{ Cell[32633, 705, 95, 4, 66, "Section"], Cell[32731, 711, 305, 6, 85, "Text"], Cell[33039, 719, 2693, 52, 887, "Input"] }, Closed]] }, Open ]] } ] *) (*********************************************************************** End of Mathematica Notebook file. ***********************************************************************)