Data Compression and Decompression using Greedy Huffman Algorithm
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman's method can be efficiently implemented, finding a code in time linear to the number of input weights if these weights are sorted. However, although optimal among methods encoding symbols separately, Huffman coding is not always optimal among all compression methods - it is replaced with arithmetic coding or asymmetric numeral systems if better compression ratio is required.
This repo has three projects among which huffman-zipper generates static .lib file, huffmanCLI is for command line interface and huffmanGUI is for graphical user interface.
- Clone this repo to your PC - Go to its project directory - Open huffman-zipper.sln in Visual Studio 2019 - Set HuffmanCLI or HuffmanGUI as startup project - To build HuffamnGUI project, configure wxWidgets for Visual Studio 2019. - Build and run the project.
The process to build a hufffman binary tree for compression is very simple and straightforward.
As per current implementation of file compression using Huffman Compression Algorithm, writing only the encoded string is insufficient to decode the compressed file during decompression. The same Huffman Tree
which was previously used during compression is required to decode the encoded characters to its original form. So, it is necessary to to write the File header in the compressed file which contains Huffman Tree
and other meta data required by the decompressor.
1. How to write a file header?
The compressed file header has following structure :
Tree | 2 bytes | File Size | File Name |
---|---|---|---|
Pre-order traversal of the tree | number of files to be compressed | number of chars in the each file | File name of each file |
2. How to save the tree in the compressed file?
One of approach to write the tree in to the file is to use a pre order traversal. And later during decompression, using same traversal algorithm, we can reconstruct the tree from the file header part.
While writing tree to the compressed file, when we visit a leaf node, write bit 1 followed by the symbol in 8 bits.And when we visit an internal node (or the root node),simply write bit 0 only. So in this way, during decompression, when the program reads a bit 1, it can read the next 8 bits and wrap the char in a binary leaf node. When the program reads a bit 0, it create a new internal node and connect the two nodes as left and right child using preorder traversal algorithm.
3. How to mark EOF of the output file?
There are many methods to mark the EOF. For example, you can choose a character that is not present in the input file as the EOF character. However, binary files may contain character which was used pesudoEOF. So, an alternative approach would be to store file size in the file header. So the decompression program first reads the file size and, then continue to work the decompression till the required file size is reached.
The decompression is much more easier than compression. During decompression, the program first reads the header from the compressed file and reconstruct the tree from it. And using the tree, it decodes the rest of the file to original characters and thus completes the decompression.
The compression ratio and performance of the Huffman coding depends on the size of input text and the frequency of distinct characters in the file. The current implementation of huffman coding may produce file with increase size for small input files due to the overhead of storing Huffman Tree
in the compressed file, which is required at the time of decompression.
But, as the size of input file increases, the compression ratio becomes nearly 50%.
The huffman-zipper documentation is available here.
Ashish Lamsal | Aaditya Subedi |
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details