A C implementation of Huffman codes for lossless compression of text.
The lossless compression works with an input of characters and their frequencies.
The create_encoding
algorithm first creates a prefix-free bit encoding of the characters in order to reduce their size.
This algorithm uses a priority queue which is implemented in priority_queue.c
with a min-heap structure implemented with an array.
The encode_file
algorithm in encoder.c
takes text in and outputs an encoded (compressed) bitstream
The decode_file
algorithm in encoder.c
takes in the encoded bitstream and outputs text.
Since the smallest unit of any data type is 1 byte in c (and in most file systems), several helper functions
exist to help with bit manipulation (ex. set_ith_bit
in encoder.c
). The encode_file
and decode_file
algorithms keep buffers of unsigned char
that bits are stored in before being written to the file or
being decoded into text.
- body containing content bytes
- Last compressed character may be less than a full byte which is smaller than the minimum
size the OS handles (one byte) so it is padded with
n
trailing zeros until it is a full byte (8 bits) in length
- Last compressed character may be less than a full byte which is smaller than the minimum
size the OS handles (one byte) so it is padded with
FOOTER_SIZE
(1) byte file footer containingn
, the number of trailing zeros used as padding in the last encoded character
The Encoding data structure defined in encoding.h
was changed with commit d3646b4
While the old structure was more space and runtime efficient (with bitwise operations and single integer comparisons to handle comparing encodings) the old format did not allow for leading zeros in an encoding
(Encoding leading zeros in a binary format would result in 011
evaluating to the
integer 3
which is the same as the binary encoding 11
even though these are both valid
prefix-free encodings in this context)
Allowing for leading zeros in the encoding increases the number of available prefix-free encodings for an alphabet which decreases the average binary encoding size (number of bits).
The increase in compression rate was why I made the choice to refactor the code and change the format.
The previous format and bitwise manipulation can be viewed in this prior state
- original text
- compressed text (not human readable)
- encoding file (not human readable)
- decompressed file
- small script to run the compression and decompression
The uncompressed file (text.txt
) takes up 16 bytes while the compressed file (test.cmp
) takes up 9 bytes.
Hex values of the corresponding files