Index O'Stuff
Home Compression Arithmetic CodingBurrows-Wheeler Transform Huffman Coding LZSS Coding LZW Coding Run Length Encoding Misc. Programming School ProjectsThesis Project crypt(3) Source Hamming Codes Bit Manipulation Libraries Square Root Approximation Library Sort Library Trailing Space Trimmer and Tab Remover Command Line Option Parser Humor Dictionary O'Modern TermsThe Ten Commandments of C Style Other Stuff TOPS Success StoryFree Win32 Software External Links SPAN (Spay/Neuter Animal Network)Los Angeles Pet Memorial Park Mirrors
Pages at geocitiesPages at dipperstein.com Obligatory Links
Visits since I started counting: |
Run Length Encoding (RLE) Discussion and Implementationby Michael DippersteinThe day after I posted my source and description for arithmetic coding, I became unable to resist the temptation to bang out something on RLE. I had a few hours of "free time" and what better way to fill them than to publish an RLE implementation? And so my version 0.1 library was born. Almost two years after that, I found myself discussing various RLE techniques, and the discussion touched on the PackBits technique. Thus the temption to bang out more on RLE arose. The result of that effort brings you rle-0.2, which provides support for traditional and PackBits styles of encoding and decoding. As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of a run length encoding (RLE) algorithm. Anyone familiar with ANSI C and RLE should be able to follow and learn from my implementation. I'm sure that there's room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection. Click here for a link to my RLE source. The rest of this page discusses RLE and my implementation. Algorithm OverviewRun length encoding stands out from other methods of compression. It does not try to reduce the average symbol size like Huffman coding or arithmetic coding, and it doesn't replace strings with dictionary references like Lemple-Ziv and Lemple-Ziv-Welch style coding. RLE replaces a string of repeated symbols with a single symbol and a count (run length) indicating the number of times the symbol is repeated. Example: The numbers are in bold to indicate that they are values, not symbols. If you count the before and after string size, you'll notice that RLE didn't make this string any smaller. Part of the reason is because it now takes a symbol and a value to represent what used to take just single a single symbol. One of the ways of preventing single symbols from expanding is to only
apply coding to runs of more than one and use the run length to indicate the
number of additional copies of the encoded symbol are required. Ideally that
would give us a string like this: Now the string is shorter, but we have another problem. How do we know whether a symbol is being followed by another symbol or a run length? We could borrow a page out of LZSS and include an encoded/not encoded flag with each symbol. An other option is to follow each encoded symbol with a designated escape code that will indicate a run length follows. The escape code solution is pretty easy to handle, except that you need to be able to make sure that the escape code doesn't occur in your uncoded input, or you need to be able to indicate if you're just handling a symbol that is the same as an escape code. Someone much more creative than I am at this moment came up with a neat
idea that solves this problem. The escape code does not need to be a fixed
value, instead an escape code will be inferred whenever a symbol repeats
itself. The value that follows the repeated symbol will indicate the number
of additional symbols that follow. Using this scheme, the example string
is encoded like this: For lack of a better name, I'll refer to this technique as traditional RLE. The downside to traditional RLE is that one more symbol is required for each run, and runs of length 2 take more space encoded than not encoded (two symbols and a count of 0). However, based on a few brief tests on files containing C source, the double symbol escape sequence produced better results than no escape sequence at all. One technique for reducing the cost of two symbol runs is to encode two types of blocks: blocks that are copied verbatim, and runs. Runs of two symbols may be placed in the blocks copied verbatim. However, this technique incurs a penalty in order to be able to indicate that a block should be copied verbatim. This method was popularized by the PackBits algorithm. The PackBits algorithm will precede a block of data with a one byte header n, where n is interpreted as follows:
Hopefully you've been paying attention and noticed the following areas of improvements in the PackBits algorithm:
The library I provide implements a variant of the PackBits algorithm where the one byte header is interpreted as follows:
This variant algorithm allows runs to be two symbols longer, but cannot
encode two byte runs. Using this scheme, the example string is encoded like
this: Encoding StringsTraditional RLEEncoding using traditional RLE is fairly simple:
When actually implementing traditional RLE, a little attention to detail is required in Step 6. The run length is stored in a finite number of bits (I used an unsigned char). Runs longer than the amount that can be counted need to be broken up into to smaller runs. When the maximum count is reached, just write the count value and start the process of looking for a run all over again. You also need to handle the case where a run is ended by an EOF. When a run is ended by an EOF, write out the run length and exit. That's all there is to it. PackBits VariantEncoding using the PackBits variant is slightly more complicate than traditional RLE. The block header cannot be written until the type of block and it's length have been determined. Until then data must be held in a buffer of the maximum length that can be copied verbatim. The following steps describe PackBits style encoding:
That's pretty much all there is. You need to stop counting your run length in Step 6 if it reaches the maximum length you can account for in your header. My actual implementation is also a little less greedy. When I reach the maximum number of symbols that can be copied verbatim, I read an extra symbol or two in case the symbols at the end of a buffer are actually the start of a run. Decoding StringsTraditional RLEDecoding traditionally encoded strings is even easier than encoding. Not only are there less steps, but there are no caveats. To decode a traditionally encoded stream:
PackBits VariantIf that wasn't easy enough, it is even easier to decode strings encoded by the variant PackBits algorithm. To decode a variant PackBits encoded stream:
Further InformationFurther discussion of RLE with links to other documentation and libraries may be found at datacompression.info. As usual, I found Arturo Campos' discussion to be a help. Take a look at Apple's Technical Note TN for a description of the PackBits algorithm. Actual SoftwareI am releasing my implementations of of run length encoding (RLE) under the LGPL . At this time I have three revisions of the code to offer. As I add enhancements or fix bugs, I will post them here as newer versions. The larger the version number, the newer the version. I'm retaining the older versions for historical reasons. I recommend that most people download the newest version unless there is a compelling reason to do otherwise. Each version is contained in its own zipped archive which includes the source files and brief instructions for building an executable. None of the archives contain executable programs. A copy of the archives may be obtained by clicking on the links below.
PortabilityAll the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows 98. If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipper@alumni.engr.ucsb.edu For more information on RLE and other compression algorithms, visit DataCompression.info. |
Home
Last updated on August 30, 2007