Index O'Stuff

Home

Compression

Arithmetic Coding
Burrows-Wheeler Transform
Huffman Coding
LZSS Coding
LZW Coding
Run Length Encoding

Misc. Programming

School Projects
Thesis Project
crypt(3) Source
Hamming Codes
Bit Manipulation Libraries
Square Root Approximation Library
Sort Library
Trailing Space Trimmer and Tab Remover
Command Line Option Parser

Humor

Dictionary O'Modern Terms
The Ten Commandments of C Style

Other Stuff

TOPS Success Story
Free Win32 Software

External Links

SPAN (Spay/Neuter Animal Network)
Los Angeles Pet Memorial Park

Mirrors

Pages at geocities
Pages at dipperstein.com

Obligatory Links

Visits since I started counting:
Counter

Run Length Encoding (RLE) Discussion and Implementation

by Michael Dipperstein

The day after I posted my source and description for arithmetic coding, I became unable to resist the temptation to bang out something on RLE. I had a few hours of "free time" and what better way to fill them than to publish an RLE implementation? And so my version 0.1 library was born.

Almost two years after that, I found myself discussing various RLE techniques, and the discussion touched on the PackBits technique. Thus the temption to bang out more on RLE arose. The result of that effort brings you rle-0.2, which provides support for traditional and PackBits styles of encoding and decoding.

As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of a run length encoding (RLE) algorithm. Anyone familiar with ANSI C and RLE should be able to follow and learn from my implementation. I'm sure that there's room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection.

Click here for a link to my RLE source. The rest of this page discusses RLE and my implementation.

Algorithm Overview

Run length encoding stands out from other methods of compression. It does not try to reduce the average symbol size like Huffman coding or arithmetic coding, and it doesn't replace strings with dictionary references like Lemple-Ziv and Lemple-Ziv-Welch style coding. RLE replaces a string of repeated symbols with a single symbol and a count (run length) indicating the number of times the symbol is repeated.

Example:
The string:
"aaaabbcdeeeeefghhhij"
may be replaced with
"a4b2c1d1e5f1g1h3i1j1".

The numbers are in bold to indicate that they are values, not symbols.

If you count the before and after string size, you'll notice that RLE didn't make this string any smaller. Part of the reason is because it now takes a symbol and a value to represent what used to take just single a single symbol.

One of the ways of preventing single symbols from expanding is to only apply coding to runs of more than one and use the run length to indicate the number of additional copies of the encoded symbol are required. Ideally that would give us a string like this:
"a3b1cde4fgh2ij".

Now the string is shorter, but we have another problem. How do we know whether a symbol is being followed by another symbol or a run length? We could borrow a page out of LZSS and include an encoded/not encoded flag with each symbol. An other option is to follow each encoded symbol with a designated escape code that will indicate a run length follows.

The escape code solution is pretty easy to handle, except that you need to be able to make sure that the escape code doesn't occur in your uncoded input, or you need to be able to indicate if you're just handling a symbol that is the same as an escape code.

Someone much more creative than I am at this moment came up with a neat idea that solves this problem. The escape code does not need to be a fixed value, instead an escape code will be inferred whenever a symbol repeats itself. The value that follows the repeated symbol will indicate the number of additional symbols that follow. Using this scheme, the example string is encoded like this:
"aa2bb0cdee3fghh1ij".

For lack of a better name, I'll refer to this technique as traditional RLE. The downside to traditional RLE is that one more symbol is required for each run, and runs of length 2 take more space encoded than not encoded (two symbols and a count of 0). However, based on a few brief tests on files containing C source, the double symbol escape sequence produced better results than no escape sequence at all.

One technique for reducing the cost of two symbol runs is to encode two types of blocks: blocks that are copied verbatim, and runs. Runs of two symbols may be placed in the blocks copied verbatim. However, this technique incurs a penalty in order to be able to indicate that a block should be copied verbatim.

This method was popularized by the PackBits algorithm. The PackBits algorithm will precede a block of data with a one byte header n, where n is interpreted as follows:

n	Meaning
0 to 127	Copy the next n + 1 symbols verbatim
-127 to -1	Repeat the next symbol 1 - n times
-128	Do nothing

Hopefully you've been paying attention and noticed the following areas of improvements in the PackBits algorithm:

The -128 header could be used for a run of 129 symbols, but it's not.
It takes two bytes to indicate a run of two, so nothing is saved.

The library I provide implements a variant of the PackBits algorithm where the one byte header is interpreted as follows:

n	Meaning
0 to 127	Copy the next n + 1 symbols verbatim
-128 to -1	Repeat the next symbol 2 - n times

This variant algorithm allows runs to be two symbols longer, but cannot encode two byte runs. Using this scheme, the example string is encoded like this:
"-2a3bbcd-3e1fg-1h1ij".

Encoding Strings

Traditional RLE

Encoding using traditional RLE is fairly simple:

Step 1.	Set the previous symbol equal to an unmatchable value.
Step 2.	Read the next symbol from the input stream.
Step 3.	If the symbol is an EOF exit.
Step 4.	Write out the current symbol.
Step 5.	If the symbol is an does not match the previous symbol, set the previous symbol to the current symbol, and go to step 2.
Step 6.	Read and count additional symbols until a non-matching symbol is found. This is the run length.
Step 7.	Write out the run length.
Step 8.	Write out the non-matching symbol.
Step 9.	Set the previous symbol to the non-matching symbol, and go to step 2.

When actually implementing traditional RLE, a little attention to detail is required in Step 6. The run length is stored in a finite number of bits (I used an unsigned char). Runs longer than the amount that can be counted need to be broken up into to smaller runs. When the maximum count is reached, just write the count value and start the process of looking for a run all over again. You also need to handle the case where a run is ended by an EOF. When a run is ended by an EOF, write out the run length and exit.

That's all there is to it.

PackBits Variant

Encoding using the PackBits variant is slightly more complicate than traditional RLE. The block header cannot be written until the type of block and it's length have been determined. Until then data must be held in a buffer of the maximum length that can be copied verbatim. The following steps describe PackBits style encoding:

Step 1.	Read symbols from the input stream into the buffer until one of the following occurs: The buffer is full (go to step 2) An EOF is reached (go to step 3) The last three symbols are identical (go to step 4)
Step 2.	If the buffer is full: write the buffer size - 1 write contents of the buffer go to step 1
Step 3.	If the symbol is an EOF: number of symbols in the buffer - 1 write contents of the buffer exit
Step 4.	If the last three symbols match, a run has been found. Determine the number of symbols in the buffer prior to the start of the run (n).
Step 5.	Write n - 1 followed by the contents of the buffer up to the start of the run.
Step 5.	Set the run length to 3.
Step 6.	Read additional symbols until a non-matching symbol is found. Increment the run length for each matching symbol.
Step 7.	Write out 2 - the run length the run length followed by the run symbol.
Step 8.	Write the non-matching symbol to the buffer and go to step 1.

That's pretty much all there is. You need to stop counting your run length in Step 6 if it reaches the maximum length you can account for in your header. My actual implementation is also a little less greedy. When I reach the maximum number of symbols that can be copied verbatim, I read an extra symbol or two in case the symbols at the end of a buffer are actually the start of a run.

Decoding Strings

Traditional RLE

Decoding traditionally encoded strings is even easier than encoding. Not only are there less steps, but there are no caveats. To decode a traditionally encoded stream:

Step 1.	Set the previous symbol equal to an unmatchable value.
Step 2.	Read the next symbol from the input stream.
Step 3.	If the symbol is an EOF exit.
Step 4.	Write out the current symbol.
Step 5.	If the symbol is an does not match the previous symbol, set the previous symbol to the current symbol, and go to step 2.
Step 6.	Read the run length.
Step 7.	Write out a run of the current symbol as long as indicated by the run length.
Step 8.	Go to step 1.

PackBits Variant

If that wasn't easy enough, it is even easier to decode strings encoded by the variant PackBits algorithm. To decode a variant PackBits encoded stream:

Step 1.	Read the block header (n).
Step 2.	If the header is an EOF exit.
Step 3.	If n is non-negative, copy the next n + 1 symbols to the output stream and go to step 1.
Step 4.	If n is negative, write 2 - n copies of the next symbol to the output stream and go to step 1.

Further Information

Further discussion of RLE with links to other documentation and libraries may be found at datacompression.info.

As usual, I found Arturo Campos' discussion to be a help.

Take a look at Apple's Technical Note TN for a description of the PackBits algorithm.

Actual Software

I am releasing my implementations of of run length encoding (RLE) under the LGPL . At this time I have three revisions of the code to offer. As I add enhancements or fix bugs, I will post them here as newer versions. The larger the version number, the newer the version. I'm retaining the older versions for historical reasons. I recommend that most people download the newest version unless there is a compelling reason to do otherwise.

Each version is contained in its own zipped archive which includes the source files and brief instructions for building an executable. None of the archives contain executable programs. A copy of the archives may be obtained by clicking on the links below.

Version	Comment
Version 0.3	Fixes bug for max length runs in the variant of the packbits algorithm.
	Replaces getopt with my optlist library.
	Explicitly license the library under LGPL version 3.
Version 0.2	Includes a variant of the packbits algorithm.
Version 0.1	Initial release.
	Does not include any packbits style encoding.

Portability

All the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows 98.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipper@alumni.engr.ucsb.edu

For more information on RLE and other compression algorithms, visit DataCompression.info.

Home
Last updated on August 30, 2007