Index O'Stuff

Home

Compression

Arithmetic Coding
Burrows-Wheeler Transform
Huffman Coding
LZSS Coding
LZW Coding
Run Length Encoding

Misc. Programming

School Projects
Thesis Project
crypt(3) Source
Hamming Codes
Bit Manipulation Libraries
Square Root Approximation Library
Sort Library
Trailing Space Trimmer and Tab Remover
Command Line Option Parser

Humor

Dictionary O'Modern Terms
The Ten Commandments of C Style

Other Stuff

TOPS Success Story
Free Win32 Software

External Links

SPAN (Spay/Neuter Animal Network)
Los Angeles Pet Memorial Park

Mirrors
Pages at geocities
Pages at dipperstein.com

Obligatory Links
NoteTab Credits
Quanta Credits
Linux User Number

Visits since I started counting:
Counter

LZSS (LZ77) Discussion and Implementation

by Michael Dipperstein


The saga continues. As with everything else that I've written, my LZSS implementation left (and still leaves) room for improvement. This web page updates the work that I have done since my original LZSS release. I'm calling my latest release version 0.3 and my original release version 0.1. My description of the LZSS algorithm will highlight some of the differences between the two versions.

I've also developed an in-memory implementation of the LZSS algorithm that preforms all encode and decode operations on arrays rather than files. This implementation might be useful to those developing on systems that do not include a file sytem.

Information on downloading the source code for all of my LZSS implementations may be found here.

As with my Huffman code implementation, my intent is to publish an easy to follow ANSI C implementation of the LZSS compression algorithm. Anyone familiar with ANSI C and the LZSS algorithm should be able to follow and learn from my implementation.

The rest of this page discusses the results of my efforts so far.


Algorithm Overview

LZSS is a dictionary encoding technique. Unlike Huffman coding, which attempts to reduce the average amount of bits required to represent a symbol, LZSS attempts to replace a string of symbols with a reference to a dictionary location of the same string. It is intended that the dictionary reference should be shorter than the string it replaces. In the case of LZ77, the predecessor to LZSS, that wasn't always the case.

Encoding

The LZSS algorithm and it's predecessor LZ77 attempt to compress series of strings by converting the strings into a dictionary offset and string length. So if the string "abcd" appeared in the dictionary at position 1234, it may be encoded as {offset = 1234, length = 4}.

Okay, so where does the dictionary come from, and why can't I find my whole file in it?

The Dictionary

The LZSS and LZ77 dictionary is not an external dictionary that lists all known symbol strings. Instead, the dictionary is a sliding window containing the last N symbols encoded/decoded. The larger N, the longer it takes to search the whole dictionary for a match and the more bits will be required to store the offset into the dictionary. Typically dictionaries contain an amount of symbols that can be represented by a whole power of 2. A 432 symbol dictionary would require 9 bits to represent all possible offsets. If you need to use 9 bits, you might as well have a 512 symbol dictionary so that you can have more entries.

Since dictionaries are sliding windows, once the (N + 1)th symbol is processed and added to the dictionary, the first symbol is removed. Additional new symbols cause an equal number of the oldest symbols to slide out.

In the example above, after encoding "abcd" as {offset = 1234, length = 4}, the sliding window would shift over 4 characters and the first 4 symbols (offsets 0 .. 3) would slide off the front of the sliding window. "a", "b", "c", and "d" would then be entered the dictionary into positions (N - 4), (N - 3), (N - 2), and (N - 1).

Representing Strings

As stated above, encoded strings are represented as an offset and a length. Since the dictionary size is fixed at N the largest offset may be (N - 1), and the longest string matching a series of characters in the dictionary may be N characters. If we plan to store these values, that would require 2 * ceil(log2(N)) bits to represent such a string. For large N, it's highly unlikely that you'll ever read an N symbol string that matches the contents of dictionary, so the maximum allowed match length is often limited.

In the examples I have seen, N is typically 4096 or 8192 and the maximum length allowed for matching strings is typically between 10 and 20 characters.

LZ77 vs. LZSS

In their original LZ77 algorithm, Lempel and Ziv proposed that all strings be encoded as a length and offset, even strings with no match. Storer and Szymanski observed that individual unmatched symbols or matched strings of one or two symbols take up more space to encode than they do to leave uncoded.

For example, encoding a string from a dictionary of 4096 symbols, and allowing for matches of up to 15 symbols, will require 16 bits to store an offset and a length. A single character is only 8 bits. Using this method, encoding a single character doubles the required storage.

Storer and Szymanski proposed preceding each symbol a with coded/uncoded flag. Using the above example it now takes 9 bits for each uncoded character and 17 bits for each encoded string.

Storer and Szymanski also observed that if you're not encoding strings of length 0 to M, then M may be used as an offset to the length of the match. Applying their observation, instead of using 4 bits to represent match lengths of 0 to 15, the same 4 bits may be used to represent match lengths of M to (15 + M).

Putting it All Together

Based on the discussion above, encoding input requires the following steps:
Step 1. Initialize the dictionary to a known value.
Step 2. Read an uncoded string that is the length of the maximum allowable match.
Step 3. Search for the longest matching string in the dictionary.
Step 4. If a match is found greater than or equal to the minimum allowable match length:
    Step 4a. Write the encoded flag, then the offset and length to the encoded output.
    Step 4b. Otherwise, write the uncoded flag and the first uncoded symbol to the encoded output.
Step 5. Shift a copy of the symbols written to the encoded output from the unencoded string to the dictionary.
Step 6. Read a number of symbols from the uncoded input equal to the number of symbols written in Step 4.
Step 7. Repeat from Step 3, until all the entire input has been encoded.

Decoding Strings

The LZSS decoding process is less resource intensive than the LZSS encoding process. The encoding process requires that a the dictionary is searched for matches to the string to be encoding. Decoding an offset and length combination only requires going to a dictionary offset and copying the specified number of symbols. No searching is required.

Decoding input requires the following steps:
Step 1. Initialize the dictionary to a known value.
Step 2. Read the encoded/not encoded flag.
Step 3. If the flag indicates an encoded string:
    Step 3a. Read the encoded length and offset, then copy the specified number of symbols from the dictionary to the decoded output.
    Step 3b. Otherwise, read the next character and write it to the decoded output.
Step 4. Shift a copy of the symbols written to the decoded output into the dictionary.
Step 5. Repeat from Step 2, until all the entire input has been decoded.

Further Information

Further discussion of LZSS with links to other documentation and libraries may be found at http://datacompression.info/LZSS.shtml.


Implementation Issues

When I set out to implement LZSS, I had the following goals in mind:

  • The code should be easy to follow, easy to debug, and easy to modify.
  • Encoded strings should not be longer than two bytes (16 bits). 16 bit values are easy to deal with and keep encoded strings small.
  • Avoid reading/writing individual bits.

My version 0.2 and later code uses a bitfile library, so reading and writing individual bits isn't a big deal. All of my versions use codes that are integral bytes, and each of my newer versions are derived from my earlier versions so there is a lot of common design.

What follows is a discussion about how these goals were accomplished.

The Dictionary

The dictionary size directly effects:

  • The time to search the entire dictionary
  • The size of the encoded offset into the dictionary
  • The likelihood of a string matching one that already exists in the dictionary

I chose to implement my dictionary as a 4096 character cyclic buffer (sliding window). I chose 4096 characters, because others have achieved good results with dictionaries of this size, and I can encode offsets of 0 to 4095 in 12 bits. If I encode the offset and length of a string in a 16 bit word, that leaves me with 4 bits for the length.

Minimum and Maximum Encoded Length

Keeping the goal of a 16 bit encoded string in mind, and the above requirement for a 12 bit offset, I'm left with 4 bits to encode the string length. With 4 bits, I can encode lengths of 0 through 15. However, as Storer and Szymanski observed, it only takes 1 byte to write out characters that match dictionary strings 0 or 1 character long, and 2 bytes to write out characters that match dictionary strings 2 characters long. Savings of one or more bytes per string doesn't occur unless the string is 3 characters or longer.

By not encoding strings that offer less than one byte of savings, the minimum encoded length is three characters. Since the minimum encoded length is three, I am able to add three to the 4 bit value stored in the length field, resulting in encoded string lengths of 3 to 18. So 18 is the maximum string length encoded by this implementation.

Encode/Not Encoded Flag

I've been calling my implementation a modified LZSS implementation. It's modified because of the way my version 0.1 code handles the encoded/not encoded flag. If I followed the traditional LZSS approach for handling the encoded/not encoded flag, writing an unencoded character would require a 9 bit write, 1 for the flag and 8 for the character. Similarly writing an encoded string would require 17 bits, 1 bit for the flag and 16 bits for the code.

While I was studying the algorithm, I came across some implementations that stored encoded flags in groups of eight followed by the characters or encoded strings that they represent. I don't know who to credit with the discovery of this technique, but it allowed my version 0.1 implementation to use standard one byte reads and writes instead of one bit reads and writes.

This technique also makes the EOF clear. By processing only bytes, there are no spare bits, and the encoded EOF is actually the EOF of the encoded file. Don't worry about it if I lost you on the EOF discussion, just relax, because there's no need to handle it specially.

After writing my version 0.1 implementation, I wrote a bitfile library that handles single and multiple bit writes. My version 0.2 and later code uses the bitfile library and handles the encoded/not encoded flag according to the traditional LZSS algorithm.

Finding Matches in the Dictionary

As stated above, one of my goals was to provide an LZSS implementation that is easy to read and easy to debug. So, to keep things simple, my first implementation used a brute force sequential search to match the string to be encoded to strings in the dictionary. Through experimentation and reading, I've discovered that the methods used for string matching significantly impact encoding time.

Other's have successfully improved the time required to find matching strings using the following techniques:

  • Knuth, Morris, Pratt linear search optimization
  • Boyer-Moore linear search optimization
  • Linked Lists
  • Hash Tables
  • Multi-Layer Hash Tables
  • Binary Search Trees
  • Splays

I have already experimented with some of these techniques and plan to experiment with others as time allows. The rest of this section documents some of what I have tried so far.

Sequential Search

The first search I implemented was a sequential search. I simply start at the beginning of the sliding window dictionary and do a character by character comparisons with the string being encoded. If the first characters match, I check the characters that follow. If I don't have a match, I move to the next character in the sliding window. The source code implementing a sequential search is contained in the version 0.1 file lzss.c and the file brute.c in versions 0.2 and later. The logic is pretty simple, but may require a number of comparisons on the order of (string length) × (dictionary length).

Linked Lists

The sequential search algorithm moves through the dictionary one character at a time, checking for matches to the first character in the string to be encoded. Assuming equal distribution of characters, there's a 1 in alphabet size chance of characters matching.

One approach to ensuring that the first character of the string being encoded matches the dictionary entry it is being compared to is to keep a list of all dictionary entries that start with a character for each character in the alphabet. For example, there would be one list of all the entries that start with 'A'. In the case of a 256 character alphabet, 256 lists must be maintained.

Since the dictionary is a sliding window of the last characters encoded by the algorithm, the lists of strings starting with a given character must be updated as old characters are removed from the dictionary and new characters are added to the dictionary. Linked lists are a natural choice for implementing such dynamic lists.

The additional memory overhead required to implement a collection of linked lists is one pointer to the head of each of the character lists, plus one pointer to the next character in the list for each character in the dictionary. It turns out the worst case number of operations for finding a match is still of the order (string length) × (dictionary length), but the average case is of the order of ((string length) × (dictionary length)) ÷ (alphabet size).

The source code implementing a linked list search is contained in the version 0.1 file lzlist.c and the file list.c in versions 0.2 and later. In my implementation, all pointers (list head and next) are int indices into the sliding window dictionary.

Hash Tables

After implementing string matches with linked lists, it seemed like a wasn't much effort to try matching using hash tables. In actuality, the linked lists approach, is a solution involving hash tables where the hash key is the first character of the string and collisions are stored in a linked list. If that's all I wanted, I'd be done.

One problem with the linked list approach is that all the strings in a list only match the first character in the list, but a string will not be encoded in {offset, length} format until it matches at least some minimum (M) number of characters. To ensure that only strings matching M characters are searched, you can make the hash key the first M characters of the string. String matching will fly when you do that, however the hash table will need (alphabet size)M entries. For my implementation alphabet size is 256 and M is 3, so the hash table would need 2563 (that's 16,777,216 for anybody without a calculator) entries. That makes the hash table 4096 times the size of the dictionary.

A 16M entry hash table doesn't strike me as a viable solution. I wanted to try much smaller tables. In order to be able to use smaller tables, I needed a key that would distribute M long strings fairly evenly through the hash table. Rather than inventing and testing a new key algorithm, I chose the key generation method discussed in K. Sadakane and H. Imai: Improving the Speed of LZ77 Compression by Hashing and Suffix Sorting, IEICE Trans. Fundamentals, Vol. E83-A, No. 12, pp. 2689--2798, 2000. The key algorithm is supposed to be the same algorithm used by gzip and may be implemented using the following C fragment:

key = 0;
for (i = 0; i < (M + 1); i++)
{
    key = (key << d) ^ (string[i]);
    key %= hash size;
}

For my implementation I ended up settling on d = 5 and hash size = 1024, though I found that the key distributed strings evenly for hash size of a power of 2 between 256 and 131072.

There's only one additional complication. In the case of linked lists, adding or removing a character from the dictionary required that one entry be added or removed from a linked list. When hashing on M characters, adding or removing the nth character from the dictionary requires that strings starting at (n - M + 1) through n be added or removed from the hash table.

The source code implementing a hash table search is contained in the version 0.1 file lzhash.c and the file hash.c in version 0.2 and later. In my implementation, all pointers (hash table and next) are int indices into the sliding window dictionary.

Actual Software

I am releasing my implementations of the LZSS algorithm under the LGPL . If you've actually read this page to get all the way down here, you already know that I have different implementations. In general earlier versions are simpler (maybe easier to follow) and later versions are easier to use as libraries and better suited for projects taking advantage of the LGPL. In some cases later version also fix minor bugs.

Each version is contained in its own zipped archive which includes the source files and brief instructions for building an executable. None of the archives contain executable programs. A copy of the archives may be obtained by clicking on the links below.

Version Comment
Version 0.6 Replaces getopt with my optlist library.
  Explicitly license the library under LGPL version 3.
Version 0.5.2 Closes output file after encoding/decoding a file passed by file name. Previous versions closed the input file twice.
Version 0.5.1 Corrects an error that occurs when trying to use the default output file for decoding.
  Minor comment and format corrections.
Version 0.5 Uses new bit file library integer based get and put bits functions, making it easier to change the dictionary size.
  Minor speed improvements are made by using a conditional Wrap macro instead of the mod (%) operator.
  The size of the hash table used to search the dictionary is now based on the size of the dictionary.
Version 0.4 Uses latest bit file library containing a fix for bug in a function not used by this library.
Version 0.3 Split encode and decode routines into two files for smarter linking.
  Added versions of the encode and decode routines that accept file pointers rather than file names.
  Makefile now builds code as libraries for better LGPL compliance.
Version 0.2 Use traditional LZSS encoding where the coded/uncoded flags precede the symbol they are associated with, rather than aggregating the bits.
  Separated encode/decode, match finding, and main into different files to allow for easier linking of lzss routines to existing code.
  Uses bitfile library for reading and writing encoded files.
Version 0.1 Aggregates code/uncoded bits into bytes.
  Does not require bitwise file reads/writes.
In Memory Version 0.1 Same as version 0.3, except arrays are used for storage instead of files.

If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipper@alumni.engr.ucsb.edu

For more information on LZSS and other compression algorithms, visit DataCompression.info.

DataCompression.Info

Home
Last updated on August 30, 2007

1