Index O'Stuff
Home Compression Arithmetic CodingBurrows-Wheeler Transform Huffman Coding LZSS Coding LZW Coding Run Length Encoding Misc. Programming School ProjectsThesis Project crypt(3) Source Hamming Codes Bit Manipulation Libraries Square Root Approximation Library Sort Library Trailing Space Trimmer and Tab Remover Command Line Option Parser Humor Dictionary O'Modern TermsThe Ten Commandments of C Style Other Stuff TOPS Success StoryFree Win32 Software External Links SPAN (Spay/Neuter Animal Network)Los Angeles Pet Memorial Park Mirrors
Pages at geocitiesPages at dipperstein.com Obligatory Links
Visits since I started counting: |
LZSS (LZ77) Discussion and Implementationby Michael DippersteinThe saga continues. As with everything else that I've written, my LZSS implementation left (and still leaves) room for improvement. This web page updates the work that I have done since my original LZSS release. I'm calling my latest release version 0.3 and my original release version 0.1. My description of the LZSS algorithm will highlight some of the differences between the two versions. I've also developed an in-memory implementation of the LZSS algorithm that preforms all encode and decode operations on arrays rather than files. This implementation might be useful to those developing on systems that do not include a file sytem. Information on downloading the source code for all of my LZSS implementations may be found here. As with my Huffman code implementation, my intent is to publish an easy to follow ANSI C implementation of the LZSS compression algorithm. Anyone familiar with ANSI C and the LZSS algorithm should be able to follow and learn from my implementation. The rest of this page discusses the results of my efforts so far. Algorithm OverviewLZSS is a dictionary encoding technique. Unlike Huffman coding, which attempts to reduce the average amount of bits required to represent a symbol, LZSS attempts to replace a string of symbols with a reference to a dictionary location of the same string. It is intended that the dictionary reference should be shorter than the string it replaces. In the case of LZ77, the predecessor to LZSS, that wasn't always the case. EncodingThe LZSS algorithm and it's predecessor LZ77 attempt to compress series of strings by converting the strings into a dictionary offset and string length. So if the string "abcd" appeared in the dictionary at position 1234, it may be encoded as {offset = 1234, length = 4}. Okay, so where does the dictionary come from, and why can't I find my whole file in it? The DictionaryThe LZSS and LZ77 dictionary is not an external dictionary that lists all known symbol strings. Instead, the dictionary is a sliding window containing the last N symbols encoded/decoded. The larger N, the longer it takes to search the whole dictionary for a match and the more bits will be required to store the offset into the dictionary. Typically dictionaries contain an amount of symbols that can be represented by a whole power of 2. A 432 symbol dictionary would require 9 bits to represent all possible offsets. If you need to use 9 bits, you might as well have a 512 symbol dictionary so that you can have more entries. Since dictionaries are sliding windows, once the (N + 1)th symbol is processed and added to the dictionary, the first symbol is removed. Additional new symbols cause an equal number of the oldest symbols to slide out. In the example above, after encoding "abcd" as {offset = 1234, length = 4}, the sliding window would shift over 4 characters and the first 4 symbols (offsets 0 .. 3) would slide off the front of the sliding window. "a", "b", "c", and "d" would then be entered the dictionary into positions (N - 4), (N - 3), (N - 2), and (N - 1). Representing StringsAs stated above, encoded strings are represented as an offset and a length. Since the dictionary size is fixed at N the largest offset may be (N - 1), and the longest string matching a series of characters in the dictionary may be N characters. If we plan to store these values, that would require 2 * ceil(log2(N)) bits to represent such a string. For large N, it's highly unlikely that you'll ever read an N symbol string that matches the contents of dictionary, so the maximum allowed match length is often limited. In the examples I have seen, N is typically 4096 or 8192 and the maximum length allowed for matching strings is typically between 10 and 20 characters. LZ77 vs. LZSSIn their original LZ77 algorithm, Lempel and Ziv proposed that all strings be encoded as a length and offset, even strings with no match. Storer and Szymanski observed that individual unmatched symbols or matched strings of one or two symbols take up more space to encode than they do to leave uncoded. For example, encoding a string from a dictionary of 4096 symbols, and allowing for matches of up to 15 symbols, will require 16 bits to store an offset and a length. A single character is only 8 bits. Using this method, encoding a single character doubles the required storage. Storer and Szymanski proposed preceding each symbol a with coded/uncoded flag. Using the above example it now takes 9 bits for each uncoded character and 17 bits for each encoded string. Storer and Szymanski also observed that if you're not encoding strings of length 0 to M, then M may be used as an offset to the length of the match. Applying their observation, instead of using 4 bits to represent match lengths of 0 to 15, the same 4 bits may be used to represent match lengths of M to (15 + M). Putting it All TogetherBased on the discussion above, encoding input requires the following
steps: Decoding StringsThe LZSS decoding process is less resource intensive than the LZSS encoding process. The encoding process requires that a the dictionary is searched for matches to the string to be encoding. Decoding an offset and length combination only requires going to a dictionary offset and copying the specified number of symbols. No searching is required. Decoding input requires the following steps: Further InformationFurther discussion of LZSS with links to other documentation and libraries may be found at http://datacompression.info/LZSS.shtml. Implementation IssuesWhen I set out to implement LZSS, I had the following goals in mind:
My version 0.2 and later code uses a bitfile library, so reading and writing individual bits isn't a big deal. All of my versions use codes that are integral bytes, and each of my newer versions are derived from my earlier versions so there is a lot of common design. What follows is a discussion about how these goals were accomplished. The DictionaryThe dictionary size directly effects:
I chose to implement my dictionary as a 4096 character cyclic buffer (sliding window). I chose 4096 characters, because others have achieved good results with dictionaries of this size, and I can encode offsets of 0 to 4095 in 12 bits. If I encode the offset and length of a string in a 16 bit word, that leaves me with 4 bits for the length. Minimum and Maximum Encoded LengthKeeping the goal of a 16 bit encoded string in mind, and the above requirement for a 12 bit offset, I'm left with 4 bits to encode the string length. With 4 bits, I can encode lengths of 0 through 15. However, as Storer and Szymanski observed, it only takes 1 byte to write out characters that match dictionary strings 0 or 1 character long, and 2 bytes to write out characters that match dictionary strings 2 characters long. Savings of one or more bytes per string doesn't occur unless the string is 3 characters or longer. By not encoding strings that offer less than one byte of savings, the minimum encoded length is three characters. Since the minimum encoded length is three, I am able to add three to the 4 bit value stored in the length field, resulting in encoded string lengths of 3 to 18. So 18 is the maximum string length encoded by this implementation. Encode/Not Encoded FlagI've been calling my implementation a modified LZSS implementation. It's modified because of the way my version 0.1 code handles the encoded/not encoded flag. If I followed the traditional LZSS approach for handling the encoded/not encoded flag, writing an unencoded character would require a 9 bit write, 1 for the flag and 8 for the character. Similarly writing an encoded string would require 17 bits, 1 bit for the flag and 16 bits for the code. While I was studying the algorithm, I came across some implementations that stored encoded flags in groups of eight followed by the characters or encoded strings that they represent. I don't know who to credit with the discovery of this technique, but it allowed my version 0.1 implementation to use standard one byte reads and writes instead of one bit reads and writes. This technique also makes the EOF clear. By processing only bytes, there are no spare bits, and the encoded EOF is actually the EOF of the encoded file. Don't worry about it if I lost you on the EOF discussion, just relax, because there's no need to handle it specially. After writing my version 0.1 implementation, I wrote a bitfile library that handles single and multiple bit writes. My version 0.2 and later code uses the bitfile library and handles the encoded/not encoded flag according to the traditional LZSS algorithm. Finding Matches in the DictionaryAs stated above, one of my goals was to provide an LZSS implementation that is easy to read and easy to debug. So, to keep things simple, my first implementation used a brute force sequential search to match the string to be encoded to strings in the dictionary. Through experimentation and reading, I've discovered that the methods used for string matching significantly impact encoding time. Other's have successfully improved the time required to find matching strings using the following techniques:
I have already experimented with some of these techniques and plan to experiment with others as time allows. The rest of this section documents some of what I have tried so far. Sequential SearchThe first search I implemented was a sequential search. I simply start at the beginning of the sliding window dictionary and do a character by character comparisons with the string being encoded. If the first characters match, I check the characters that follow. If I don't have a match, I move to the next character in the sliding window. The source code implementing a sequential search is contained in the version 0.1 file lzss.c and the file brute.c in versions 0.2 and later. The logic is pretty simple, but may require a number of comparisons on the order of (string length) × (dictionary length). Linked ListsThe sequential search algorithm moves through the dictionary one character at a time, checking for matches to the first character in the string to be encoded. Assuming equal distribution of characters, there's a 1 in alphabet size chance of characters matching. One approach to ensuring that the first character of the string being encoded matches the dictionary entry it is being compared to is to keep a list of all dictionary entries that start with a character for each character in the alphabet. For example, there would be one list of all the entries that start with 'A'. In the case of a 256 character alphabet, 256 lists must be maintained. Since the dictionary is a sliding window of the last characters encoded by the algorithm, the lists of strings starting with a given character must be updated as old characters are removed from the dictionary and new characters are added to the dictionary. Linked lists are a natural choice for implementing such dynamic lists. The additional memory overhead required to implement a collection of linked lists is one pointer to the head of each of the character lists, plus one pointer to the next character in the list for each character in the dictionary. It turns out the worst case number of operations for finding a match is still of the order (string length) × (dictionary length), but the average case is of the order of ((string length) × (dictionary length)) ÷ (alphabet size). The source code implementing a linked list search is contained in the version 0.1 file lzlist.c and the file list.c in versions 0.2 and later. In my implementation, all pointers (list head and next) are int indices into the sliding window dictionary. Hash TablesAfter implementing string matches with linked lists, it seemed like a wasn't much effort to try matching using hash tables. In actuality, the linked lists approach, is a solution involving hash tables where the hash key is the first character of the string and collisions are stored in a linked list. If that's all I wanted, I'd be done. One problem with the linked list approach is that all the strings in a list only match the first character in the list, but a string will not be encoded in {offset, length} format until it matches at least some minimum (M) number of characters. To ensure that only strings matching M characters are searched, you can make the hash key the first M characters of the string. String matching will fly when you do that, however the hash table will need (alphabet size)M entries. For my implementation alphabet size is 256 and M is 3, so the hash table would need 2563 (that's 16,777,216 for anybody without a calculator) entries. That makes the hash table 4096 times the size of the dictionary. A 16M entry hash table doesn't strike me as a viable solution. I wanted to try much smaller tables. In order to be able to use smaller tables, I needed a key that would distribute M long strings fairly evenly through the hash table. Rather than inventing and testing a new key algorithm, I chose the key generation method discussed in K. Sadakane and H. Imai: Improving the Speed of LZ77 Compression by Hashing and Suffix Sorting, IEICE Trans. Fundamentals, Vol. E83-A, No. 12, pp. 2689--2798, 2000. The key algorithm is supposed to be the same algorithm used by gzip and may be implemented using the following C fragment:
key = 0; For my implementation I ended up settling on d = 5 and hash size = 1024, though I found that the key distributed strings evenly for hash size of a power of 2 between 256 and 131072. There's only one additional complication. In the case of linked lists, adding or removing a character from the dictionary required that one entry be added or removed from a linked list. When hashing on M characters, adding or removing the nth character from the dictionary requires that strings starting at (n - M + 1) through n be added or removed from the hash table. The source code implementing a hash table search is contained in the version 0.1 file lzhash.c and the file hash.c in version 0.2 and later. In my implementation, all pointers (hash table and next) are int indices into the sliding window dictionary. Actual SoftwareI am releasing my implementations of the LZSS algorithm under the LGPL . If you've actually read this page to get all the way down here, you already know that I have different implementations. In general earlier versions are simpler (maybe easier to follow) and later versions are easier to use as libraries and better suited for projects taking advantage of the LGPL. In some cases later version also fix minor bugs. Each version is contained in its own zipped archive which includes the source files and brief instructions for building an executable. None of the archives contain executable programs. A copy of the archives may be obtained by clicking on the links below.
If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipper@alumni.engr.ucsb.edu For more information on LZSS and other compression algorithms, visit DataCompression.info. |
Home
Last updated on August 30, 2007