Index O'Stuff
Home Compression Arithmetic CodingBurrows-Wheeler Transform Huffman Coding LZSS Coding LZW Coding Run Length Encoding Misc. Programming School ProjectsThesis Project crypt(3) Source Hamming Codes Bit Manipulation Libraries Square Root Approximation Library Sort Library Trailing Space Trimmer and Tab Remover Command Line Option Parser Humor Dictionary O'Modern TermsThe Ten Commandments of C Style Other Stuff TOPS Success StoryFree Win32 Software External Links SPAN (Spay/Neuter Animal Network)Los Angeles Pet Memorial Park Mirrors
Pages at geocitiesPages at dipperstein.com Obligatory Links
Visits since I started counting: |
Arithmetic Code Discussion and Implementationby Michael DippersteinThe insanity continues. After playing with Huffman coding and LZSS, I had to give arithmetic coding a try. It wasn't enough to understand how it worked I had to actually implement it. I've read about what this stuff did to Ross Williams. I can only hope that I'm not going to be subjected to a similar fate. As with my other compression implementations, my intent is to publish an easy to follow ANSI C implementation of the arithmetic coding algorithm. Anyone familiar with ANSI C and the arithmetic coding algorithm should be able to follow and learn from my implementation. There's a lot of room for improvement of compression ratios, speed, and memory usage, but this project is about learning and sharing, not perfection. Click here for information on downloading my source code. The rest of this page discusses arithmetic coding and the results of my efforts so far. Algorithm OverviewArithmetic coding is similar to Huffman coding; they both achieve their compression by reducing the average number of bits required to represent a symbol. Given: More often than not, the optimal number of bits is fractional. Unlike Huffman coding, arithmetic coding provides the ability to represent symbols with fractional bits. Since, ∑pi = 1, we can represent each probability, pi, as a unique non-overlapping range of values between 0 and 1. There's no magic in this, we're just creating ranges on a probability line. For example, suppose we have an alphabet 'a', 'b', 'c', 'd', and 'e' with probabilities of occurrence of 30%, 15%, 25%, 10%, and 20%. We can choose the following range assignments to each symbol based on its probability:
Where square brackets '[' and ']' mean the adjacent number is included and parenthesis '(' and ')' mean the adjacent number is excluded. Ranges assignments like the ones in this table can then be use for encoding and decoding strings of symbols in the alphabet. Algorithms using ranges for coding are often referred to as range coders. Encoding StringsBy assigning each symbol its own unique probability range, it's possible to encode a single symbol by its range. Using this approach, we could encode a string as a series of probability ranges, but that doesn't compress anything. Instead additional symbols may be encoded by restricting the current probability range by the range of a new symbol being encoded. The pseudo code below illustrates how additional symbols may be added to an encoded string by restricting the string's range bounds. lower bound = 0 upper bound = 1 while there are still symbols to encode current range = upper bound - lower bound upper bound = lower bound + (current range × upper bound of new symbol) lower bound = lower bound + (current range × lower bound of new symbol) end while Any value between the computed lower and upper probability bounds now encodes the input string. Example:
Start with lower and upper probability bounds of 0 and 1. It should become apparent from the example that precision requirements increase as additional symbols are encoded. Strings of unlimited length require infinite precision probability range bounds. The section on implementation discusses how the need for infinite precision is handled. Decoding StringsThe decoding process must start with a an encoded value representing a string. By definition, the encoded value lies within the lower and upper probability range bounds of the string it represents. Since the encoding process keeps restricting ranges (without shifting), the initial value also falls within the range of the first encoded symbol. Successive encoded symbols may be identified by removing the scaling applied by the known symbol. To do this, subtract out the lower probability range bound of the known symbol, and multiply by the size of the symbols' range. Based on the discussion above, decoding a value may be performed following the steps in the pseudo code below: encoded value = encoded input while string is not fully decoded identify the symbol containing encoded value within its range //remove effects of symbol from encoded value current range = upper bound of new symbol - lower bound of new symbol encoded value = (encoded value - lower bound of new symbol) ÷ current range end while Example:
Decode first symbol There are two issues I've glossed over with this example: knowing when to stop and the required computational precision. The section on implementation discusses both of these issues. Order-n ModelsThe arithmetic coding algorithm that I described is an order-0 zero algorithm. The probability ranges for a symbol are dependant upon 0 preceding symbols. There are order-n algorithms in which a symbol's probability range is based on the n symbols that preceded it. A common example for why we'd want to use a model where n > 0 is the one where English text is being encoded. If a 'q' has just been encoded, there is nearly a 100% chance that it will be followed by a 'u'. Qwest Communications and ruined things for everyone. One of the biggest drawbacks of order-n models where n > 0 is the amount of memory required to store the probabilities ranges. If it takes O(N) memory to store the probability ranges for an order-0 model of an N symbol alphabet, it takes O(Nn + 1) memory for a order-n model. ImplementationThis section discusses some of the issues of implementing an arithmetic encoder/decoder and the approach that I took to handle those issues. As I have already stated, my implementation is intended to be easy to develop and follow, not necessarily optimal. What is a SymbolOne of the first questions that needs to be resolved before you start is "What is a symbol?". For my implementation a symbol is any 8 bit combination as well as an End Of File (EOF) marker. This means that there are 257 possible symbols in any encoded stream. Infinite PrecisionIn my algorithm overview I noted that the lower and upper probability range bounds as well as the encoded value theoretically require infinite precision values on the range [0, 1). The method chosen to handle this largely impacts the rest of the implementation. Not wanting to blaze new trials, I chose to use the common solution. Values on the probability range scale will be represented as infinite length fixed point values, with the binary point before the most significant bit. Example: Wow Infinity. That's a Lot of Bits.Fortunately for us, we don't need to operate on all the bits at once. As additional symbols are encoded, the lower and upper range bounds converge. As the bounds converge, their most significant bits stop changing. By choosing the precision of each symbol's range bounds, we can also limit the amount of bits required for each computation. As you read through this section, you will learn how it is possible to achieve good compression results using 16 bit computations. Computing Probability RangesI've implemented both static and adaptive order-0 probability models. The my static model computes probability ranges based on symbols counts obtained by making a pass on the input prior to starting the encoding. Others may use a predetermined set of ranges. My adaptive model assumes a uniform distribution and updates itself as new symbols are processed. For my static model, the first step to computing probability ranges for each symbol is to simply count occurrences of each symbol, as well as the total numbers of symbols. The next step is to scale all of the symbol counts so that they can be in N bit computations. To do this symbol counts must be scaled to use no more than N - 2 bits. In the case of 16 bit computations, counts must be 14 bits or less. You'll see why in the next section.
Now we have a scaled count for each symbol. We need to convert the scaled counts to range bounds on a probability line. For an alphabet with symbols S0, S1, ... Sn and scaled counts c0, c1, ... cn, we can define ranges as follows:
For my adaptive model we start with scaled symbols, but the N - 2 bit restriction still applies. It's possible to add the effect of a new symbol that would make the symbol count too large to be represented by N - 2 bits (see Adaptive Model and Symbol Range Updates). When this happens, I just rescale the current model for half the symbol count.
Encoding with Scaled RangesThe standard form of arithmetic coding's encoding is based on fractional ranges on a probability line between 0 and 1. We just assigned our symbols probability ranges on a probability line between 0 and ∑ci (total symbol count rescaled). The encoding algorithm needs to be adapted for our new range scale. The following pseudo code modifies the standard form of the encoding algorithm for a range scale of [0, ∑ci):
lower bound = 0 There are a few important things to know about this algorithm:
Decoding with Scaled RangesThe standard form of arithmetic coding's decoding is also based on fractional ranges on a probability line between 0 and 1. Like encoding, we have to rescale our calculations for the decoding process. The following pseudo code modifies the standard form of the decoding algorithm for a range scale of [0, ∑ci):
encoded value = encoded input Limiting Computations to N BitsWe have a way of representing floating point values as an infinite stream of bits, we have a way of scaling our probability ranges to N − 2 bits, and we can encode and decode using the scaled values. This still doesn't get us to a finite number of bits. There are a few facts that we can use to develop a solution that limits computations to N bits:
As the lower and upper probability range bounds start to converge, they are likely to reach a point where the have identical MSBs (most significant bits). Once the bounds have matching MSBs, the MSBs will remain matching for the rest of the encoding process. From that point on we don't need to worry about the MSBs anymore. Shift the MSB out to the encoded data stream, and shift in a 0 and a 1 for the LSBs of the lower and upper range limits respectively. Since we didn't perform an operation that affects bits beyond the Nth, we know the LSBs will be 0 and 1. Example: UnderflowNow that we have a rule for shifting bits through our N bit variables, We can do everything with N bit math. Well almost. It turns out there's one thing I overlooked. What happens when the lower and upper range bounds start to converge around 1/2 (0111.... and 1000...)? This condition is called an underflow condition. When an underflow condition occurs the MSBs of the range bounds will never match. The good news is that there's a fairly simple way to handle underflow. We can recognize that an underflow condition is pending when the two MSBs of the lower range bound are different from the two MSBs of the upper range bound. What we have to do in this situation is to remove the second bit from both range bounds, shift the other bits left, and remember that we had an underflow condition. We may still have an underflow condition so this process may need to be repeated. When we finally get an opportunity to shift out the MSB, follow it with the underflow bit(s) of the opposite value of the converged MSB. Example: Stopping The Decoding ProcessIn my decoding example I stated that a three symbol string should be decoded, but if you notice, there is still a coded value when we're done. We stopped decoding because the problem said to only decode three symbols. One of the methods for knowing when to stop the decoding process is to include a symbol count somewhere in the encoded file. The file header is an ideal place to store the symbol count. Another approach is to implement a "bijective" coder. The idea of a "bijective" implementation seems like more a hassle than a benefit. If you're interested in learning about "bijective" encoding/decoding refer to SCOTT's "one to one" compression discussion. I took the approach of encoding an EOF. The upside is that I don't need to include a total symbol count in my header. I don't even need to count EOFs, because there's only one. The downsides are that the EOF does take up space in the probability line, and encoding the EOF may also add extra bits to the encoded file. I performed a few tests comparing including a total symbol count in the file header against encoding the EOF. Encoding the EOF appeared to result in the better average compression ratios. My testing was far from scientific. If I get some time I will actually do a mathematical analysis of the two methods to determine when one method is better than the other. Stopping the Encoding ProcessSince I stop my encoding process after encoding the EOF it is at that point that I know the final probability range bounds. However, our probability range bounds have infinite precision and we need to write enough bits to be sure that the value we output falls within the range for the string we just encoded. Since the decoder also uses N bits, we just need to make sure it will have N bits to work with when it decodes the EOF. To do this, output the lower range bound's second MSB, all remaining underflow bits, plus an additional underflow bit. Static vs. Adaptive ModelsThe arithmetic coding algorithm is well suited for both static and adaptive probability models. Encoders and decoders using adaptive probability models start with a fixed model and use a set of rules to adjust the model as symbols are encoded/decoded. Encoders and decoders that use static symbol probability models start with a model that doesn't change during the encoding process. Static probability models may be constant regardless of what is being encoded or based on the encoded input. Static Model and File HeadersMy static probability model implementation uses a file header. Symbol probabilities are computed prior to the start of the encoding process and they are stored in a file header for the decoder to use at the start of the decoding process. The file header is section of data prior to the encoded output that includes scaled range counts for each encoded symbol (except the EOF whose count is always one). Since I've scaled my counts to all be 14 bit values, I cheat a little and just write 14 bit values to the encoded file. Handling the 14 bit values is only slightly more complicated than handling 16 bit values, and it makes the file 512 bits smaller. Adaptive Model and Symbol Range UpdatesIn the section on Computing Probability Ranges I touched on the potential need to rescale adaptive probability ranges and how I did it. What I neglected to mention was why this happens and what it means. As its name would suggest, the adaptive model updates with the data stream. I chose to update my model after a symbol is encoded or decoded. After a symbol is encoded or decoded, it's upper bound and the bounds of every symbol after it must be incremented by one. On average that will be 128 updates per an encode/decode in a 256 symbol alphabet. Others have tried to reduce the number of updates required by placing the more common symbols near the end of the list of ranges. Laziness and the my inability to see a great performance benefit kept me from trying that approach. If adding a new symbol causes the total probability to exceed N - 2 bits, the probability ranges must be rescaled. One side effect of rescaling is that symbols encoded/decode after a rescale operation will have twice the weight of symbols encoded prior to the rescale (the weight of prior symbols is halved). I have a feeling that this may be beneficial more often than not, but it's not intended to be a feature of my implementation. PortabilityAll the source code that I have provided is written in strict ANSI C. I would expect it to build correctly on any machine with an ANSI C compiler. I have tested the code compiled with gcc on Linux and mingw on Windows 98. There are some compile time options that offer minimal speed-up when compiling code for a little endian target, but the little endian code is disabled by default. My implementation assumes that long integers (long) are bigger than short integers (sort), if this is not the case with your compiler, a compile time error should be issued. This implementation is also incapable of handling files with greater than ULONG_MAX bytes to be encoded. The program will issue an error and gracefully terminate if the file being encoded is too large. Further InformationFurther discussion of the arithmetic coding algorithm with links to other documentation and libraries may be found at http://www.datacompression.info/ArithmeticCoding.shtml. I found Arturo Campos' and Mark Nelson's discussions to be a huge help. Actual SoftwareI am releasing my implementations of the arithmetic coding algorithm under the LGPL. All of my releases may be found on this site. As I add enhancements or fix bugs, I will post them here as newer versions. The larger the version number, the newer the version. I'm retaining the older versions for historical reasons. I recommend that most people download the newest version unless there is a compelling reason to do otherwise. Each version is contained in its own zipped archive which includes the source files and brief instructions for building an executable. None of the archives contain executable programs. A copy of the archives may be obtained by clicking on the links below.
If you have any further questions or comments, you may contact me by e-mail. My e-mail address is: mdipper@alumni.engr.ucsb.edu For more information on arithmetic coding algorithm and other compression algorithms, visit DataCompression.info. |
Home
Last updated on August 29, 2007