The Green Tree Of Compression Methods A Practical Introduction To Data Compression

Prologue ======== What is information ? According to the most popular definition, any interaction between objects, when one of them acquires some substance, and the other(s) don't lose it, is called information interaction, and the transmitted substance is called information. Such information, that falls under this definition, can better be called "physical". But it doesn't take into account non-physical, "absolute" information, in no way depending from the material objects. The simplest example is multiplication table, known by everyone from first years of life to last. Info that sum of two squares of legs is equal to the square of hypotenuse; that ratio of the circumference length to the diameter is equal to "pi", and ratio of circles area to area of square with side equal to circles radius - also "pi"; and so on. It is exactly "absolute" information about the objects of non- material world. If the material world disappears, all this information stays immutable. Multiplication table, and all formulas and theorems of all branches of mathematics. More complex examples: the set of numbers coding DNA of some man Nika, and the set of numbers resulting from scanning of his photo. Such information, of course, has no sense outside material world, but, unlike "chaotic" info, gets sense in the material world, and thereby as if like raises own status, "develops" in it, displays some properties, aspects, attributes, characteristics. Only due to it, only in the material world (and also its models) the "best" part of information becomes "knowledge" and even "wisdom", while other pieces remain as unmanifested "chaos". Thus, the difference between analogue (physical) information and digital (data) is much more principal and deep than it originally seems. Answers ======= Question1: How is information compressed ? There are hundreds of links at http://www.compression-pointers.com , and hundreds of compressors at http://act.by.net. What are all this compression methods- programs- standards about, in short? Answer1: That's rather easy. But some definitions first. A BIT is an "atom" of digital information (DATA): imagine a small "box" somewhere in space-time (in a memory chip, on a CD-ROM, floppy or hard disk, communication line...) with only two possible states: full (1, one, true, yes, exists) or empty (0, zero, false, no, doesn't exist) A finite sequence of bits we call A CODE. A (traditional) BYTE is a sequence of eight bits. A GENERALIZED N-bit BYTE is a sequence of N bits, has 2^N (two raised to the N-th power) possible states (thus, a traditional 8-bit byte can have 256 different values: 0, 1, ... , 255). A finite sequence of bytes we call A WORD, and number of bytes in this finite sequence - the words length. In general case, word consists of N-bit bytes, not 8-bit (thus, a code is a word of generalized 1-bit bytes). Q2: What's a symbol then? Exactly this word is present in the numerous patents about devices and methods for data compression. A2: A symbol is an atom of a language (a letter, a note, a figure...). ASCII (American Standard Code for Information Interchange) sets accordance between 8-bit bytes and symbols, but when we meet national alphabets, we need more: 16 bits per symbol (UNICODE standard). Finally,(lossless) COMPRESSION of a BLOCK (a finite piece of digital information) is description how to bit-exactly create this given block, put into another, shorter (compressed) block, occupying less bits than the initial block we describe. The so-called LOSSY compression consists of two different processes (not independent, and parallel, as a rule): (1) extracting useful parts from blocks, creating model depending on the purpose of compression, the peculiarities of both source and destination of information, and improving as much as possible (data for) step (2) lossless compression. When measuring physical parameters (brightness, frequency, voltage, amplitude, resistance and so on), inexactitude is inevitable, therefore "round-off" is quite permissible. Q3: What's the difference between these definitions and existing terms? A3: From time immemorial, when there were more 8-bit processors than 16, and was no necessity in UNICODE, people usually mean "byte" when saying "symbol", and mean "8-bit byte" when saying "byte". The sequence of two bytes computer world calls "a word", and sequence of four bytes - "double word" (dword). Very many non-Intel processors are constructed so that every addressable memory unit contains 16-, 24- or 32-bit byte, not 8-bit, as in all IBM-compatible computers. If "compressed" block is longer than initial - it's not compression but recoding by definition, regardless of what nearly half of currently existing compressors "think" about it (increasing size to 1% - 5%, instead of 1 bit, in "worst cases" of incompressible data). If lossy "compression" doesn't contain true compression algorithms - it removes information not actual for known purpose, inside the limits of chosen model. However, there are so many definitions of "compression" and especially "code", that misunderstandings appear much more often than they should. Now, with such clear definitions, seems rather evident the following: A) Any base compression method is good for compressing either blocks of bits, or blocks of bytes, or blocks of words, as these are very different objects. Exact mathematical definitions can be given for these three base types of digital information, but in human words, when compressing a block, algorithm assumes either that bits in given block appear according to some law (block of bits), or that not bits, but bytes appear according to some function, or that neither bits nor bytes, but words appear according to a rule. B) Given block will be compressed better, if its structure is known: whether it is a block of bits, or a block of N-bit bytes, or a block of words of N-bit bytes. And, of course, if we know "the answer": the rules it was created with, what was the length of records, bytes, how they were formed. In other words, the more compression process "knows" about block creation process, the better compression it does. While practically all blocks are created as either blocks of bits (1), or blocks of bytes (2), or blocks of words (3). Two "boundary" types can be added: (0) "chaos" (no laws can be detected - neither for words nor for bytes nor for bits) and (4) "texts" (sets of words) appear according to some rule. For example, block contains English sentences, Russian sentences, and texts of functions on C programming language. Other types, like mixtures of bytes and bits, or words and bits, are (practically) much less possible. C) In each of three cases, two variants are possible: probabilities of elements (bits (1), bytes (2), words (3)) are different, and a) don't depend on previous and following elements, the "context" (so-called "Bernoulli source" of elements, without memory); b) highly depend on previous and following elements ("Markov source", with memory). Other cases are so rare that can be considered as "practically impossible" (for example, sometimes depend, sometimes not; depend, but not all values; depend on selected elements of context, not nearest). We see that case 2b (probabilities of bytes are different and depend on the context) is identical to case 3 (probabilities of words are different), while case 1b (bits, depend) is not identical to case 2 (simplest example - poorly compressed data, files like *.zip, *.gif). That's because by definition a finite sequence of bytes is a word, while a finite sequence of bits - is a code, not a byte. In other words, byte is a record of fixed length, and word - of variable. D) To simplify data structure and improve compression, main algorithm when compressing blocks of words, creates blocks of bytes a/o bits; when compressing blocks of bytes, creates blocks of bits, and only when compressing blocks of bits - outputs to final compressed block. For some reasons the majority of compression programs today simply skip the third step (compression of blocks of bits), or stick it together with the second (compression of blocks of bytes), starting with assumption "a block of words is given". And only archivers with options for "multimedia compression" check for blocks of bytes. That's why people speak in terms of "primary" and "secondary" compression: "modeling" and/or "sorting" (changing bytes order), and "encoding" (bytes to codes). We can often see phrases like "it uses LZ77 followed by arithmetic coder" in comp.compression newsgroup, in archivers manuals, compression related articles. As we all know, all first compressors like ARC, ARJ, LHA, PAK, ZIP, ZOO etc. were using LZ77 or LZ78 followed by a Huffman or a Shannon-Fano encoder. Q4: What are these reasons? A4: First main reason is that most "natural" blocks appearing in computers from input devices (keyboard, scanner, microphone, camera...) are really blocks of words. In general case, words of N-bit bytes: modern graphic devices operate 24- or 32-bit bytes - "pixels", while audio - 8-, 16- or 32-bit bytes - "samples". The same with executable instructions for Central Processor Unit (CPU), Floating Point Unit (FPU) (files with executable code: .exe, .dll ...). Blocks of bytes and bits appear, as a rule, when computers process that "natural" data. Thus, the main problem is to find out how were those _words_ created, what were the laws. It is the primary compression algorithm that plays the main role in achieving best-in-the-world compression of certain types of files ( see http://act.by.net ). Second, it's much easier for computers to deal with bytes than with bits. Each byte has a unique address in memory, each address points to a byte (8-bit, if it's an IBM-compatible PC). It was only 20 years ago that first 16-bit processors appeared, able to process more than 8-bit units with one instruction. There are practically no good algorithms for compressing blocks of bits (due to first two reasons). Excellent results give appropriate primary and secondary algorithms, while the third step takes much time, giving little improvements. More than 10 years ago, when "universal" .zip format was developed, nearly all readable texts were blocks of 8-bit ASCII symbols, nearly all executable modules were for 16- or even 8-bit processors, and nearly all multimedia files - blocks of 8- or even of 4-bit bytes. That's why it seemed that it's enough to have one algorithm for all data types: 8-bit bytes (multimedia files) and blocks of words (texts, executables). Many special algorithms were developed since that time, but not very many universal programs, achieving compression close to world records on all types of files (RK, UHArc, RAR, ERI, ACE). Q5: Any other methods of digital data classification ? A5: Of course, many other exist. Only the most actual is described above. The following can also be mentioned: - One-dimensional, two-dimensional (pictures, for example), three- dimensional (video, 3D images) and so on. We know that 2D-3D-ND files are presently met in less than 10% of cases, while the probability to see them in "unknown data" set is less than 1% (they are stored separately, as a rule, and processed with special algorithms). But in this case (it was detected that data is not one-dimensional) the more important question is immediately solved: whether it contains bits, bytes or words: bytes with probability more than 0.999 . - The so-called "textual" and "binary" data. The most classical classification. Under our definitions, first are blocks of words, and the second is everything else, including mixtures of all types. - Extension-based classification (.ARC, .AVI, .BMP etc. files). The way for those who are trying to make commercial archivers. (People knowing PPM and BlockSorting families of compression methods can think that order-1, order-2, order-3 and so on types of data exist). Q6: What compression methods are known nowadays ? A6: For blocks of words or bytes: LZ77, LZ78, LZW (Lempel-Ziv-Welch), LZSS and many other variants of "sliding window"; BlockSorting: with the given block - full (BWT, Burrows-Wheeler Transform) and partial (using N preceding elements), or with "parallel" dependent block(s) of same size, PPM (Prediction by Partial Match), Markov and other modeling. For bytes: MTF (move to front); methods of Huffman and Shannon-Fano; LPC (Linear Prediction Coding) and its plain case of one-element memory- Delta Coding (compute/code differences between neighboring bytes); Distance Coding (compute/code distances between equal values),also known as Homala Benesa (how many larger (bytes) before next same (byte)); SEM (separate exponents and mantissas). For bytes or bits: RLE (run length encoding), arithmetic coding with all possible modifications (range coder, ELS - entropy logarithmic scale and so on), ENUC (enumerative coding), ERI93. These are only main families (branches from the three main branches of the tree) of base compression methods. Each family contains lots of different methods, each method has many variants, each variant has some unique internal parameters. Q7: They are usually somehow combined, aren't they ? A7: Practice shows that for blocks of words it's good to first apply one or few algorithms from the first family: in simplest case, a selected one, in better case, one is selected once per block, before starting to compress given block, in best case - somehow synthesizing them during the process of compression of block of words (synthesis of LZ and BS is particularly useful). On the second step one or few methods from the second family are applied, on the third step - from the third. In case of blocks of bytes the first, most difficult step, should better be skipped, and in case of blocks of bits it's better to immediately go to the third step, skipping both first and second. It's not good to apply algorithms for words - to bytes or bits. One example is PNG ( shows good performance, see http://go.to/artest ), the other is 7-Zip ( http://www.7-zip.com ), giving compression up to 20% better than PKzip, Info-zip, GZip, WinZip etc. (for example, BINARIES.DAT from http://artest1.tripod.com/artest18.zip : 457724 bytes with "7zip a -mx", 548844 "pkzip -exx", 551255 "gzip -9"), because it has the knowledge that sometimes even very long strings are better compressed if regarded as bytes, not as words. It's not good to apply algorithms for bytes - to words or bits (the very first compressors: Huffman method, the only known at that time, was applied to all types of data; the question about their structure didn't even appear: one method exists, nothing to choose from) But we shouldn't completely forget about "incorrect", "unpractical" data: for example, such blocks of words that are better compressed if RLE is applied first, before all other methods, are certainly possible (RLE for words: considers that every word is written N times, and N>1 ). Or such blocks of bits that will be compressed better with PPM: poorly compressed data - .ZIP, .MP3, .GIF, .JPG, .PNG files and so on. Q8: What can be said about static and adaptive coding ? A8: At that distant past, when computers were huge, while files - small, the problem of extracting logically different fragments from given file did not arise. First, there were relatively less files with diverse data, and also less types of data: bytes dimensionality did rarely achieve even 24 or 32. Second, those logically different fragments were so small that expenses of describing their borders were comparable with gains in case of their usage. Recording of new binary tree (methods of Huffman and Shannon-Fano, 8-bit bytes) takes about few hundreds bytes, moreover complexity and time and memory requirements of both compression and decompression algorithms can increase 2 or 3 times - therefore it's meaningless to pay attention to fragments shorter than few kilobytes. Especially if files average length is less than 100 kilobytes, and probability of diversity of files contents is less than 1/100. It's now different with that: average length is about ten times bigger, substantial diversity of contents is also about ten times more probable. Theoretically, any compression algorithm can be made both "entirely static" (all parameters are assigned at the start only) and "entirely adaptive" (all parameters are periodically recalculated). But practically, saying "static or adaptive", people do honour to the very first compression method, described by David Huffman in 1952. Q9: How to detect if given file contains blocks of bits, bytes or words? A9: This is one of the most actual, lengthy and interesting questions. As can be seen from reports at http://compression.ca and http://go.to/artest , wrong treatment of blocks leads to compression up to 50 times worse, especially on so-called "multimedia" files (take, for example, ASTAV.DAT from ftp://ftp.cdrom.com/pub/simtelnet/msdos/astronmy/mpla_eng.zip 125K) Most files contain blocks of words of 8- or 16-bit bytes, "multimedia" (graphics, video, audio) files are blocks of generalized bytes (or very short words) of 8, 16, 24 or 32 bits, as a rule. That's why the easiest way is to look at files extension and/or the header, a/o test part with few kilobytes and take the best on this fragment method. It's not nice to apply algorithms intended for words to bytes or even to bits. It's not much better to apply algorithms for bytes - to words or bits. Q10: Any other methods except this easiest ? A10: The following way takes much more time and programming. Assume that we have an algorithm to measure the orderliness (m.o.) in a given block of words, returning a value from 0 (complete chaos, all words are different) to 1 (complete order, all words are identical); and have an algorithm for m.o. in a given block of bytes, and also an algorithm for m.o. in a given block of bits. Then we just assume that we've got block of words, and count the first value W, applying the first algorithm, then assume it's a block of bytes and get the second number Y, and finally the same with the third measure B. Comparing these three numbers (they are 0 <= W <= 1 , 0 <= Y <=1 , 0 <= B <= 1 ) we easily see what our examined block more corresponds to. If we admit that byte can be either 8-, 16- or 32-bit, we need to calculate three values for bytes and three for words, seven in all. Q11: Looks like all three algorithms do something very close, but with different elements - bits, bytes, words ? A11: Exactly so. Simplest example, when elements are bits, and algorithm first splits initial block to logically different fragments (i.e. for example block with hundred "zeroes" followed by hundred "ones" is processed as two "good" fragments, not as one "bad"): B= ( |n[1] - N/2| + |n[0] - N/2| ) / N where N - number of bits in block, n[0] - number of "zeroes", n[1] - number of "ones". As we see, B=0 if n[0]=n[1]=N/2 ; B=1 if n[0]=N or n[1]=N . Same trivial example for the case of 8-bit bytes: Y = ( |n[255]-N/256| + ... + |n[0]-N/256| ) * 128 / (N*255) Y=0 if n[255]=...=n[0]=N/256; Y=1 if n[255]=N or n[254]=N... or n[0]=N. Q12: Quite simple for bits and bytes, but what about words, they can consist of 8-, 16- or 32-bit bytes, while their lengths can be 2, 3, ... , L bytes, where L is close to 10 ? A12: As in any research - ideas, assumptions, modeling, testing, optimization... Searching for better algorithm is a very fascinating process. You'll surely like it! The future of data compression is certain, and the success is always near. New compression tasks will appear 50 and 100 years after these words are written, and new algorithms will be developed to better solve those tasks. First, 8-bit elements will soon become as rare as 16-bit ten or twenty years ago. Second, we are now limited with one- dimensional, and not content-addressable memory. Though some processors can treat memory as two-dimensional, they are relatively rare and novel. Finally, if we look at definition of compression, there's no further restriction of what a description should be. It can certainly reference everything that destination can always access. But when that referenced data is inaccessible, description is undecipherable. Thus, if we only assume that all references used in description will be accessible, but don't know exactly, we do "potentially lossy" compression. Only non-physical information is always accessible, it doesn't depend on material objects (see prologue). The size of it is infinite. And when we study how it can be applied (wavelets, fractals, for example...) - we actually study how we depend on it. Epilogue ======== RHLAOC ~~~~~~ HLAO http://www.faqs.org/faqs/compression-faq/part2/ HL http://www.cs.sfu.ca/CC/365/li/squeeze/ O http://www.data-compression.com/theory.html HL http://www.data-compression.com/lossless.html RHL-O- http://www.dspguide.com/datacomp.htm RH- http://www.go2net.com/internet/deep/1997/01/01/ RHL http://www.ganssle.com/articles/acompres.htm RHLAO http://www.newmediarepublic.com/dvideo/compression/adv04.html H O- http://www.rdrop.com/~cary/html/data_compression.html RHLAO- http://www.arturocampos.com/cp_ch1.html RHLA- http://w3.antd.nist.gov/cag/lecser_ay1/lecser_ay1.html RH- - http://www.ccs.neu.edu/groups/honors-program/freshsem/19951996/jnl22/jeff.html RHLAO http://www.ece.umn.edu/users/kieffer/ece5585.html RHLAO- http://www.ics.uci.edu/~dan/pubs/DataCompression.html RHL-O http://www.ils.unc.edu/~willc/dcfulltext.html L O http://www.cs.pdx.edu/~idr/unbzip2.cgi?compression/acb_compress.html.bz2 R-L-O http://cswww.willamette.edu/~sarnold/cs443/compression.html RHLAO- http://www.cs.tut.fi/~albert/Dev/pucrunch/packing.html RHLA http://vectorsite.tripod.com/ttdcmp1.html R L O http://members.aol.com/breunling/obcompr.htm RHLAO- http://www.rasip.fer.hr/research/compress/algorithms/index.html HLA - http://www.ifi.uio.no/~ftp/publications/cand-scient-theses/SHuseby/html/node41.html RHLAO http://www.cs.su.oz.au/~symvonis/teaching/cs4-data-compression.html -L- - http://www.image.cityu.edu.hk/~loben/thesis/node22.html L O http://home.uleth.ca/~borrtj/pres/index.htm RHL-O- http://www.scit.wlv.ac.uk/~c9581158/main.html RHLA - http://www.eee.bham.ac.uk/WoolleySI/All7/body0.htm RHLAO- http://wannabe.guru.org/alg/node163.html R - means page contains description of Run Length Encoding H - description of Huffman Coding L - Lempel-Ziv methods A - Arithmetic Coding O - (at least some words about) other methods C - attempts to classify types of data and/or compression methods Unfortunately, useful words were not found in all these overviews and introductions about - classification of types of digital data; - classification of compression algorithms, the "tree" of methods; - what algorithms are appropriate for what data types, i.e. (no) answer to the most actual question: ********************************************************************** * How to choose the most effective algorithm, * * having nothing but the data itself ? * * (among hundreds existing and new appearing every day) * ********************************************************************** This text, English (ver.1.4 now) can be found at http://geocities.com/eri32/int.htm Russian (ver.1.4) takes place at http://geocities.com/eri32/intro.htm With best kind regards, excuses for misprints, A.Ratushnyak, http://go.to/artest go Back to main ARTest page


1