This page explains data compression

It is clear that the more bytes transmitted then the greater the cost to the sender. Thus data compression is often employed to help reduce bills. Important is the fact that data compression is closely related to data representation. The data sent over a channel can be viewed as a sequence of symbols. These symbols are assumed to be drawn from some (possibly infinite) set of emblems. Data compression can be approached in three general ways:

The finiteness of the set of symbols
The relative frequencies with which the symbols are used
The context in which a symbol appears

Encoding a Finite Set of Equally Likely Symbols

In many applications the messages are drawn from a finite set and expressed in ASCII. For example, the files in a library's collection might be usefully regarded as a finite set of symbols. Suppose that each day a complete list of books requested were sent to a branch library, so each would know which books were most in demand and where. The daily transmission might consist of the branch number, followed by the list of all the titles requested at that branch, that day.

Generally a book has 20 characters in its title and thus, represented in ASCII code, a book would require 140 bits. Obviously no library in the world has (2 to the power 140) books and hence by simply giving each book a sequence number, it is possible to reduce the number of bits needed per title. Although the receiver must have the numbered list, this need only be sent once and can be transmitted by mailing of a magnetic tape.

In cases where an occasional reference is made to an item not in the numbered list, the name can be spelled out in full, using an escape convention.

Frequency Dependent Coding (Huffman)

In virtually all text, some symbols occur more often than others. In English text the letter "E" occurs 100 times more often than the letter "G", and the word "THE" occurs 10 times more often than the word "BE".

This observation suggests an encoding scheme in which common symbols are assigned short codes and rare symbols are assigned long codes.

For example, consider the storage of four symbols A, B, C and D, each with an occurrence probability of 0.50, 0.30, 0.15 and 0.05 respectively. By encoding them as 0, 10, 110 and 111(instead of 00, 01, 10 and 11), the average symbol now only requires 1.7 bits.

In light of the above example, it is interesting to ask what the minimum number of bits per symbol is. In fact there is a method of obtaining this information, unfortunately there is no way to achieve the theoretical minimum encoding with independently coded symbols, because many of them require a fractional number of bits.

However, an algorithm developed by Huffman can be used to produce a reasonable approximation to the theoretical limit. Huffman coding exploits the property that not all symbols in a transmitted frame occur with the same frequency, for example, in a frame comprising strings of characters, certain characters occur more often than others. Thus it can be deduced that this approach to encoding is a form of statistical encoding. The algorithm, now known as Huffman Coding is shown below.

Write down all the symbols, together with the associated probability of each. As the algorithm proceeds, a binary tree will be built up, with these symbols as the terminal nodes. Initially all the nodes should be unmarked.
Find the two smallest nodes (lowest occurrence probability) and mark them. Add a new node, with arcs to each of these nodes just marked. Set the probability of the new node to the sum of the probabilities of the two nodes it is connected to.
Repeat step two until all nodes have been marked except one. The probability of the unmarked node will always be 1.0.
The encoding for each symbol can now be found by tracing the path from the unmarked symbol, recording the sequence of the left and right branches taken. The code is just the path, with left = 0 and right = 1(or vice versa).

Huffman coding can also be done in radices other than 2. For example, by choosing the 256 smallest unmarked nodes at each step, and having 256 arcs radiate from each intermediate node, we get a code in which each symbol is an integral number of bytes.

Context Dependent Encoding

Lastly, another approach to data compression is to develop a scheme which would determine the conditional probability of each symbol occurring for each possible predecessor. For letters as symbols, this would result in having 26 tables, one for the frequency distribution following an "A", one for the frequency distribution following a "B", and so on. If there are strong correlations between symbols and their successors this method yields large savings.

One aspect of this approach is in the digital transmission of colour television. This signal can be though of as a sequence of frames (usually 25 or 30 per second), each containing a rectangular array of picture elements called pixels. One simple scheme might be to have an image of 1000 by 600 pixels with each pixel being a 16-bit number. Fifteen of these bits would encode the red, green and blue intensities and the last bit would distinguish data from control signals.

A straight encoding of this information would require 600,000 pixels per frame resulting in a data rate of 240 megabits/sec. Nyquist theorem explains that a bandwidth of 480 MHz is required. Obviously as current analogue television only uses 6 MHz per channel, this digital scheme is not likely to find widespread acceptance in a world where the spectrum is already oversubscribed.

However, it should be noted that most frames are almost identical to their predecessors. Thus if the television had a buffer capable of holding one frame then all that would have to be transmitted would be the differences in successive frames. Therefore, it is clear that significant savings in information to be transmitted would be obtained.