Technical Notice

 
   
     
 

Extract from Wikipedia (the complete article can be see here)

In information theory, the Shannon entropy or information entropy is a measure of the uncertainty associated with a random variable. It quantifies the information contained in a message, usually in bits or bits/symbol. It is the minimum message length necessary to communicate information.

This also represents an absolute limit on the best possible lossless compression of any communication: treating a message as a series of symbols, the shortest possible representation to transmit the message is the Shannon entropy in bits/symbol multiplied by the number of symbols in the original message.

A fair coin has an entropy of one bit. However, if the coin is not fair, then the uncertainty is lower (if asked to bet on the next outcome, we would bet preferentially on the most frequent result), and thus the Shannon entropy is lower. A long string of repeating characters has an entropy of 0, since every character is predictable. The entropy of English text is between 1.0 and 1.5 bits per letter.

Equivalently, the Shannon entropy is a measure of the average information content the recipient is missing when he does not know the value of the random variable.

The concept was introduced by Claude E. Shannon in his 1948 paper "A Mathematical Theory of Communication".

Definition

The information entropy of a discrete random variable X, that can take on possible values {x1...xn} is

where

I(X) is the information content or self-information of X, which is itself a random variable; and

p(xi) = Pr(X=xi) is the probability mass function of X.

Characterization

Information entropy is characterised by these desiderata:

(Define and )

Continuity

The measure should be continuous — i.e., changing the value of one of the probabilities by a very small amount should only change the entropy by a small amount.

Symmetry

The measure should be unchanged if the outcomes xi are re-ordered.

Maximum

If all the outcomes are equally likely, then entropy should be maximal (uncertainty is highest when all possible events are equiprobable). In this case, the entropy increases with the number of outcomes.

Additivity

The amount of entropy should be the same independently of how the process is regarded as being divided into parts.

This last functional relationship characterizes the entropy of a system with sub-systems. It demands that the entropy of a system can be calculated from the entropy of its sub-systems if we know how the sub-systems interact with each other.

Assume that we have an ensemble of n elements with a uniform distribution on them. If we mentally divide this ensemble into k boxes (sub-systems) with bi elements in each, the entropy can be calculated as a sum of individual entropies of the boxes weighed by the probability of finding oneself in that particular box PLUS the entropy of the system of boxes.

For positive integers bi where b1 + ... + bk = n,

This implies that the entropy of a certain outcome is zero:

Any definition of entropy satisfying these assumptions has the form:

Where K is a constant corresponding to a choice of measurement units.Information entropy explained

For a random variable with outcomes , the Shannon information entropy, a measure of uncertainty (see further below) and denoted by , is defined as

(1)

Where is the probability mass function of outcome, and is the base of the logarithm used. Possible values of are 2, , and 10. The unit of the information entropy is bit for , nat for , dit (or digit) for . For the uninitiated, it is hard to develop a feel for the totally abstract expression in Eq.(1), which could be a big turn-off for further exploring this beautiful theory.

To understand the meaning of Eq.(1), let's first consider a set of possible outcomes (events) , with equal probability . An example would be a fair die with values, from to . The uncertainty for such set of outcomes can be measured by

(2)

The logarithm is used so to provide the additivity characteristic for independent uncertainty. For example, consider appending to each value of the first die the value of a second die, which has possible outcomes . There are thus possible outcomes . The uncertainty for such set of outcomes is then

(3)

Thus the uncertainty of playing with two dice is obtained by adding the uncertainty of the second die to the uncertainty of the first die .

Now return to the case of playing with one die only (the first one); we can write

In the case of a non-uniform probability mass function (or distribution in the case of continuous random variable), we let

(4)

which is also called a surprisal; the lower the probability , i.e. , the higher the uncertainty or the surprise, i.e. , for the outcome \displaystyle x_i

The average uncertainty , with being the average operator, is obtained by

(5)

and is used as the definition of the information entropy in Eq.(1). The above also explained why information entropy and information uncertainty can be used interchangeably.

 

 

 

 
 
 
 
 

S = the signal, N = the noise

 

Schematic diagram of a communication system

 

Average Uncertainty

 

Entropy

Download "a mathematical theory of communication"