|
Extract from Wikipedia (the complete
article can be see here)
In information theory, the Shannon entropy or
information entropy is a measure of the uncertainty associated with a random
variable. It quantifies the information contained in a message, usually in
bits or bits/symbol. It is the minimum message length necessary to communicate
information.
This also represents an absolute limit on
the best possible lossless compression of any communication: treating a message as a series of symbols, the shortest
possible representation to transmit the message is the Shannon entropy in
bits/symbol multiplied by the number of symbols in the original message.
A fair coin has an entropy of one bit. However, if the coin is not fair, then the
uncertainty is lower (if asked to bet on the next outcome, we would bet
preferentially on the most frequent result), and thus the Shannon entropy is
lower. A long string of repeating characters has an entropy of 0, since every
character is predictable. The entropy of English text is between 1.0 and 1.5
bits per letter.
Equivalently, the Shannon entropy is a
measure of the average information content the recipient is missing when he does not know the value of the
random variable.
The concept was introduced by Claude
E. Shannon in his 1948 paper "A Mathematical Theory of
Communication".
Definition
The information entropy of a discrete random variable X, that can take on possible values {x1...xn} is

where
I(X) is the information content or self-information of X, which is itself a random variable; and
p(xi) = Pr(X=xi) is the probability mass function of X.
Characterization
Information entropy is characterised by these desiderata:
(Define and )
Continuity
The measure should be continuous — i.e., changing the value of
one of the probabilities by a very small amount should only change the entropy
by a small amount.
Symmetry
The measure should be unchanged if the
outcomes xi are re-ordered.
Maximum
If all the outcomes are equally likely,
then entropy should be maximal (uncertainty is highest when all possible events
are equiprobable). In this case, the entropy increases with the number of
outcomes.
Additivity
The amount of entropy should be the same
independently of how the process is regarded as being divided into parts.
This last functional relationship
characterizes the entropy of a system with sub-systems. It demands that the
entropy of a system can be calculated from the entropy of its sub-systems if we
know how the sub-systems interact with each other.
Assume that we have an ensemble of n elements with a uniform distribution on them. If we mentally divide
this ensemble into k boxes (sub-systems)
with bi elements in each, the
entropy can be calculated as a sum of individual entropies of the boxes weighed
by the probability of finding oneself in that particular box PLUS the entropy
of the system of boxes.
For positive
integers bi where b1 + ... + bk = n,
This implies that the entropy of a
certain outcome is zero:

Any definition of entropy satisfying
these assumptions has the form:

Where K is a constant corresponding to a choice of measurement units.Information entropy explained
For a random variable with outcomes , the Shannon information entropy, a measure of uncertainty (see further
below) and denoted by , is defined as

|
(1) |
Where is the probability mass function of outcome , and is the base of the logarithm used. Possible values of are 2, , and 10. The unit of the information entropy is bit for , nat for , dit (or
digit) for . For the uninitiated, it is hard to develop a feel for the totally
abstract expression in Eq.(1), which could be a big turn-off for further
exploring this beautiful theory.
To understand the meaning of Eq.(1),
let's first consider a set of possible outcomes (events) , with equal probability . An example would be a fair die with values, from to . The uncertainty for such set of outcomes can be measured by

|
(2) |
The logarithm is used so to provide the
additivity characteristic for independent uncertainty. For example, consider
appending to each value of the first die the value of a second die, which has possible outcomes . There are thus possible outcomes . The
uncertainty for such set of outcomes is then

|
(3) |
Thus the uncertainty of playing with two
dice is obtained by adding the uncertainty of the second die to the uncertainty of the first die .
Now return to the case of playing with
one die only (the first one); we can write

In the case of a non-uniform probability
mass function (or distribution in the case of continuous random variable), we
let

|
(4) |
which is also called a surprisal;
the lower the probability , i.e. , the higher the
uncertainty or the surprise, i.e. , for the
outcome 
The average uncertainty , with being the
average operator, is obtained by

|
(5) |
and is used as the definition of the
information entropy in Eq.(1). The above also explained why information entropy and information uncertainty can be used interchangeably.
|