ITDM - 01 - Entropy
1 Lecture Info
Data:
2 Probability Theory
Probability theory studies the properties and patterns of random experiments.
Def: A random variable is a real-valued number \(X(\omega)\) assigned to every outcome of a stochastic experiment, such that \(\forall x \in \mathbb{R}\), the set
\[\{\omega: X(\omega) \leq x\}\]
is an event.
Observation: The function defining the random variable \(X: \Omega \to \mathbb{R}\) is not random. What is actually random is the outcome \(\omega \in \Omega\), and not \(X(\omega)\).
When dealing with r.v. the following things are useful to remember
- The outcome of an experiment is denoted as \(\omega\) and it is uniquely determined by carrying out the experiment.
- The set of all possible outcomes is denoted by \(\Omega\) and is called the sample space.
- An \(x \in \Omega\) is called a sample point, while an \(A \subseteq \Omega\) is called an event.
We have two different types of random varaibles, depending on the codomain of the function:
- Discrete random variables: If the codomain is enumerable, then we
define the following functions of interest
- Cumulative Distribution Function:
\[F_X(x) := P(X \leq x)\]
- Probabilty Mass Function:
\[p_x(x_i) := P(X = x_i)\]
- Cumulative Distribution Function:
- Continuous random variables: If the codomain is a subseteq of
\(\mathbb{R}\), then instead of a probability mass function we have
a probability density function, denoted by \(f_X(x)\), and a
cumulative distribution function characterized by the following
equation
\[F_X(x) := \int\limits_{-\infty}^x f_X(t) dt\]
3 Information
Information is an abstract term which can be interpreted as additional knowledge. Information is thus received when a question is answered. Once obtained, we can use information to modify our behaviors.
Observation: The modern information theory is based on the following articles:
- "Certain Factors Affecting Telegraph Speed", Harry Nyquist, 1924.
- "Trasmission of Information", Ralph V. Hartley, 1928.
- "A Mathematical Theory of Communication", C.E. Shannon, 1948.
Shannon was the first to use a probability model for a communication system. Since Shannon, information has always been linked to probability and randomness.
4 Self-Information
To define a useful measure of information we start with the following basic assumptions:
- If there is no knowledge, or \(0\) knowledge, then we have no information aswell.
- If \(I\) is the amount of information, then \(I \geq 0\) and there is no upper bound.
Combining the two we get \(I \in [0, \infty)\).
Information and probability are intuitively linked in the following way:
- Events with low probability give high information if they happen, and
- Events with high probability give low information if they happen.
We thus arrive at the following definition
Definition: Let \(x_i\) be the outcome of the experiment. Then the self-information of the event \((X = x_i)\) is defined as follows
\[I_a(x_i) = \log_a \frac{1}{p_X(x_i)}\]
where \(a \in \mathbb{R}^+\).
Usually \(a = 2\), and the unit of measure is called bit (binary unit).
Observation: Initially the word bit was used to represent a unit of measure, and not a binary digit. You can thus have a real number of bits.
4.1 Properties of Self-Information
The self-information measure we have just defined has the following properties
- Continuity
- Non-negativity
- Monotinicity
5 Entropy
The self-information measure is defined for a single event and it can be computed only after the experiment is over and the outcome is known.
We thus introduce the concept of entropy which allows us to compute the amount of information of a random experiment before the experiment is actually carried out.
Definition: Given a discrete random variable \(X\), the entropy of \(X\) is defined as
\[H(X) := \mathbb{E}[I(X)] = \sum\limits_{i = 1}^{N_X} p(x_i) \cdot I(x_i) = \sum\limits_{i = 1}^{N_x} p(x_i) \cdot \frac{1}{\log_2 p(x_i)} = -\sum\limits_{i = 1}^{N_x} p(x_i) \log_2{p(x_i)}\]
5.1 Properties of Entropy
L'entropia presenta le seguenti proprietà
- Continuity.
- Non-Negativity.
- Concavity.
In particular the concavity of \(H(X)\) means that the entropy function has a computable and unique maximum.
5.2 Example
Consider the typical coin tossing problem with \(P(H) = P(T) = 1/2\). By computing the entropy we get
\[H(X) = \frac12 \cdot \log_2{2} + \frac12 \cdot \log_2{2} = 1\]
The binary entropy function \(H(p)\) is maximized when \(p = 1/2\).