Physics/A/Shannon entropy

Simple information entropy
Figure 1 illustrates the fact that flipping two coins can yield four get four different outcomes (represented as four pairs of flipped coins.) Using ($H$,$T$) to represent (heads,tails) we can list these four outcomes,


 * $HH$, $HT$, $TH$, and $TT$,

We shall use the symbol $&Omega;$ (Omega) to denote the number of messages that can be sent by these two coins:



The popularity of this inspired me to seek other ways to use coins to understand what is known as Shannon's entropy. Table 1 extends Figure 1 to include the relationship between entropy $$S$$ and the number of messages $$\Omega$$ (called Omega) one might generate by displaying $&Omega; = 4 (outcomes.)$ coins as either "heads" or "tails". But neither the table, the figure, nor even the formula,


 * $$S=log_2\Omega,$$

captures the full complexity associated with Shannon's information entropy. A hint at this complexity can be seen in the following question:

If entropy is equivalent to the number of coins used to convey information, how should one deal with a "one-sided coin?

Such a question must be viewed outside the scope of mathematics. The fact that one rarely hears "fire!" in a crowded theater does not remove that word from the lexicon of how the audience should behave in a theater. If coins i.e. a base-1 "alphabet" Shannon's entropy, $$H,$$ was defined so coins contribute nothing to the entropy

''Define S as simple entropy and note that it is the number of "coins" used. Shanon's entropy, H, defined so that it also depends on the probability with which a coin is used. Connect probability to frequency.''

$$H\le S =N$$

Additive property I
Figure 2 illustrations how Elephant communicates with the animal kingdom every morning by displaying 5 coins, using his trunk to indicate where the sequence begins. As shown in Figure 3, Elephant asks Crow to display the first 3 coins and Rat to display the last two coins. When acting on behalf of Elephant, Crow and Rat each point to the first coin in their message, and the animal kingdom understands that Crow is to be read first. In this case Crow and Rat are not sending independent signals, but are instead following Elephant's instructions. On the other hand, if Crow and Rat are act independently, Crow controls 3 bits of entropy, while Rat controls 2 bits. The relationship between acting dependently and independently can be summarized as:

Elephants total, or net, entropy $$S_\text{Tot}$$ equals the sum of the entropies controlled by Crow and Rat, but the $S$ number of messages Elephant can send equals the product of what Crow and Rat could send if they act independently:

$$S_\text{Tot}=S_\text{Crow}+S_\text{Rat}$$     $$\Leftrightarrow$$     $$\Omega_\text{Tot}=\Omega_\text{Crow}\cdot\Omega_\text{Rat}$$

Entropy is independent of the "alphabet" or "language" used
Neither our simple entropy $$S$$, nor the Shannon entropy $$H$$ do does not distinguish between the "language" or "alphabets" one might use when sending messages. Rat could equivalently send the messages by displaying two coins. Or messages could be expressed using binary numbers, or even four Arabic numbers using the four-sided die shown in Figure 3.

First derivation of Shannon entropy
$$H = \underbrace{\sum p_j log_2\left(p_j^{-1}\right)}_{\text{all possible messages}}\equiv -\underbrace{\sum p_j log_2\left(p_j\right)}_{\text{all possible messages}}$$

Expectation value reviewed

 * Expected_value Weighted average

The expectation, or "average" value of the four integers: $$\{1, 2, 2, 5\}$$

$$E(x)=\frac{1+2+2+5}{4}= 2.5 =\frac 1 4\cdot 1 + \frac 1 2\cdot 2 + \frac 1 4\cdot 5 $$

The fractions $Total$,$1/4$, and $1/2$, refer to the probabilities of $1/4$ being $x$, $1$, and $2$, respectively. This permits us to write the expectation value as a sum involving all probabilities:

$$E(x)= 1\cdot p_{(x=1)}+2\cdot p_{(x=2)}+5\cdot p_{(x=5)}\rightarrow \sum_x p_x\cdot x $$

Additive property revisited
$$-H_T= \sum_{k=1}^{N_T}p_k\log p_k,$$ where $$N_T=N_1N_2,$$ and $$p_k=p_\alpha p_\beta,$$

where the logarithm's base is two: $$\log p_k\equiv\log_2(p_k).$$

$$-H_T=\sum_{i=1}^{N_1}\sum_{i=j}^{N_2}(p_ip_k)\log (p_ip_k)$$ $$= \sum_{i=1}^{N_1}\sum_{i=j}^{N_2}p_ip_k\cdot\underbrace{\left(\log p_i+\log p_k\right)}_{\log p_ip_j}$$ $$= \sum_{i=1}^{N_1}\sum_{i=j}^{N_2}\left(p_ip_j\log p_i + p_ip_j\log p_j\right)$$ $$ = \underbrace{\left(\sum_{i=j}^{N_2}p_j\right)}_{=1}\sum_{i=1}^{N_1}p_i\log p_i +\underbrace{\left(\sum_{i=1}^{N_1}p_i\right)}_{=1}\sum_{j=1}^{N_2}p_j\log p_j$$ $$ = \sum_{i=1}^{N_1}p_i\log p_i + \sum_{j=1}^{N_2}p_j\log p_j$$ $$ = -H_1-H_2$$

log2
$$-H_T= \sum_{k=1}^{N_T}p_k\log_2 p_k,$$ where $$N_T=N_1N_2,$$ and $$p_k=p_\alpha p_\beta,$$

where the logarithm's base is two: $$\log_2 p_k\equiv\log_2(p_k).$$

$$-H_T=\sum_{i=1}^{N_1}\sum_{i=j}^{N_2}(p_ip_k)\log_2 (p_ip_k)$$ $$= \sum_{i=1}^{N_1}\sum_{i=j}^{N_2}p_ip_k\cdot\underbrace{\left(\log_2 p_i+\log_2 p_k\right)}_{\log_2 p_ip_j}$$ $$= \sum_{i=1}^{N_1}\sum_{i=j}^{N_2}\left(p_ip_j\log_2 p_i + p_ip_j\log_2 p_j\right)$$ $$ = \underbrace{\left(\sum_{i=j}^{N_2}p_j\right)}_{=1}\sum_{i=1}^{N_1}p_i\log_2 p_i +\underbrace{\left(\sum_{i=1}^{N_1}p_i\right)}_{=1}\sum_{j=1}^{N_2}p_j\log_2 p_j$$ $$ = \sum_{i=1}^{N_1}p_i\log_2 p_i + \sum_{j=1}^{N_2}p_j\log_2 p_j$$ $$ = -H_1-H_2$$

Wheel of fortune


$$S_T=\log(n_C+n_R)$$$$= H_{CR} + \overbrace{\frac{n_C}{n_C+n_R}}^{p_C}\log n_C+ \overbrace{\frac{n_R}{n_C+n_R}}^{p_R}\log n_R$$

$$H_{CR} = \log(n_C+n_R)-\overbrace{\frac{n_C}{n_C+n_R}}^{p_C}\log n_C- \overbrace{\frac{n_R}{n_C+n_R}}^{p_R}\log n_R$$$$= \log(n_C+n_R)-p_C\log n_C- p_R\log n_R$$

Use fact that, $$p_C+p_R = 1,$$ to write this as.

$$H_{CR} $$$$= (p_C+p_R)\log(n_C+n_R)-p_C\log n_C- p_R\log n_R$$ $$ = p_C\left(\underbrace{\log(n_C+n_R)-\log n_C}_{\log(1/p_C)}\right)- p_R\left(\underbrace{\log(n_C+n_R)-\log n_R}_{\log(1/p_R)}\right)$$ $$ = p_C log(1/p_C)+p_R log(1/p_R)$$

Move
Moved to Draft:Information theory/Permutations and combinations: $$\text{If unlabeled: }\Omega = 5! = 120\,\text{outcomes}$$, $$\text{If labeled: }\Omega =\frac{(2+3)!}{2!\, 3!}=10\,\text{outcomes}$$



Don't move
Employ fact that $$\Omega=2^S$$ is often a very large number, since $5$ is often a large number. A simplified version of Sterling's formula, $$n!\approx n\ln n -n$$ becomes for logarithms to base two:

$$log (n!)\approx n\log n - \frac{n}{\ln 2} $$ $$log (n!)\approx n\log n - n/\ln 2 $$

$$n=n_C+n_R$$

$$\widetilde S = \log\left[\frac{(n_C+n_R)!}{n_C!n_R!}\right]$$

$$\approx (n_C+n_R)\log (n_C+n_R) -n_C\log n_C -n_R\log n_R$$

$$= n_C\Bigl[ \log (n_C+n_R) -\log n_C\Bigr] + n_R\Bigl[ \log (n_C+n_R) - \log n_R\Bigr] $$

$$\widetilde S = n_C\log\left(\frac{n}{n_C}\right)+n_R\log\left(\frac{n}{n_R}\right)\quad\text{if }n>>1 $$

gamma from wikipedia
w:special:permalink/1040737805


 * $$n! = \Gamma(n + 1),$$


 * $$ \sqrt{\pi} \left(\frac{x}{e}\right)^x \left( 8x^3 + 4x^2 + x + \frac{1}{100} \right)^{1/6} < \Gamma(1+x) < \sqrt{\pi} \left(\frac{x}{e}\right)^x \left( 8x^3 + 4x^2 + x + \frac{1}{30} \right)^{1/6}.$$

Two coin examples
$$\begin{matrix} \,     & TT                      &TH                        &HT                        &HH                 \\ H=      &p_1 \log_2(1/p_1)       &+p_2 \log_2(1/p_2)        &+p_3 \log_2(1/p_3)        &+p_4 \log_2(1/p_4)   \\ 1=      &0                       &+0                        &+\frac 12\log_2 (2)       &+\frac 12\log_2 (2)   \\ 1.918...=&\frac 13\frac 12\log_2(6)&+\frac 13\frac 12\log_2(6)&+\frac 23\frac 12\log_2(3)&+\frac 23\frac 12\log_2(3)\\ 1.836...=&\frac 13\frac 13\log_2(9)&+\frac 13\frac 23\log_2(9/2)&+\frac 23\frac 13\log_2(9/2) &+\frac 23\frac 23\log_2(9/4)\\ \end{matrix} $$

additive property of shannon entropy
$$-\Sigma H =-(H_1+H_2)= \sum_{\alpha=1}^{N_1}p_\alpha\log p_\alpha + \sum_{\beta=1}^{N_2}p_\beta\log p_\beta.$$

In our example, the crow's entropy is $$H_1$$, and $$N_1=8$$. The eight outcomes for the crow's three coins have probabilities:

$$p_\alpha\in\{ p_1, p_2,p_3,p_4,p_5,p_6,p_7,p_8\}$$ for the outcomes $$(000 ,001, 010, 011, 100, 101, 111),$$ respectively. Similarly, the set of Rat's four possible outcomes $$p_\beta\in\{ p_1, p_2,p_3,p_4\}$$.

short
$$-H_T= \sum_{k=1}^{N_T}p_k\log p_k,$$ where $$N_T=N_1N_2,$$ and $$p_k=p_\alpha p_\beta.$$ $$H_T=\sum_{i=1}^{N_1}\sum_{i=j}^{N_2}(p_ip_k)\log (p_ip_k)$$ $$= \sum_{i=1}^{N_1}\sum_{i=j}^{N_2}p_ip_k\cdot\left(\log p_i+\log p_k)\right)$$ $$= \sum_{i=1}^{N_1}\sum_{i=j}^{N_2}\left(p_ip_j\log p_i + p_ip_j\log p_j\right)$$ $$ = \sum_{i=j}^{N_2}p_j\sum_{i=1}^{N_1}p_i\log p_i + \sum_{i=1}^{N_1}p_i\sum_{j=1}^{N_2}p_j\log p_j$$

Image gallery
All images from the same randomized version of file:Image entropy with Inkscape First 01.svg

blurbs
I thought of calling it "information", but the word was overly used, so I decided to call it "uncertainty". [...] Von Neumann told me, "You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage."

$$ \overbrace{00}^{{}^{\prime\prime\,}\!1^{\,\prime\prime}}\; \overbrace{00}^{{}^{\prime\prime\,}\!2^{\,\prime\prime}}\; \overbrace{00}^{{}^{\prime\prime\,}\!3^{\,\prime\prime}}\; \overbrace{00}^{{}^{\prime\prime\,}\!4^{\,\prime\prime}}\; $$

Rules are needed to keep entropy "simple"
Crows messages are {000,001,010,011,100,101,111}, which corresponds to the integers 0-7 in base. The Crow could send more messages by choosing whether to include leading zeros, and the sole reason for forbidding such signals is to keep the system as simple as possible. For example, we "declare" the single sided coin to be incapable of conveying information because $$log(1)=0$$, and not because it is impossible to convey a message by not flipping a coin. Later, this convention that silence never conveys information will lead us to an alternative path to Shannon's formula for entropy.

H is Shannon entropy: H&#x2264;S
Shannon entropy $$H$$ is a generalization of the simple formula, of the simple formula, $$S=\log_2\Omega$$, that can be used to adjust for situations where some messages are either never sent, or are less likely to be sent. It involves the probability of certain messages be sent or not sent, and the formula can be used regardless of the mechanism responsible the messages are selected.

The vagueness associated with whether these probabilities are estimated by observation of many signals, or determined by some unknown means does not prevent Shannon's entropy from being useful. For example:


 * Gadsby is a 50,000 word novel that does not contain the letter "e". Any casual analysis leads to the conclusion that the omission of the letter is deliberate.
 * Those engaged in espionage can make inferences about a signal in which the "alphabet" appears to be a uniformly random sequence of letters can is likely to be encrypted.
 * In written English, the letters "q" and "k" have the same pronunciation, which suggests that our text documents could be made smaller by removing the letter "q", for example, spelling "quick" as "quik". This is not much of a concern for text documents, but images and movies can be more easily stored or transmitted if compression is used.

The entropy of Rat's message does not depend on whether the binary system of two coins is used, or whether Rat displays the information using the four sided die shown in Figure 3.

Math
$$\{HH,HT,TH,TT\}\Leftrightarrow \{00,01,10,11\}\Leftrightarrow \{\text{one},\text{two},\text{three},\text{four}\}$$

Letting, $S$ = "heads" and $H$= "tails", these 4 outcomes can be described in a number of different ways. Four such "outcomes" can also be counted in other ways, for example: $$\{HH,HT,TH,TT\}\Leftrightarrow \{00,01,10,11\}\Leftrightarrow \{\text{one},\text{two},\text{three},\text{four}\}$$

This is the simplest example of the relationship between the

REWRITE: Figure 1 is currently being used by Wikipedias in five different languages. It illustrates the fact that the value of Shannon Entropy $$S$$ is the number of coins flipped:

, if the coins are “fair”, i.e. $$S=N.$$. The simplest example of an “unfair” coin would be one with either two “head” or two “tails”. Any reasonable measure of information content would ignore that coin and yield an entropy of $$N-1.$$ The scope of informal peek into the entropy is limited:

Figure 2:mThe elephant with 5 coins can send, $$2^5=32,$$ different messages, which is associated with 5 bits of information (if the coins are "fair"). If three coins are given the the crow and two coins to the rat, the crow can send, $$2^3=8,$$ different messages, while the rat can send only, $$2^2=4,$$ different messages. Even though the rat an crow can independently send only $$8+4=12$$ different messages, together, they possess the same number of bits of information, $$2+3=5,$$ as the elephant had. This illustrates the additive nature of entropy.

Assume all outcomes are equally probable
I find Crow the more illuminating, in part due to confusions 4=2&#xD7;2 versus 4=2+2. it is often to count not coins but outcomes. For rat equivalent to four sided coin. The "alphabet" used to send information is not important. Coins has advantage and disadvantage. DELETE $$ \underbrace{ \overbrace{00}^{{}^{\prime\prime\,}\!1^{\,\prime\prime}}\; \overbrace{00}^{{}^{\prime\prime\,}\!2^{\,\prime\prime}}\; \overbrace{00}^{{}^{\prime\prime\,}\!3^{\,\prime\prime}} }_{\text{equally probable}} \underbrace{ \overbrace{00}^{{}^{\prime\prime\,}\!4^{\,\prime\prime}}\; }_{P} $$

$$p_1=p_2=p_3=\frac{1-P}{3}\implies $$

$$p_1+p_2+p_3$$

Reminders
-
 * Information theory/Shannon entropy Reminders:
 * Include links from Information theory and Introduction_to_Information_Theory

decisions

 * 1) REFERENCE Wikipedia Permalink section \#Characterization for two rigorous paths to Shannon's formula

Characterization

 * {| class="toccolours collapsible collapsed" width="80%" style="text-align:left"

!From w:Entropy (Information theory)
 * -Bold text
 * Let $\operatorname{I}$ be the information function which one assumes to be twice continuously differentiable, one has:


 * $$\begin{align}

& \operatorname{I}(p_1 p_2) &=\ & \operatorname{I}(p_1) + \operatorname{I}(p_2) && \quad \text{Starting from property 4} \\ & p_2 \operatorname{I}'(p_1 p_2) &=\ & \operatorname{I}'(p_1) && \quad \text{taking the derivative w.r.t}\ p_1 \\ & \operatorname{I}'(p_1 p_2) + p_1 p_2 \operatorname{I}''(p_1 p_2) &=\ & 0 && \quad \text{taking the derivative w.r.t}\ p_2 \\ & \operatorname{I}'(u) + u \operatorname{I}''(u) &=\ & 0 && \quad \text{introducing}\, u = p_1 p_2 \\ & (u \mapsto u \operatorname{I}'(u))' &=\ & 0 \end{align}$$

This differential equation leads to the solution $$\operatorname{I}(u) = k \log u$$ for any $$k \in \mathbb{R}$$. Property 2 leads to $$k < 0$$. Properties 1 and 3 then hold also.
 * }


 * What is the entropy S when one of the “unfair” in that the two outcomes occur with unequal probabilities.
 * How can this definition be extended to include entities of more than two outcomes?
 * What does it mean to say t


 * hat entropy is an “extensive property”?
 * How can this essay contribute to higher education?

Expected number of bits and entropy per symbol
Variable-length_code helped me with Shannon's source coding theorem

Number of messages
This has a trivial resolution if we note that the exponential function, $$y=log_2(x),$$ and logarithmic function, $$x=log_2(y),$$ are mutually inverse functions:


 * $$y=2^x \text{ if and only if } x=\log_2(y)\;\ldots (\text{ provided } y>0.)$$

For example, the $$\Omega = 6$$ messages associated with the throw of a six-sided die has an entropy of $$S=log_2(6)\approx 2.5585$$

micro/macro state grand ensemble/Gibbs entropy

 * These words are used differently and should be avoided
 * Microstate (statistical mechanics)
 * http://pillowlab.princeton.edu/teaching/statneuro2018/slides/notes08_infotheory.pdf uses Lagrangian multiplier to establish extremal See (8.19), or equivalently 8.2.2. See also efficiency at (8.22)

- Since this essay is about the entropy of information,

large numbers
Observable_universe 10^80


 * 1) no attempt to explain.  This introduction the formula is not rigorous.  purpose familiarize properties.   slowly build, begin w info result of randon generator flipping N coins, ignoring surpisal.  N evolves to shannon entropy.
 * 2) The elephant shown to right uses 5 coins to send messages. Sometimes he gives three coins to the crow and two coins to the rat so they can also send messages.  The number of possible messages available to each animal depends on how many coins are used.


 * 1) kilobytes is a smallfile
 * 2) https://en.wikipedia.org/wiki/Byte
 * 3) https://en.wikipedia.org/wiki/Kilobyte

Links

 * An Intuitive Guide to the Concept of Entropy Arising in Various Sectors of Science Uses Stirling approximation to show the p log p arises from large amounts of entropy.
 * towardsdatascience.com pages can be accessed for free at first, but later it seems you need to pay.
 * https://planetcalc.com/2476/
 * Stirling's approximation
 * https://math.stackexchange.com/questions/331103/intuitive-explanation-of-entropy
 * https://en.wikipedia.org/w/index.php?title=Entropy_in_thermodynamics_and_information_theory&oldid=1020584746#See_also
 * Interesting links, mostly to WP
 * https://en.wikipedia.org/w/index.php?title=Entropy_(information_theory)&oldid=1033726547#Characterization
 * 2 derivations of PlogP stuff.
 * I=-log p is information content (low p means big I)
 * already referenced!
 * https://en.wikipedia.org/wiki/Entropy_(statistical_thermodynamics)
 * Uses log n! = n log n to justify PlogP stuff
 * Binary entropy function graphs $$\operatorname H(X) = \operatorname H_\text{b}(p) = -p \log_2 p - (1 - p) \log_2 (1 - p)$$,

Boltzmann or Gibbs?

 * Boltzmann's entropy formula