Skip to main content

Some Statistics Knowledge

Basic concepts

What's the essence of probability? There are two views:

  • Frequentist: Probability is an objective thing. We can know probability from the result of repeating a random event many times in the same condition.
  • Bayesian: Probability is a subjective thing. Probability means how you think it's likely to happen based on your initial assumptions and the evidences you see. Probability is relative to the information you have.

Probability is related to sampling assumptions. Example: Bertrand Paradox: there are many ways to randoly select a chord on a circle, with different proability densities of chord.

A distribution tells how likely a random variable will be what value:

  • A discrete distribution can be a table, telling the probability of each possible outcome.
  • A discrete distribuiton can be a function, where the input is a possible outcome and the output is probability.
  • A discrete distribution can be a vector (an array), where i-th number is the probability of i-th outcome.
  • A discrete distribution can be a histogram, where each pillar is a possible outcome, and the height of pillar is probability.
  • A continuous distribution can be described by a probability density function (PDF) ff. A continuous distribution has infinitely many outcomes, and the probability of each specific outcome is zero (usually). We care about the probability of a range: P(a<X<b)=abf(x)dxP(a<X<b)=\int_a^b f(x)dx. The integral of the whole range should be 1: f(x)dx=1\int_{-\infty}^{\infty}f(x)dx=1. The value of PDF can be larger than 1.
  • A distribution can be described by cumulative distribution function. F(x)=P(Xx)F(x) = P(X \leq x). It can be integration of PDF: F(x)=xf(x)dxF(x) = \int_{-\infty}^x f(x)dx. It start from 0 and monotonically increase then reach 1.
  • Quantile function QQ is the inverse of cumulative distribution function. Q(p)=xQ(p) = x means F(x)=pF(x)=p and P(Xx)=pP(X \leq x) = p. The top 25% value is Q(0.75)Q(0.75). The bottom 25% value is Q(0.25)Q(0.25).

Independent means that two random variables don't affect each other. Knowing one doesn't affect the distribution of other. But there are dependent random variables that, when you know one, the distribution of another changes.

P(X=x)P(X=x) means the probability of random variable XX take value xx. It can also be written as PX(x)P_X(x) or P(X)P(X). Sometimes the probability density function ff is used to represent a distribution.

A joint distribution tells how likely a combination of multiple variables will be what value. For a joint distribution of X and Y, each outcome is a pair of X and Y, denoted (X,Y)(X, Y). If X and Y are independent, then P(X=x,Y=y)=P((X,Y)=(x,y))=P(X=x)P(Y=y)P(X=x,Y=y)=P((X,Y)=(x,y))=P(X=x) \cdot P(Y=y).

For a joint distribution of (X,Y)(X, Y), if we only care about X, then the distribution of X is called marginal distribution.

You can only add probability when two events are mutually exclusive.

You can only multiply probability when two events are independent, or multiplying a conditional probability with the condition's probability.

Conditional probability

P(EC)P(E \vert C) means the probability of EE happening if CC happens.

P(EC)=P(ECE and C both happen)P(C)P(EC)=P(EC)P(C)P(E|C) = \frac{P(\overbrace{E \cap C}^{\mathclap{\text{E and C both happen}}})}{P(C)} \quad\quad\quad\quad\quad P(E\cap C) = P(E|C) \cdot P(C)

If E and C are independent, then P(EC)=P(E)P(C)P(E \cap C) = P(E)P(C), then P(EC)=P(E)P(E \vert C)=P(E).

For example, there is a medical testing method of a disease. The test result can be positive (indicate having diesase) or negative. But that test is not always accurate.

There are two random variables: whether test result is positive, whther the person actually has disease. This is a joint distribution. The 4 cases:

Test is positiveTest is negative
Actually has diseaseTrue positive aaFalse negative (Type II Error) bb
Actually doesn't have diseaseFalse positive (Type I Error) ccTrue negative dd

a,b,c,da, b, c, d are four possibilities. a+b+c+d=1a + b + c + d = 1.

For that distribution, there are two marginal distributions. If we only care about whether the person actually has disease and ignore the test result, then the marginal distribution is:

Probability
Actually has diseasea+ba+b (the infect rate of population)
Actually doesn't have diseasec+dc+d

Similarily there is also a marginal distribution of whether the test result is positive.

False negative rate is P(Test is negative  Actually has disease)P(\text{Test is negative } \vert \text{ Actually has disease}), it means the rate of negative test when actually having disease. And false positive rate is P(Test is positive  Actually doesn’t have disease)P(\text{Test is positive } \vert \text{ Actually doesn't have disease}).

False negative rate=P(Test is negativeActually has disease)=ba+b\text{False negative rate} = P(\text{Test is negative} | \text{Actually has disease}) = \frac{b}{a + b} False positive rate=P(Test is positiveActually doesn’t have disease)=cc+d\text{False positive rate} = P(\text{Test is positive} | \text{Actually doesn't have disease}) = \frac{c}{c + d}

Some people may intuitively think false negative rate means P(Test result is false  Test is negative)P(\text{Test result is false } \vert \text{ Test is negative}), which equals P(Actually has disease  Test is negative)P(\text{Actually has disease } \vert \text{ Test is negative}), which equals bb+d\frac{b}{b+d}. But that's not the official definition of false negative.

Bayes theorem allow "reversing" P(AB)P(A \vert B) as P(BA)P(B \vert A):

P(AB)=P(AB)P(B)=P(BA)P(A)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B|A)\cdot P(A)}{P(B)}
  • Prior means what I assume the distribution is before knowing some new information.
  • If I see some new information and improved my understanding of the distribution, then the new distribution that I assume is posterior.

Mean

The theoretical mean is the "weighted average" of all possible cases using theoretical probabilities.

E[X]E[X] denotes the theoretical mean of random variable XX, also called the expected value of XX. It's also often denoted as μ\mu.

For discrete case, E[X]E[X] is calculated by summing all theoretically possible values multiply by their theoretical probability.

The mean for discrete case:

μ=E[X]=xconsider all cases of xxP(X=x)probability of that case\mu = E[X] = \sum_{\underbrace{x} _ {\mathclap{\text{consider all cases of x}}}} x \cdot \overbrace{P(X=x)} ^ {\mathclap{\text{probability of that case}}}

The mean for continuous case:

μ=E[X]=xp(x)dx\mu = E[X] = \int_{-\infty}^{\infty} x \cdot p(x) dx

Some rules related to mean:

  • The mean of two random variables can add up E[X+Y]=E[X]+E[Y]E[iXi]=iE[Xi]E[X + Y] = E[X] + E[Y]\quad \quad \quad E[\sum_iX_i] = \sum_iE[X_i]
  • Multiplying a random variable by a constant kk multiplies its mean E[kX]=kE[X]E[kX] = k \cdot E[X]
  • A constant's mean is that constant E[k]=kE[k] = k

(The constant kk doesn't necessarily need to be globally constant. It just need to be a certain value that's not affected by the random outcome. It just need to be "constant in context".)

Another important rule is that, if XX and YY are independent, then

E[XY]=E[X]E[Y]E[X \cdot Y] = E[X] \cdot E[Y]

Because when XX and YY are independent, P(X=xi,Y=yj)=P(X=xi)P(Y=yj)P(X=x_i, Y=y_j) = P(X=x_i) \cdot P(Y=y_j), then:

E[XY]=i,jxiyjP(X=xi,Y=yj)=i,jxiyjP(X=xi)P(Y=yj)E[X \cdot Y] = \sum_{i,j}{x_i \cdot y_j \cdot P(X=x_i, Y=y_j)} = \sum_{i,j}{x_i \cdot y_j \cdot P(X=x_i) \cdot P(Y=y_j)}

Note that E[X+Y]=E[X]+E[Y]E[X+Y]=E[X]+E[Y] always work regardless of independence, but E[XY]=E[X]E[Y]E[XY]=E[X]E[Y] requires independence.

For a sum, the common factor that's not related to sum index can be extraced out. So:

i,jf(i)g(j)=i(j(f(i)irrelevant to jg(j)))=i(f(i)jg(j)irrelevant to i)=(if(i))(jg(j))\sum_{i,j}f(i)g(j) = \sum_{i} \left( \sum _ {j} (\underbrace{f(i)} _ \text{irrelevant to j} \cdot g(j)) \right) =\sum _ {i} \left( f(i) \underbrace{\sum _ {j} g(j)} _ \text{irrelevant to i} \right) =\left(\sum _ {i} f(i)\right) \left(\sum _ {j} g(j)\right)

Then:

i,jxiyjP(X=xi)P(Y=yj)=(ixiP(X=xi))(jyjP(Y=yj))=E[X]E[Y]\sum_{i,j}{x_i \cdot y_j \cdot P(X=x_i) \cdot P(Y=y_j)} = \left(\sum_ix_iP(X=x_i)\right) \cdot \left(\sum_jy_jP(Y=y_j)\right) = E[X] \cdot E[Y]

(That's for the discrete case. Continuous case is similar.)

If we have nn samples of XX, denoted X1,X2,...XnX_1, X_2, ... X_n, each sample is a random variable, and each sample is independent to each other, and each sample are taken from the same distribution (independently and identically distributed, i.i.d), then we can estimate the theoretical mean by calculating the average. The estimated mean is denoted as μ^\hat{\mu} (Mu hat):

E^i[X]=μ^=1niXi\hat{E}_i[X] = \hat{\mu} = \frac{1}{n} \sum_i{X_i}

Hat ^\hat{} means it's an empirical value calculated from samples, not the theoretical value.

Some important clarifications:

  • The theoretical mean is weighted average using theoretical probabilities
  • The estimated mean (empirical mean, sample mean) is non-weighted average over samples
  • The theoretical mean is an accurate value, determined by the theoretical distribution
  • The estimated mean is an inaccurate random variable, because it's calculated from random samples

The mean of estimated mean equals the theoretical mean.

E[μ^]=E[1niXi]=1niE[Xi]=1niE[X]=1nnE[X]=μE[\hat{\mu}] = E[\frac{1}{n}\sum_iX_i] = \frac{1}{n} \sum_i E[X_i] = \frac{1}{n} \sum_i E[X] = \frac{1}{n} n \cdot E[X] = \mu

Note that if the samples are not independent to each other, or they are taken from different distributions, then the estimation will be possibly biased.

Variance

The theoretical variance, Var[X]\text{Var}[X], also denoted as σ2\sigma ^2, measures how "spread out" the samples are.

σ2=Var[X]=E[(Xμ)2]\sigma ^2 = \text{Var}[X] = E[(X - \mu)^2]

If kk is a constant:

Var[kX]=k2Var[X]\text{Var}[kX] = k^2 \text{Var}[X] Var[X+k]=Var[X]\text{Var}[X + k] = \text{Var}[X] Var[X]=E[X2]E[X]2\text{Var}[X] = E[X^2] - E[X]^2

Standard deviation (stdev) σ\sigma is the square root of variance. Multiplying a random variable by a constant also multiplies the standard deviation.

The covariance Cov[X,Y]\text{Cov}[X, Y] measures the "joint variability" of two random variables XX and YY.

Cov[X,Y]=E[(XE[X])(YE[Y])]Var[X]=Cov[X,X]\text{Cov}[X, Y] = E[(X-E[X])(Y-E[Y])] \quad\quad\quad \text{Var}[X]=\text{Cov}[X,X]

Some rules related to variance:

Var[X+Y]=E[((XE[X])+(YE[Y]))2]\text{Var}[X + Y]= E[((X-E[X])+(Y-E[Y]))^2] =E[(XE[X])2+(YE[Y])2+2(XE[X])(YE[Y])]=Var[X]+Var[Y]+2Cov[X,Y]= E[(X-E[X])^2 + (Y-E[Y])^2 + 2(X-E[X])(Y-E[Y])] = \text{Var}[X] + \text{Var}[Y] + 2 \cdot \text{Cov}[X, Y]

If XX and YY are indepdenent, as previouly mentioned E[XY]=E[X]E[Y]E[XY]=E[X]\cdot E[Y], then

Cov[X,Y]=E[(XE[X])(YE[Y])]=E[XE[X]]E[YE[Y]]=00=0\text{Cov}[X, Y] = E[(X-E[X])(Y-E[Y])] = E[X-E[X]] \cdot E[Y-E[Y]] = 0 \cdot 0 = 0

so Var[X+Y]=Var[X]+Var[Y]\text{Var}[X + Y]= \text{Var}[X] + \text{Var}[Y]

The mean is sometimes also called location. The variance is sometimes called dispersion.

If we have some i.i.d samples but don't know the theoretical variance, how to estimate the variance? If we know the theoretical mean, then it's simple:

σ^2=1ni((Xiμ)2)\hat{\sigma}^2 = \frac{1}{n} \sum_{i}((X_i - \mu)^2) E[σ^2]=σ2E[\hat{\sigma}^2] = \sigma^2

However, the theoretical mean is different to the estimated mean. If we don't know the theoretical mean and use the estimated mean, it will be biased, and we need to divide n1n-1 instead of nn to avoid bias:

σ^2=1n1i((Xiμ^)2)\hat{\sigma}^2 = \frac{1}{n-1} \sum_{i}((X_i - \hat{\mu})^2)

This is called Bessel's correction. note that the more i.i.d samples you have, the smaller the bias, so if you have many i.i.d samples, then the bias doesn't matter in practice.

Originally, n samples have n degrees of freedom. If we keep the estimated mean fixed, then it will only have n-1 degrees of freedom. That's an intuitive explanation of the correction. The exact dedution of that correction is tricky:

Deduction of Bessel's correction

Firstly, the estimated mean itself also has variance

Var[μ^]=Var[1niXi]=1n2Var[iXi]\text{Var}[\hat{\mu}] = \text{Var}\left[\frac{1}{n}\sum_iX_i\right] = \frac{1}{n^2} \text{Var}\left[\sum_iX_i\right]

As each sample is independent to other samples. As previously mentioned, if XX and YY are independent, adding the variable also adds the variance: Var[X+Y]=Var[X]+Var[Y]\text{Var}[X + Y]= \text{Var}[X] + \text{Var}[Y]. So:

Var[iXi]=iVar[Xi]=nσ2\text{Var}\left[\sum_i{X_i}\right] = \sum_i{\text{Var}[X_i]} = n\sigma^2 Var[μ^]=1n2Var[iXi]=1n2nσ2=σ2n\text{Var}[\hat{\mu}] = \frac{1}{n^2} \text{Var}\left[\sum_iX_i\right] = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}

As previously mentioned E[μ^]=μE[\hat{\mu}] = \mu, then Var[μ^]=E[(μ^E[μ^])2]=E[(μ^μ)2]=σ2n\text{Var}[\hat{\mu}] = E[(\hat{\mu} - E[\hat{\mu}])^2] = E[(\hat{\mu} - \mu)^2] = \frac{\sigma^2}{n}. This will be used later.

A trick is to rewrite Xiμ^X_i - \hat{\mu} to (Xiμ)(μ^μ)(X_i - \mu) - (\hat{\mu} - \mu) and then expand:

i((Xiμ^)2)=i(((Xiμ)(μ^μ))2)=i((Xiμ)22(Xiμ)(μ^μ)+(μ^μ)2)\sum_{i}((X_i - \hat{\mu})^2) = \sum _ {i}\left(((X_i - \mu) - (\hat{\mu} - \mu))^2\right) = \sum _ i{\left( (X_i - \mu)^2-2(X_i - \mu)(\hat{\mu} - \mu)+(\hat{\mu} - \mu)^2\right) } =i(Xiμ)22(μ^μ)i(Xiμ)+n(μ^μ)2= \sum_i{(X_i - \mu)^2} -2 (\hat{\mu} - \mu) \sum_i{(X_i - \mu)} +n(\hat{\mu} - \mu)^2 \quad

Then take mean of two sides:

E[i((Xiμ^)2)]=E[i(Xiμ)22(μ^μ)i(Xiμ)+n(μ^μ)2]E\left[ \sum _ {i}((X_i - \hat{\mu})^2) \right]= E\left[\sum _ i{(X_i - \mu)^2} -2 (\hat{\mu} - \mu) \sum _ i{(X_i - \mu)} +n(\hat{\mu} - \mu)^2\right] =E[i(Xiμ)2]2E[(μ^μ)i(Xiμ)]+nE[(μ^μ)2]=E\left[\sum_i{(X_i - \mu)^2}\right] -2 E\left[(\hat{\mu} - \mu) \sum_i{(X_i - \mu)}\right] +n E[ (\hat{\mu} - \mu)^2 ]

There are now three terms. The first one equals nσ2n\sigma^2:

E[i(Xiμ)2]=nσ2E\left[\sum_i{(X_i - \mu)^2}\right] = n\sigma^2

note that

i(Xiμ)=(iXi)nμ=nμ^nμ=n(μ^μ)\sum_i{(X_i-\mu)} = (\sum_iX_i) - n\mu = n\hat{\mu} - n\mu = n(\hat{\mu}-\mu)

So the second one becomes

2E[(μ^μ)i(Xiμ)]=2E[(μ^μ)n(μ^μ)]=2nE[(μ^μ)2]-2 E\left[(\hat{\mu} - \mu) \sum_i{(X_i - \mu)}\right] = -2E[(\hat{\mu}-\mu)n(\hat{\mu}-\mu)] = -2nE[(\hat{\mu}-\mu)^2]

Now the above three things become

E[i((Xiμ^)2)]=nσ2nE[(μ^μ)2]E\left[ \sum_{i}((X_i - \hat{\mu})^2) \right]=n\sigma^2 -nE[(\hat{\mu}-\mu)^2]

E[(μ^μ)2]E[(\hat{\mu}-\mu)^2] is also Var[μ^]\text{Var}[\hat{\mu}]. As previously mentioned, it equals σ2n\frac{\sigma^2}{n}, so

E[i((Xiμ^)2)]=nσ2nσ2n=(n1)σ2E\left[ \sum_{i}((X_i - \hat{\mu})^2) \right]= n\sigma^2 -n \frac{\sigma^2}{n} = (n-1)\sigma^2

So

E[i((Xiμ^)2)n1]=σ2E\left[ \frac{\sum _ {i}((X_i - \hat{\mu})^2)}{n-1} \right] = \sigma^2

Z-score

For a random variable XX, if we know its mean μ\mu and standard deviation σ\sigma then we can "standardize" it so that its mean become 0 and standard deviation become 1:

Z=XμσZ = \frac{X-\mu}{\sigma}

That's called Z-score or standard score.

Often the theoretical mean and theoretical standard deviation is unknown, so z score is computed using sample mean and sample stdev:

Z=Xμ^σ^Z = \frac{X-\hat\mu}{\hat\sigma}

In deep learning, normalization uses Z score:

  • Layer normalization: it works on a vector. It treats each element in a vector as different samples from the same distribution, and then replace each element with their Z-score (using sample mean and sample stdev).
  • Batch normalization: it works on a batch of vectors. It treats the elements in the same index in different vectors in batch as different samples from the same distribtion, and then compute Z-score (using sample mean and sample stdev).

Note that in layer normalization and batch normalization, the variance usually divides by nn instead of n1n-1.

Computing Z-score for a vector can also be seen as a projection:

  • The input x=(x1,x2,...,xn)\boldsymbol{x} = (x_1,x_2,...,x_n)
  • The vector of ones: 1=(1,1,...,1)\boldsymbol{1} = (1, 1, ..., 1)
  • Computing sample mean can be seen as scaling 1n\frac 1 n then dot product with the vector of ones: μ^=1nx1{\hat \mu}= \frac 1 n \boldsymbol{x} \cdot \boldsymbol{1}
  • Subtracting the sample mean can be seen as subtracting μ^1\hat {\mu} \cdot \boldsymbol{1}, let's call it y\boldsymbol y: y=xμ^1=x1n(x1)1\boldsymbol y = \boldsymbol x - {\hat \mu} \cdot \boldsymbol{1} = \boldsymbol x- \frac 1 n (\boldsymbol{x} \cdot \boldsymbol{1}) \cdot \boldsymbol{1}
  • Recall projection: projecting vector a\boldsymbol a onto b\boldsymbol b is (abbb)b(\frac{\boldsymbol a \cdot \boldsymbol b}{\boldsymbol b \cdot \boldsymbol b}) \cdot \boldsymbol b.
  • (1)2=n(\boldsymbol 1)^2 = n. So 1n(x1)1\frac 1 n (\boldsymbol{x} \cdot \boldsymbol{1}) \cdot \boldsymbol{1} is the projection of x\boldsymbol x onto 1\boldsymbol 1.
  • Subtracting it means removing the component in the direction of 1\boldsymbol 1 from x\boldsymbol x. So y\boldsymbol y is orthogonal to 1\boldsymbol 1. y\boldsymbol y is in a hyper-plane orthogonal to 1\boldsymbol 1.
  • Standard deviation can be seen as the length of y\boldsymbol y divide by n\sqrt{n} (or n1\sqrt{n-1}): σ2=1n(y)2\boldsymbol\sigma^2 = \frac 1 n (\boldsymbol y)^2, σ=1ny\boldsymbol\sigma = \frac 1 {\sqrt{n}} \vert \boldsymbol y \vert.
  • Dividing by standard deviation can be seen as projecting it onto unit sphere then multiply by n\sqrt n (or n1\sqrt{n-1}).
  • So computing Z-score can be seen as firstly projecting onto a hyper-plane that's orthogonal to 1\boldsymbol 1 and then projecting onto unit sphere then multiply by n\sqrt n (or n1\sqrt{n-1}).

Skewness

Skewness measures which side has more extreme values.

Skew[X]=E[(Xμ)3σ3]\text{Skew}[X] = E\left[\frac{(X - \mu)^3}{\sigma ^3}\right]

A large positive skew means there is a fat tail on positive side (may have positive Black Swans). A large negative skew means fat tail on negative side (may have negative Black Swans).

If two sides are symmetric, its skew is 0, regardless of how fat the tails are. Gaussian distributions are symmetric so they has zero skew. note that an asymmetric distribution can also has 0 skewness.

There is a concept called moments that unify mean, variance, skewness and kurtosis:

  • The n-th moment: E[Xn]E[X^n]. Mean is the first moment.
  • The n-th central moment: E[(Xμ)n]E[(X-\mu)^n]. Variance is the second central moment.
  • The n-th central standardized moment: E[(Xμσ)n]E[(\frac{X-\mu}{\sigma})^n]. Skewness is the third central standardized moment. Kurtosis is the fourth central standardized moment.

There is an unbiased way to estimate the thrid central moment μ3\mu_3.

μ3[X]=E[(Xμ)3]μ3^=n(n1)(n2)i(Xiμ^)3\mu_3[X] = E[(X-\mu)^3] \quad\quad\quad\quad \hat{\mu_3} = \frac{n}{(n-1)(n-2)} \sum_i (X_i - \hat{\mu})^3

The deduction of unbiased third central moment estimator is similar to Bessel's correction, but more tricky.

A common way of estimating skewness from i.i.d samples, is to use the unbiased third central moment estimator, to divide by cubic of unbiased estimator of standard deviation:

G1=μ3^σ^3=n(n1)(n2)i(Xiμ^)3σ^3G_1 = \frac{\hat{\mu_3}}{\hat{\sigma}^3} = \frac{n}{(n-1)(n-2)}\sum_i{\frac{(X_i - \hat{\mu})^3}{\hat{\sigma}^3}}

But it's still biased, as E[XY]E[\frac{X}{Y}] doesn't necessarily equal E[X]E[Y]\frac{E[X]}{E[Y]}. Unfortunately, there is no completely unbiased way to estimate skewness from i.i.d samples (unless you have other assumptions about the underlying distribution). The bias gets smaller with more i.i.d samples.

Kurtosis

Larger kurtosis means it has a fatter tail. The more extreme values (Black Swans) it has, the higher its kurtosis.

Kurt[X]=E[(Xμ)4σ4]=E[(Xμ)4]σ4\text{Kurt}[X] = E\left[\frac{(X - \mu)^4}{\sigma ^4}\right] = \frac{E[(X-\mu)^4]}{\sigma^4}

Gaussian distributions have kurtosis of 3. Excess kurtosis is the kurtosis minus 3.

A common way of estimating excess kurtosis from i.i.d samples, is to use the unbiased estimator of fourth cumulant (E[(XE[X])4]3Var[X]2E[(X-E[X])^4]-3Var[X]^2), to divide the square of unbiased estimator of variance:

G2=(n+1)n(n1)(n2)(n3)i((Xiμ^)4)σ^43(n1)2(n2)(n3)G_2 = \frac{(n+1)n}{(n-1)(n-2)(n-3)} \cdot \frac{\sum_i((X_i-\hat{\mu})^4)}{\hat{\sigma}^4} -3\frac{(n-1)^2}{(n-2)(n-3)}

It's still biased.

Control variate

If we have some independent samples of XX, can estimate mean E[X]E[X] by calculating average E^[X]=1niXi\hat{E}[X]=\frac{1}{n}\sum_i X_i. The variance of calculated average is 1nVar[X]\frac{1}{n} \text{Var}[X], which will reduce by having more samples.

However, if the variance of XX is large and the amount of samples is few, the average will have a large variance, the estimated mean will be inaccurate. We can make the estimation more accurate by using control variate.

If:

  • we have a random variable Y that's correlated with X
  • we know the true mean of Y: E[Y]E[Y],

Then we can estimate E[X]E[X] using E^[X+λ(YE[Y])]\hat{E}[X+\lambda(Y-E[Y])], where λ\lambda is a constant. By choosing the right λ\lambda, the estimator can have lower variance than just calculating average of X. The Y here is called a control variate.

Some previous knowledge: E[E^[A]]=E[A]E[\hat{E}[A]] = E[A], Var[E^[A]]=1nVar[A]\text{Var}[\hat{E}[A]]=\frac{1}{n}\text{Var}[A].

The mean of that estimator is E[X]E[X], meaning that the estimator is unbiased:

E[E^[X+λ(YE[Y])]]=E[X+λ(YE[Y])]=E[X]+λ(E[YE[Y]]=0)=E[X]E[\hat{E}[X+\lambda(Y-E[Y])]] = E[X+\lambda(Y-E[Y])] = E[X] + \lambda(\underbrace{E[Y-E[Y]]}_{=0})=E[X]

Then calculate the variance of the estimator:

Var[E^[X+λ(YE[Y])]]=1nVar[X+λ(YE[Y])]=1nVar[X+λYλE[Y]constant]\text{Var}[\hat{E}[X+\lambda(Y-E[Y])]]=\frac{1}{n}\text{Var}[X+\lambda(Y-E[Y])] =\frac{1}{n}\text{Var}[X+\lambda Y\underbrace{-\lambda E[Y]}_\text{constant}] =1nVar[X+λY]=1n(Var[X]+Var[λY]+2cov[X,λY])=1n(Var[X]+λ2Var[Y]+2λcov[X,Y])=\frac{1}{n}\text{Var}[X+\lambda Y] = \frac{1}{n}(\text{Var}[X]+\text{Var}[\lambda Y] +2\text{cov}[X,\lambda Y]) = \frac{1}{n}(\text{Var}[X]+\lambda^2 \text{Var}[Y]+2\lambda \text{cov}[X,Y])

We want to minimize the variance of estimatory by choosing a λ\lambda. We want to find a λ\lambda that minimizes Var[Y]λ2+2cov[X,Y]λ\text{Var}[Y] \lambda^2 + 2\text{cov}[X,Y] \lambda. Quadratic funciton knowledge tells ax2+bx+c  (a>0)ax^2+bx+c \ \ (a>0) minimizes when x=b2ax=\frac{-b}{2a}, then the optimal lambda is:

λ=cov[X,Y]Var[Y]\lambda = - \frac{\text{cov}[X,Y]}{\text{Var}[Y]}

And by using that optimal λ\lambda, the variance of estimator is:

Var[E^[X+λ(YE[Y])]]=1n(Var[X]cov[X,Y]2Var[Y])\text{Var}[\hat{E}[X+\lambda(Y-E[Y])]]=\frac{1}{n} \left( \text{Var}[X] -\frac{\text{cov}[X,Y]^2}{\text{Var}[Y]} \right)

If X and Y are correlated, then cov[X,Y]2Var[Y]>0\frac{\text{cov}[X,Y]^2}{\text{Var}[Y]} > 0, then the new estimator has smaller variance and is more accurate than the simple one. The larger the correlation, the better it can be.

Information entropy

Information entropy measures:

  • How uncertain a distribution is.
  • How much information a sample in that distribution carries.

If we want to measure the amount of information of a specific event, an event EE 's amount of information as I(E)I(E), we can define these axioms:

  • If that event always happens, then it carries zero information. I(E)=0I(E) = 0 if P(E)=1P(E) = 1.
  • The more rare an event is, the larger information (more surprise) it carries. I(E)I(E) increases as P(E)P(E) decreases.
  • The information of two independent events happen together is the sum of the information of each event. Here I use (X,Y)(X, Y) to denote the combination of XX and YY. That means I((X,Y))=I(X)+I(Y)I((X, Y)) = I(X) + I(Y) if P((X,Y))=P(X)P(Y)P((X, Y)) = P(X) \cdot P(Y). This implies the usage of logarithm.

Then according to the three axioms, the definition of II is:

I(E)=logb1P(E)=logbP(E)I(E) = \log_b \frac{1}{P(E)} = - \log_b P(E)

The base bb is relative to the unit. We often use the amount of bits as the unit of amount of information. An event with 50% probability has 1 bit of information, then the base will be 2:

I(E)=log21P(E)(in bits)I(E) = \log_2 \frac{1}{P(E)} \quad \text{(in bits)}

Then, for a distribution, the expected value of information of one sample is the expected value of I(E)I(E). That defines information entropy HH:

H(X)=E[I(X)]=E[log21P(X)]H(X) = E[I(X)] = E\left[\log_2\frac{1}{P(X)}\right]

In discrete case:

H(X)=x(P(x)log2(1P(x)))H(X) = \sum_x \left(P(x) \cdot \log_2\left(\frac{1}{P(x)}\right) \right)

If there exists xx where P(x)=0P(x) = 0, then it can be ignored in entropy calculation, as limx0xlogx=0\lim_{x \to 0} x \log x = 0.

Information entropy in discrete case is always positive.

In continuous case, where ff is the probability density function, this is called differential entropy:

H(X)=Xf(x)log1f(x)dxH(X) = \int_{\mathbb{X}} {f(x) \cdot \log \frac{1}{f(x)}} dx

(X\mathbb{X} means the set of xx where f(x)0f(x) \neq 0, also called support of ff.)

In continuous case the base is often ee rather than 2. Here log\log by default means loge\log_e.

In discrete case, 0<=P(x)<=10<=P(x)<=1, log1P(x)>0\log \frac{1}{P(x)} > 0, so entropy can never be negative. But in continuous case, probability density function can take value larger than 1, so entropy may be negative.

  • A fair coin toss with two cases has 1 bit of information entropy: 0.5log2(10.5)+0.5log2(10.5)=10.5 \cdot log_2(\frac{1}{0.5}) + 0.5 \cdot log_2(\frac{1}{0.5}) = 1 bit.
  • If the coin is biased, for example the head has 90% probability and tail 10%, then its entropy is: 0.9log2(10.9)+0.1log2(10.1)0.470.9 \cdot log_2(\frac{1}{0.9}) + 0.1 \cdot log_2(\frac{1}{0.1}) \approx 0.47 bits.
  • If it's even more biased, having 99.99% probability of head and 0.01% probability of tail, then its entropy is: 0.9999log2(10.9999)+0.0001log2(10.0001)0.00150.9999 \cdot log_2(\frac{1}{0.9999}) + 0.0001 \cdot log_2(\frac{1}{0.0001}) \approx 0.0015 bits.
  • If a coin toss is fair but has 0.01% percent of standing up on the table, having 3 cases each with probability 0.0001, 0.49995, 0.49995, then its entropy is 0.0001log2(10.0001)+0.49995log2(10.49995)+0.49995log2(10.49995)1.00140.0001 \cdot log_2(\frac{1}{0.0001}) + 0.49995 \cdot log_2(\frac{1}{0.49995}) + 0.49995 \cdot log_2(\frac{1}{0.49995}) \approx 1.0014 bits. (The standing up event itself has about 13.3 bits of information, but its probability is low so it contributed small in information entropy)

If X and Y are independent, then H((X,Y))=E[I((X,Y))]=E[I(X)+I(Y)]=E[I(X)]+E[I(Y)]=H(X)+H(Y)H((X,Y))=E[I((X,Y))]=E[I(X)+I(Y)]=E[I(X)]+E[I(Y)]=H(X)+H(Y). If one fair coin toss has 1 bit entropy, then n independent tosses has n bit entropy.

If I split one case into two cases, entropy increases. If I merge two cases into one case, entropy reduces. Because p1log1p1+p2log1p2>(p1+p2)log1p1+p2p_1\log \frac{1}{p_1} + p_2\log \frac{1}{p_2} > (p_1+p_2) \log \frac{1}{p_1+p_2} (if p10,p20p_1 \neq 0, p_2 \neq 0), which is because that f(x)=log1xf(x)=\log \frac{1}{x} is convex, so p1p1+p2log1p1+p2p1+p2log1p2>log1p1+p2\frac{p_1}{p_1+p_2}\log\frac{1}{p_1}+\frac{p_2}{p_1+p_2}\log\frac{1}{p_2}>\log\frac{1}{p_1+p_2} , then multiply two sides by p1+p2p_1+p_2 gets the above result.

The information entropy is the theorecical minimum of information required to encode a sample. For example, to encode the result of a fair coin toss, we use 1 bit, 0 for head and 1 for tail (reversing is also fine). If the coin is biased to head, to compress the information, we can use 0 for two consecutive heads, 10 for one head, 11 for one tail, which require fewer bits on average for each sample. That may not be optimal, but the most optimal loseless compresion cannot be better than information entropy.

In continuous case, if kk is a positive constant, H(kX)=H(X)+logkH(kX) = H(X) + \log k:

Y=kX(k>0)fY(y)=1kfX(yk)Y=kX \quad (k>0) \quad \quad f_Y(y) = \frac{1}{k}f_X(\frac{y}{k}) H(Y)=fY(y)log1fY(y)dy=1kfX(x)log11kfX(x)d(kx)=fX(x)(log1fX(x)+logk)dxH(Y) = \int {f_Y(y) \cdot \log\frac{1}{f_Y(y)}} dy=\int{\frac{1}{k}f_X(x)\log\frac{1}{\frac{1}{k}f_X(x)}} d(kx) =\int f_X(x) \left(\log\frac{1}{f_X(x)} + \log k \right) dx =fX(x)log1fX(x)dx+(logk)fX(x)dx=H(X)+logk=\int f_X(x) \log \frac{1}{f_X(x)}dx + (\log k) \int f_X(x) dx = H(X) + \log k

Entropy is invariant to offset of random variable. H(X+k)=H(X)H(X+k)=H(X)

Joint information entropy

A joint distribution of X and Y is a distribution where each outcome is a pair of X and Y. Its entropy is called joint information entropy. Here I will use H((X,Y))H((X,Y)) to denote joint entropy (to avoid confusing with cross entropy).

H((X,Y))=E(X,Y)[log1P((X,Y))]=x,yP((X,Y)=(x,y))log1P((X,Y)=(x,y))H((X,Y)) = E_{(X,Y)}\left[\log\frac{1}{P((X,Y))}\right] = \sum_{x,y}P((X,Y)=(x,y)) \log \frac{1}{P((X,Y)=(x,y))}

If I fix the value of Y as yy, then see the distribution of X:

H(XY=y)=EX[log1P(XY=y)]=xP(X=xY=y)log1P(X=xY=y)H(X|Y=y) = E_X\left[\log \frac{1}{P(X|Y=y)} \right]=\sum_xP(X=x|Y=y) \log\frac{1}{P(X=x|Y=y)}

Take that mean over different Y, we get conditional entropy:

H(XY)=Ey[H(XY=y)]=x,yP(Y=y)P(X=xY=y)log1P(X=xY=y)H(X|Y) = E_y[H(X|Y=y)] = \sum_{x,y} P(Y=y) P(X=x|Y=y) \log\frac{1}{P(X=x|Y=y)}

Applying conditional probability rule: P((X,Y))=P(XY)P(Y)P((X,Y)) = P(X \vert Y) P(Y)

=x,yP((X,Y))log1P(X=xY=y)= \sum_{x,y} P((X,Y)) \log \frac{1}{P(X=x|Y=y)}

So the conditional entropy is defined like this:

H(XY)=EX,Y[log1P(XY)]=x,yP((X,Y)=(x,y))log1P(X=xY=y)H(X|Y) = E_{X,Y}\left[\log\frac{1}{P(X|Y)}\right] = \sum_{x,y} P((X,Y)=(x,y)) \log \frac{1}{P(X=x|Y=y)}

P((X,Y))=P(XY)P(Y)P((X, Y)) = P(X \vert Y) P(Y). Similarily, H((X,Y))=H(XY)+H(Y)H((X,Y))=H(X \vert Y)+H(Y). The exact deduction is as follows:

H(XY)+H(Y)=EX,Y[log1P(XY)]+EY[log1P(Y)]=EX,Y[log1P(XY)]+EX,Y[log1P(Y)]H(X|Y) + H(Y) = E_{X,Y}\left[ \log\frac{1}{P(X|Y)} \right] +E_Y\left[\log\frac{1}{P(Y)}\right]=E_{X,Y}\left[ \log\frac{1}{P(X|Y)} \right] +E_{X,Y}\left[\log\frac{1}{P(Y)}\right] =EX,Y[log1P(XY)+log1P(Y)]=EX,Y[log1P(XY)P(Y)]=EX,Y[log1P((X,Y))]=H((X,Y))=E_{X,Y}\left[ \log\frac{1}{P(X|Y)}+\log\frac{1}{P(Y)} \right]=E_{X,Y}\left[ \log\frac{1}{P(X|Y)P(Y)}\right]=E_{X,Y}\left[ \log \frac{1}{P((X,Y))} \right] = H((X,Y))

If XX and YY are not independent, then the joint entropy is smaller than if they are independent: H((X,Y))<H(X)+H(Y)H((X, Y)) < H(X) + H(Y). If X and Y are not independent then knowing X will also give some information about Y. This can be deduced by mutual information which will be explained below.

KL divergence

Here IA(x)I_A(x) denotes the amount of information of value (event) xx in distribution AA. The difference of information of the same value in two distributions AA and BB:

IB(x)IA(x)=log1PB(x)log1PA(x)=logPA(x)PB(x)I_B(x) - I_A(x) = \log \frac{1}{P_B(x)} - \log \frac{1}{P_A(x)} = \log \frac{P_A(x)}{P_B(x)}

The KL divergence from AA to BB is the expected value of that regarding the probabilities of AA. Here EAE_A means the expected value calculated using AA's probabilities:

DKL(AB)=EA[IB(x)IA(x)]=xPA(x)logPA(x)PB(x)D_{KL}(A \parallel B) = E_A[I_B(x) - I_A(x)] = \sum_x P_A(x) \log \frac{P_A(x)}{P_B(x)}

You can think KL divergence as:

  • The "distance" between two distributions.
  • If I "expect" the distribution is B, but the distribution is actually A, how much "surprise" do I get on average.
  • If I design a loseless compression algorithm optimized for B, but use it to compress data from A, then the compression will be not optimal and contain redundant information. KL divergence measures how much redundant information it has on average.

KL divergence is also called relative entropy.

KL divergence is asymmetric, DKL(AB)D_{KL}(A\parallel B) is different to DKL(BA)D_{KL}(B\parallel A). It's often that the first distribution is the real underlying distribution, and the second distribution is an approximation or the model output.

If A and B are the same, the KL divergence betweem them are zero. Otherwise, KL divergence is positive. KL divergence can never be negative, will explain later.

PB(x)P_B(x) appears on denominator. If there exists xx that PB(x)=0P_B(x) = 0 and PA(x)0P_A(x) \neq 0, then it can be seen that KL divergence is infinite. It can be seen as "The model expect something to never happen but it actually can happen". If there is no such case, we say that A absolutely continuous with respect to B, written as ABA \ll B. This requires all outcomes from B to include all outcomes from A.

Another concept is cross entropy. The cross entropy from A to B, denoted H(A,B)H(A, B), is the entropy of A plus KL divergence from A to B:

H(A,B)=H(A)+DKL(AB)=EA[IB(x)]H(A, B) = H(A) + D_{KL}(A \parallel B) = E_A[I_B(x)] H(A,B)=xPA(x)log1PB(x)H(A, B) = \sum_x P_A(x) \cdot \log \frac{1}{P_B(x)}

Information entropy H(X)H(X) can also be expressed as cross entropy of itself H(X,X)H(X, X), similar to the relation between variance and covariance.

(In some places H(A,B)H(A,B) denotes joint entropy. I use H((A,B))H((A,B)) for joint entropy to avoid ambiguity.)

Cross entropy is also asymmetric.

In deep learning cross entropy is often used as loss function. If each piece of training data's distribution's entropy H(A)H(A) is fixed, minimizing cross entropy is the same as minimizing KL divergence.

Jensen's inequality

Jensen's inequality states that for a concave function ff:

E[f(X)]f(E[X])E[f(X)] \leq f(E[X])

The reverse applies for convex.

Here is a visual example showing Jensen's inequality. For example I have a discrete distribution with 5 cases X1,X2,X3,X4,X5X_1,X_2,X_3,X_4,X_5 (these are possible outcomes of distribution, not samples), corresponding to X coordinates of the red dots. The probabilities of the 5 cases are p1,p2,p3,p4,p5p_1, p_2, p_3, p_4, p_5 that sum to 1.

E[X]=p1X1+p2X2+p3X3+p4X4+p5X5E[X] = p_1 X_1 + p_2 X_2 + p_3 X_3 + p_4 X_4 + p_5 X_5.

E[f(x)]=p1f(X1)+p2f(X2)+p3f(X3)+p4f(X4)+p5f(X5)E[f(x)] = p_1 f(X_1) + p_2 f(X_2) + p_3 f(X_3) + p_4 f(X_4) + p_5 f(X_5).

Then (E[X],E[f(x)])(E[X], E[f(x)]) can be seen as an interpolation between five points (X1,f(X1)),(X2,f(X2)),(X3,f(X3)),(X4,f(X4)),(X5,f(X5))(X_1, f(X_1)), (X_2, f(X_2)), (X_3, f(X_3)), (X_4, f(X_4)), (X_5, f(X_5)), using weights p1,p2,p3,p4,p5p_1, p_2, p_3, p_4, p_5. The possible area of the interpolated point correspond to the green convex polygon:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon

np.random.seed(42)

def concave_function(x):
return -x**2

x_range = np.linspace(-3, 3, 1000)
y_range = concave_function(x_range)

x_points = np.random.uniform(-2.5, 2.5, 5)
x_points = np.sort(x_points)
y_points = concave_function(x_points)

average_x = np.mean(x_points)
f_of_average_x = concave_function(average_x)

average_of_f = np.mean(y_points)

plt.figure(figsize=(10, 6))

plt.plot(x_range, y_range, 'b-', linewidth=2, label='Concave function f(x)')

plt.scatter(x_points, y_points, color='red', s=50, zorder=3, label='($X_i$, $f(X_i)$)')

polygon_vertices = list(zip(x_points, y_points))
polygon = Polygon(polygon_vertices, closed=True, alpha=0.3, facecolor='green', edgecolor='green',
linewidth=2, label='Where ($E[X]$, $E[f(X)]$) can be')
plt.gca().add_patch(polygon)

inequality_text = "($E[X]$, $E[f(X)]$) is below ($E[X]$, $f(E[X])$)"
plt.text(0, min(y_range) + 1, inequality_text,
horizontalalignment='center', fontsize=12, bbox=dict(facecolor='white', alpha=0.7))

plt.grid(True, alpha=0.3)
plt.legend(loc='upper right')
plt.title("Visualization of Jensen's Inequality for a Concave Function", fontsize=14)
plt.xlabel('x', fontsize=12)
plt.ylabel('f(x)', fontsize=12)

plt.tight_layout()
#plt.show()

plt.savefig("jensen_inequality.svg")

For each point in green polygon (E[X],E[f(X)])(E[X], E[f(X)]), the point on function curve with the same X coordinate (E[X],f(E[X]))(E[X], f(E[X])) is above it. So E[f(X)]f(E[X])E[f(X)] \leq f(E[X]).

The same applies when you add more cases to the discrete distribution, the convex polygon will have more points but still below the function curve. The same applies to continuous distribution when there are infinitely many cases.

Jensen's inequality tells that KL divergence is non-negative:

There is a trick that extracting -1 makes PAP_A be in denominator that will be cancelled later.

DKL(AB)=EA[logPA(x)PB(x)]=EA[logPB(x)PA(x)]D_{KL}(A\parallel B) = E_A\left[\log \frac{P_A(x)}{P_B(x)}\right] = - E_A\left[\log \frac{P_B(x)}{P_A(x)}\right]

The logarithm function is concave. Jensen's inequality gives:

EA[logPB(x)PA(x)]log(EA[PB(x)PA(x)])E_A\left[\log \frac{P_B(x)}{P_A(x)}\right] \leq \log \left( E_A \left[\frac{P_B(x)}{P_A(x)}\right] \right)

Multiplying -1 and flip:

DKL(AB)=EA[logPB(x)PA(x)]log(EA[PB(x)PA(x)])D_{KL}(A \parallel B) = - E_A\left[\log \frac{P_B(x)}{P_A(x)}\right] \geq -\log \left( E_A \left[\frac{P_B(x)}{P_A(x)}\right] \right)

The right side equals 0 because:

EA[PB(x)PA(x)]=xPA(x)PB(x)PA(x)=xPB(x)=1log(EA[PB(x)PA(x)])=log1=0E_A \left[\frac{P_B(x)}{P_A(x)}\right] = \sum_x P_A(x) \cdot \frac{P_B(x)}{P_A(x)} = \sum_x P_B(x) = 1 \quad\quad\quad -\log \left( E_A\left[\frac{P_B(x)}{P_A(x)}\right] \right) = -\log 1 = 0

Estimate KL divergence

If:

  • We have two distributions: AA is the target distribution, BB is the output of our model
  • We have nn samples from AA: x1,x2,...xnx_1, x_2, ... x_n
  • We know the probablity of each sample in each distribution. We know PA(xi)P_A(x_i) and PB(xi)P_B(x_i)

Then how to estimate the KL divergence DKL(A,B)D_{KL}(A, B)?

Reference: Approximating KL Divergence

As KL divergence is EA[logPA(x)PB(x)]E_A\left[\log \frac{P_A(x)}{P_B(x)}\right], the simply way is to calculate the average of logPA(x)PB(x)\log \frac{P_A(x)}{P_B(x)}:

D^KL(AB)=E^xA[logPA(x)PB(x)]=1nxAlogPA(x)PB(x)\hat{D}_{KL}(A \parallel B) = \hat E_{x \sim A}\left[\log\frac{P_A(x)}{P_B(x)}\right]=\frac{1}{n}\sum_{x \sim A} \log \frac{P_A(x)}{P_B(x)}

However it may to be negative in some cases. The true KL divergence can never be negative. This may cause issues.

A better way to estimate KL divergence is:

D^KL(AB)=E^xA[logPA(x)PB(x)+PB(x)PA(x)1]\hat{D}_{KL}(A \parallel B) = \hat E_{x \sim A} \left[\log \frac{P_A(x)}{P_B(x)} + \frac{P_B(x)}{P_A(x)} - 1\right]

(PA(x)=0P_A(x) = 0 is impossible because it's sampled from A)

It's always positive and has no bias. The PB(x)PA(x)1\frac{P_B(x)}{P_A(x)}-1 is a control variate and is negatively correlated with logPA(x)PB(x)\log \frac{P_A(x)}{P_B(x)}.

Recall control variate: if we want to estimate E[X]E[X] from samples more accurately, we can find another variable YY that's correlated with XX, and we must know its theoretical mean E[Y]E[Y], then we use E^[X+λY]λE[Y]\hat E[X+\lambda Y] - \lambda E[Y] to estimate E[X]E[X]. The parameter λ\lambda is choosed by minimizing variance.

The mean of that control variate is zero, because ExA[PB(x)PA(x)1]=xPA(x)(PB(x)PA(x)1)=x(PB(x)PA(x))=xPB(x)xPA(x)=0E_{x \sim A}\left[\frac{P_B(x)}{P_A(x)}-1\right]=\sum_x P_A(x) (\frac{P_B(x)}{P_A(x)}-1)=\sum_x (P_B(x) - P_A(x)) =\sum_x P_B(x) - \sum_x P_A(x)=0

The λ=1\lambda=1 is not chosen by mimimizing variance, but chosen by making the estimator non-negative. If I define k=PB(x)PA(x)k=\frac{P_B(x)}{P_A(x)}, then logPA(x)PB(x)+λ(PB(x)PA(x)1)=logk+λ(k1)\log \frac{P_A(x)}{P_B(x)} + \lambda(\frac{P_B(x)}{P_A(x)} - 1) = -\log k + \lambda(k-1). We want it to be non-negative: logk+λ(k1)0-\log k + \lambda(k-1) \geq 0 for all k>0k>0, it can be seen that a line y=λ(k1)y=\lambda (k-1) must be above y=logky=\log k, the only solution is λ=1\lambda=1, where the line is a tangent line on logk\log k.

KL divergence in reinforcement learning

In reinforcement learning, the KL divergence is usually used in loss function for regularization and make training more stable.

In policy model, the model accepts a state input and output a distribution (a vector of probabilities), telling the probability of doing each possible action in that state, called policy value:

πparameter(action  state)=probability of taking that action given state\pi_{\text{parameter}} (\text{action} \ | \ \text{state}) = \text{probability of taking that action given state}

The training keeps updating the model parameter. KL divergence can be added into loss function to make the new model's policy be not too far from old policy, avoiding model from changing too fast, making training more stable:

Loss= ...+E^state[D^KL(πnew parameter(state)πold parameter(state))]\text{Loss} = \ ... + \hat E_{\text{state}}[\hat D_{KL}(\pi_\text{new parameter}(\cdot | \text{state}) \parallel \pi_\text{old parameter}(\cdot | \text{state}))]

That KL divergence is estimated from samples, using multiple states and multiple actions:

D^KL(πnew(state)πold(state))=E^actionπnew[logπnew(actionstate)πold(actionstate)+πold(actionstate)πnew(actionstate)1]\hat D_{KL}(\pi_\text{new}(\cdot|\text{state}) \parallel \pi_\text{old}(\cdot | \text{state}))= \hat E_{\text{action} \sim \pi_\text{new}}\left[ \log \frac{\pi_\text{new}(\text{action}|\text{state})} {\pi_\text{old}(\text{action}|\text{state})} +\frac{\pi_\text{old}(\text{action}|\text{state})} {\pi_\text{new}(\text{action}|\text{state})}-1 \right]

Update: According to On a few pitfalls in KL divergence gradient estimation for RL, the gradient of KL divergence behave differently. This section may need to be revised.

Mutual information

If X and Y are independent, then H((X,Y))=H(X)+H(Y)H((X,Y))=H(X)+H(Y). But if X and Y are not independent, knowing X reduces uncertainty of Y, then H((X,Y))<H(X)+H(Y)H((X,Y))<H(X)+H(Y).

Mutual information I(X;Y)I(X;Y) measures how "related" X and Y are:

I(X;Y)=H(X)+H(Y)H((X,Y))I(X;Y) = H(X) + H(Y) - H((X, Y))

For a joint distribution, if we only care about X, then the distribution of X is a marginal distribution, same as Y.

If we treat X and Y as independent, consider a "fake" joint distribution as if X and Y are independent. Denote that "fake" joint distribution as ZZ, then P(Z=(x,y))=P(X=x)P(Y=y)P(Z=(x,y))=P(X=x)P(Y=y). It's called "outer product of marginal distribution", because its probability matrix is the outer product of two marginal distributions, so it's denoted XYX \otimes Y.

Then mutual information can be expressed as KL divergence between joint distribution (X,Y)(X, Y) and that "fake" joint distribution XYX \otimes Y:

I(X;Y)=H(X)+H(Y)H((X,Y))=EX[I(X)]+EY[I(Y)]EX,Y[I((X,Y))]I(X;Y)=H(X)+H(Y)-H((X,Y))=E_X[I(X)]+E_Y[I(Y)]-E_{X,Y}[I((X,Y))] =EX,Y[I(X)+I(Y)I((X,Y))]=EX,Y[log1P(X)+log1P(Y)log1P((X,Y))]=E_{X,Y}[I(X)+I(Y)-I((X,Y))] = E_{X,Y}\left[\log\frac{1}{P(X)} + \log \frac{1}{P(Y)} - \log \frac{1}{P((X,Y))} \right] =EX,Y[logP((X,Y))P(X)P(Y)]=DKL((X,Y)(XY))=E_{X,Y}\left[\log\frac{P((X,Y))}{P(X)P(Y)} \right] = D_{KL}((X,Y) \parallel (X \otimes Y))

KL divergence is zero when two distributions are the same, and KL divergence is positive when two distributions are not the same. So:

  • Mutual information I(X;Y)I(X;Y) is zero if the joint distribution (X,Y)(X,Y) is the same as XYX\otimes Y, which means X and Y are independent.
  • Mutual information I(X;Y)I(X;Y) is positive if X and Y are not independent.
  • Mutual information is never negative, because KL divergence is never negative.

H((X,Y))=H(X)+H(Y)I(X;Y)H((X,Y))=H(X)+H(Y)-I(X;Y), so if X and Y are not independent then H((X,Y))<H(X)+H(Y)H((X,Y))<H(X)+H(Y).

Mutual information is symmetric, I(X;Y)=I(Y;X)I(X;Y)=I(Y;X).

As H((X,Y))=H(XY)+H(Y)H((X,Y)) = H(X \vert Y) + H(Y), so I(X;Y)=H(X)+H(Y)H((X,Y))=H(X)H(XY)I(X;Y) = H(X) + H(Y) - H((X,Y)) = H(X) - H(X \vert Y).

If knowing Y completely determines X, knowing Y make the distribution of X collapse to one case with 100% probability, then H(XY)=0H(X \vert Y) = 0, then I(X;Y)=H(X)I(X;Y)=H(X).

Information Bottleneck theory in deep learning

Information Bottleneck theory tells that the training of neural network will learn an intermediary representation that:

  • Minimize I(Input ; IntermediaryRepresentation)I(\text{Input} \ ; \ \text{IntermediaryRepresentation}). Try to compress the intermediary representation and reduce unnecessary information related to input.
  • Maximize I(IntermediaryRepresentation ; Output)I(\text{IntermediaryRepresentation} \ ; \ \text{Output}). Try to keep the information in intermediary representation that's releveant to the output as much as possible.

Convolution

If we have two independent random variablex X and Y, and consider the distribution of the sum Z=X+YZ=X+Y, then

P(Z=z)=x,y, if z=x+yP(X=x)P(Y=y)P(Z=z)=\sum_{x,y, \ \text{if} \ z=x+y} P(X=x)P(Y=y)

For each z, it sums over different x and y within the constraint z=x+yz=x+y.

The constraint z=x+yz=x+y allows determining yy from xx and zz: y=zxy=z-x, so it can be rewritten as:

P(Z=z)=xP(X=x)P(Y=zx)P(Z=z)=\sum_{x}P(X=x) P(Y=z-x)

In continuous case

fZ(z)=fX(x)fY(zx)dxf_Z(z) = \int_{-\infty}^{\infty} f_X(x) f_Y(z-x) dx

The probability density function of the sum fZf_Z is denoted as convolution of fXf_X and fYf_Y:

fZ=fXfYPZ=PXPYf_Z = f_X * f_Y \quad\quad\quad P_Z=P_X * P_Y

The convolution operator * can:

  • In continuous case, convolution takes two probability density functions, and give a new probability density function.
  • In discrete case, convolution can take two functions and give a new function. Each function inputs an outcome and outputs the probability of that outcome.
  • In discrete case, convolution can take two vectors and give a new vector. Each vector's i-th element correspond to the probability of i-th outcome.

Convolution can also work in 2D or more dimensions. If X=(x1,x2)X=(x_1,x_2) and Y=(y1,y2)Y=(y_1,y_2) are 2D random variables (two joint distributions), Z=X+Y=(z1,z2)Z=X+Y=(z_1,z_2) is convolution of X and Y:

fz(z1,z2)=fX(x1,x2)fY(z1x1,z2x2)dx1dx2f_z(z_1,z_2) = \int_{-\infty}^\infty \int_{-\infty}^\infty f_X(x_1,x_2) \cdot f_Y(z_1-x_1,z_2-x_2) dx_1dx_2

Convolution can also work on cases where the values are not probabilities. Convolutional neural network uses discrete version of convolution on matrices.

Likelihood

Normally when talking about probability we mean the probability of an outcome under a modelled distribution: P(outcome  modelled distribution)P(\text{outcome} \ \vert \ \text{modelled distribution}). But sometimes we have some concrete samples from a distribution but want to know which model suits the best, so we talk about the probability that a model is true given some samples: P(modelled distributionoutcome)P(\text{modelled distribution} \vert \text{outcome}).

If I have some samples, then some parameters make the samples more likely to come from the modelled distribution, and some parameters make the samples less likely to come from the modelled distribution.

For example, if I model a coin flip using a parameter θ\theta, that and I observe 10 coin flips have 9 heads and 1 tail, then θ=0.9\theta=0.9 is more likely than θ=0.5\theta=0.5. That's straightforward for a simple model. But for more complex models, we need to measure likelihood.

Likelihood L(θx1,x2,...,xn)L(\theta \vert x_1,x_2,...,x_n) measures:

  • How likely that we get samples x1,x2,...,xnx_1, x_2, ... , x_n from the modelled distribution using parameter θ\theta.
  • how likely a parameter θ\theta is the real underlying parameter, given some independent samples x1,x2,...,xnx_1,x_2,...,x_n.
L(θx1,x2,...xn)=f(x1θ)f(x2θ)...f(xnθ)=if(xiθ)L(\theta | x_1,x_2,...x_n) = f(x_1|\theta) \cdot f(x_2|\theta) \cdot ... \cdot f(x_n|\theta) = \prod_i f(x_i|\theta)

For example, if I model a coin flip distribution using a parameter θ\theta, the probability of head is θ\theta and tail is 1θ1-\theta. If I observe 10 coin flip has 9 heads and 1 tail, then the likelihood of θ\theta:

L(θ 9 heads and 1 tail )=θ9(1θ)L(\theta | \text{ 9 heads and 1 tail }) = \theta^9 \cdot (1-\theta)
  • If I assume that the coin flip is fair, θ=0.5\theta=0.5, then likelihood is about 0.000977.
  • If I assume θ=0.9\theta=0.9, then likelihood is about 0.387, which is larger.
  • If I assume θ=0.999\theta=0.999 then likelihood is about 0.00099, which is smaller than when assuming θ=0.9\theta=0.9.

The more likely a parameter is, the higher its likelihood. If θ\theta equals the true underlying parameter then likelihood takes maximum.

By taking logarithm, multiply becomes addition, making it easier to analyze. The log-likelihood function:

logL(θx1,x2,...xn)=ilogf(xiθ)\log L(\theta | x_1,x_2,...x_n) = \sum_i \log f(x_i|\theta)

Score and Fisher information

The score function is the derivative of log-likelihood with respect to parameter, for one sample:

s(θ;x)=logL(θx)θ=logf(xθ)θ=1f(xθ)f(xθ)θs(\theta;x) = \frac{\partial \log L(\theta | x)}{\partial \theta} = \frac{\partial\log f(x | \theta)}{\partial \theta} = \frac{1}{f(x|\theta)} \cdot \frac{\partial f(x|\theta)}{\partial \theta}

If θ\theta equals true underlying parameter, then mean of likelihood Ex[L(θx)]E_x[L(\theta \vert x)] takes maximum, mean of log-likelihood Ex[logL(θx)]E_x[\log L(\theta \vert x)] also takes maximum.

A continuous function's maximum point has zero derivative, so when θ\theta is true, then the mean of score function Ex[s(θ;x)]=Ex[f(xθ)]θE_x[s(\theta;x)]= \frac{\partial E_x[f(x \vert \theta)]}{\partial \theta} is zero.

The Fisher information I(θ)\mathcal{I}(\theta) is the mean of the square of score:

I(θ)=Ex[s(θ;x)2]\mathcal{I}(\theta) = E_x[s(\theta;x)^2]

(The mean is calculated over different outcomes, not different parameters.)

We can also think that Fisher information is always computed under the assumption that θ\theta is the true underlying parameter, then Ex[s(θ;x)]=0E_x[s(\theta;x)]=0, then Fisher information is the variance of score I(θ)=Varx[s(θ;x)]\mathcal{I}(\theta)=\text{Var}_x[s(\theta;x)].

Fisher informaiton I(θ)\mathcal{I}(\theta) also measures the curvature of score function, in parameter space, around θ\theta.

Fisher information measures how much information a sample can tell us about the underlying parameter.

Linear score

When the parameter is an offset and the offset is infinitely small, then the score function is called linear score. If the infinitely small offset is θ\theta. The offseted probability density is f2(xθ)=f(x+θ)f_2(x \vert \theta) = f(x+\theta), then

slinear(x)=s(θ;x)=f2(xθ)θ=logf(x+θ0)θ=dlogf(x)dxs_\text{linear}(x)=s(\theta;x) = \frac{\partial f_2(x|\theta)}{\partial \theta} = \frac{\partial \log f(x+\overbrace{\theta}^{\approx 0})}{\partial \theta} = \frac{d\log f(x)}{dx}

In the places that use score function (and Fisher information) but doesn not specify which parameter, they usually refer to the linear score function.

Max-entropy distributions

Recall that if we make probability distribution more "spread out" the entropy will increase. If there is no constraint, maximizing entropy of real-number distribution will get a distribution that's "infinitely spread out over all real numbers". But if there are constraints, maximizing entropy will give some common and important distributions:

ConstraintMax-entropy distribution
aXba \leq X \leq bUniform distribution
E[X]=μ, Var[X]=σ2E[X]=\mu,\ \text{Var}[X]=\sigma^2Normal distribution
X0, E[X]=μX \geq 0, \ E[X]=\muExponential distribution
Xm>0, E[logX]=gX \geq m > 0, \ E[\log X] = gPareto (Type I) distribution

There are other max-entropy distributions. See Wikipedia.

We can rediscover these max-entropy distributions, by using Largrange multiplier and functional derivative.

Largrange multiplier

To find the distribution with maximum entropy under variance constraint, we can use Largrange multiplier. If we want to find maximum or minimum of f(x)f(x) under the constraint that g(x)=0g(x)=0, we can define Largragian function L\mathcal{L}:

L(x,λ)=f(x)+λg(x)\mathcal{L}(x,\lambda) = f(x) + \lambda \cdot g(x)

Its two partial derivatives have special properties:

L(x,λ)x=f(x)x+λg(x)xL(x,λ)λ=g(x)\frac{\partial \mathcal{L}(x,\lambda)}{\partial x} = \frac{\partial f(x)}{\partial x} + \lambda \frac{\partial g(x)}{\partial x} \quad\quad\quad \frac{\partial \mathcal{L}(x,\lambda)}{\partial \lambda} = g(x)

Then solving equation L(x,λ)x=0\frac{\partial \mathcal{L}(x,\lambda)}{\partial x}=0 and L(x,λ)λ=0\frac{\partial \mathcal{L}(x,\lambda)}{\partial \lambda}=0 will find the maximum or minimum under constraint. Similarily, if there are many constraints, there are multiple λ\lambdas. Similar things also apply to functions with multiple arguments. The argument xx can be a number or even a function, which involves functional derivative:

Functional derivative

A functional is a function that inputs a function and outputs a value. (One of) its input is a function rather than a value (it's a higher-order function). Functional derivative (also called variational derivative) means the derivative of a functional respect to its argument function.

To compute functional derivative, we add a small "perturbation" to the function. f(x)f(x) becomes f(x)+ϵη(x)f(x)+ \epsilon \cdot \eta(x), where epsilon ϵ\epsilon is an infinitely small value that approaches zero, and eta η(x)\eta(x) is a test function. The test function can be any function that satisfy some properties.

The definition of functional derivative:

G(f+ϵη)ϵ=Gfη(x)dx\frac{\partial G(f+\epsilon \eta)}{\partial \epsilon} = \int \boxed{\frac{\partial G}{\partial f}} \cdot \eta(x) dx

Note that it's inside integration.

For example, this is a functional: G(f)=xf(x)dxG(f) = \int x f(x) dx. To compute functional derivative G(f)f\frac{\partial G(f)}{\partial f}, we firstly compute G(f+ϵη)ϵ\frac{\partial G(f+\epsilon \eta)}{\partial \epsilon} then try to make it into the form of Gfη(x)dx\int \boxed{\frac{\partial G}{\partial f}} \cdot \eta(x) dx

G(f+ϵη)ϵ=x(f(x)+ϵη(x))dxϵ=xη(x)dx\frac{\partial G(f+\epsilon \eta)}{\partial \epsilon}= \frac{\partial \int x (f(x)+\epsilon \eta(x))dx }{\partial \epsilon}= \int x \cdot \eta(x) dx

Then by pattern matching with the definition, we get Gf=x\frac{\partial G}{\partial f}=x.

Calculate functional derivative for G(f)=x2f(x)dxG(f)=\int x^2f(x)dx:

G(f+ϵη)G(f)ϵ=x2(f(x)+ϵη(x))x2f(x)dxϵ=x2η(x)dx\frac{G(f+\epsilon\eta)-G(f)}{\partial \epsilon}= \frac{\partial\int x^2(f(x)+\epsilon\eta(x))- x^2f(x)dx}{\partial \epsilon} = \int x^2 \eta(x) dx

Then Gf=x2\frac{\partial G}{\partial f}=x^2.

Calculate functional derivative for G(f)=(f(x)logf(x))dxG(f) = \int (-f(x) \log f(x)) dx:

G(f+ϵη)ϵ=(1)(f(x)+ϵη(x))log(f(x)+ϵη(x))dxϵ\frac{\partial G(f+\epsilon \eta)}{\partial \epsilon} = \frac{\partial \int (-1) (f(x)+\epsilon \eta(x)) \log(f(x)+\epsilon\eta(x)) dx}{\partial \epsilon} =(1)(η(x)log(f(x)+ϵη(x))+η(x)f(x)+ϵη(x)(f(x)+ϵη(x)))dx= \int (-1) \left( \eta(x) \log (f(x)+\epsilon \eta(x)) + \frac{\eta(x)}{f(x)+\epsilon \eta(x)} (f(x)+\epsilon \eta(x)) \right) dx =(log(f(x)+ϵη(x))1)η(x)dx= \int \left( -\log(f(x)+\epsilon \eta(x)) - 1 \right) \eta(x) dx

As log\log is continuous, and ϵη(x)\epsilon \eta(x) is infinitely small, so log(f(x)+ϵη(x))=log(f(x))\log(f(x)+\epsilon \eta(x))=\log (f(x)):

G(f+ϵη)ϵ=(logf(x)1)η(x)dxGf=logf(x)1\frac{\partial G(f+\epsilon \eta)}{\partial \epsilon} = \int (-\log f(x) - 1) \eta(x) dx \quad\quad\quad \frac{\partial G}{\partial f} = -\log f(x) - 1

Get uniform distribution by maximizing entropy

If we constraint the variance range, aXba \leq X \leq b, then maximize its entropy using fuctional derivative

We have constraint abf(x)dx=1\int_a^b f(x)dx=1, which is abf(x)dx1=0\int_a^b f(x)dx-1=0.

L(f)=abf(x)log1f(x)dx+λ1(abf(x)dx1)\mathcal{L}(f) = \int_a^b f(x) \log \frac 1 {f(x)} dx + \lambda_1 \left(\int_a^b f(x)dx-1 \right) =ab(f(x)logf(x)+λ1f(x))dxλ1= \int_a^b (-f(x) \log f(x) + \lambda_1 f(x)) dx - \lambda_1

Compute derivatives

Lf=logf(x)1+λ1Lλ1=abf(x)dx1\frac{\partial \mathcal{L}}{\partial f} = -\log f(x) -1 + \lambda_1 \quad\quad\quad \frac{\partial \mathcal{L}}{\partial \lambda_1} = \int_a^b f(x)dx-1

Solve Lf=0\frac{\partial \mathcal{L}}{\partial f}=0:

logf(x)1+λ1=0logf(x)=λ11f(x)=eλ11-\log f(x) -1 + \lambda_1=0 \quad\quad\quad \log f(x) = \lambda_1 - 1 \quad\quad\quad f(x) = e^{\lambda_1-1}

Solve Lλ1=0\frac{\partial \mathcal{L}}{\partial \lambda_1}=0:

abf(x)dx=1abeλ11dx=1(ba)eλ11=1eλ11=1ba\int_a^b f(x)dx=1 \quad\quad\quad \int_a^b e^{\lambda_1-1} dx = 1 \quad\quad\quad (b-a) e^{\lambda_1-1} = 1 \quad\quad\quad e^{\lambda_1-1}=\frac 1 {b-a}

The result is f(x)=1ba   (axb)f(x) = \frac 1 {b-a} \ \ \ (a \leq x \leq b).

Normal distribution

The normal distribution, also called Gaussian distribution, is important in statistics. It's the distribution with maximum entropy if we constraint its variance σ2\sigma^2 to be a finite value.

It has two parameters: the mean μ\mu and the standard deviation σ\sigma. N(μ,σ2)N(\mu, \sigma^2) denotes a normal distribution. Changing μ\mu moves the PDF alone X axis. Changing σ\sigma scales PDF along X axis.

We can rediscover normal distribution by maximizing entropy under variance constraint. To do that there are two prerequisite concepts: Largrange multiplier and functional derivative.

Rediscover normal distribution by maximizing entropy with variance constraint

For a distribution's probability density function ff, we want to maximize its entropy H(f)=f(x)log1f(x)dxH(f)=\int f(x) \log\frac{1}{f(x)}dx under the constraint:

  • It's a valid probability density function: f(x)dx=1\int_{-\infty}^{\infty} f(x)dx=1, and f(x)0f(x) \geq 0
  • The mean: xf(x)dx=μ\int_{-\infty}^{\infty} x f(x) dx = \mu
  • The variance constraint: f(x)(xμ)2dx=σ2\int_{-\infty}^{\infty} f(x) (x-\mu)^2 dx = \sigma^2

We can simplify to make deduction easier:

  • Moving the probability density function along X axis doesn't change entropy, so we can fix the mean as 0 (we can replace xx as xμx-\mu after finishing deduction).
  • log1f(x)\log\frac{1}{f(x)} already implicitly tells f(x)>0f(x)>0
  • It turns out that the mean constraint xf(x)dx=0\int_{-\infty}^{\infty} x f(x) dx = 0 is not necessary to deduce the result, so we can not include it in Largrange multipliers. (Including it is also fine but will make it more complex.)

The Largragian function:

L(f,λ1,λ2,λ3)={f(x)log1f(x)dx+λ1(f(x)dx1)+λ2(f(x)x2dxσ2)\mathcal{L}(f,\lambda_1,\lambda_2,\lambda_3)= \begin{cases} \int_{-\infty}^{\infty} f(x) \log\frac{1}{f(x)}dx \\ + \lambda_1 \left(\int_{-\infty}^{\infty} f(x)dx-1\right) \\ + \lambda_2 \left(\int_{-\infty}^{\infty} f(x)x^2dx -\sigma^2\right) \end{cases} =(f(x)logf(x)+λ1f(x)+λ2x2f(x))dxλ1λ2σ2=\int_{-\infty}^{\infty} (-f(x)\log f(x) + \lambda_1 f(x) + \lambda_2 x^2 f(x) ) dx - \lambda_1 - \lambda_2\sigma^2

Then compute the functional derivative Lf\frac{\partial \mathcal{L}}{\partial f}

Lf=logf(x)1+λ1+λ2x2\frac{\partial \mathcal{L}}{\partial f} = -\log f(x) - 1 + \lambda_1 + \lambda_2 x^2

Then solve Lf=0\frac{\partial \mathcal{L}}{\partial f}=0:

Lf=0logf(x)=1+λ1+λ2x2f(x)=e(1+λ1+λ2x2)\frac{\partial \mathcal{L}}{\partial f}=0 \quad\quad\quad \log f(x) = -1+\lambda_1+\lambda_2 x^2 \quad\quad\quad f(x) = e^{(-1+\lambda_1+\lambda_2 x^2)}

We get the rough form of normal distribution's probabilify density function.

Then solve Lλ1=0\frac{\partial \mathcal{L}}{\partial \lambda_1}=0:

Lλ1=0f(x)dx=1e(1+λ1+λ2x2)dx=1\frac{\partial \mathcal{L}}{\partial \lambda_1}=0 \quad\quad\quad \int_{-\infty}^{\infty} f(x)dx=1 \quad\quad\quad \int_{-\infty}^{\infty} e^{(-1+\lambda_1+\lambda_2 x^2)} dx = 1

That integration must converge, so λ2<0\lambda_2<0.

A subproblem: solve ekx2dx\int_{-\infty}^{\infty} e^{-k x^2}dx (k>0k>0). The trick is to firstly compute its square (ekx2dx)2(\int_{-\infty}^{\infty} e^{-k x^2}dx)^2, turning the integration into two-dimensional, and then substitude polar coordinates x=rcosθ, y=rsinθ, x2+y2=r2, dx dy=r dr dθx=r \cos \theta, \ y = r \sin \theta, \ x^2+y^2=r^2, \ dx\ dy = r \ dr \ d\theta :

(ekx2dx)2=ek(x2+y2)dx dy=θ=0θ=2πr=0r=rekr2dr dθ=2π0rekr2dr\left( \int_{-\infty}^{\infty} e^{-kx^2}dx\right)^2 =\int_{-\infty}^{\infty}\int_{-\infty}^{\infty} e^{-k(x^2+y^2)}dx\ dy = \int_{\theta=0}^{\theta=2\pi}\int_{r=0}^{r=\infty} r e^{-kr^2}dr\ d\theta = 2\pi \int_{0}^{\infty} r e^{-kr^2}dr

Then substitude u=kr2, du=2kr dr, dr=12krduu=-kr^2, \ du = -2kr\ dr, \ dr = -\frac{1}{2kr}du:

=2π0(12k)eudu=πk0eudu=πk= 2\pi \int_{0}^{-\infty} (-\frac{1}{2k}) e^udu=\frac{\pi}{k}\int_{-\infty}^0e^udu=\frac{\pi}{k}

So ekx2dx=πk\int_{-\infty}^{\infty} e^{-kx^2}dx=\sqrt{\frac{\pi}{k}}.

Put λ2=k-\lambda_2=k

e(1+λ1+λ2x2)dx=e1+λ1eλ2x2=e1+λ1πλ2=1\int_{-\infty}^{\infty} e^{(-1+\lambda_1+\lambda_2 x^2)} dx = e^{-1+\lambda_1} \int_{-\infty}^{\infty} e^{\lambda_2 x^2} = e^{-1+\lambda_1} \sqrt{\frac{\pi}{-\lambda_2}} = 1 e1+λ1=λ2πe^{-1+\lambda_1} = \sqrt{\frac{-\lambda_2}{\pi}}

Then solve Lλ2=0\frac{\partial \mathcal{L}}{\partial \lambda_2}=0:

Lλ2=0x2f(x)dx=σ2x2e(1+λ1+λ2x2)dx=σ2\frac{\partial \mathcal{L}}{\partial \lambda_2}=0 \quad\quad\quad \int_{-\infty}^{\infty} x^2f(x)dx=\sigma^2 \quad\quad\quad \int_{-\infty}^{\infty} x^2e^{(-1+\lambda_1+\lambda_2 x^2)} dx = \sigma^2

It requires another trick. For the previous result ekx2dx=πk\int_{-\infty}^{\infty} e^{-kx^2}dx=\sqrt{\frac{\pi}{k}}, take derivative to kk on two sides:

e(x2)kdx=πk12take derivative to k(x2)e(x2)kdx=12πk32\int_{-\infty}^{\infty} e^{(-x^2)k}dx=\sqrt{\pi} k^{-\frac{1}{2}} \xrightarrow{\text{take derivative to }k} \int_{-\infty}^{\infty} (-x^2)e^{(-x^2)k}dx = -\frac{1}{2}\sqrt{\pi} k^{-\frac{3}{2}}

So x2ekx2dx=12πk3\int_{-\infty}^{\infty} x^2e^{-kx^2}dx = \frac{1}{2}\sqrt{\frac{\pi}{k^3}}

x2e(1+λ1+λ2x2)dx=e1+λ1eλ2x2dx=e1+λ112πλ23=σ2\int_{-\infty}^{\infty} x^2e^{(-1+\lambda_1+\lambda_2 x^2)} dx =e^{-1+\lambda_1} \int_{-\infty}^{\infty} e^{\lambda_2x^2}dx =e^{-1+\lambda_1} \cdot \frac{1}{2} \sqrt{\frac{\pi}{-\lambda_2^3}}=\sigma^2

By using e1+λ1=λ2πe^{-1+\lambda_1} = \sqrt{\frac{-\lambda_2}{\pi}}, we get:

λ2π12πλ23=σ21λ22=2σ2\sqrt{\frac{-\lambda_2}{\pi}} \cdot \frac{1}{2} \sqrt{\frac{\pi}{-\lambda_2^3}}=\sigma^2 \quad\quad\quad \sqrt{\frac{1}{\lambda_2^2}}=2\sigma^2

Previously we know that λ2<0\lambda_2<0, then λ2=12σ2\lambda_2=-\frac{1}{2\sigma^2}. Then e1+λ1=12πσ2e^{-1+\lambda_1}=\sqrt{\frac{1}{2\pi\sigma^2}}

Then we finally deduced the normal distribution's probability density function (when mean is 0):

f(x)=e(1+λ1+λ2x2)=12πσ2e12σ2x2f(x) = e^{(-1+\lambda_1+\lambda_2 x^2)} = \sqrt{\frac{1}{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}x^2}

When mean is not 0, substitute xx as xμx-\mu, we get the general normal distribution:

f(x)=12πσ2e12σ2(xμ)2=12πσe12(xμσ)2f(x)=\sqrt{\frac{1}{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(x-\mu)^2} = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\left( \frac{x-\mu}{\sigma} \right)^2}

Entropy of normal distribution

We can then calculate the entropy of normal distribution:

H(X)=f(x)log1f(x)dx=f(x)log(2πσ2e(xμ)22σ2)dxH(X) = \int f(x)\log\frac{1}{f(x)}dx=\int f(x) \log( \sqrt{2\pi\sigma^2}e^{\frac{(x-\mu)^2}{2\sigma^2}})dx =f(x)(12log(2πσ2)+(xμ)22σ2)dx=12log(2πσ2)f(x)dx=1+12σ2f(x)(xμ)2=σ2dx=\int f(x) \left(\frac{1}{2}\log(2\pi\sigma^2)+\frac{(x-\mu)^2}{2\sigma^2}\right)dx=\frac{1}{2}\log(2\pi\sigma^2)\underbrace{\int f(x)dx} _ {=1}+ \frac{1}{2\sigma^2}\underbrace{\int f(x)(x-\mu)^2} _ {=\sigma^2}dx =12log(2πσ2)+12=12log(2πeσ2)=\frac{1}{2}\log(2\pi\sigma^2)+\frac{1}{2}=\frac{1}{2}\log(2\pi e \sigma^2)

If X follows normal distribution and Y's distribution that have the same mean and variance, the cross entropy H(Y,X)H(Y,X) have the same value: 12log(2πeσ2)\frac{1}{2}\log(2\pi e \sigma^2), regardless of the exact probability density function of Y. The deduction is similar to the above:

H(Y,X)=fY(x)log1fX(x)dx=fY(x)log(2πσ2e(xμ)22σ2)dxH(Y,X)=\int f_Y(x) \log \frac 1 {f_X(x)} dx = \int f_Y(x) \log( \sqrt{2\pi\sigma^2}e^{\frac{(x-\mu)^2}{2\sigma^2}})dx =fY(x)(12log(2πσ2)+(xμ)22σ2)dx=12log(2πσ2)fY(x)dx=1+12σ2fY(x)(xμ)2=σ2dx=\int f_Y(x) \left(\frac{1}{2}\log(2\pi\sigma^2)+\frac{(x-\mu)^2}{2\sigma^2}\right)dx=\frac{1}{2}\log(2\pi\sigma^2)\underbrace{\int f_Y(x)dx} _ {=1}+ \frac{1}{2\sigma^2}\underbrace{\int f_Y(x)(x-\mu)^2} _ {=\sigma^2}dx =12log(2πσ2)+12=12log(2πeσ2)=\frac{1}{2}\log(2\pi\sigma^2)+\frac{1}{2}=\frac{1}{2}\log(2\pi e \sigma^2)

Central limit theorem

We have a random variable XX, which has meam 0 and (finite) variance σ2\sigma^2.

If we add up nn independent samples of XX: X1+X2+...+XnX_1+X_2+...+X_n, the variance of sum is nσ2n\sigma^2.

To make its variance constant, we can divide it by n\sqrt n, then we get Sn=X1+X2+...+XnnS_n = \frac{X_1+X_2+...+X_n}{\sqrt n}. Here SnS_n is called the standardized sum, because it makes variance not change by sample count.

Central limit theorem says that the standardized sum apporaches normal distribution as nn increase. No matter what the original distribution of XX is (as long as its variance is finite), the standardized sum will approach normal distribution.

The information of distribution of XX will be "washed out" during the process. This "washing out information" is also increasing of entropy. As nn increase, the entopy of standardized sum always increase (except when X follows normal distribution the entropy stays at maximum). H(Sn+1)>H(Sn)H(S_{n+1}) > H(S_n) if XX is not normally distributed.

Normal distribution has the maximum entropy under variance constraint. As the entropy of standardized sum increase, its entropy will approach maximum and it will approach normal distribution. This is similar to second law of theomodynamics.

This is called Entropic Central Limit Theorem. Proving that is hard and requires a lot of prerequisite knowledges. See also: Solution of Shannon's problem on the monotonicity of entropy, Generalized Entropy Power Inequalities and Monotonicity Properties of Information

In the real world, many things follow normal distribution, like height of people, weight of people, error in manufacturing, error in measurement, etc.

The height of people is affect by many complex factors (nurtrition, health, genetic factors, exercise, environmental factors, etc.). The combination of these complex factors definitely cannot be similified to a standardized sum of i.i.d zero-mean samples X1+X2+...+Xnn\frac{X_1+X_2+...+X_n}{\sqrt n}. Some factors have large effect and some factors have small effect. The factors are not necessarily independent. But the height of people still roughly follows normal distribution. This can be semi-explained by second law of theomodynamics. The complex interactions of many factors increase entropy of the height. At the same time there are also many factors that constraint the variance of height. Why is there a variance constraint? In some cases variance correspond to instability. A human that is 100 meters tall is impossible as it's physically unstable. Similarily a human that's 1 cm tall is impossible in maintaining normal biological function. The unstable things tend to collapse and vanish (survivorship bias), and the stable things remain. That's how the variance constraint occurs in nature. In some places, variance correspond to energy, and the variance is constrainted by conservation of energy.

Although normal distribution is common, not all distributions are normal. There are also many things that follow fat-tail distributions.

Also note that Central Limit Theorem works when nn approaches infinity. Even if a distribution's standardized sum approach normal distribution, the speed of converging is important: some distribution converge to normal quickly, and some slowly. Some fat-tail distribution has finite variance but their standardized sum converge to normal distribution very slowly.

Multivariate normal distribution

In below, bold letter (like x\boldsymbol x) means column vector:

x=[x1x2...xn]\boldsymbol x = \begin{bmatrix} x_1 \\ x_2 \\ ... \\ x_n \end{bmatrix}

Linear transform: for a (column) vector x\boldsymbol{x}, muliply a matrix AA on it: AxA\boldsymbol x is linear transformation. Linear transformation can contain rotation, scaling and shearing. For row vector it's xA\boldsymbol xA. Two linear transformations can be combined one, corresponding to matrix multiplication.

Affine transform: for a (column) vector x\boldsymbol x, multiply a matrix on it and then add some offset Ax+bA\boldsymbol x + \boldsymbol b. It can move based on the result of linear transform. Two affine transformations can be combined into one. If y=Ax+b,z=Cy+d\boldsymbol y=A\boldsymbol x+\boldsymbol b, \boldsymbol z=C\boldsymbol y+\boldsymbol d, then z=(CA)x+(Cb+d)\boldsymbol z=(CA)\boldsymbol x +(C\boldsymbol b + \boldsymbol d)

(in some places affine transformation is called "linear transformation".)

Normal distribution has linear properties:

  • if you multiply a constant, the result still follow normal distribution. XN kXNX \sim N \rightarrow \ kX \sim N
  • if you add a constant, the result still follow normal distribution. XN(X+k)NX \sim N \rightarrow (X+k) \sim N
  • If you add up two independent normal random variables, the result still follows normal distribution. XN,YN(X+Y)NX \sim N, Y \sim N \rightarrow (X+Y) \sim N
  • A linear combination of many normal distributions also follow normal distribution. X1N,X2N,...XnN(k1X1+k2X2+...+knXn)NX_1 \sim N, X_2 \sim N, ... X_n \sim N \rightarrow (k_1X_1 + k_2X_2 + ... + k_nX_n) \sim N

If:

  • We have a (row) vector x\boldsymbol x of independent random variables x=(x1,x2,...xn)\boldsymbol x=(x_1, x_2, ... x_n), each element in vector follows a normal distribution (not necessarily the same normal distribution),
  • then, if we apply an affine transformation on that vector, which means multipling a matrix AA and then adding an offset b\boldsymbol b, y=Ax+b\boldsymbol y=A\boldsymbol x+\boldsymbol b,
  • then each element of y\boldsymbol y is a linear combination of normal distributions, yi=x1Ai,1+x2Ai,2+...xnAi,n+biy_i=x_1 A_{i,1} + x_2 A_{i, 2} + ... x_n A_{i,n} + b_i,
  • so each element in y\boldsymbol y also follow normal distribution. Now y\boldsymbol y follows multivariate normal distribution.

Note that the elements of y\boldsymbol y are no longer necessarily independent.

What if I apply two or many affine transformations? Two affine transformations can be combined into one. So the result is still multivariate normal distribution.

To describe a multivariate normal distribution, an important concept is covariance matrix.

Recall covariance: Cov[X,Y]=E[(XE[X])(YE[Y])]\text{Cov}[X,Y]=E[(X-E[X])(Y-E[Y])]. Some rules about covariance:

  • It's symmetric: Cov[X,Y]=Cov[Y,X]\text{Cov}[X,Y] = \text{Cov}[Y,X]
  • If X and Y are independent, Cov[X,Y]=0\text{Cov}[X,Y]=0
  • Adding constant Cov[X+k,Y]=Cov[X,Y]\text{Cov}[X+k,Y] = \text{Cov}[X,Y]. Variance is invariant to translation.
  • Multiplying constant Cov[kX,Y]=kCov[X,Y]\text{Cov}[k\cdot X,Y] = k \cdot \text{Cov}[X,Y]
  • Addition Cov[X+Y,Z]=Cov[X,Z]+Cov[Y,Z]\text{Cov}[X+Y,Z] = \text{Cov}[X,Z]+\text{Cov}[Y,Z]

Covariance matrix:

Cov(x,y)=E[(xE[x])(yE[y])T]\text{Cov}(\boldsymbol x,\boldsymbol y) = E[(\boldsymbol x - E[\boldsymbol x])(\boldsymbol y-E[\boldsymbol y])^T]

Here E[x]E[\boldsymbol x] taking mean of each element in x\boldsymbol x and output a vector. It's element-wise. E[x]i=E[xi]E[\boldsymbol x]_i = E[\boldsymbol x_i]. Similar for matrix.

The covariance matrix written out:

Cov(x,y)=[Cov[x1,y1] Cov[x1,y2] ... Cov[x1,yn]Cov[x2,y1] Cov[x2,y2] ... Cov[x2,yn]   Cov[xn,y1] Cov[xn,y2] ... Cov[xn,yn]]\text{Cov}(\boldsymbol x,\boldsymbol y)=\begin{bmatrix} \text{Cov}[\boldsymbol x_1,\boldsymbol y_1] &\ \text{Cov}[\boldsymbol x_1,\boldsymbol y_2] &\ ... &\ \text{Cov}[\boldsymbol x_1,\boldsymbol y_n] \\ \text{Cov}[\boldsymbol x_2,\boldsymbol y_1] &\ \text{Cov}[\boldsymbol x_2,\boldsymbol y_2] &\ ... &\ \text{Cov}[\boldsymbol x_2,\boldsymbol y_n] \\ \vdots &\ \vdots &\ \ddots &\ \vdots \\ \text{Cov}[\boldsymbol x_n,\boldsymbol y_1] &\ \text{Cov}[\boldsymbol x_n,\boldsymbol y_2] &\ ... &\ \text{Cov}[\boldsymbol x_n,\boldsymbol y_n] \end{bmatrix}

Recall that multiplying constant and addition can be "moved out of E[]E[]": E[kX]=kE[X], E[X+Y]=E[X]+E[Y]E[kX] = k E[X], \ E[X+Y]=E[X]+E[Y]. If AA is a matrix that contains random variable and BB is a matrix that's not random, then E[AB]=E[A]B, E[BA]=BE[A]E[A\cdot B] = E[A]\cdot B, \ E[B\cdot A] = B\cdot E[A], because multiplying a matrix come down to multiplying constant and adding up, which all can "move out of E[]E[]". Vector can be seen as a special kind of matrix.

So applying it to covariance matrix:

Cov(Ax,y)=E[(AxE[Ax])(yE[y])T]=E[(AxAE[x])(yE[y])T]\text{Cov}(A \cdot \boldsymbol x,\boldsymbol y) = E[(A\cdot \boldsymbol x - E[A\cdot \boldsymbol x])(\boldsymbol y-E[\boldsymbol y])^T] = E[(A\cdot \boldsymbol x - A \cdot E[\boldsymbol x])(\boldsymbol y-E[\boldsymbol y])^T] =AE[(xE[x])(yE[y])T]=ACov(x,y)=A\cdot E[(\boldsymbol x - E[\boldsymbol x])(\boldsymbol y-E[\boldsymbol y])^T] = A \cdot \text{Cov}(\boldsymbol x, \boldsymbol y)

Similarily, Cov(x,By)=Cov(x,y)BT\text{Cov}(\boldsymbol x, B \cdot \boldsymbol y) = \text{Cov}(\boldsymbol x, \boldsymbol y) \cdot B^T.

If x\boldsymbol x follows multivariate normal distribution, it can be described by mean vector μ\boldsymbol \mu (the mean of each element of x\boldsymbol x) and covariance matrix Cov(x,x)\text{Cov}(\boldsymbol x,\boldsymbol x).

Initially, if I have some independent normal variables x1,x2,...xnx_1, x_2, ... x_n with mean values μ1,...,μn\mu_1, ..., \mu_n and variances σ12,...,σn2\sigma_1^2, ..., \sigma_n^2. If we treat them as a multivariate normal distribution, the mean vector μx=(μ1,...,μn)\boldsymbol \mu_x = (\mu_1, ..., \mu_n). The covariance matrix will be diagonal as they are independent:

Cov(x,x)=[σ12 0 ... 00 σ22 ... 0   0 0 ... σn2]\text{Cov}(\boldsymbol x,\boldsymbol x) = \begin{bmatrix} \sigma_1^2 &\ 0 &\ ... &\ 0 \\ 0 &\ \sigma_2^2 &\ ... &\ 0 \\ \vdots &\ \vdots &\ \ddots &\ \vdots \\ 0 &\ 0 &\ ... &\ \sigma_n^2 \end{bmatrix}

Then if we apply an affine transformation y=Ax+b\boldsymbol y = A \boldsymbol x + \boldsymbol b, then μy=Aμx+b\boldsymbol \mu_y = A \mu_x + \boldsymbol b. Cov(y,y)=Cov(Ax+b,Ax+b)=Cov(Ax,Ax)=ACov(x,x)AT\text{Cov}(\boldsymbol y,\boldsymbol y) = \text{Cov}(A \boldsymbol x + \boldsymbol b,A \boldsymbol x + \boldsymbol b) = \text{Cov}(A \boldsymbol x, A \boldsymbol x) = A \text{Cov}(\boldsymbol x,\boldsymbol x) A^T.

Gaussian splatting

The industry standard of 3D modelling is to model the 3D object as many triangles, called mesh. It only models the visible surface object. It use many triangles to approximate curved surface.

Gaussian splatting provides an alternative method of 3D modelling. The 3D scene is modelled by a lot of mutlivariate (3D) gaussian distributions, called gaussian. When rendering, that 3D gaussian distribution is projected onto a plane (screen) and approximately become a 2D gaussian distribution, now probability density correspond to color opacity.

Note that the projection is perspective projection (near things big and far things small). Perspective projection is not linear. After perspective projection the 3D Gaussian distribution, it's no longer strictly a 2D Gaussian distribution, but is very close to a 2D Gaussian distribution, so approximated as a 2D gaussian. If the projection is linear then result is still gaussian.

A gaussian's color can be fixed or can change based on different view directions.

TODO

Score-based diffusion model

In diffusion model, we add gaussian noise to image (or other things). Then the diffusion model takes noisy input and we train it to output the noise added to it. There will be many steps of adding noise and the model should output the noise added in each step.

Tweedie's formula shows that estimating the noise added is the same as computing the likelihood of image distribution.

To simplify, here we only consider one dimension and one noise step (the same also applies to many dimensions and many noise steps).

If the original value is x0x_0, we add a noise ϵN(0,σ2)\epsilon \sim N(0, \sigma^2), the noise-added value is x1=x0+ϵx_1 = x_0 + \epsilon, x1N(x0,σ2)x_1 \sim N(x_0, \sigma^2).

The diffusion model only know x1x_1 and don't know x0x_0. The diffusion model need to estimate ϵ\epsilon from x1x_1.

Here:

  • p0(x0)p_0(x_0) is the probability density of original clean value (for image generation, it correspond to the probability distribution of images that we want to generate)
  • p1(x1)p_1(x_1) is the probability density of noise-added value
  • p10(x1x0)p_{1 \vert 0}(x_1 \vert x_0) is the probability density of noise-added value, given clean training data x0x_0. It's a normal distribution given x0x_0. It can also be seen as a function that take two arguments x0,x1x_0, x_1.
  • p01(x0x1)p_{0 \vert 1}(x_0 \vert x_1) is the probability density of the original clean value given noise-added value. It can also be seen as a function that take two arguments x0,x1x_0, x_1.

(I use p10(x1x0)p_{1 \vert 0}(x_1 \vert x_0) instead of shorter p(x1x0)p(x_1 \vert x_0) is to reduce confusion between different distributions.)

p10(x1x0)p_{1 \vert 0}(x_1 \vert x_0) is a normal distribution:

p10(x1x0)=12πσe12(x1x0σ)2p_{1 \vert 0}(x_1 \vert x_0) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{1}{2}\left( \frac{x1-x_0}{\sigma} \right)^2}

Take log:

logp10(x1x0)=12(x1x0σ)2+log12πσ\log p_{1 \vert 0}(x_1 \vert x_0) = -\frac 1 2 \left( \frac{x_1-x_0}{\sigma} \right)^2 + \log \frac 1 {\sqrt{2\pi}\sigma}

The linear score function under condition:

logp10(x1x0)x1=(x1x0σ)1σ=x1x0σ2\frac{\partial \log p_{1 \vert 0}(x_1 \vert x_0)}{\partial x_1} = -\left(\frac{x_1-x_0}{\sigma} \right) \cdot \frac {1} {\sigma} = -\frac{x_1-x_0}{\sigma^2}

Bayes rule:

p01(x0x1)=p10(x1x0)p0(x0)p1(x1)p_{0 \vert 1}(x_0 \vert x_1) = \frac{p_{1 \vert 0}(x_1 \vert x_0) p_0(x_0)}{p_1(x_1)}

Take log

logp01(x0x1)=logp10(x1x0)+logp0(x0)logp1(x1)\log p_{0 \vert 1}(x_0 \vert x_1) = \log p_{1 \vert 0}(x_1 \vert x_0) + \log p_0(x_0) - \log p_1(x_1)

Take partial derivative to x1x_1:

logp01(x0x1)x1=logp10(x1x0)x1+logp0(x0)x1=0logp1(x1)x1\frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} = \frac{\partial \log p_{1 \vert 0}(x_1 \vert x_0)}{\partial x_1} + \underbrace{\frac{\partial \log p_0(x_0)}{\partial x_1}}_{=0} - \frac{\partial \log p_1(x_1)}{\partial x_1}

Using previous result logp10(x1x0)x1=x1x0σ2\frac{\partial \log p_{1 \vert 0}(x_1 \vert x_0)}{\partial x_1} = - \frac{x_1-x_0}{\sigma^2}

logp01(x0x1)x1=x1x0σ2logp1(x1)x1\frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} = - \frac{x_1-x_0}{\sigma^2} - \frac{\partial \log p_1(x_1)}{\partial x_1}

Rearrange:

σ2logp01(x0x1)x1=x1+x0σ2logp1(x1)x1\sigma^2 \frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} = - x_1+x_0 - \sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} x0=σ2logp01(x0x1)x1+x1+σ2logp1(x1)x1x_0=\sigma^2 \frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1}+x_1+\sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1}

Now if we already know the noise-added value x1x_1, but we don't know x0x_0 so x0x_0 is uncertain. We want to compute the expectation of x0x_0 under that condition that x1x_1 is known.

E[x0x1]=Ex0[σ2logp01(x0x1)x1+x1+σ2logp1(x1)x1x1]E[x_0 \vert x_1] = E_{x_0}\left[ \sigma^2 \frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1}+x_1+\sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} \biggr\vert x_1 \right] =x1+Ex0[σ2logp01(x0x1)x1x1]+Ex0[σ2logp1(x1)x1x1]= x_1 + E_{x_0}\left[\sigma^2 \frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1}\biggr\vert x_1\right] + E_{x_0}\left[ \sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} \biggr\vert x_1\right]

Within it, Ex0[logp01(x0x1)x1x1]=0E_{x_0}\left[ \frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} \biggr\vert x_1 \right]=0, because

Ex0[logp01(x0x1)x1x1]=p01(x0x1)logp01(x0x1)x1dx0E_{x_0}\left[ \frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} \biggr\vert x_1 \right]= \int p_{0 \vert 1}(x_0 \vert x_1) \cdot \frac{\partial \log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} dx_0

And Ex0[σ2logp1(x1)x1x1]=σ2logp1(x1)x1E_{x_0}\left[ \sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} \biggr\vert x_1\right] = \sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} because it's unrelated to random x0x_0.

So

E[x0x1]=x1+σ2logp1(x1)x1Train diffusion model to output thisE[x_0 \vert x_1] = x_1 + \underbrace{\sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1}}_{\mathclap{\text{Train diffusion model to output this}}}

That's Tweedie's formula (for 1D case). It can be generalized to many dimensions, where the x0,x1x_0, x_1 are vectors, the distributions p0,p1,p01,p10p_0, p_1, p_{0 \vert 1}, p_{1 \vert 0} are joint distributions where different dimensions are not necessarily independent. The gaussian noise added to different dimensions are still independent.

The diffusion model is trained to estimate the added noise, which is the same as estimating the linear score.

Exponential distribution

If we have constraint X0X \geq 0 and fix the mean E[X]E[X] to a specific value μ\mu, then maximizing entropy gives exponential distribution. It can also be rediscovered from Lagrange multiplier:

L(f,λ1,λ2,λ3)={0f(x)log1f(x)dx+λ1(0f(x)dx1)+λ2(0f(x)xdxμ)\mathcal{L}(f,\lambda_1,\lambda_2,\lambda_3)= \begin{cases} \int_{0}^{\infty} f(x) \log\frac{1}{f(x)}dx \\ + \lambda_1 \left(\int_{0}^{\infty} f(x)dx-1\right) \\ + \lambda_2 \left(\int_{0}^{\infty} f(x)xdx -\mu\right) \end{cases} =0(f(x)logf(x)+λ1f(x)+λ2xf(x))dxλ1λ2μ=\int_{0}^{\infty} (-f(x)\log f(x) + \lambda_1 f(x) + \lambda_2 x f(x) ) dx - \lambda_1 - \lambda_2\mu Lf=logf(x)1+λ1+λ2xLλ1=0f(x)dx1Lλ2=0xf(x)dxμ\frac{\partial \mathcal{L}}{\partial f} = -\log f(x) - 1 + \lambda_1 + \lambda_2 x \quad\quad\quad \frac{\partial \mathcal{L}}{\partial \lambda_1}=\int_0^{\infty}f(x)dx-1 \quad\quad\quad \frac{\partial \mathcal{L}}{\partial \lambda_2}=\int_0^{\infty} xf(x)dx-\mu

Then solve Lf=0\frac{\partial \mathcal{L}}{\partial f}=0:

Lf=0logf(x)=1+λ1+λ2xf(x)=e(1+λ1+λ2x)=e(1+λ1)eλ2x\frac{\partial \mathcal{L}}{\partial f}=0 \quad\quad\quad \log f(x) = -1+\lambda_1+\lambda_2 x \quad\quad\quad f(x) = e^{(-1+\lambda_1+\lambda_2 x)} = e^{(-1+\lambda_1)} \cdot e^{\lambda_2 x}

Then solve Lλ1=0\frac{\partial \mathcal{L}}{\partial \lambda_1}=0:

Lλ1=00e(1+λ1)eλ2xdx=10eλ2xdx=e1λ1\frac{\partial \mathcal{L}}{\partial \lambda_1}=0 \quad\quad\quad \int_0^{\infty} e^{(-1+\lambda_1)} \cdot e^{\lambda_2 x} dx = 1 \quad\quad\quad \int_0^{\infty} e^{\lambda_2 x} dx = e^{1-\lambda_1}

To make that integration finite, λ2<0\lambda_2 < 0.

Let u=λ2x, du=λ2dx,dx=1λ2duu = \lambda_2 x, \ du = \lambda_2 dx, dx=\frac 1 {\lambda_2} du,

0eλ2xdx=1λ20eudu=1λ2=e1λ1\int_0^{\infty} e^{\lambda_2 x} dx = \frac 1 {\lambda_2} \int_0^{-\infty} e^udu = -\frac 1 {\lambda_2} = e^{1-\lambda_1}

Then solve Lλ2=0\frac{\partial \mathcal{L}}{\partial \lambda_2}=0:

Lλ2=00xe(1+λ1+λ2x)dx=μ0xeλ2xdx=μe1λ1\frac{\partial \mathcal{L}}{\partial \lambda_2}=0 \quad\quad\quad \int_0^{\infty} x e^{(-1+\lambda_1+\lambda_2 x)} dx = \mu \quad\quad\quad \int_0^{\infty} x e^{\lambda_2 x} dx = \mu e^{1-\lambda_1} 0xeλ2xdx=(1λ2xeλ2x1λ22eλ2x)0=1λ22\int_0^{\infty} x e^{\lambda_2 x} dx = (\frac 1 {\lambda_2} x e^{\lambda_2 x} - \frac 1 {\lambda_2^2} e^{\lambda_2x}) _0^{\infty} = \frac 1 {\lambda_2^2}

Now we have

f(x)=e(1+λ1)eλ2x1λ2=e1λ11λ22=μe1λ1f(x) = e^{(-1+\lambda_1)} \cdot e^{\lambda_2 x} \quad\quad\quad -\frac 1 {\lambda_2} = e^{1-\lambda_1} \quad\quad\quad \frac 1 {\lambda_2^2}=\mu e^{1-\lambda_1}

Solving it gives λ2=1μ, e1λ1=μ\lambda_2 = - \frac 1 {\mu}, \ e^{1-\lambda_1} = \mu. Then

f(x)=1μe1μx(x0)f(x) = \frac 1 \mu e^{-\frac 1 \mu x} \quad\quad (x \geq 0)

In the common definition of exponential distribution, λ=1μ\lambda = \frac 1 \mu, f(x)=λeλxf(x) = \lambda e^{-\lambda x}.

Its tail function:

TailFunction(x)=P(X>x)=xλeλydy=(eλy)y=xy==eλx\text{TailFunction}(x) = P(X>x) = \int_x^{\infty} \lambda e^{-\lambda y}dy= \left(-e^{-\lambda y}\right) \biggr\vert_{y=x}^{y=\infty} = e^{-\lambda x}

If some event is happening in fixed rate (λ\lambda), exponential distribution measures how long do we need to wait for the next event, if how long we will need to wait is irrelevant how long we have aleady waited (memorylessness).

Exponential distribution can measure:

  • The lifetime of machine components.
  • The time until a radioactive atom decays.
  • The time length of phone calls.
  • The time interval between two packets for a router.
  • ...

How to understand memorlessness? For example, a kind of radioactive atom decays once per 5 minutes on average. If the time unit is minute, then λ=15\lambda = \frac 1 5. For a specific atom, if we wait for it to decay, the time we need to wait is on average 5 minutes. However, if we have already waited for 3 minutes and it still hasn't decay, the expected time that we need to wait is still 5 minutes. If we have waited for 100 minutes and it still hasn't decay, the expected time that we need to wait is still 5 minutes. Because the atom doesn't "remember" how long we have waited.

Memorylessness means the probability that we still need to wait needToWait\text{needToWait} amount of time is irrelevant to how long we have already waited:

P(t(alreadyWaited+needToWait)  talreadyWaited)=P(tneedToWait)P(t \geq (\text{alreadyWaited} + \text{needToWait}) \ \vert \ t \geq \text{alreadyWaited}) = P(t \geq \text{needToWait})

(We can also rediscover exponential distrbution from just memorylessness.)

Memorylessness is related with its maximum entropy property. Maximizing entropy under constraints means maximizing uncertainty and minizing information other than the constraints. The only two constraints are X0X\geq 0, the wait time is positive, and E[X]=1λE[X]=\frac 1 \lambda, the average rate of the event. Other than the two constraints, there is no extra information. No information tells waiting reduces time need to wait, no information tells waiting increases time need to wait. So it's the most unbiased: waiting has no effect on the time need to wait. If the radioactive atom has some "internal memory" that changes over time and controls how likely it will decay, then the waiting time distribution encodes extra information other than the two constraints, which makes it no longer max-entropy.

Pareto distribution

Rediscover Pareto distribution from 80/20 rule

80/20 rule: for example 80% of weallth are in the richest 20% (the real number may be different).

It has fractal property: even within the richest 20%, 80% of wealth are in the richest 20% within. Based on this fractal-like property, we can naturally get Pareto distribution.

If the total people count is NN, the total wealth amount is WW. Then 0.2N0.2N people have 0.8W0.8W wealth. Applying the same within the 0.2N0.2N people: 0.20.2N0.2 \cdot 0.2 N people have 0.80.8W0.8 \cdot 0.8W wealth. Applying again, 0.20.20.2N0.2 \cdot 0.2 \cdot 0.2 N people have 0.80.80.8W0.8 \cdot 0.8 \cdot 0.8 W welath.

Generalize it, 0.2kN0.2^k N people have 0.8kW0.8^k W wealth (kk can be generalized to continuous real number).

If the wealth variable is XX (assume X>0X > 0), its probability density function is f(x)f(x), and porportion of people correspond to probability, the richest 0.2k0.2^k porportion of people group have 0.8kW0.8^kW wealth, tt is the wealth threshold (minimum wealth) of that group:

P(Xt)=tf(x)dx=0.2kP(X \geq t) = \int_t^{\infty} f(x)dx = 0.2^k

Note that f(x)f(x) represents probability density function (PDF), which correspond to density of proportion of people. Nf(x)N\cdot f(x) is people amount density over wealth. Multiplying it with wealth xx and integrate gets total wealth in range:

tx(Nf(x))dx=0.8kWtxf(x)dx=0.8kWN\int_t^{\infty} x (N \cdot f(x)) dx = 0.8^k W \quad\quad\quad \int_t^{\infty} x f(x) dx = 0.8^k \frac W N

We can rediscover Pareto distribution from these. The first thing to do is extract and eliminate kk:

tf(x)dx=0.2k=e(ln0.2)k(ln0.2)k=lntf(x)dx\int_t^{\infty} f(x)dx = 0.2^k = e^{(\ln 0.2)k} \quad\quad\quad (\ln 0.2) k=\ln\int_t^{\infty} f(x)dx txf(x)dx=WN0.8k=WNe(ln0.8)k(ln0.8)k=lnNtxf(x)dxW\int_t^{\infty} x f(x) dx = \frac W N 0.8^k = \frac W N e^{(\ln 0.8)k} \quad\quad\quad (\ln 0.8)k = \ln \frac{N\int_t^{\infty} x f(x) dx}{W} k=lntf(x)dxln0.2=lnNtxf(x)dxWln0.8ln0.8ln0.2lntf(x)dx=lnNtxf(x)dxWk=\frac{\ln\int_t^{\infty} f(x)dx}{\ln 0.2}=\frac{\ln \frac{N\int_t^{\infty} x f(x) dx}{W}}{\ln 0.8} \quad\quad\quad \frac{\ln 0.8}{\ln 0.2} \ln\int_t^{\infty} f(x)dx = \ln \frac{N\int_t^{\infty} x f(x) dx}{W} ln((tf(x)dx)ln0.8ln0.2)=lnNtxf(x)dxW\ln\left(\left(\int_t^{\infty} f(x)dx\right) ^{\frac{\ln 0.8}{\ln 0.2}} \right) = \ln \frac{N\int_t^{\infty} x f(x) dx}{W} (tf(x)dx)ln0.8ln0.2=NWtxf(x)dx\left(\int_t^{\infty} f(x)dx\right) ^{\frac{\ln 0.8}{\ln 0.2}} = \frac N W \int_t^{\infty} x f(x) dx

Then we can take derivative to tt on two sides:

ln0.8ln0.2(tf(x)dx)ln0.8ln0.21(f(t))=NW(tf(t))\frac{\ln 0.8}{\ln 0.2} \left( \int_t^{\infty} f(x)dx \right)^{\frac{\ln 0.8}{\ln 0.2}-1} (- f(t)) = \frac N W (-t f(t))

f(t)0f(t) \neq 0. Divide two sides by f(t)-f(t):

ln0.8ln0.2(tf(x)dx)ln0.8ln0.21=NWt\frac{\ln 0.8}{\ln 0.2} \left( \int_t^{\infty} f(x)dx \right)^{\frac{\ln 0.8}{\ln 0.2}-1} = \frac N W t ((tf(x)dx)ln0.8ln0.21)1ln0.8ln0.21=(Nln0.2Wln0.8t)1ln0.8ln0.21\left( \left( \int_t^{\infty} f(x)dx \right)^{\frac{\ln 0.8}{\ln 0.2}-1} \right)^{\frac 1 {\frac{\ln 0.8}{\ln 0.2}-1}} = \left(\frac {N\ln 0.2} {W\ln 0.8} t\right)^{\frac 1 {\frac{\ln 0.8}{\ln 0.2}-1}} tf(x)dx=(Nln0.2Wln0.8t)1ln0.8ln0.21=(Nln0.2Wln0.8t)ln0.2ln0.8ln0.2\int_t^{\infty} f(x)dx = \left( \frac{N\ln 0.2}{W\ln 0.8} t \right)^{\frac 1 {\frac{\ln 0.8}{\ln 0.2}-1}} = \left( \frac{N\ln 0.2}{W\ln 0.8} t \right)^{\frac {\ln 0.2} {\ln 0.8-\ln 0.2}}

Take derivative to tt on two sides again:

f(t)=ln0.2ln0.8ln0.2(Nln0.2Wln0.8t)ln0.2ln0.8ln0.21Nln0.2Wln0.8-f(t) = \frac {\ln 0.2} {\ln 0.8-\ln 0.2} \left( \frac{N\ln 0.2}{W\ln 0.8} t \right)^{\frac {\ln 0.2} {\ln 0.8-\ln 0.2} - 1} \cdot \frac{N\ln 0.2}{W\ln 0.8}

Now tt is an argument and can be renamed to xx. And do some adjustments:

f(x)=ln0.2ln0.8ln0.2(Nln0.2Wln0.8)ln0.2ln0.8ln0.2xln0.2ln0.8ln0.21f(x) = -\frac {\ln 0.2} {\ln 0.8-\ln 0.2} \left(\frac{N\ln 0.2}{W\ln 0.8}\right)^{\frac {\ln 0.2} {\ln 0.8-\ln 0.2} } \cdot x ^{\frac {\ln 0.2} {\ln 0.8-\ln 0.2} - 1}

Now we get the PDF. We still need to make the total probability area to be 1 to make it a valid distribution. But there is no extra unknown parameter in PDF to change. The solution is to crop the range of XX. If we set the minimum wealth in distribution to be mm (but doesn't constraint the maximum wealth), creating constraint XmX \geq m, then using the previous result

mf(x)dx=1(Nln0.2Wln0.8m)ln0.2ln0.8ln0.2=1m=Wln0.8Nln0.20.1386WN\int_m^{\infty} f(x)dx = 1 \quad\quad\quad \left( \frac{N\ln 0.2}{W\ln 0.8} m \right)^{\frac {\ln 0.2} {\ln 0.8-\ln 0.2}} = 1 \quad\quad\quad m = \frac{W \ln 0.8}{N \ln 0.2} \approx 0.1386 \frac W N

Now we rediscovered (a special case of) Pareto distribution from just fractal 80/20 rule. We can generalize it further for other cases like 90/10 rule, 80/10 rule, etc. and get Pareto (Type I) distribution. It has two parameters, shape parameter α\alpha (correspond to ln0.2ln0.8ln0.2=ln5ln41.161-\frac {\ln 0.2} {\ln 0.8-\ln 0.2} = \frac{\ln 5}{\ln 4} \approx 1.161) and minimum value mm:

f(x)={αmαxα1 xm,0 x<mf(x) = \begin{cases} \alpha m^\alpha x^{-\alpha-1} &\ x \geq m, \\ 0 &\ x < m \end{cases}

Note that in real world the wealth of one can be negative (has debts more than assets). The Pareto distribution is just an approximation. mm means the threshold where Pareto distribution starts to be good approximation.

If α1\alpha \leq 1 then its theoretical mean is infinite. Of course if we have finite samples then the sample mean will be finite, but if the theoretical mean is infinite, the more sample we have, the larger the sample mean tend to be, and the trend won't stop.

If α2\alpha \leq 2 then its theoretical variance is infinite. Recall that centrol limit theorem require finite variance. The standarized sum of values taken from Pareto distribution whose α2\alpha \leq 2 does not follow central limit theorem because it has infinite variance.

Pareto distribution is often described using tail function (rather than probability density function):

TailFunction(x)=P(X>x)={mαxα if xm,1 if x<m\text{TailFunction}(x) = P(X>x) = \begin{cases} m^\alpha x^{-\alpha} &\ \text{if } x \geq m, \\ 1 &\ \text{if } x < m \end{cases}

Rediscover Pareto distribution by maximizing entropy under geometric mean constraint

There are additive values, like length, mass, money. For additive values, we often compute arithmetic average 1n(x1+x2+..+xn)\frac 1 n (x_1 + x_2 + .. + x_n).

There are also multiplicative values, like asset return rate, growth ratio. For multiplicative values, we often compute geometric average (x1x2...xn)1n(x_1 \cdot x_2 \cdot ... \cdot x_n)^{\frac 1 n}. For example, if an asset grows by 20% in first year, drops 10% in second year and grows 1% in third year, then the average growth ratio per year is (1.20.91.01)13(1.2 \cdot 0.9 \cdot 1.01)^{\frac 1 3}.

Logarithm allows turning multiplication into addition, and turning power into multiplication. If y=logxy = \log x, then log of geometric average of xx is arithmetic average of yy:

log((x1x2...xn)1n)=1n(logx1+logx2+...+logxn)=1n(y1+y2+...+yn)\log \left((x_1 \cdot x_2 \cdot ... \cdot x_n)^{\frac 1 n}\right) = \frac 1 n (\log x_1 + \log x_2 + ... + \log x_n)=\frac 1 n (y_1 + y_2 + ... + y_n)

Pareto distribution maximizes entropy under geometric mean constraint E[logX]E[\log X].

If we have constraints Xm>0X \geq m > 0, E[logX]=gE[\log X] = g, using largrange multiplier to maximize entropy:

L(f,λ1,λ2)={mf(x)log1f(x)dx+λ1(mf(x)dx1)+λ2(mf(x)logxdxg)\mathcal{L}(f, \lambda_1, \lambda_2)= \begin{cases}\int_m^{\infty} f(x) \log \frac 1 {f(x)} dx \\\\ + \lambda_1 (\int_m^{\infty} f(x)dx-1) \\\\ + \lambda_2 (\int_m^{\infty} f(x)\log x dx - g) \end{cases} L(f,λ1,λ2)=m( f(x)logf(x)+λ1f(x)+λ2f(x)logx )dxλ1gλ2\mathcal{L}(f, \lambda_1, \lambda_2) = \int_m^{\infty} (\ -f(x)\log f(x) + \lambda_1 f(x) + \lambda_2 f(x) \log x \ ) dx -\lambda_1 - g \lambda_2 Lf=logf(x)1+λ1+λ2logx\frac{\partial \mathcal{L}}{\partial f} = -\log f(x) - 1 + \lambda_1 + \lambda_2 \log x Lλ1=mf(x)dx1\frac{\partial \mathcal{L}}{\partial \lambda_1} = \int_m^{\infty} f(x) dx -1 Lλ2=mf(x)logx dxg\frac{\partial \mathcal{L}}{\partial \lambda_2} = \int_m^{\infty} f(x) \log x \ dx-g

Solve Lf=0\frac{\partial \mathcal{L}}{\partial f}=0:

logf(x)1+λ1+λ2logx=0- \log f(x) - 1 + \lambda_1 + \lambda_2 \log x=0 logf(x)=1+λ1+λ2logx\log f(x) = -1+\lambda_1 + \lambda_2 \log x f(x)=e1+λ1+λ2logxf(x) = e^{-1+\lambda_1+\lambda_2 \log x} f(x)=e1+λ1(elogx)λ2=e1+λ1xλ2f(x) = e^{-1+\lambda_1} \cdot (e^{\log x})^{\lambda_2} = e^{-1+\lambda_1} \cdot x^{\lambda_2}

Solve Lλ1=0\frac{\partial \mathcal{L}}{\partial \lambda_1}=0:

e1+λ1mxλ2dx=1mxλ2dx=e1λ1e^{-1+\lambda_1}\int_m^{\infty} x^{\lambda_2}dx = 1\quad\quad\quad \int_m^{\infty} x^{\lambda_2}dx = e^{1-\lambda_1}

To make mxλ2dx\int_m^{\infty} x^{\lambda_2}dx be finite, λ2<1\lambda_2 < -1.

mxλ2dx=(1λ2+1xλ2+1)x=mx==1λ2+1mλ2+1=e1λ1\int_m^{\infty} x^{\lambda_2}dx= \left( \frac{1}{\lambda_2+1}x^{\lambda_2+1} \right) \biggr\vert^{x=\infty}_{x=m} =- \frac 1 {\lambda_2+1} m^{\lambda_2 + 1} = e^{1-\lambda_1} mλ2+1λ2+1=e1λ1e1+λ1=λ2+1mλ2+1(1)\frac{m^{\lambda_2+1}}{\lambda_2+1} = -e^{1-\lambda_1} \tag{1}\quad\quad\quad e^{-1+\lambda_1}=-\frac{\lambda_2+1}{m^{\lambda_2+1}}

Solve Lλ2=0\frac{\partial \mathcal{L}}{\partial \lambda_2}=0:

mf(x)logx dx=g\int_m^{\infty} f(x) \log x \ dx=g me1+λ1xλ2logx dx=g\int_m^{\infty} e^{-1+\lambda_1} \cdot x^{\lambda_2} \log x \ dx=g

If we temporarily ignore e1+λ1e^{-1+\lambda_1} and compute mxλ2logx dx\int_m^{\infty} x^{\lambda_2} \log x \ dx. Let u=logxu=\log x, x=eux=e^u, dx=eududx = e^udu:

mxλ2logx dx=logmeλ2uu du=(1λ2+1ue(λ2+1)u1(λ2+1)2e(λ2+1)u)u=logmu=\int_m^{\infty} x^{\lambda_2} \log x \ dx=\int_{\log m}^{\infty} e^{\lambda_2 u} u \ du = \left( \frac 1 {\lambda_2+1} u e^{(\lambda_2+1)u} - \frac 1 {(\lambda_2+1)^2} e^{(\lambda_2+1)u}\right) \biggr\vert_{u=\log m}^{u=\infty}

Then

mxλ2logx dx=1λ2+1(logm)e(λ2+1)logm+1(λ2+1)2e(λ2+1)logm\int_m^{\infty} x^{\lambda_2} \log x \ dx=- \frac 1 {\lambda_2+1} (\log m) e^{(\lambda_2+1)\log m} + \frac 1 {(\lambda_2+1)^2} e^{(\lambda_2+1)\log m} =1λ2+1(logm)m(λ2+1)+1(λ2+1)2m(λ2+1)=- \frac 1 {\lambda_2+1} (\log m) m^{(\lambda_2+1)} + \frac 1 {(\lambda_2+1)^2} m^{(\lambda_2+1)}

So

me1+λ1xλ2logx dx=e1+λ1(1λ2+1(logm)m(λ2+1)+1(λ2+1)2m(λ2+1))\int_m^{\infty} e^{-1+\lambda_1} \cdot x^{\lambda_2} \log x \ dx = e^{-1+\lambda_1} \left(- \frac 1 {\lambda_2+1} (\log m) m^{(\lambda_2+1)} + \frac 1 {(\lambda_2+1)^2} m^{(\lambda_2+1)} \right)

By using (1) e1+λ1=λ2+1mλ2+1e^{-1+\lambda_1}=-\frac{\lambda_2+1}{m^{\lambda_2+1}}:

=(logm+1λ2+1)=logm1λ2+1=g=- (-\log m + \frac 1 {\lambda_2+1})=\log m - \frac 1 {\lambda_2+1} = g 1λ2+1=logmgλ2+1=1logmg\frac 1 {\lambda_2+1} = \log m - g\quad\quad\quad\lambda_2+1 = \frac 1 {\log m - g} e1+λ1=λ2+1mλ2+1=1logmgm1logmge^{-1+\lambda_1}=-\frac{\lambda_2+1}{m^{\lambda_2+1}} = - \frac{\frac 1 {\log m - g}}{m^{\frac 1 {\log m - g}}} f(x)=e1+λ1xλ2=1logmgm1logmgx(1logmg1)f(x)= e^{-1+\lambda_1} \cdot x^{\lambda_2} = - \frac{\frac 1 {\log m - g}}{m^{\frac 1 {\log m - g}}} x^{(\frac 1 {\log m - g}-1)}

Let α=1logmg\alpha = -\frac 1 {\log m - g}, it become:

f(x)=αmαxα1(x>m)f(x) = \alpha m^{\alpha} x^{-\alpha-1} \quad\quad(x>m)

Now we rediscovered Pareto (Type I) distribution by maximizing entropy.

In the process we have λ2<1\lambda_2 \lt -1. From λ2+1=1logmg\lambda_2+1 = \frac 1 {\log m - g} we know logmg<0\log m - g <0, which is m<egm < e^g.

Share of top pp porportion

For example, if wealth follows Pareto distribution, how to compute the wealth share of the top 1%? Generally how to compute the share of the top pp porpotion?

We firstly need to compute the threshold value tt of the top nn:

P(X>t)=nmαtα=pt=(pmα)1α=mp1αP(X>t) = n \quad\quad\quad m^\alpha t^{-\alpha}=p \quad\quad\quad t= (p m^{-\alpha})^{- \frac{1}{\alpha}} = m p^{- \frac{1}{\alpha}}

Then compute the share

Share=txNf(x)dxmxNf(x)dx=txf(x)dxmxf(x)dx\text{Share} = \frac{\int_t^{\infty} x N f(x)dx}{\int_m^{\infty} x N f(x)dx}=\frac{\int_t^{\infty} x f(x)dx}{\int_m^{\infty} x f(x)dx} bxf(x)dx=bαmαxαdx=αmα(1α+1xα+1)x=bx==(αmα1α+1)bα+1\int_b^{\infty} x f(x)dx = \int_b^{\infty} \alpha m^{-\alpha} x^{-\alpha}dx = \alpha m^{-\alpha} \cdot \left( \frac 1 {-\alpha+1} x^{-\alpha+1} \right) \biggr\vert_{x=b}^{x=\infty} = \left(- \alpha m^{-\alpha} \frac 1 {-\alpha+1}\right) b^{-\alpha+1}

To make that integration finite, we need α+1<0-\alpha+1< 0, α>1\alpha > 1.

Share=txf(x)dxmxf(x)dx=(αmα1α+1)tα+1(αmα1α+1)mα+1=tα+1mα+1=mα+1p1α(α+1)mα+1=p11α\text{Share}=\frac{\int_t^{\infty} x f(x)dx}{\int_m^{\infty} x f(x)dx}= \frac{\left(- \alpha m^{-\alpha} \frac 1 {-\alpha+1}\right) t^{-\alpha+1}}{\left(- \alpha m^{-\alpha} \frac 1 {-\alpha+1}\right) m^{-\alpha+1}}= \frac{t^{-\alpha+1}}{m^{-\alpha+1}} = \frac{m^{-\alpha+1} p^{ - \frac{1}{\alpha} (-\alpha+1)}}{m^{-\alpha+1}}=p^{1- \frac{1}{\alpha}}

The share porpotion is irrelevant to mm.

Some concrete numbers:

α\alphaShare of top 20%Share of top 1%
1.00199.84%99.54%
1.186.39%65.79%
1.16096480.00%52.81%
1.276.47%46.42%
1.368.98%34.55%
1.558.48%21.54%
244.72%10.00%
2.538.07%6.31%
334.20%4.64%
print("| $\\alpha$ | Share of top 20% | Share of top 1% |\n| - | - | - |\n"+ "\n".join([
"|"+ "|".join([f"{a}"] + [
f"{pow(p,1-(1/a)):.2%}" for p in [0.2,0.01]
]) + "|" for a in [1.001,1.1,1.160964,1.2,1.3,1.5,2,2.5,3]
]))

Power law distributions

A distribution is power law distribution if its tail function P(X>x)P(X>x) is roughly porpotional to xαx^{-\alpha}, where α\alpha is called exponent.

P(X>x)xα(roughly)P(X>x) \propto x^{-\alpha} \quad\quad \text{(roughly)}

The "roughly" here means that it can have small deviations that is infinitely small when xx is large enough. Rigorously speaking it's P(X>x)L(x)xαP(X>x) \propto L(x) x^{-\alpha} where LL is a slow varying function that requires limxL(rx)L(x)=1\lim_{x \to \infty} \frac{L(rx)}{L(x)}=1 for positive rr.

Note that in some places the power law is written as P(X>x)L(x)x(α1)P(X>x) \propto L(x) x^{-(\alpha-1)}. In these places the α\alpha is 1 larger than the α\alpha in Pareto distribution. The same α\alpha can have different meaning in different places. Here I will use the α\alpha that's consistent with the α\alpha in Pareto distribution.

The lower the exponent α\alpha, the more right-skewed it is, and the more extreme values it have.

The power law parameter estimation according to Power laws, Pareto distributions and Zipf’s law:

DistributionEstimated min value that power law starts to holdEstimated exponent α\alpha
Frequency of use of words11.20
Number of citations of papers1002.04
Number of hits on web sites11.40
Copies of books sold in the US2 millions2.51
Telephone calls received101.22
Magnitude of earthquakes3.82.04
Diameter of moon craters0.012.14
Intensity of solar flares2000.83
Intensity of wars30.80
Net worth of Americans$600 millions1.09
Frequency of family names100000.94
Population of US cities400001.30

Book The Black Swan also provides some estimation of power law parameter in real world:

DistributionEstimated exponent α\alpha
Number of books sold in the U.S.1.5
Magnitude of earthquakes2.8
Market moves3 (or lower)
Company size1.5
People killed in terroist attacks2 (but possibly a much lower exponent)

Note that the estimation is not accurate because they are sensitive to rare extreme samples.

Note that there are things whose estimated α<1\alpha < 1: intensity of solar flares, intensity of wars, frequency of family names. Recall that in Pareto (Type I) distribuion if α1\alpha \leq 1 then the theoretical mean is infinite. The sample mean tend to be higher and higher when we collect samples and the trend won't stop. If the intensity of war do follow power law and the real α<1\alpha < 1, then much larger wars exists in the future.

Note that most of these things has estimated α<2\alpha < 2. In Pareto (Type I) distribution if α2\alpha \leq 2 then its theoretical variance is infinite. Not having a finite variance makes them not follow central limit theorem and should not be modelled using gaussian distribution.

There are other distributions that can have extreme values:

  • Log-normal distribution: If logX\log X is normally distributed, then XX follows log-normal distribution. Put in another way, if YY is normally distributed, then eYe^Y follows log-normal distribution.
  • Stretched exponential distribution: P(X>x)P(X>x) is roughly porpotional to ekxβe^{-kx^\beta} (β<1\beta < 1)
  • Power law with exponential cutoff: P(X>x)P(X>x) is roughly porpotional to xαeλxx^{-\alpha} e^{-\lambda x}

They all have less extreme values than power law distributions, but more extreme values than normal distribution and exponential distribution.

Relation with exponential distribution

If TT follows exponential distribution, then aTa^T follows Pareto (Type I) distribution if a>1a>1.

If TT follows exponential distribution, its probability density fT(t)=λeλtf_T(t) = \lambda e^{-\lambda t} (T0T\geq 0), its cumulative distribution function FT(t)=P(T<t)=1eλtF_T(t) = P(T<t) = 1-e^{-\lambda t}

If Y=aTY=a^T, a>1a>1, then

P(Y<y)=P(aT<y)=P(T<logyloga)=1eλlogyloga=1(elogy)λloga=1yλlogaP(Y<y) = P(a^T < y) = P\left(T < \frac{\log y}{\log a}\right) = 1- e^{-\lambda \frac{\log y}{\log a}}=1- (e^{\log y})^{-\frac{\lambda}{\log a}}=1-y^{-\frac{\lambda}{\log a}} TailFunction(y)=P(Y>y)=1P(Y<y)=yλloga\text{TailFunction}(y)=P(Y>y) = 1-P(Y<y) = y^{-\frac{\lambda}{\log a}}

Because T0T\geq 0, Ya0=1Y \geq a^0=1. Now YY's tail function is in the same form as Pareto (Type I) distribution, where α=λloga, m=1\alpha=\frac{\lambda}{\log a}, \ m =1.

Lindy effect

If the lifetime of something follows power law distribution, then it has Lindy effect: the longer that it has existed, the longer that it will likely to continue existing.

If the lifetime TT follows Pareto distribution, if something keeps living at time tt, then compute the expected lifetime under that condition.

(The mean is weighted average. The conditional mean is also weighted average but under condition. But as the total integrated weight is not 1, it need to divide the total integrated weight.)

E[TT>t]=txf(x)dxtf(x)dx=txαmαxα1dxtαmαxα1dxE[T | T > t] = \frac{\int_t^{\infty} xf(x)dx}{\int_t^{\infty} f(x)dx} = \frac{\int_t^{\infty} x \alpha m^{-\alpha} x^{-\alpha-1} dx }{\int_t^{\infty} \alpha m^{-\alpha} x^{-\alpha-1} dx} =txαdxtxα1dx=1α+1xα+1x=tx=1αxαx=tx==1α+1tα+11αtα=αα1t= \frac{\int_t^{\infty} x^{-\alpha} dx }{\int_t^{\infty} x^{-\alpha-1} dx} = \frac{ \frac 1 {-\alpha+1} x^{-\alpha+1} |_{x=t}^{x=\infty}}{\frac 1 {-\alpha} x^{-\alpha}|_{x=t}^{x=\infty}} = \frac{-\frac 1 {-\alpha+1} t^{-\alpha+1}}{-\frac 1 {-\alpha} t^{-\alpha}} = \frac{\alpha}{\alpha-1} t

(For that integration to be finite, α+1<0-\alpha+1<0, α>1\alpha>1)

The expected lifetime is αα1t\frac{\alpha}{\alpha-1} t under the condition that it has already lived to time tt. The expected remaining lifetime is αα1tt=1α1t\frac{\alpha}{\alpha-1} t-t= \frac{1}{\alpha-1}t. It increases by tt.

Lindy effect often doesn't apply to physical things. Lindy effect often applies to information, like technology, culture, art, social norm, etc.

Distribution of lifetimeExpected remaining lifetime of living ones
Normal distributionGet shorter as time passes
Exponential distributionDoes not change as time passes (memorylessness)
Pareto distributionGet longer as time passes (Lindy effect)

Benford's law

If some numbers spans multiple orders of magnitudes, Benford's law says that about 30% of numbers have leading digit 1, about 18% of numbers have leading digit of 2, ... The digit dd's porportion is log10(1+1d)\log_{10} \left(1 + \frac 1 d \right).

Pareto distribution is a distribution that spans many orders of magnitudes. Let's compute the distribution of first digit if the number follows Pareto distribution.

If xx starts with digit dd then d10kx<(d+1)10kd 10^k \leq x < (d+1) 10^k, k=0,1,2,...k=0, 1, 2, ... Pareto distribution has a lower bound mm. If we make mm randomly distributed then analytically computing the probability of each starting digit become hard due to edge cases.

In this case, doing a Monte Carlo simulation is easier.

How to randomly sample numbers from a Pareto distribution? Firstly we know the cumulative distribution function F(x)=P(X<x)=1P(X>x)=1mαxαF(x) = P(X<x) = 1-P(X>x) = 1- m^\alpha x^{-\alpha}. We can then get quantile function, which is the inverse of FF: F(x)=p,  Q(p)=xF(x)=p, \ \ Q(p) = x

p=1mαxαmαxα=1pxα=(1p)mαp=1-m^\alpha x^{-\alpha} \quad\quad\quad m^\alpha x^{-\alpha}=1-p \quad\quad\quad x^{-\alpha} = (1-p) m^{-\alpha} (xα)1α=((1p)mα)1αx=m(1p)1αQ(p)=m(1p)1α(x^{-\alpha})^{- \frac{1}{\alpha}} = \left((1-p) m^{-\alpha}\right)^{- \frac{1}{\alpha}} \quad\quad\quad x = m (1-p)^{- \frac{1}{\alpha}} \quad\quad\quad Q(p) = m (1-p)^{- \frac{1}{\alpha}}

Now we can randomly sample pp between 0 and 1 then Q(p)Q(p) will follow Pareto distribution.

Given xx how to calculate its first digit? If 10x<10010\leq x<100 (1log10x<21 \leq \log_{10} x < 2) then first digit is x10\lfloor {\frac x {10}} \rfloor. If 100x<1000100 \leq x < 1000 (2log10x<32 \leq \log_{10}x < 3) then the first digit is x100\lfloor {\frac x {100}} \rfloor. Generalize it, the first digit dd is:

d=x10log10xd = \left\lfloor \frac {x} {10^{\lfloor \log_{10} x \rfloor}} \right\rfloor

Because Pareto distribution has a lot of extreme values, directly calculating the sample will likely to exceed floating-point range and give some inf. So we need to use log scale. Only calculate using logx\log x and avoid using xx directly.

Sampling in log scale:

logx=log(m(1p)1α)=logm1αlog(1p)\log x = \log \left(m (1-p)^{- \frac{1}{\alpha}}\right) = \log m - \frac{1}{\alpha} \log (1-p)

Calculating first digit in log scale:

log10x=logexloge10\log_{10}x = \frac{\log_e x}{\log_e 10} logx10log10x=logxlog10xlog10=logxlogxlog10log10\log \frac {x} {10^{\lfloor \log_{10} x \rfloor}} = \log x - \lfloor \log_{10} x \rfloor \log 10 = \log x - \left\lfloor \frac{\log x}{\log 10} \right\rfloor \log 10 d=elogxlogxlog10log10d = \left\lfloor e^{\log x - \left\lfloor \frac{\log x}{\log 10} \right\rfloor \log 10} \right\rfloor

When α\alpha approaches 00 it accurately follows Benford's law. The larger α\alpha the larger deviation with Benford's law.

If we fix the min value mm as a specific number, like 33, when α\alpha is not very close to 00 it significantly deviates with Benford's law. However if we make mm a random value between 1 and 10 then it will be close to Benford's law.

import numpy as np
import matplotlib.pyplot as plt

def first_digit_of_log_x(log_x):
log_10_x = log_x / np.log(10)
exponent = log_x - np.floor(log_10_x) * np.log(10)
return np.floor(np.exp(exponent)).astype(int)

benford_probs = [np.log10(1 + 1/d) for d in range(1, 10)]

n_samples = 1000000
alphas = [0.001, 0.9, 1.2, 2.0]

fig, axs = plt.subplots(4, 2, figsize=(12, 10))
fig.suptitle("First digit distribution in Pareto Distributions")

def sub_plot(row, col, alpha, m, m_str):
p = np.random.uniform(0, 1, n_samples)
log_xs = np.log(m) - (np.log(1 - p)) / alpha
digits = first_digit_of_log_x(log_xs)
digit_counts = np.bincount(digits, minlength=10)[1:10]
observed_probs = digit_counts / digit_counts.sum()

axs[row, col].bar(range(1, 10), observed_probs, label='Result', color='#6075eb')
axs[row, col].plot(range(1, 10), benford_probs, 'o-', label='According to Benford\'s Law', color='#ff7c3b')
axs[row, col].set_title(f"$\\alpha$ = {alpha}, {m_str}")
axs[row, col].legend()
axs[row, col].set_xticks(range(1, 10))
axs[row, col].set_ylim(0, 0.5)

sub_plot(0,0,0.001,np.random.uniform(1, 10, n_samples),'$m \\sim U[1,10]$')
sub_plot(1,0,0.9,np.random.uniform(1, 10, n_samples),'$m \\sim U[1,10]$')
sub_plot(2,0,1.2,np.random.uniform(1, 10, n_samples),'$m \\sim U[1,10]$')
sub_plot(3,0,2.0,np.random.uniform(1, 10, n_samples),'$m \\sim U[1,10]$')

sub_plot(0,1,0.001,3.0,'$m = 3$')
sub_plot(1,1,0.9,3.0,'$m = 3$')
sub_plot(2,1,1.2,3.0,'$m = 3$')
sub_plot(3,1,2.0,3.0,'$m = 3$')

plt.tight_layout()
plt.savefig("pareto_benfords_law.svg")

Hypothesis testing

We have a null hypothesis H0H_0, like "the coin is fair", and an alternative hypothesis H1H_1, like "the coin is unfair". We now need to test how likely H1H_1 is true using data.

If you have some data and it's extreme if we assume null hypothesis H0H_0, then P-value is the probability of getting the result that's as extreme or more extreme than the data if we assume null hypothesis H0H_0 is true. If p-value is small then the alternative hypothesis is likely true.

If I do ten coin flips then get 9 heads and 1 tail, the probability that the coin flip is fair but still get 9 heads and 1 tail. P-value is the probability that we get as extreme or more extreme as the result, and the "extreme" is two sided, so p-value is P(9 heads 1 tail)+P(10 heads 0 tail)+P(1 heads 9 tail)+P(0 heads 10 tail)P(\text{9 heads 1 tail}) + P(\text{10 heads 0 tail}) + P(\text{1 heads 9 tail}) + P(\text{0 heads 10 tail}) assume coin flip is fair.

Can we swap the null hypothesis and alternative hypothesis? For two conflicting hypothesis, which one should be the null hypothesis? The key is burden of proof. The null hypothesis is the default that most people tend to agree and does not need proving. The alternative hypothesis is special and require you to prove using the data.

The lower the p value, the higher your confidence that alternative hypothesis is true. But due to randomness you cannot be 100% sure.

Caveat: Collect data until significance

If you are doing an AB test, you keep collecting data, and when there is statistical significance (like p-value lower than 0.05) you make a conclusion, this is not statistically sound. A random fluctation in the process could lead to false positive results.

A more rigorous approach is to determine required sample size before AB test. And the fewer data you have the stricter hypothesis test should be (lower p-value threshold). According to O'Brien-Fleming Boundary, the p-value threshold should be 0.001 when you have 25% data, 0.005 when you have 50% data, 0.015 when you have 75% data and 0.045 when you have 100% data.

Bootstrap

If I have some samples and I calculate values like mean, variance, median, etc. The calculated value is called statistic. The statistics themselves are also random. If you are sure "In 95% probability the real median is between 8.1 and 8.2" then [8.1,8.2][8.1,8.2] is a confidence interval with 95% confidence level. Confidence interval can measure how uncertain a statistics is.

One way of computing confidence interval is called bootstrap. It doesn't require you to assume that the statistic is normally distributed. But it do require the samples to be i.i.d.

It works by resample from the data and create many replacements of the data, then calculate the statistics of the replacement data, then get the confidence interval.

For example if the original samples are [1.0,2.0,3.0,4.0,5.0][1.0, 2.0, 3.0, 4.0, 5.0], resample means randomly select one from original data and repeat 5 times, giving things like [4.0,2.0,4.0,5.0,2.0][4.0, 2.0, 4.0, 5.0, 2.0] or [3.0,2.0,4.0,4.0,5.0][3.0, 2.0, 4.0, 4.0, 5.0] (they are likely to contain duplicates).

Then compute the statistics for each resample. If the confidence level is 95%, then the confidence interval's lower bound is the 2.5% percentile number in these statistics, and the upper bound is the 97.5% percentile number in these statistics.

Overfitting

When we train a model (including deep learning and linear regression) we want it to also work on new data that's not in training set. But the training itself is to change the model parameter to fit training data.

Overfitting means the training make the model "memorize" the training data and does not discover the underlying rule in real world that generates training data.

Reducing overfitting is a hard topic. The ways to reduce overfitting:

  • Regularization. Force the model to be "simpler". Force the model to compress data. Weight sharing is also regularization (CNN is weight sharing comparing to MLP). Add inductive bias to limit the possibility of model.

    (The old way of regularization is to simply reduce parameter count, but in deep learning, there is deep double descent effect where more parameter is better.)

  • Make the model more expressive. If the model is not exprssive enough to capture real underlying rule in real world that generates training data, it's simply unable to generalize. An example is that RNN is less expressive than Transformer due to fixed-size state.

  • Make the training data more comprehensive. Reinforcement learning, if done properly, can provide more comprehensive training data than supervised learning, because of the randomness in interacting with environment.

How to test how overfit a model is?

  • Separate the data into training set and test set. Only train using training set and check model performance on test set.
  • Test sensitivity to random fluctation. We can add randomness to parameter, input, hyperparameter, etc., then see model performance. An overfit model is more prone to random perturbation because memorization is more "fragile" than real underlying rule.

Issues in real-world statistics

  • Survivorship bias and selection bias.
  • Simpson's paradox and base rate fallacy.
  • Confusing correlation with causalty.
  • Try too many different hypothesis. Spurious correlations
  • Collect data until significance.
  • Wrongly remove outliers.
  • ...