What's the essence of probability? There are two views:
Frequentist: Probability is an objective thing. We can know probability from the result of repeating a random event many times in the same condition.
Bayesian: Probability is a subjective thing. Probability means how you think it's likely to happen based on your initial assumptions and the evidences you see. Probability is relative to the information you have.
Probability is related to sampling assumptions. Example: Bertrand Paradox: there are many ways to randoly select a chord on a circle, with different proability densities of chord.
A distribution tells how likely a random variable will be what value:
A discrete distribution can be a table, telling the probability of each possible outcome.
A discrete distribuiton can be a function, where the input is a possible outcome and the output is probability.
A discrete distribution can be a vector (an array), where i-th number is the probability of i-th outcome.
A discrete distribution can be a histogram, where each pillar is a possible outcome, and the height of pillar is probability.
A continuous distribution can be described by a probability density function (PDF)f. A continuous distribution has infinitely many outcomes, and the probability of each specific outcome is zero (usually). We care about the probability of a range: P(a<X<b)=∫abf(x)dx. The integral of the whole range should be 1: ∫−∞∞f(x)dx=1. The value of PDF can be larger than 1.
A distribution can be described by cumulative distribution function. F(x)=P(X≤x). It can be integration of PDF: F(x)=∫−∞xf(x)dx. It start from 0 and monotonically increase then reach 1.
Quantile function Q is the inverse of cumulative distribution function. Q(p)=x means F(x)=p and P(X≤x)=p. The top 25% value is Q(0.75). The bottom 25% value is Q(0.25).
Independent means that two random variables don't affect each other. Knowing one doesn't affect the distribution of other. But there are dependent random variables that, when you know one, the distribution of another changes.
P(X=x) means the probability of random variable X take value x. It can also be written as PX(x) or P(X). Sometimes the probability density function f is used to represent a distribution.
A joint distribution tells how likely a combination of multiple variables will be what value. For a joint distribution of X and Y, each outcome is a pair of X and Y, denoted (X,Y). If X and Y are independent, then P(X=x,Y=y)=P((X,Y)=(x,y))=P(X=x)⋅P(Y=y).
For a joint distribution of (X,Y), if we only care about X, then the distribution of X is called marginal distribution.
You can only add probability when two events are mutually exclusive.
You can only multiply probability when two events are independent, or multiplying a conditional probability with the condition's probability.
P(E∣C) means the probability of E happening ifC happens.
P(E∣C)=P(C)P(E∩CE and C both happen)P(E∩C)=P(E∣C)⋅P(C)
If E and C are independent, then P(E∩C)=P(E)P(C), then P(E∣C)=P(E).
For example, there is a medical testing method of a disease. The test result can be positive (indicate having diesase) or negative. But that test is not always accurate.
There are two random variables: whether test result is positive, whther the person actually has disease. This is a joint distribution. The 4 cases:
Test is positive
Test is negative
Actually has disease
True positive a
False negative (Type II Error)b
Actually doesn't have disease
False positive (Type I Error)c
True negative d
a,b,c,d are four possibilities. a+b+c+d=1.
For that distribution, there are two marginal distributions. If we only care about whether the person actually has disease and ignore the test result, then the marginal distribution is:
Probability
Actually has disease
a+b (the infect rate of population)
Actually doesn't have disease
c+d
Similarily there is also a marginal distribution of whether the test result is positive.
False negative rate is P(Test is negative ∣ Actually has disease), it means the rate of negative test when actually having disease. And false positive rate is P(Test is positive ∣ Actually doesn’t have disease).
False negative rate=P(Test is negative∣Actually has disease)=a+bbFalse positive rate=P(Test is positive∣Actually doesn’t have disease)=c+dc
Some people may intuitively think false negative rate means P(Test result is false ∣ Test is negative), which equals P(Actually has disease ∣ Test is negative), which equals b+db. But that's not the official definition of false negative.
Bayes theorem allow "reversing" P(A∣B) as P(B∣A):
P(A∣B)=P(B)P(A∩B)=P(B)P(B∣A)⋅P(A)
Prior means what I assume the distribution is before knowing some new information.
If I see some new information and improved my understanding of the distribution, then the new distribution that I assume is posterior.
The theoretical mean is the "weighted average" of all possible cases using theoretical probabilities.
E[X] denotes the theoretical mean of random variable X, also called the expected value of X. It's also often denoted as μ.
For discrete case, E[X] is calculated by summing all theoretically possible values multiply by their theoretical probability.
The mean for discrete case:
μ=E[X]=consider all cases of xx∑x⋅P(X=x)probability of that case
The mean for continuous case:
μ=E[X]=∫−∞∞x⋅p(x)dx
Some rules related to mean:
The mean of two random variables can add up
E[X+Y]=E[X]+E[Y]E[∑iXi]=∑iE[Xi]
Multiplying a random variable by a constant k multiplies its mean
E[kX]=k⋅E[X]
A constant's mean is that constant
E[k]=k
(The constant k doesn't necessarily need to be globally constant. It just need to be a certain value that's not affected by the random outcome. It just need to be "constant in context".)
Another important rule is that, if X and Y are independent, then
E[X⋅Y]=E[X]⋅E[Y]
Because when X and Y are independent, P(X=xi,Y=yj)=P(X=xi)⋅P(Y=yj), then:
(That's for the discrete case. Continuous case is similar.)
If we have n samples of X, denoted X1,X2,...Xn, each sample is a random variable, and each sample is independent to each other, and each sample are taken from the same distribution (independently and identically distributed, i.i.d), then we can estimate the theoretical mean by calculating the average. The estimated mean is denoted as μ^ (Mu hat):
E^i[X]=μ^=n1i∑Xi
Hat ^ means it's an empirical value calculated from samples, not the theoretical value.
Some important clarifications:
The theoretical mean is weighted average using theoretical probabilities
The estimated mean (empirical mean, sample mean) is non-weighted average over samples
The theoretical mean is an accurate value, determined by the theoretical distribution
The estimated mean is an inaccurate random variable, because it's calculated from random samples
The mean of estimated mean equals the theoretical mean.
Note that if the samples are not independent to each other, or they are taken from different distributions, then the estimation will be possibly biased.
The mean is sometimes also called location. The variance is sometimes called dispersion.
If we have some i.i.d samples but don't know the theoretical variance, how to estimate the variance? If we know the theoretical mean, then it's simple:
σ^2=n1i∑((Xi−μ)2)E[σ^2]=σ2
However, the theoretical mean is different to the estimated mean. If we don't know the theoretical mean and use the estimated mean, it will be biased, and we need to divide n−1 instead of n to avoid bias:
σ^2=n−11i∑((Xi−μ^)2)
This is called Bessel's correction. note that the more i.i.d samples you have, the smaller the bias, so if you have many i.i.d samples, then the bias doesn't matter in practice.
Originally, n samples have n degrees of freedom. If we keep the estimated mean fixed, then it will only have n-1 degrees of freedom. That's an intuitive explanation of the correction. The exact dedution of that correction is tricky:
Firstly, the estimated mean itself also has variance
Var[μ^]=Var[n1i∑Xi]=n21Var[i∑Xi]
As each sample is independent to other samples. As previously mentioned, if X and Y are independent, adding the variable also adds the variance: Var[X+Y]=Var[X]+Var[Y]. So:
For a random variable X, if we know its mean μ and standard deviation σ then we can "standardize" it so that its mean become 0 and standard deviation become 1:
Z=σX−μ
That's called Z-score or standard score.
Often the theoretical mean and theoretical standard deviation is unknown, so z score is computed using sample mean and sample stdev:
Z=σ^X−μ^
In deep learning, normalization uses Z score:
Layer normalization: it works on a vector. It treats each element in a vector as different samples from the same distribution, and then replace each element with their Z-score (using sample mean and sample stdev).
Batch normalization: it works on a batch of vectors. It treats the elements in the same index in different vectors in batch as different samples from the same distribtion, and then compute Z-score (using sample mean and sample stdev).
Note that in layer normalization and batch normalization, the variance usually divides by n instead of n−1.
Computing Z-score for a vector can also be seen as a projection:
The input x=(x1,x2,...,xn)
The vector of ones: 1=(1,1,...,1)
Computing sample mean can be seen as scaling n1 then dot product with the vector of ones: μ^=n1x⋅1
Subtracting the sample mean can be seen as subtracting μ^⋅1, let's call it y: y=x−μ^⋅1=x−n1(x⋅1)⋅1
Recall projection: projecting vector a onto b is (b⋅ba⋅b)⋅b.
(1)2=n. So n1(x⋅1)⋅1 is the projection of x onto 1.
Subtracting it means removing the component in the direction of 1 from x. So y is orthogonal to 1. y is in a hyper-plane orthogonal to 1.
Standard deviation can be seen as the length of y divide by n (or n−1): σ2=n1(y)2, σ=n1∣y∣.
Dividing by standard deviation can be seen as projecting it onto unit sphere then multiply by n (or n−1).
So computing Z-score can be seen as firstly projecting onto a hyper-plane that's orthogonal to 1 and then projecting onto unit sphere then multiply by n (or n−1).
Skewness measures which side has more extreme values.
Skew[X]=E[σ3(X−μ)3]
A large positive skew means there is a fat tail on positive side (may have positive Black Swans). A large negative skew means fat tail on negative side (may have negative Black Swans).
If two sides are symmetric, its skew is 0, regardless of how fat the tails are. Gaussian distributions are symmetric so they has zero skew. note that an asymmetric distribution can also has 0 skewness.
There is a concept called moments that unify mean, variance, skewness and kurtosis:
The n-th moment: E[Xn]. Mean is the first moment.
The n-th central moment: E[(X−μ)n]. Variance is the second central moment.
The n-th central standardized moment: E[(σX−μ)n]. Skewness is the third central standardized moment. Kurtosis is the fourth central standardized moment.
There is an unbiased way to estimate the thrid central moment μ3.
μ3[X]=E[(X−μ)3]μ3^=(n−1)(n−2)ni∑(Xi−μ^)3
The deduction of unbiased third central moment estimator is similar to Bessel's correction, but more tricky.
A common way of estimating skewness from i.i.d samples, is to use the unbiased third central moment estimator, to divide by cubic of unbiased estimator of standard deviation:
G1=σ^3μ3^=(n−1)(n−2)ni∑σ^3(Xi−μ^)3
But it's still biased, as E[YX] doesn't necessarily equal E[Y]E[X]. Unfortunately, there is no completely unbiased way to estimate skewness from i.i.d samples (unless you have other assumptions about the underlying distribution). The bias gets smaller with more i.i.d samples.
Larger kurtosis means it has a fatter tail. The more extreme values (Black Swans) it has, the higher its kurtosis.
Kurt[X]=E[σ4(X−μ)4]=σ4E[(X−μ)4]
Gaussian distributions have kurtosis of 3. Excess kurtosis is the kurtosis minus 3.
A common way of estimating excess kurtosis from i.i.d samples, is to use the unbiased estimator of fourth cumulant (E[(X−E[X])4]−3Var[X]2), to divide the square of unbiased estimator of variance:
If we have some independent samples of X, can estimate mean E[X] by calculating average E^[X]=n1∑iXi. The variance of calculated average is n1Var[X], which will reduce by having more samples.
However, if the variance of X is large and the amount of samples is few, the average will have a large variance, the estimated mean will be inaccurate. We can make the estimation more accurate by using control variate.
If:
we have a random variable Y that's correlated with X
we know the true mean of Y: E[Y],
Then we can estimate E[X] using E^[X+λ(Y−E[Y])], where λ is a constant. By choosing the right λ, the estimator can have lower variance than just calculating average of X. The Y here is called a control variate.
Some previous knowledge: E[E^[A]]=E[A], Var[E^[A]]=n1Var[A].
The mean of that estimator is E[X], meaning that the estimator is unbiased:
We want to minimize the variance of estimatory by choosing a λ. We want to find a λ that minimizes Var[Y]λ2+2cov[X,Y]λ. Quadratic funciton knowledge tells ax2+bx+c(a>0) minimizes when x=2a−b, then the optimal lambda is:
λ=−Var[Y]cov[X,Y]
And by using that optimal λ, the variance of estimator is:
Var[E^[X+λ(Y−E[Y])]]=n1(Var[X]−Var[Y]cov[X,Y]2)
If X and Y are correlated, then Var[Y]cov[X,Y]2>0, then the new estimator has smaller variance and is more accurate than the simple one. The larger the correlation, the better it can be.
How much information a sample in that distribution carries.
If we want to measure the amount of information of a specific event, an event E 's amount of information as I(E), we can define these axioms:
If that event always happens, then it carries zero information. I(E)=0 if P(E)=1.
The more rare an event is, the larger information (more surprise) it carries. I(E) increases as P(E) decreases.
The information of two independent events happen together is the sum of the information of each event. Here I use (X,Y) to denote the combination of X and Y. That means I((X,Y))=I(X)+I(Y) if P((X,Y))=P(X)⋅P(Y). This implies the usage of logarithm.
Then according to the three axioms, the definition of I is:
I(E)=logbP(E)1=−logbP(E)
The base b is relative to the unit. We often use the amount of bits as the unit of amount of information. An event with 50% probability has 1 bit of information, then the base will be 2:
I(E)=log2P(E)1(in bits)
Then, for a distribution, the expected value of information of one sample is the expected value of I(E). That defines information entropy H:
H(X)=E[I(X)]=E[log2P(X)1]
In discrete case:
H(X)=x∑(P(x)⋅log2(P(x)1))
If there exists x where P(x)=0, then it can be ignored in entropy calculation, as limx→0xlogx=0.
Information entropy in discrete case is always positive.
In continuous case, where f is the probability density function, this is called differential entropy:
H(X)=∫Xf(x)⋅logf(x)1dx
(X means the set of x where f(x)=0, also called support of f.)
In continuous case the base is often e rather than 2. Here log by default means loge.
In discrete case, 0<=P(x)<=1, logP(x)1>0, so entropy can never be negative. But in continuous case, probability density function can take value larger than 1, so entropy may be negative.
A fair coin toss with two cases has 1 bit of information entropy: 0.5⋅log2(0.51)+0.5⋅log2(0.51)=1 bit.
If the coin is biased, for example the head has 90% probability and tail 10%, then its entropy is: 0.9⋅log2(0.91)+0.1⋅log2(0.11)≈0.47 bits.
If it's even more biased, having 99.99% probability of head and 0.01% probability of tail, then its entropy is: 0.9999⋅log2(0.99991)+0.0001⋅log2(0.00011)≈0.0015 bits.
If a coin toss is fair but has 0.01% percent of standing up on the table, having 3 cases each with probability 0.0001, 0.49995, 0.49995, then its entropy is 0.0001⋅log2(0.00011)+0.49995⋅log2(0.499951)+0.49995⋅log2(0.499951)≈1.0014 bits. (The standing up event itself has about 13.3 bits of information, but its probability is low so it contributed small in information entropy)
If X and Y are independent, then H((X,Y))=E[I((X,Y))]=E[I(X)+I(Y)]=E[I(X)]+E[I(Y)]=H(X)+H(Y). If one fair coin toss has 1 bit entropy, then n independent tosses has n bit entropy.
If I split one case into two cases, entropy increases. If I merge two cases into one case, entropy reduces. Because p1logp11+p2logp21>(p1+p2)logp1+p21 (if p1=0,p2=0), which is because that f(x)=logx1 is convex, so p1+p2p1logp11+p1+p2p2logp21>logp1+p21 , then multiply two sides by p1+p2 gets the above result.
The information entropy is the theorecical minimum of information required to encode a sample. For example, to encode the result of a fair coin toss, we use 1 bit, 0 for head and 1 for tail (reversing is also fine). If the coin is biased to head, to compress the information, we can use 0 for two consecutive heads, 10 for one head, 11 for one tail, which require fewer bits on average for each sample. That may not be optimal, but the most optimal loseless compresion cannot be better than information entropy.
In continuous case, if k is a positive constant, H(kX)=H(X)+logk:
A joint distribution of X and Y is a distribution where each outcome is a pair of X and Y. Its entropy is called joint information entropy. Here I will use H((X,Y)) to denote joint entropy (to avoid confusing with cross entropy).
If X and Y are not independent, then the joint entropy is smaller than if they are independent: H((X,Y))<H(X)+H(Y). If X and Y are not independent then knowing X will also give some information about Y. This can be deduced by mutual information which will be explained below.
Here IA(x) denotes the amount of information of value (event) x in distribution A. The difference of information of the same value in two distributions A and B:
The KL divergence from A to B is the expected value of that regarding the probabilities of A. Here EA means the expected value calculated using A's probabilities:
If I "expect" the distribution is B, but the distribution is actually A, how much "surprise" do I get on average.
If I design a loseless compression algorithm optimized for B, but use it to compress data from A, then the compression will be not optimal and contain redundant information. KL divergence measures how much redundant information it has on average.
KL divergence is also called relative entropy.
KL divergence is asymmetric, DKL(A∥B) is different to DKL(B∥A). It's often that the first distribution is the real underlying distribution, and the second distribution is an approximation or the model output.
If A and B are the same, the KL divergence betweem them are zero. Otherwise, KL divergence is positive. KL divergence can never be negative, will explain later.
PB(x) appears on denominator. If there exists x that PB(x)=0 and PA(x)=0, then it can be seen that KL divergence is infinite. It can be seen as "The model expect something to never happen but it actually can happen". If there is no such case, we say that A absolutely continuous with respect to B, written as A≪B. This requires all outcomes from B to include all outcomes from A.
Another concept is cross entropy. The cross entropy from A to B, denoted H(A,B), is the entropy of A plus KL divergence from A to B:
Information entropy H(X) can also be expressed as cross entropy of itself H(X,X), similar to the relation between variance and covariance.
(In some places H(A,B) denotes joint entropy. I use H((A,B)) for joint entropy to avoid ambiguity.)
Cross entropy is also asymmetric.
In deep learning cross entropy is often used as loss function. If each piece of training data's distribution's entropy H(A) is fixed, minimizing cross entropy is the same as minimizing KL divergence.
Jensen's inequality states that for a concave function f:
E[f(X)]≤f(E[X])
The reverse applies for convex.
Here is a visual example showing Jensen's inequality. For example I have a discrete distribution with 5 cases X1,X2,X3,X4,X5 (these are possible outcomes of distribution, not samples), corresponding to X coordinates of the red dots. The probabilities of the 5 cases are p1,p2,p3,p4,p5 that sum to 1.
Then (E[X],E[f(x)]) can be seen as an interpolation between five points (X1,f(X1)),(X2,f(X2)),(X3,f(X3)),(X4,f(X4)),(X5,f(X5)), using weights p1,p2,p3,p4,p5. The possible area of the interpolated point correspond to the green convex polygon:
For each point in green polygon (E[X],E[f(X)]), the point on function curve with the same X coordinate (E[X],f(E[X])) is above it. So E[f(X)]≤f(E[X]).
The same applies when you add more cases to the discrete distribution, the convex polygon will have more points but still below the function curve. The same applies to continuous distribution when there are infinitely many cases.
Jensen's inequality tells that KL divergence is non-negative:
There is a trick that extracting -1 makes PA be in denominator that will be cancelled later.
(PA(x)=0 is impossible because it's sampled from A)
It's always positive and has no bias. The PA(x)PB(x)−1 is a control variate and is negatively correlated with logPB(x)PA(x).
Recall control variate: if we want to estimate E[X] from samples more accurately, we can find another variable Y that's correlated with X, and we must know its theoretical mean E[Y], then we use E^[X+λY]−λE[Y] to estimate E[X]. The parameter λ is choosed by minimizing variance.
The mean of that control variate is zero, because Ex∼A[PA(x)PB(x)−1]=∑xPA(x)(PA(x)PB(x)−1)=∑x(PB(x)−PA(x))=∑xPB(x)−∑xPA(x)=0
The λ=1 is not chosen by mimimizing variance, but chosen by making the estimator non-negative. If I define k=PA(x)PB(x), then logPB(x)PA(x)+λ(PA(x)PB(x)−1)=−logk+λ(k−1). We want it to be non-negative: −logk+λ(k−1)≥0 for all k>0, it can be seen that a line y=λ(k−1) must be above y=logk, the only solution is λ=1, where the line is a tangent line on logk.
In reinforcement learning, the KL divergence is usually used in loss function for regularization and make training more stable.
In policy model, the model accepts a state input and output a distribution (a vector of probabilities), telling the probability of doing each possible action in that state, called policy value:
πparameter(action∣state)=probability of taking that action given state
The training keeps updating the model parameter. KL divergence can be added into loss function to make the new model's policy be not too far from old policy, avoiding model from changing too fast, making training more stable:
If X and Y are independent, then H((X,Y))=H(X)+H(Y). But if X and Y are not independent, knowing X reduces uncertainty of Y, then H((X,Y))<H(X)+H(Y).
Mutual information I(X;Y) measures how "related" X and Y are:
I(X;Y)=H(X)+H(Y)−H((X,Y))
For a joint distribution, if we only care about X, then the distribution of X is a marginal distribution, same as Y.
If we treat X and Y as independent, consider a "fake" joint distribution as if X and Y are independent. Denote that "fake" joint distribution as Z, then P(Z=(x,y))=P(X=x)P(Y=y). It's called "outer product of marginal distribution", because its probability matrix is the outer product of two marginal distributions, so it's denoted X⊗Y.
Then mutual information can be expressed as KL divergence between joint distribution (X,Y) and that "fake" joint distribution X⊗Y:
KL divergence is zero when two distributions are the same, and KL divergence is positive when two distributions are not the same. So:
Mutual information I(X;Y) is zero if the joint distribution (X,Y) is the same as X⊗Y, which means X and Y are independent.
Mutual information I(X;Y) is positive if X and Y are not independent.
Mutual information is never negative, because KL divergence is never negative.
H((X,Y))=H(X)+H(Y)−I(X;Y), so if X and Y are not independent then H((X,Y))<H(X)+H(Y).
Mutual information is symmetric, I(X;Y)=I(Y;X).
As H((X,Y))=H(X∣Y)+H(Y), so I(X;Y)=H(X)+H(Y)−H((X,Y))=H(X)−H(X∣Y).
If knowing Y completely determines X, knowing Y make the distribution of X collapse to one case with 100% probability, then H(X∣Y)=0, then I(X;Y)=H(X).
Information Bottleneck theory tells that the training of neural network will learn an intermediary representation that:
Minimize I(Input;IntermediaryRepresentation). Try to compress the intermediary representation and reduce unnecessary information related to input.
Maximize I(IntermediaryRepresentation;Output). Try to keep the information in intermediary representation that's releveant to the output as much as possible.
If we have two independent random variablex X and Y, and consider the distribution of the sum Z=X+Y, then
P(Z=z)=x,y,ifz=x+y∑P(X=x)P(Y=y)
For each z, it sums over different x and y within the constraint z=x+y.
The constraint z=x+y allows determining y from x and z: y=z−x, so it can be rewritten as:
P(Z=z)=x∑P(X=x)P(Y=z−x)
In continuous case
fZ(z)=∫−∞∞fX(x)fY(z−x)dx
The probability density function of the sum fZ is denoted as convolution of fX and fY:
fZ=fX∗fYPZ=PX∗PY
The convolution operator ∗ can:
In continuous case, convolution takes two probability density functions, and give a new probability density function.
In discrete case, convolution can take two functions and give a new function. Each function inputs an outcome and outputs the probability of that outcome.
In discrete case, convolution can take two vectors and give a new vector. Each vector's i-th element correspond to the probability of i-th outcome.
Convolution can also work in 2D or more dimensions. If X=(x1,x2) and Y=(y1,y2) are 2D random variables (two joint distributions), Z=X+Y=(z1,z2) is convolution of X and Y:
Convolution can also work on cases where the values are not probabilities. Convolutional neural network uses discrete version of convolution on matrices.
Normally when talking about probability we mean the probability of an outcome under a modelled distribution: P(outcome∣modelled distribution). But sometimes we have some concrete samples from a distribution but want to know which model suits the best, so we talk about the probability that a model is true given some samples: P(modelled distribution∣outcome).
If I have some samples, then some parameters make the samples more likely to come from the modelled distribution, and some parameters make the samples less likely to come from the modelled distribution.
For example, if I model a coin flip using a parameter θ, that and I observe 10 coin flips have 9 heads and 1 tail, then θ=0.9 is more likely than θ=0.5. That's straightforward for a simple model. But for more complex models, we need to measure likelihood.
Likelihood L(θ∣x1,x2,...,xn) measures:
How likely that we get samples x1,x2,...,xn from the modelled distribution using parameter θ.
how likely a parameter θ is the real underlying parameter, given some independent samples x1,x2,...,xn.
For example, if I model a coin flip distribution using a parameter θ, the probability of head is θ and tail is 1−θ. If I observe 10 coin flip has 9 heads and 1 tail, then the likelihood of θ:
L(θ∣ 9 heads and 1 tail )=θ9⋅(1−θ)
If I assume that the coin flip is fair, θ=0.5, then likelihood is about 0.000977.
If I assume θ=0.9, then likelihood is about 0.387, which is larger.
If I assume θ=0.999 then likelihood is about 0.00099, which is smaller than when assuming θ=0.9.
The more likely a parameter is, the higher its likelihood. If θ equals the true underlying parameter then likelihood takes maximum.
By taking logarithm, multiply becomes addition, making it easier to analyze. The log-likelihood function:
If θ equals true underlying parameter, then mean of likelihood Ex[L(θ∣x)] takes maximum, mean of log-likelihood Ex[logL(θ∣x)] also takes maximum.
A continuous function's maximum point has zero derivative, so when θ is true, then the mean of score function Ex[s(θ;x)]=∂θ∂Ex[f(x∣θ)] is zero.
The Fisher informationI(θ) is the mean of the square of score:
I(θ)=Ex[s(θ;x)2]
(The mean is calculated over different outcomes, not different parameters.)
We can also think that Fisher information is always computed under the assumption that θ is the true underlying parameter, then Ex[s(θ;x)]=0, then Fisher information is the variance of score I(θ)=Varx[s(θ;x)].
Fisher informaiton I(θ) also measures the curvature of score function, in parameter space, around θ.
Fisher information measures how much information a sample can tell us about the underlying parameter.
When the parameter is an offset and the offset is infinitely small, then the score function is called linear score. If the infinitely small offset is θ. The offseted probability density is f2(x∣θ)=f(x+θ), then
Recall that if we make probability distribution more "spread out" the entropy will increase. If there is no constraint, maximizing entropy of real-number distribution will get a distribution that's "infinitely spread out over all real numbers". But if there are constraints, maximizing entropy will give some common and important distributions:
Constraint
Max-entropy distribution
a≤X≤b
Uniform distribution
E[X]=μ,Var[X]=σ2
Normal distribution
X≥0,E[X]=μ
Exponential distribution
X≥m>0,E[logX]=g
Pareto (Type I) distribution
There are other max-entropy distributions. See Wikipedia.
We can rediscover these max-entropy distributions, by using Largrange multiplier and functional derivative.
To find the distribution with maximum entropy under variance constraint, we can use Largrange multiplier. If we want to find maximum or minimum of f(x) under the constraint that g(x)=0, we can define Largragian function L:
L(x,λ)=f(x)+λ⋅g(x)
Its two partial derivatives have special properties:
∂x∂L(x,λ)=∂x∂f(x)+λ∂x∂g(x)∂λ∂L(x,λ)=g(x)
Then solving equation ∂x∂L(x,λ)=0 and ∂λ∂L(x,λ)=0 will find the maximum or minimum under constraint. Similarily, if there are many constraints, there are multiple λs. Similar things also apply to functions with multiple arguments. The argument x can be a number or even a function, which involves functional derivative:
A functional is a function that inputs a function and outputs a value. (One of) its input is a function rather than a value (it's a higher-order function). Functional derivative (also called variational derivative) means the derivative of a functional respect to its argument function.
To compute functional derivative, we add a small "perturbation" to the function. f(x) becomes f(x)+ϵ⋅η(x), where epsilon ϵ is an infinitely small value that approaches zero, and eta η(x) is a test function. The test function can be any function that satisfy some properties.
The definition of functional derivative:
∂ϵ∂G(f+ϵη)=∫∂f∂G⋅η(x)dx
Note that it's inside integration.
For example, this is a functional: G(f)=∫xf(x)dx. To compute functional derivative ∂f∂G(f), we firstly compute ∂ϵ∂G(f+ϵη) then try to make it into the form of ∫∂f∂G⋅η(x)dx
∂ϵ∂G(f+ϵη)=∂ϵ∂∫x(f(x)+ϵη(x))dx=∫x⋅η(x)dx
Then by pattern matching with the definition, we get ∂f∂G=x.
Calculate functional derivative for G(f)=∫x2f(x)dx:
The normal distribution, also called Gaussian distribution, is important in statistics. It's the distribution with maximum entropy if we constraint its variance σ2 to be a finite value.
It has two parameters: the mean μ and the standard deviation σ. N(μ,σ2) denotes a normal distribution. Changing μ moves the PDF alone X axis. Changing σ scales PDF along X axis.
We can rediscover normal distribution by maximizing entropy under variance constraint. To do that there are two prerequisite concepts: Largrange multiplier and functional derivative.
Rediscover normal distribution by maximizing entropy with variance constraint
For a distribution's probability density function f, we want to maximize its entropy H(f)=∫f(x)logf(x)1dx under the constraint:
It's a valid probability density function: ∫−∞∞f(x)dx=1, and f(x)≥0
The mean: ∫−∞∞xf(x)dx=μ
The variance constraint: ∫−∞∞f(x)(x−μ)2dx=σ2
We can simplify to make deduction easier:
Moving the probability density function along X axis doesn't change entropy, so we can fix the mean as 0 (we can replace x as x−μ after finishing deduction).
logf(x)1 already implicitly tells f(x)>0
It turns out that the mean constraint ∫−∞∞xf(x)dx=0 is not necessary to deduce the result, so we can not include it in Largrange multipliers. (Including it is also fine but will make it more complex.)
We get the rough form of normal distribution's probabilify density function.
Then solve ∂λ1∂L=0:
∂λ1∂L=0∫−∞∞f(x)dx=1∫−∞∞e(−1+λ1+λ2x2)dx=1
That integration must converge, so λ2<0.
A subproblem: solve ∫−∞∞e−kx2dx (k>0). The trick is to firstly compute its square (∫−∞∞e−kx2dx)2, turning the integration into two-dimensional, and then substitude polar coordinates x=rcosθ,y=rsinθ,x2+y2=r2,dxdy=rdrdθ :
If X follows normal distribution and Y's distribution that have the same mean and variance, the cross entropy H(Y,X) have the same value: 21log(2πeσ2), regardless of the exact probability density function of Y. The deduction is similar to the above:
We have a random variable X, which has meam 0 and (finite) variance σ2.
If we add up n independent samples of X: X1+X2+...+Xn, the variance of sum is nσ2.
To make its variance constant, we can divide it by n, then we get Sn=nX1+X2+...+Xn. Here Sn is called the standardized sum, because it makes variance not change by sample count.
Central limit theorem says that the standardized sum apporaches normal distribution as n increase. No matter what the original distribution of X is (as long as its variance is finite), the standardized sum will approach normal distribution.
The information of distribution of X will be "washed out" during the process. This "washing out information" is also increasing of entropy. As n increase, the entopy of standardized sum always increase (except when X follows normal distribution the entropy stays at maximum). H(Sn+1)>H(Sn) if X is not normally distributed.
Normal distribution has the maximum entropy under variance constraint. As the entropy of standardized sum increase, its entropy will approach maximum and it will approach normal distribution. This is similar to second law of theomodynamics.
In the real world, many things follow normal distribution, like height of people, weight of people, error in manufacturing, error in measurement, etc.
The height of people is affect by many complex factors (nurtrition, health, genetic factors, exercise, environmental factors, etc.). The combination of these complex factors definitely cannot be similified to a standardized sum of i.i.d zero-mean samples nX1+X2+...+Xn. Some factors have large effect and some factors have small effect. The factors are not necessarily independent. But the height of people still roughly follows normal distribution. This can be semi-explained by second law of theomodynamics. The complex interactions of many factors increase entropy of the height. At the same time there are also many factors that constraint the variance of height. Why is there a variance constraint? In some cases variance correspond to instability. A human that is 100 meters tall is impossible as it's physically unstable. Similarily a human that's 1 cm tall is impossible in maintaining normal biological function. The unstable things tend to collapse and vanish (survivorship bias), and the stable things remain. That's how the variance constraint occurs in nature. In some places, variance correspond to energy, and the variance is constrainted by conservation of energy.
Although normal distribution is common, not all distributions are normal. There are also many things that follow fat-tail distributions.
Also note that Central Limit Theorem works when n approaches infinity. Even if a distribution's standardized sum approach normal distribution, the speed of converging is important: some distribution converge to normal quickly, and some slowly. Some fat-tail distribution has finite variance but their standardized sum converge to normal distribution very slowly.
In below, bold letter (like x) means column vector:
x=x1x2...xn
Linear transform: for a (column) vector x, muliply a matrix A on it: Ax is linear transformation. Linear transformation can contain rotation, scaling and shearing. For row vector it's xA. Two linear transformations can be combined one, corresponding to matrix multiplication.
Affine transform: for a (column) vector x, multiply a matrix on it and then add some offset Ax+b. It can move based on the result of linear transform. Two affine transformations can be combined into one. If y=Ax+b,z=Cy+d, then z=(CA)x+(Cb+d)
(in some places affine transformation is called "linear transformation".)
Normal distribution has linear properties:
if you multiply a constant, the result still follow normal distribution. X∼N→kX∼N
if you add a constant, the result still follow normal distribution. X∼N→(X+k)∼N
If you add up two independent normal random variables, the result still follows normal distribution. X∼N,Y∼N→(X+Y)∼N
A linear combination of many normal distributions also follow normal distribution. X1∼N,X2∼N,...Xn∼N→(k1X1+k2X2+...+knXn)∼N
If:
We have a (row) vector x of independent random variables x=(x1,x2,...xn), each element in vector follows a normal distribution (not necessarily the same normal distribution),
then, if we apply an affine transformation on that vector, which means multipling a matrix A and then adding an offset b, y=Ax+b,
then each element of y is a linear combination of normal distributions, yi=x1Ai,1+x2Ai,2+...xnAi,n+bi,
so each element in y also follow normal distribution. Now y follows multivariate normal distribution.
Note that the elements of y are no longer necessarily independent.
What if I apply two or many affine transformations? Two affine transformations can be combined into one. So the result is still multivariate normal distribution.
To describe a multivariate normal distribution, an important concept is covariance matrix.
Recall covariance: Cov[X,Y]=E[(X−E[X])(Y−E[Y])]. Some rules about covariance:
It's symmetric: Cov[X,Y]=Cov[Y,X]
If X and Y are independent, Cov[X,Y]=0
Adding constant Cov[X+k,Y]=Cov[X,Y]. Variance is invariant to translation.
Multiplying constant Cov[k⋅X,Y]=k⋅Cov[X,Y]
Addition Cov[X+Y,Z]=Cov[X,Z]+Cov[Y,Z]
Covariance matrix:
Cov(x,y)=E[(x−E[x])(y−E[y])T]
Here E[x] taking mean of each element in x and output a vector. It's element-wise. E[x]i=E[xi]. Similar for matrix.
Recall that multiplying constant and addition can be "moved out of E[]": E[kX]=kE[X],E[X+Y]=E[X]+E[Y]. If A is a matrix that contains random variable and B is a matrix that's not random, then E[A⋅B]=E[A]⋅B,E[B⋅A]=B⋅E[A], because multiplying a matrix come down to multiplying constant and adding up, which all can "move out of E[]". Vector can be seen as a special kind of matrix.
If x follows multivariate normal distribution, it can be described by mean vector μ (the mean of each element of x) and covariance matrix Cov(x,x).
Initially, if I have some independent normal variables x1,x2,...xn with mean values μ1,...,μn and variances σ12,...,σn2. If we treat them as a multivariate normal distribution, the mean vector μx=(μ1,...,μn). The covariance matrix will be diagonal as they are independent:
Cov(x,x)=σ120⋮00σ22⋮0......⋱...00⋮σn2
Then if we apply an affine transformation y=Ax+b, then μy=Aμx+b. Cov(y,y)=Cov(Ax+b,Ax+b)=Cov(Ax,Ax)=ACov(x,x)AT.
The industry standard of 3D modelling is to model the 3D object as many triangles, called mesh. It only models the visible surface object. It use many triangles to approximate curved surface.
Gaussian splatting provides an alternative method of 3D modelling. The 3D scene is modelled by a lot of mutlivariate (3D) gaussian distributions, called gaussian. When rendering, that 3D gaussian distribution is projected onto a plane (screen) and approximately become a 2D gaussian distribution, now probability density correspond to color opacity.
Note that the projection is perspective projection (near things big and far things small). Perspective projection is not linear. After perspective projection the 3D Gaussian distribution, it's no longer strictly a 2D Gaussian distribution, but is very close to a 2D Gaussian distribution, so approximated as a 2D gaussian. If the projection is linear then result is still gaussian.
A gaussian's color can be fixed or can change based on different view directions.
In diffusion model, we add gaussian noise to image (or other things). Then the diffusion model takes noisy input and we train it to output the noise added to it. There will be many steps of adding noise and the model should output the noise added in each step.
Tweedie's formula shows that estimating the noise added is the same as computing the likelihood of image distribution.
To simplify, here we only consider one dimension and one noise step (the same also applies to many dimensions and many noise steps).
If the original value is x0, we add a noise ϵ∼N(0,σ2), the noise-added value is x1=x0+ϵ, x1∼N(x0,σ2).
The diffusion model only know x1 and don't know x0. The diffusion model need to estimate ϵ from x1.
Here:
p0(x0) is the probability density of original clean value (for image generation, it correspond to the probability distribution of images that we want to generate)
p1(x1) is the probability density of noise-added value
p1∣0(x1∣x0) is the probability density of noise-added value, given clean training data x0. It's a normal distribution given x0. It can also be seen as a function that take two arguments x0,x1.
p0∣1(x0∣x1) is the probability density of the original clean value given noise-added value. It can also be seen as a function that take two arguments x0,x1.
(I use p1∣0(x1∣x0) instead of shorter p(x1∣x0) is to reduce confusion between different distributions.)
Now if we already know the noise-added value x1, but we don't know x0 so x0 is uncertain. We want to compute the expectation of x0 under that condition that x1 is known.
And Ex0[σ2∂x1∂logp1(x1)x1]=σ2∂x1∂logp1(x1) because it's unrelated to random x0.
So
E[x0∣x1]=x1+Train diffusion model to output thisσ2∂x1∂logp1(x1)
That's Tweedie's formula (for 1D case). It can be generalized to many dimensions, where the x0,x1 are vectors, the distributions p0,p1,p0∣1,p1∣0 are joint distributions where different dimensions are not necessarily independent. The gaussian noise added to different dimensions are still independent.
The diffusion model is trained to estimate the added noise, which is the same as estimating the linear score.
If we have constraint X≥0 and fix the mean E[X] to a specific value μ, then maximizing entropy gives exponential distribution. It can also be rediscovered from Lagrange multiplier:
If some event is happening in fixed rate (λ), exponential distribution measures how long do we need to wait for the next event, if how long we will need to wait is irrelevant how long we have aleady waited (memorylessness).
Exponential distribution can measure:
The lifetime of machine components.
The time until a radioactive atom decays.
The time length of phone calls.
The time interval between two packets for a router.
...
How to understand memorlessness? For example, a kind of radioactive atom decays once per 5 minutes on average. If the time unit is minute, then λ=51. For a specific atom, if we wait for it to decay, the time we need to wait is on average 5 minutes. However, if we have already waited for 3 minutes and it still hasn't decay, the expected time that we need to wait is still 5 minutes. If we have waited for 100 minutes and it still hasn't decay, the expected time that we need to wait is still 5 minutes. Because the atom doesn't "remember" how long we have waited.
Memorylessness means the probability that we still need to wait needToWait amount of time is irrelevant to how long we have already waited:
(We can also rediscover exponential distrbution from just memorylessness.)
Memorylessness is related with its maximum entropy property. Maximizing entropy under constraints means maximizing uncertainty and minizing information other than the constraints. The only two constraints are X≥0, the wait time is positive, and E[X]=λ1, the average rate of the event. Other than the two constraints, there is no extra information. No information tells waiting reduces time need to wait, no information tells waiting increases time need to wait. So it's the most unbiased: waiting has no effect on the time need to wait. If the radioactive atom has some "internal memory" that changes over time and controls how likely it will decay, then the waiting time distribution encodes extra information other than the two constraints, which makes it no longer max-entropy.
80/20 rule: for example 80% of weallth are in the richest 20% (the real number may be different).
It has fractal property: even within the richest 20%, 80% of wealth are in the richest 20% within. Based on this fractal-like property, we can naturally get Pareto distribution.
If the total people count is N, the total wealth amount is W. Then 0.2N people have 0.8W wealth. Applying the same within the 0.2N people: 0.2⋅0.2N people have 0.8⋅0.8W wealth. Applying again, 0.2⋅0.2⋅0.2N people have 0.8⋅0.8⋅0.8W welath.
Generalize it, 0.2kN people have 0.8kW wealth (k can be generalized to continuous real number).
If the wealth variable is X (assume X>0), its probability density function is f(x), and porportion of people correspond to probability, the richest 0.2k porportion of people group have 0.8kW wealth, t is the wealth threshold (minimum wealth) of that group:
P(X≥t)=∫t∞f(x)dx=0.2k
Note that f(x) represents probability density function (PDF), which correspond to density of proportion of people. N⋅f(x) is people amount density over wealth. Multiplying it with wealth x and integrate gets total wealth in range:
∫t∞x(N⋅f(x))dx=0.8kW∫t∞xf(x)dx=0.8kNW
We can rediscover Pareto distribution from these. The first thing to do is extract and eliminate k:
Now we get the PDF. We still need to make the total probability area to be 1 to make it a valid distribution. But there is no extra unknown parameter in PDF to change. The solution is to crop the range of X. If we set the minimum wealth in distribution to be m (but doesn't constraint the maximum wealth), creating constraint X≥m, then using the previous result
Now we rediscovered (a special case of) Pareto distribution from just fractal 80/20 rule. We can generalize it further for other cases like 90/10 rule, 80/10 rule, etc. and get Pareto (Type I) distribution. It has two parameters, shape parameter α (correspond to −ln0.8−ln0.2ln0.2=ln4ln5≈1.161) and minimum value m:
f(x)={αmαx−α−10x≥m,x<m
Note that in real world the wealth of one can be negative (has debts more than assets). The Pareto distribution is just an approximation. m means the threshold where Pareto distribution starts to be good approximation.
If α≤1 then its theoretical mean is infinite. Of course if we have finite samples then the sample mean will be finite, but if the theoretical mean is infinite, the more sample we have, the larger the sample mean tend to be, and the trend won't stop.
If α≤2 then its theoretical variance is infinite. Recall that centrol limit theorem require finite variance. The standarized sum of values taken from Pareto distribution whose α≤2 does not follow central limit theorem because it has infinite variance.
Pareto distribution is often described using tail function (rather than probability density function):
TailFunction(x)=P(X>x)={mαx−α1if x≥m,if x<m
Rediscover Pareto distribution by maximizing entropy under geometric mean constraint
There are additive values, like length, mass, money. For additive values, we often compute arithmetic average n1(x1+x2+..+xn).
There are also multiplicative values, like asset return rate, growth ratio. For multiplicative values, we often compute geometric average (x1⋅x2⋅...⋅xn)n1. For example, if an asset grows by 20% in first year, drops 10% in second year and grows 1% in third year, then the average growth ratio per year is (1.2⋅0.9⋅1.01)31.
Logarithm allows turning multiplication into addition, and turning power into multiplication. If y=logx, then log of geometric average of x is arithmetic average of y:
For example, if wealth follows Pareto distribution, how to compute the wealth share of the top 1%? Generally how to compute the share of the top p porpotion?
We firstly need to compute the threshold value t of the top n:
print("| $\\alpha$ | Share of top 20% | Share of top 1% |\n| - | - | - |\n"+"\n".join([ "|"+"|".join([f"{a}"]+[ f"{pow(p,1-(1/a)):.2%}"for p in[0.2,0.01] ])+"|"for a in[1.001,1.1,1.160964,1.2,1.3,1.5,2,2.5,3] ]))
A distribution is power law distribution if its tail function P(X>x) is roughly porpotional to x−α, where α is called exponent.
P(X>x)∝x−α(roughly)
The "roughly" here means that it can have small deviations that is infinitely small when x is large enough. Rigorously speaking it's P(X>x)∝L(x)x−α where L is a slow varying function that requires limx→∞L(x)L(rx)=1 for positive r.
Note that in some places the power law is written as P(X>x)∝L(x)x−(α−1). In these places the α is 1 larger than the α in Pareto distribution. The same α can have different meaning in different places. Here I will use the α that's consistent with the α in Pareto distribution.
The lower the exponent α, the more right-skewed it is, and the more extreme values it have.
Book The Black Swan also provides some estimation of power law parameter in real world:
Distribution
Estimated exponent α
Number of books sold in the U.S.
1.5
Magnitude of earthquakes
2.8
Market moves
3 (or lower)
Company size
1.5
People killed in terroist attacks
2 (but possibly a much lower exponent)
Note that the estimation is not accurate because they are sensitive to rare extreme samples.
Note that there are things whose estimated α<1: intensity of solar flares, intensity of wars, frequency of family names. Recall that in Pareto (Type I) distribuion if α≤1 then the theoretical mean is infinite. The sample mean tend to be higher and higher when we collect samples and the trend won't stop. If the intensity of war do follow power law and the real α<1, then much larger wars exists in the future.
Note that most of these things has estimated α<2. In Pareto (Type I) distribution if α≤2 then its theoretical variance is infinite. Not having a finite variance makes them not follow central limit theorem and should not be modelled using gaussian distribution.
There are other distributions that can have extreme values:
Log-normal distribution: If logX is normally distributed, then X follows log-normal distribution. Put in another way, if Y is normally distributed, then eY follows log-normal distribution.
Stretched exponential distribution: P(X>x) is roughly porpotional to e−kxβ (β<1)
Power law with exponential cutoff: P(X>x) is roughly porpotional to x−αe−λx
They all have less extreme values than power law distributions, but more extreme values than normal distribution and exponential distribution.
If the lifetime of something follows power law distribution, then it has Lindy effect: the longer that it has existed, the longer that it will likely to continue existing.
If the lifetime T follows Pareto distribution, if something keeps living at time t, then compute the expected lifetime under that condition.
(The mean is weighted average. The conditional mean is also weighted average but under condition. But as the total integrated weight is not 1, it need to divide the total integrated weight.)
The expected lifetime is α−1αt under the condition that it has already lived to time t. The expected remaining lifetime is α−1αt−t=α−11t. It increases by t.
Lindy effect often doesn't apply to physical things. Lindy effect often applies to information, like technology, culture, art, social norm, etc.
If some numbers spans multiple orders of magnitudes, Benford's law says that about 30% of numbers have leading digit 1, about 18% of numbers have leading digit of 2, ... The digit d's porportion is log10(1+d1).
Pareto distribution is a distribution that spans many orders of magnitudes. Let's compute the distribution of first digit if the number follows Pareto distribution.
If x starts with digit d then d10k≤x<(d+1)10k, k=0,1,2,... Pareto distribution has a lower bound m. If we make m randomly distributed then analytically computing the probability of each starting digit become hard due to edge cases.
In this case, doing a Monte Carlo simulation is easier.
How to randomly sample numbers from a Pareto distribution? Firstly we know the cumulative distribution function F(x)=P(X<x)=1−P(X>x)=1−mαx−α. We can then get quantile function, which is the inverse of F: F(x)=p,Q(p)=x
Now we can randomly sample p between 0 and 1 then Q(p) will follow Pareto distribution.
Given x how to calculate its first digit? If 10≤x<100 (1≤log10x<2) then first digit is ⌊10x⌋. If 100≤x<1000 (2≤log10x<3) then the first digit is ⌊100x⌋. Generalize it, the first digit d is:
d=⌊10⌊log10x⌋x⌋
Because Pareto distribution has a lot of extreme values, directly calculating the sample will likely to exceed floating-point range and give some inf. So we need to use log scale. Only calculate using logx and avoid using x directly.
When α approaches 0 it accurately follows Benford's law. The larger α the larger deviation with Benford's law.
If we fix the min value m as a specific number, like 3, when α is not very close to 0 it significantly deviates with Benford's law. However if we make m a random value between 1 and 10 then it will be close to Benford's law.
We have a null hypothesis H0, like "the coin is fair", and an alternative hypothesis H1, like "the coin is unfair". We now need to test how likely H1 is true using data.
If you have some data and it's extreme if we assume null hypothesis H0, then P-value is the probability of getting the result that's as extreme or more extreme than the data if we assume null hypothesis H0 is true. If p-value is small then the alternative hypothesis is likely true.
If I do ten coin flips then get 9 heads and 1 tail, the probability that the coin flip is fair but still get 9 heads and 1 tail. P-value is the probability that we get as extreme or more extreme as the result, and the "extreme" is two sided, so p-value is P(9 heads 1 tail)+P(10 heads 0 tail)+P(1 heads 9 tail)+P(0 heads 10 tail) assume coin flip is fair.
Can we swap the null hypothesis and alternative hypothesis? For two conflicting hypothesis, which one should be the null hypothesis? The key is burden of proof. The null hypothesis is the default that most people tend to agree and does not need proving. The alternative hypothesis is special and require you to prove using the data.
The lower the p value, the higher your confidence that alternative hypothesis is true. But due to randomness you cannot be 100% sure.
If you are doing an AB test, you keep collecting data, and when there is statistical significance (like p-value lower than 0.05) you make a conclusion, this is not statistically sound. A random fluctation in the process could lead to false positive results.
A more rigorous approach is to determine required sample size before AB test. And the fewer data you have the stricter hypothesis test should be (lower p-value threshold). According to O'Brien-Fleming Boundary, the p-value threshold should be 0.001 when you have 25% data, 0.005 when you have 50% data, 0.015 when you have 75% data and 0.045 when you have 100% data.
If I have some samples and I calculate values like mean, variance, median, etc. The calculated value is called statistic. The statistics themselves are also random. If you are sure "In 95% probability the real median is between 8.1 and 8.2" then [8.1,8.2] is a confidence interval with 95% confidence level. Confidence interval can measure how uncertain a statistics is.
One way of computing confidence interval is called bootstrap. It doesn't require you to assume that the statistic is normally distributed. But it do require the samples to be i.i.d.
It works by resample from the data and create many replacements of the data, then calculate the statistics of the replacement data, then get the confidence interval.
For example if the original samples are [1.0,2.0,3.0,4.0,5.0], resample means randomly select one from original data and repeat 5 times, giving things like [4.0,2.0,4.0,5.0,2.0] or [3.0,2.0,4.0,4.0,5.0] (they are likely to contain duplicates).
Then compute the statistics for each resample. If the confidence level is 95%, then the confidence interval's lower bound is the 2.5% percentile number in these statistics, and the upper bound is the 97.5% percentile number in these statistics.
When we train a model (including deep learning and linear regression) we want it to also work on new data that's not in training set. But the training itself is to change the model parameter to fit training data.
Overfitting means the training make the model "memorize" the training data and does not discover the underlying rule in real world that generates training data.
Reducing overfitting is a hard topic. The ways to reduce overfitting:
Regularization. Force the model to be "simpler". Force the model to compress data. Weight sharing is also regularization (CNN is weight sharing comparing to MLP). Add inductive bias to limit the possibility of model.
(The old way of regularization is to simply reduce parameter count, but in deep learning, there is deep double descent effect where more parameter is better.)
Make the model more expressive. If the model is not exprssive enough to capture real underlying rule in real world that generates training data, it's simply unable to generalize. An example is that RNN is less expressive than Transformer due to fixed-size state.
Make the training data more comprehensive. Reinforcement learning, if done properly, can provide more comprehensive training data than supervised learning, because of the randomness in interacting with environment.
How to test how overfit a model is?
Separate the data into training set and test set. Only train using training set and check model performance on test set.
Test sensitivity to random fluctation. We can add randomness to parameter, input, hyperparameter, etc., then see model performance. An overfit model is more prone to random perturbation because memorization is more "fragile" than real underlying rule.