Some notes and theorems about the estimation of parameters with Bayesian and Frequentist methods and their convergence in the limit of many observations.
Table of contents
Part 1: Bayesian
Bayes’ Theorem
From the definition of the conditional probability P(X∣Y)≡P(Y)P(X∩Y) one gets:
P(X∣Y)P(Y)=P(X∩Y)=P(Y∣X)P(X)
This theorem - called Bayes’ Theorem - can be used to estimate the probability p(θ∣x) of a parameter being θ after measuring a value x drawn from a probability distribution pθ(x)≡p(x∣θ):
p(θ∣x)=p(x)p(x∣θ)p(θ)=∫p(x∣θ′)p(θ′)dθ′p(x∣θ)p(θ)
Where the fact that conditional probabilities sum up to one was used, implying that:
p(x)=p(x)∫p(θ∣x)dθ=∫p(x∣θ)p(θ)dθ
The four probability densities are then:
- the posterior p(θ∣x) which is the probability of the parameter being θ considering that x was measured
- the likelihood p(x∣θ) which is the probability of measuring x for a given parameter θ
- the prior p(θ) which is the assumed distribution of the parameter θ for many repeated experiments
- the marginal probability p(x) of measuring x averaged over all possible parameters θ
After the posterior has been found, the new probability p(x∣x)=∫p(x∣θ)p(θ∣x)dθ is called posterior predictive distribution.
Since the likelihood p(x∣θ)=∏ip(xi∣θ) becomes narrower and narrower for more measurements, the posterior becomes more independent of the prior with more measurements.
Asymptotic behaviour of the likelihood
That the likelihood becomes narrower with more measurements can be seen by rewriting the likelihood function:
p(x∣θ)=i∏p(xi∣θ)=elog(∏ip(xi∣θ))=e∑ilog(p(xi∣θ))
Taylor expanding the exponent around the maximum of the likelihood (where θML maximizes the (log) likelihood, which - by definition - is the maximum likelihood estimate of θ):
i∑log(p(xi∣θ))≈i∑log(p(xi∣θML))+(θ−θML)2∂θ2∂2i∑log(p(xi∣θML))/2
The first term is just a multiplicative constant that is removed by the normalisation of the posterior. By the law of large numbers the second term will converge to the expectation value of the second derivative:
N1i∑∂θ2∂2log(p(xi∣θML))→N→∞E[∂θ2∂2log(p(x∣θML))]≡−I(θ)
The expectation value of the second derivative of the log likelihood is also known as the Fisher information.
The likelihood function thus asymptotically becomes a normal distribution with mean μ=θML and variance σ2=NI(θ)1:
p(x∣θ) →n→∞ p(x∣θML)e−(θ−θML)2NI(θ)/2
A more rigerous proof of this is known as the Bernstein–von Mises theorem.
Choice of Prior
A flat prior p(θ)=1 seems like a sensible choice when having no prior information, but is not invariant under reparametrizations: For example p(σ)dσ=1 implies p(σ2)dσ2=2σ1dσ2, even though they are the same parameter. A more sensible choice of prior is one that maximizes the difference between prior and posterior distribution. This difference can be measured by the Kullback-Leibler divergence:
∬p(x)p(θ∣x)logp(θ)p(θ∣x)dθdx=∬p(θ)p(x∣θ)logp(θ)p(θ∣x)dθdx→∫p(θ)logp(θ)NI(θ)/2πdθ
A prior that maximizes this difference is the Jeffreys prior, which is given by the square root of the Fisher information:
p(θ)=I(θ)≡E[(∂θ∂log(p(x∣θ)))2]
Credible intervals
From the posterior p(θ∣x) one can determine a range [θmin,θmax] in which θ lies with probability 1−α, called credible interval:
P(θ≤θmax∣x)=∫−∞θmaxp(θ∣x)dθ=!1−α/2=!∫θmin∞p(θ∣x)dθ=P(θ≥θmin∣x)
Loss functions
Instead of stating the full posterior p(θ∣x), one can characterise the distribution by a value θ^(x) that minimizes on average a loss function ∫L(θ^(x),θ)p(θ∣x)dθ over the posterior p(θ∣x). Possible loss functions include:
Mean squared error
The loss function L=θ^(x)−θ2 is minimized by θ^(x)=mean[p(θ∣x)] since one requires:
∂θ^∂∫θ^(x)−θ2p(θ∣x)dθ=2(θ^(x)−∫θp(θ∣x)dθ)=!0
Where was used that θ^(x) must be independent of θ. Plugging the mean back into the average squared error shows that the error is then given by the variance of the posterior.
This can alternatively be seen directly from the average loss function:
⟨(θ^(x)−θ)2⟩=⟨(θ^(x)−⟨θ⟩)2+(θ−⟨θ⟩)2⟩≥⟨(θ−⟨θ⟩)2⟩
Mean absolute error
The loss function L=θ^(x)−θ1 is minimized by θ^(x)=median[p(θ∣x)] since one requires:
∂θ^∂∫θ^(x)−θ1p(θ∣x)dθ=∂θ^∂(∫θ^(x)∞(θ^(x)−θ)p(θ∣x)dθ−∫−∞θ^(x)(θ^(x)−θ)p(θ∣x)dθ)=∫−∞θ^(x)p(θ∣x)dθ−∫θ^(x)∞p(θ∣x)dθ=!0
Where by definition the median is the value for that the cumulative probabilities above and below are equal.
Mean 0-1 error
The loss function L=θ^(x)−θ0 with 00≡0 is minimized by θ^(x)=mode[p(θ∣x)] since one requires:
∂θ^∂∫θ^(x)−θ0p(θ∣x)dθ=∂θ^∂(1−∫δ(θ^(x)−θ)p(θ∣x)dθ)=−∂θ^∂p(θ^(x)∣x)=!0
Luckily for unimodal and symmetric distributions (e.g. the normal distribution) the mean, median and mode are all the same. In the following we will concentrate on the mean squared error.
Example: Estimating the parameters of a normal distribution
The likelihood function p(x∣μ,σ) can be written with x=∑inxi as:
p(x∣μ,σ)=i∏p(xi∣μ,σ)=i∏2πσ2e−2σ2(xi−μ)2=2πσ2ne−∑i2σ2(xi−μ)2=2πσ2ne−2σ2∑i(xi−x)2+n(x−μ)2
Using the Jeffreys prior p(μ)=1, p(σ)=2πσ21 and integrating out μ and σ yields:
p(x)=∬p(x∣μ,σ)p(μ)p(σ)dμdσ=∫2πσ2n+1e−2σ2∑i(xi−x)2+n(x−μ)2dμdσ=∫n2πσ2ne−2σ2∑i(xi−x)2dσ=i∑(xi−x)2−(n−1)n2πnΓ(2n−1)
The posterior of the mean is a Student’s t distribution with n−1 degrees of freedom:
p(μ∣x)=p(x)1∫p(x∣μ,σ)p(μ)p(σ)dσ=p(x)1∫2πσ2n+1e−2σ2∑i(xi−x)2+n(x−μ)2dσ=p(x)12πn+1Γ(2n)i∑(xi−x)2+n(x−μ)2−n=πΓ(2n−1)∑i(xi−x)2Γ(2n)n1+∑i(xi−x)2n(x−μ)2−n
The posterior of the variance is an inverse χ2 distribution with n−1 degrees of freedom:
p(σ∣x)=p(x)1∫p(x∣μ,σ)p(μ)p(σ)dμ=p(x)1∫2πσ2n+1e−2σ2∑i(xi−x)2+n(x−μ)2dμ=p(x)1n2πσ2ne−2σ2∑i(xi−x)2=i∑(xi−x)2nΓ(2n−1)22σ2ne−2σ2∑i(xi−x)2
Estimating the parameter p of a binomial trial with Jeffreys prior p(p)=p(1−p)−1 yields a mean of:
p^(x)=∫px(1−p)n−xp(p)dp∫ppx(1−p)n−xp(p)dp=n+1x+21
And a mean squared error of:
(Δp^)2=∫px(1−p)n−xp(p)dp∫p2px(1−p)n−xp(p)dp−(∫ppx(1−p)n−xp(p)dp)2=(n+2)(n+1)2(n−x+21)(x+21)
Bayes factor
Part 2: Frequentist
An alternative to the Bayesian method is to not assume a prior/posterior distribution for θ, instead relying on the mode of the likelihood function and looking at the distribution of the possible measurements x instead of θ.
Maximum Likelihood
A way of parameter estimation is the maximum likelihood method where the estimator θ^(x) is given by the condition that θ^(x) maximizes the likelihood, implying ∂θ∂p(x∣θ^(x))=!0. One can Taylor expand the likelihood around the true parameter θ0:
∂θ∂log(p(x∣θ0))≈0∂θ∂log(p(x∣θ^(x)))+∂θ2∂2log(p(x∣θ^(x)))(θ0−θ^(x))
This difference converges by the central limit theorem, the law of large numbers and Slutsky’s theorem to the following normal distribution:
θ0−θ^(x)=∂θ2∂2log(p(x∣θ^(x)))∂θ∂log(p(x∣θ0))=∑iN1∂θ2∂2log(p(xi∣θ^(x)))∑iN1∂θ∂log(p(xi∣θ0))→n→∞E[∂θ2∂2log(p(x∣θ^(x)))]N(0,E[(∂θ∂log(p(x∣θ0)))2]/n)
Where was used that the derivative of the logarithm vanishes in expectation:
E[∂θ∂log(p(x∣θ0))]=∫p(x∣θ0)∂θ∂log(p(x∣θ0))dx=∫∂θ∂p(x∣θ0)dx=∂θ∂E[1]=0
Similarly the expecation values in the nominator and denominator are in fact both equal to the Fisher information:
E[∂θ2∂2log(p(x∣θ^(x)))]=E[∂θ∂(∂θ∂p(x∣θ^(x))p(x∣θ^(x))1)]=0∂θ2∂2E[1]−I(θ)E(∂θ∂log(p(x∣θ^(x))))2
If the estimator converges to the true value θ^(x)→θ0, then:
θ0−θ^(x) →n→∞ N(μ=0,σ2=nI(θ)1)
Confidence Intervals
The confidence interval [θ^min,θ^max] is the analog of the credible interval, where after repeated measurements the true parameter θ is included 1−α of the time. For a given x the confidence interval can be found from:
P(X≤x∣θ^min)=∫−∞xp(x′∣θ^min)dx′=!1−α/2=!∫x∞p(x′∣θ^max)dx′=P(X≥x∣θ^max)
Mean squared error
The mean squared error is taken as the squared difference between estimated parameter θ^(x) and the true parameter θ over all possible x:
E[(θ^(x)−θ)2]=bias2(E[θ^(x)]−θ)2+varianceE[θ^(x)2]−E[θ^(x)]2
For unbiased estimators the mean squared error equals the variance of the estimator.
Estimating for example the parameter p of a binomial trial with the maximum likelihood estimator yields:
p^(x)=nx
(Δp^)2=x∑(xn)px(1−p)n−x(nx−p)2=np−p2
Cramer Rao bound
Defining the scalar product ⟨f(x)g(x)⟩≡E[f(x)g(x)] one gets from the Cauchy–Schwarz inequality that the variance of an estimator θ^(x) is bounded by the Fisher information:
(∂θ∂E[θ^(x)])2=E[(θ^(x)−θ)∂θ∂log(p(x∣θ))]2≤E[(θ^(x)−θ)2]E[(∂θ∂log(p(x∣θ)))2]
Where the first equality follows from the interchange of integration and differentiation:
E[(θ^(x)−θ)∂θ∂log(p(x∣θ))]=∫p(x∣θ)(θ^(x)−θ)p(x∣θ)1∂θ∂p(x∣θ)dx=∂θ∂=E[θ^(x)]∫θ^(x)p(x∣θ)dx−θ∂θ∂=1∫p(x∣θ)dx
For unbiased estimators (E[θ^(x)]=θ) this gives a lower bound on the mean squared error as the inverse of the fisher information I(θ):
E[(θ^(x)−θ)2]≥I(θ)1
The Cauchy-Schwarz inequality follows from:
21 E(E[x2]x−E[y2]y)2=1−E[x2]E[y2]E[xy]≥0
Linear regression
Likelihood ratio test