Niklas Buschmann

Notes on Statistics

Some notes and theorems about the estimation of parameters with Bayesian and Frequentist methods and their convergence in the limit of many observations.

Table of contents

Part 1: Bayesian

Bayes’ Theorem

From the definition of the conditional probability P(XY)P(XY)P(Y) P(X|Y) \equiv \frac{P(X \cap Y)}{P(Y)} one gets:

P(XY)P(Y)=P(XY)=P(YX)P(X)P(X|Y)P(Y) = P(X \cap Y) = P(Y|X)P(X)

This theorem - called Bayes’ Theorem - can be used to estimate the probability p(θx) \textrm{p}(\theta|x) of a parameter being θ \theta after measuring a value x x drawn from a probability distribution pθ(x)p(xθ) p_{\theta}(x) \equiv \textrm{p}(x|\theta) :

p(θx)=p(xθ)p(θ)p(x)=p(xθ)p(θ)p(xθ)p(θ)dθ\textrm{p}(\theta|x) = \frac{\textrm{p}(x|\theta)\textrm{p}(\theta)}{\textrm{p}(x)} = \frac{\textrm{p}(x|\theta)\textrm{p}(\theta)}{\int \textrm{p}(x|\theta')\textrm{p}(\theta') \textrm{d}\theta'}

Where the fact that conditional probabilities sum up to one was used, implying that:

p(x)=p(x)p(θx)dθ=p(xθ)p(θ)dθ\textrm{p}(x) = \textrm{p}(x)\int \textrm{p}(\theta|x)\textrm{d}\theta = \int \textrm{p}(x|\theta)\textrm{p}(\theta)\textrm{d}\theta

The four probability densities are then:

After the posterior has been found, the new probability p(xx)=p(xθ)p(θx)dθ \textrm{p}(x|\mathbf{x}) = \int \textrm{p}(x|\theta)\textrm{p}(\theta|\mathbf{x})\textrm{d}\theta is called posterior predictive distribution.

Since the likelihood p(xθ)=ip(xiθ) \textrm{p}(\mathbf{x}|\theta) = \prod_i \textrm{p}(x_i|\theta) becomes narrower and narrower for more measurements, the posterior becomes more independent of the prior with more measurements.

Asymptotic behaviour of the likelihood

That the likelihood becomes narrower with more measurements can be seen by rewriting the likelihood function:

p(xθ)=ip(xiθ)=elog(ip(xiθ))=eilog(p(xiθ))\textrm{p}(\mathbf{x}|\theta) = \prod_i \textrm{p}(x_i|\theta) = e^{\log(\prod_i \textrm{p}(x_i|\theta))} = e^{\sum_i \log(\textrm{p}(x_i|\theta))}

Taylor expanding the exponent around the maximum of the likelihood (where θML \theta_{\textrm{ML}} maximizes the (log) likelihood, which - by definition - is the maximum likelihood estimate of θ \theta):

ilog(p(xiθ))ilog(p(xiθML))+(θθML)22θ2ilog(p(xiθML))/2\sum_i \log(\textrm{p}(x_i|\theta)) \approx \sum_i \log(\textrm{p}(x_i|\theta_{\textrm{ML}})) + \left(\theta-\theta_{\textrm{ML}}\right)^2\frac{\partial^2}{\partial \theta^2}\sum_i \log(\textrm{p}(x_i|\theta_{\textrm{ML}}))/2

The first term is just a multiplicative constant that is removed by the normalisation of the posterior. By the law of large numbers the second term will converge to the expectation value of the second derivative:

1Ni2θ2log(p(xiθML))NE[2θ2log(p(xθML))]I(θ)\frac{1}{N}\sum_i \frac{\partial^2}{\partial \theta^2} \log(\textrm{p}(x_i|\theta_{\textrm{ML}})) \quad\overset{N\rightarrow\infty}{\rightarrow}\quad \mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log(\textrm{p}(x|\theta_{\textrm{ML}}))\right] \equiv -I(\theta)

The expectation value of the second derivative of the log likelihood is also known as the Fisher information.

The likelihood function thus asymptotically becomes a normal distribution with mean μ=θML \mu = \theta_{\textrm{ML}} and variance σ2=1NI(θ) \sigma^2 = \frac{1}{NI(\theta)} :

p(xθ) n p(xθML)e(θθML)2NI(θ)/2\textrm{p}(\mathbf{x}|\theta) \ \overset{n\rightarrow\infty}{\rightarrow}\ \textrm{p}(\mathbf{x}|\theta_{\textrm{ML}}) e^{-\left(\theta-\theta_{\textrm{ML}}\right)^2 N I(\theta)/2}

A more rigerous proof of this is known as the Bernstein–von Mises theorem.

Choice of Prior

A flat prior p(θ)=1 \textrm{p}(\theta)=1 seems like a sensible choice when having no prior information, but is not invariant under reparametrizations: For example p(σ)dσ=1 \textrm{p}(\sigma)\textrm{d}\sigma=1 implies p(σ2)dσ2=12σdσ2 \textrm{p}(\sigma^2)\textrm{d}\sigma^2=\frac{1}{2\sigma}\textrm{d}\sigma^2 , even though they are the same parameter. A more sensible choice of prior is one that maximizes the difference between prior and posterior distribution. This difference can be measured by the Kullback-Leibler divergence:

p(x)p(θx)logp(θx)p(θ)dθdx=p(θ)p(xθ)logp(θx)p(θ)dθdxp(θ)logNI(θ)/2πp(θ)dθ\begin{aligned} \iint \textrm{p}(x)\textrm{p}(\theta|x)\log \frac{\textrm{p}(\theta|x)}{\textrm{p}(\theta)}\textrm{d}\theta\textrm{d}x &= \iint \textrm{p}(\theta)\textrm{p}(x|\theta)\log \frac{\textrm{p}(\theta|x)}{\textrm{p}(\theta)}\textrm{d}\theta\textrm{d}x \\ &\rightarrow \int \textrm{p}(\theta)\log\frac{\sqrt{NI(\theta)/2\pi}}{\textrm{p}(\theta)}\textrm{d}\theta \end{aligned}

A prior that maximizes this difference is the Jeffreys prior, which is given by the square root of the Fisher information:

p(θ)=I(θ)E[(log(p(xθ))θ)2]\textrm{p}(\theta) = \sqrt{I(\theta)} \equiv \sqrt{\mathbb{E}\left[\left(\frac{\partial \log(\textrm{p}(x|\theta))}{\partial \theta}\right)^2\right]}

Credible intervals

From the posterior p(θx) \textrm{p}(\theta|\mathbf{x}) one can determine a range [θmin,θmax] [\theta_{\textrm{min}}, \theta_{\textrm{max}}] in which θ \theta lies with probability 1α 1-\alpha , called credible interval:

P(θθmaxx)=θmaxp(θx)dθ=!1α/2=!θminp(θx)dθ=P(θθminx)P(\theta\leq \theta_{\textrm{max}}|\mathbf{x})=\int_{-\infty}^{\theta_{\textrm{max}}} \textrm{p}(\theta|\mathbf{x})\textrm{d}\theta \overset{!}{=} 1-\alpha/2 \overset{!}{=} \int^{\infty}_{\theta_{\textrm{min}}} \textrm{p}(\theta|\mathbf{x})\textrm{d}\theta = P(\theta\geq \theta_{\textrm{min}}|\mathbf{x})

Loss functions

Instead of stating the full posterior p(θx) \textrm{p}(\theta|\mathbf{x}) , one can characterise the distribution by a value θ^(x) \hat{\theta}(\mathbf{x}) that minimizes on average a loss function L(θ^(x),θ)p(θx)dθ \int L(\hat{\theta}(\mathbf{x}),\theta)\textrm{p}(\theta|\mathbf{x})\textrm{d}\theta over the posterior p(θx) \textrm{p}(\theta|x) . Possible loss functions include:

Mean squared error

The loss function L=θ^(x)θ2 L=\left|\hat{\theta}(\mathbf{x})-\theta\right|^2 is minimized by θ^(x)=mean[p(θx)] \hat{\theta}(\mathbf{x}) = \textrm{mean}[\textrm{p}(\theta|x)] since one requires:

θ^θ^(x)θ2p(θx)dθ=2(θ^(x)θp(θx)dθ)=!0\frac{\partial}{\partial \hat{\theta}} \int \left|\hat{\theta}(\mathbf{x})-\theta\right|^2 \textrm{p}(\theta|\mathbf{x})\textrm{d}\theta = 2\left(\hat{\theta}(\mathbf{x})-\int \theta\,\textrm{p}(\theta|\mathbf{x})\textrm{d}\theta\right) \overset{!}{=} 0

Where was used that θ^(x) \hat{\theta}(\mathbf{x}) must be independent of θ \theta . Plugging the mean back into the average squared error shows that the error is then given by the variance of the posterior.

This can alternatively be seen directly from the average loss function:

(θ^(x)θ)2=(θ^(x)θ)2+(θθ)2(θθ)2\langle (\hat{\theta}(\mathbf{x})-\theta)^2 \rangle = \langle (\hat{\theta}(\mathbf{x})-\langle \theta \rangle)^2 + (\theta - \langle \theta \rangle)^2 \rangle \geq \langle (\theta - \langle \theta \rangle)^2 \rangle

Mean absolute error

The loss function L=θ^(x)θ1 L=\left|\hat{\theta}(\mathbf{x})-\theta\right|^1 is minimized by θ^(x)=median[p(θx)] \hat{\theta}(\mathbf{x}) = \textrm{median}[\textrm{p}(\theta|x)] since one requires:

θ^θ^(x)θ1p(θx)dθ=θ^(θ^(x)(θ^(x)θ)p(θx)dθθ^(x)(θ^(x)θ)p(θx)dθ)=θ^(x)p(θx)dθθ^(x)p(θx)dθ=!0\begin{aligned} \frac{\partial}{\partial \hat{\theta}} \int \left|\hat{\theta}(\mathbf{x})-\theta\right|^1 \textrm{p}(\theta|\mathbf{x})\textrm{d}\theta &= \frac{\partial}{\partial \hat{\theta}} \left(\int_{\hat{\theta}(\mathbf{x})}^{\infty} \left(\hat{\theta}(\mathbf{x})-\theta\right)\textrm{p}(\theta|\mathbf{x})\textrm{d}\theta -\int_{-\infty}^{\hat{\theta}(\mathbf{x})} \left(\hat{\theta}(\mathbf{x})-\theta\right)\textrm{p}(\theta|\mathbf{x})\textrm{d}\theta\right)\\ &= \int_{-\infty}^{\hat{\theta}(\mathbf{x})} \textrm{p}(\theta|\mathbf{x})\textrm{d}\theta - \int_{\hat{\theta}(\mathbf{x})}^{\infty} \textrm{p}(\theta|\mathbf{x})\textrm{d}\theta \overset{!}{=} 0 \end{aligned}

Where by definition the median is the value for that the cumulative probabilities above and below are equal.

Mean 0-1 error

The loss function L=θ^(x)θ0 L=\left|\hat{\theta}(\mathbf{x})-\theta\right|^0 with 000 0^0 \equiv 0 is minimized by θ^(x)=mode[p(θx)] \hat{\theta}(\mathbf{x}) = \textrm{mode}[\textrm{p}(\theta|x)] since one requires:

θ^θ^(x)θ0p(θx)dθ=θ^(1δ(θ^(x)θ)p(θx)dθ)=θ^p(θ^(x)x)=!0\begin{aligned} \frac{\partial}{\partial \hat{\theta}} \int \left|\hat{\theta}(\mathbf{x})-\theta\right|^0 \textrm{p}(\theta|\mathbf{x})\textrm{d}\theta &= \frac{\partial}{\partial \hat{\theta}} \left(1 - \int \delta\left(\hat{\theta}(\mathbf{x})-\theta\right)\textrm{p}(\theta|\mathbf{x})\textrm{d}\theta\right) \\ &= -\frac{\partial}{\partial \hat{\theta}} \textrm{p}(\hat{\theta}(\mathbf{x})|x) \overset{!}{=} 0 \end{aligned}

Luckily for unimodal and symmetric distributions (e.g. the normal distribution) the mean, median and mode are all the same. In the following we will concentrate on the mean squared error.

Example: Estimating the parameters of a normal distribution

The likelihood function p(xμ,σ) \textrm{p}(x|\mu,\sigma) can be written with x=ixin \overline{x} = \sum_i \frac{x_i}{n} as:

p(xμ,σ)=ip(xiμ,σ)=ie(xiμ)22σ22πσ2=ei(xiμ)22σ22πσ2n=ei(xix)2+n(xμ)22σ22πσ2n\textrm{p}(\mathbf{x}|\mu,\sigma) = \prod_i \textrm{p}(x_i|\mu,\sigma) = \prod_i \frac{e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}}{\sqrt{2\pi \sigma^2}} = \frac{e^{-\sum_i \frac{(x_i-\mu)^2}{2\sigma^2}}}{\sqrt{2\pi \sigma^2}^n} = \frac{e^{- \frac{\sum_i(x_i-\overline{x})^2 + n(\overline{x}-\mu)^2}{2\sigma^2}}}{\sqrt{2\pi \sigma^2}^n}

Using the Jeffreys prior p(μ)=1 \textrm{p}(\mu) = 1 , p(σ)=12πσ2 \textrm{p}(\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} and integrating out μ \mu and σ \sigma yields:

p(x)=p(xμ,σ)p(μ)p(σ)dμdσ=ei(xix)2+n(xμ)22σ22πσ2n+1dμdσ=ei(xix)22σ2n2πσ2ndσ=i(xix)2(n1)Γ(n12)n2πn\begin{aligned} \textrm{p}(\mathbf{x}) &= \iint \textrm{p}(\mathbf{x}|\mu,\sigma)\textrm{p}(\mu)\textrm{p}(\sigma)\textrm{d}\mu\textrm{d}\sigma = \int\frac{e^{- \frac{\sum_i(x_i-\overline{x})^2 + n(\overline{x}-\mu)^2}{2\sigma^2}}}{\sqrt{2\pi \sigma^2}^{n+1}}\textrm{d}\mu\textrm{d}\sigma \\ &= \int\frac{e^{-\frac{\sum_i(x_i-\overline{x})^2}{2\sigma^2}}}{\sqrt{n}\sqrt{2\pi \sigma^2}^n}\textrm{d}\sigma = \sqrt{\sum_i(x_i-\overline{x})^2}^{-(n-1)}\frac{\Gamma\left(\frac{n-1}{2}\right)}{\sqrt{n}\sqrt{2\pi^n}} \end{aligned}

The posterior of the mean is a Student’s t distribution with n1 n-1 degrees of freedom:

p(μx)=1p(x)p(xμ,σ)p(μ)p(σ)dσ=1p(x)ei(xix)2+n(xμ)22σ22πσ2n+1dσ=1p(x)Γ(n2)2πn+1i(xix)2+n(xμ)2n=Γ(n2)nπΓ(n12)i(xix)21+n(xμ)2i(xix)2n\begin{aligned} \textrm{p}(\mu|\mathbf{x}) &= \frac{1}{\textrm{p}(x)}\int \textrm{p}(\mathbf{x}|\mu,\sigma)\textrm{p}(\mu)\textrm{p}(\sigma)\textrm{d}\sigma = \frac{1}{\textrm{p}(\mathbf{x})}\int\frac{e^{- \frac{\sum_i(x_i-\overline{x})^2 + n(\overline{x}-\mu)^2}{2\sigma^2}}}{\sqrt{2\pi \sigma^2}^{n+1}}\textrm{d}\sigma \\ &= \frac{1}{\textrm{p}(\mathbf{x})}\frac{\Gamma\left(\frac{n}{2}\right)}{\sqrt{2\pi^{n+1}}}\sqrt{\sum_i(x_i-\overline{x})^2 + n(\overline{x}-\mu)^2}^{-n} \\ &= \frac{\Gamma\left(\frac{n}{2}\right)\sqrt{n}}{\sqrt{\pi}\Gamma\left(\frac{n-1}{2}\right)\sqrt{\sum_i(x_i-\overline{x})^2}}\sqrt{1 + \frac{n(\overline{x}-\mu)^2}{\sum_i(x_i-\overline{x})^2}}^{-n} \end{aligned}

The posterior of the variance is an inverse χ2 \chi^2 distribution with n1 n-1 degrees of freedom:

p(σx)=1p(x)p(xμ,σ)p(μ)p(σ)dμ=1p(x)ei(xix)2+n(xμ)22σ22πσ2n+1dμ=1p(x)ei(xix)22σ2n2πσ2n=i(xix)2n2Γ(n12)ei(xix)22σ22σ2n\begin{aligned} \textrm{p}(\sigma|\mathbf{x}) &= \frac{1}{\textrm{p}(\mathbf{x})}\int \textrm{p}(\mathbf{x}|\mu,\sigma)\textrm{p}(\mu)\textrm{p}(\sigma)\textrm{d}\mu = \frac{1}{\textrm{p}(\mathbf{x})}\int \frac{e^{- \frac{\sum_i(x_i-\overline{x})^2 + n(\overline{x}-\mu)^2}{2\sigma^2}}}{\sqrt{2\pi \sigma^2}^{n+1}}\textrm{d}\mu \\ &= \frac{1}{\textrm{p}(\mathbf{x})} \frac{e^{- \frac{\sum_i(x_i-\overline{x})^2}{2\sigma^2}}}{\sqrt{n}\sqrt{2\pi \sigma^2}^n} = \sqrt{\sum_i(x_i-\overline{x})^2}^n\frac{\sqrt{2}}{\Gamma\left(\frac{n-1}{2}\right)}\frac{e^{- \frac{\sum_i(x_i-\overline{x})^2}{2\sigma^2}}}{\sqrt{2\sigma^2}^n} \end{aligned}

Estimating the parameter p p of a binomial trial with Jeffreys prior p(p)=p(1p)1 \textrm{p}(p)=\sqrt{\textrm{p}(1-p)}^{-1} yields a mean of:

p^(x)=ppx(1p)nxp(p)dppx(1p)nxp(p)dp=x+12n+1\hat{p}(x) = \frac{\int p p^x(1-p)^{n-x} \textrm{p}(p)\textrm{d}p}{\int p^x(1-p)^{n-x} \textrm{p}(p)\textrm{d}p} = \frac{x+\frac{1}{2}}{n+1}

And a mean squared error of:

(Δp^)2=p2px(1p)nxp(p)dp(ppx(1p)nxp(p)dp)2px(1p)nxp(p)dp=(nx+12)(x+12)(n+2)(n+1)2(\Delta \hat{p})^2 = \frac{\int p^2 p^x(1-p)^{n-x} \textrm{p}(p)\textrm{d}p - \left(\int p p^x(1-p)^{n-x} \textrm{p}(p)\textrm{d}p\right)^2}{\int p^x(1-p)^{n-x} \textrm{p}(p)\textrm{d}p} = \frac{(n-x+\frac{1}{2})(x+\frac{1}{2})}{(n+2)(n+1)^2}

Bayes factor

Part 2: Frequentist

An alternative to the Bayesian method is to not assume a prior/posterior distribution for θ \theta , instead relying on the mode of the likelihood function and looking at the distribution of the possible measurements x x instead of θ \theta .

Maximum Likelihood

A way of parameter estimation is the maximum likelihood method where the estimator θ^(x) \hat{\theta}(\mathbf{x}) is given by the condition that θ^(x) \hat{\theta}(\mathbf{x}) maximizes the likelihood, implying p(xθ^(x))θ=!0 \frac{\partial \textrm{p}(\mathbf{x}|\hat{\theta}(\mathbf{x}))}{\partial \theta} \overset{!}{=} 0 . One can Taylor expand the likelihood around the true parameter θ0 \theta_0 :

log(p(xθ0))θlog(p(xθ^(x)))θ0+2log(p(xθ^(x)))θ2(θ0θ^(x))\frac{\partial \log(\textrm{p}(\mathbf{x}|\theta_0))}{\partial \theta} \approx \underbrace{\frac{\partial \log(\textrm{p}(\mathbf{x}|\hat{\theta}(\mathbf{x})))}{\partial \theta}}_{0} + \frac{\partial^2 \log(\textrm{p}(\mathbf{x}|\hat{\theta}(\mathbf{x})))}{\partial \theta^2} (\theta_0-\hat{\theta}(\mathbf{x}))

This difference converges by the central limit theorem, the law of large numbers and Slutsky’s theorem to the following normal distribution:

θ0θ^(x)=log(p(xθ0))θ2log(p(xθ^(x)))θ2=i1Nlog(p(xiθ0))θi1N2log(p(xiθ^(x)))θ2nN(0,E[(log(p(xθ0))θ)2]/n)E[2log(p(xθ^(x)))θ2]\theta_0-\hat{\theta}(\mathbf{x}) = \frac{\frac{\partial \log(\textrm{p}(\mathbf{x}|\theta_0))}{\partial \theta}}{\frac{\partial^2 \log(\textrm{p}(\mathbf{x}|\hat{\theta}(\mathbf{x})))}{\partial \theta^2}} = \frac{\sum_i\frac{1}{N} \frac{\partial \log(\textrm{p}(x_i|\theta_0))}{\partial \theta}}{\sum_i \frac{1}{N} \frac{\partial^2 \log(\textrm{p}(x_i|\hat{\theta}(\mathbf{x})))}{\partial \theta^2}} \overset{n\rightarrow\infty}{\rightarrow} \frac{\mathcal{N}\left(0, \mathbb{E}\left[\left(\frac{\partial \log(\textrm{p}(x|\theta_0))}{\partial \theta}\right)^2\right]/n\right)}{\mathbb{E}[\frac{\partial^2 \log(\textrm{p}(x|\hat{\theta}(\mathbf{x})))}{\partial \theta^2}]}

Where was used that the derivative of the logarithm vanishes in expectation:

E[log(p(xθ0))θ]=p(xθ0)log(p(xθ0))θdx=θp(xθ0)dx=θE[1]=0\mathbb{E}\left[\frac{\partial \log(\textrm{p}(x|\theta_0))}{\partial \theta}\right] = \int \textrm{p}(x|\theta_0)\frac{\partial \log(\textrm{p}(x|\theta_0))}{\partial \theta}\textrm{d}x = \int \frac{\partial}{\partial \theta}\textrm{p}(x|\theta_0)\textrm{d}x = \frac{\partial }{\partial \theta} \mathbb{E}[1] = 0

Similarly the expecation values in the nominator and denominator are in fact both equal to the Fisher information:

E[2log(p(xθ^(x)))θ2]=E[θ(p(xθ^(x))θ1p(xθ^(x)))]=2E[1]θ20E[(log(p(xθ^(x)))θ)2]I(θ)\mathbb{E}\left[\frac{\partial^2 \log(\textrm{p}(x|\hat{\theta}(\mathbf{x})))}{\partial \theta^2}\right] = \mathbb{E}\left[\frac{\partial}{\partial \theta} \left(\frac{\partial \textrm{p}(x|\hat{\theta}(\mathbf{x}))}{\partial \theta}\frac{1}{\textrm{p}(x|\hat{\theta}(\mathbf{x}))}\right)\right] = \underbrace{\frac{\partial^2 \mathbb{E}[1]}{\partial \theta^2}}_0 -\underbrace{\mathbb{E}\left[\left(\frac{\partial \log(\textrm{p}(x|\hat{\theta}(\mathbf{x})))}{\partial \theta}\right)^2\right]}_{I(\theta)}

If the estimator converges to the true value θ^(x)θ0 \hat{\theta}(x) \rightarrow \theta_0 , then:

θ0θ^(x) n N(μ=0,σ2=1nI(θ))\theta_0-\hat{\theta}(\mathbf{x}) \ \overset{n\rightarrow\infty}{\rightarrow}\ \mathcal{N}\left(\mu = 0, \sigma^2 = \frac{1}{n I(\theta)}\right)

Confidence Intervals

The confidence interval [θ^min,θ^max] [\hat{\theta}_\textrm{min},\hat{\theta}_\textrm{max}] is the analog of the credible interval, where after repeated measurements the true parameter θ \theta is included 1α 1-\alpha of the time. For a given x x the confidence interval can be found from:

P(Xxθ^min)=xp(xθ^min)dx=!1α/2=!xp(xθ^max)dx=P(Xxθ^max)P(X\leq x|\hat{\theta}_{\textrm{min}})=\int_{-\infty}^x \textrm{p}(x'|\hat{\theta}_{\textrm{min}})\textrm{d}x' \overset{!}{=} 1-\alpha/2 \overset{!}{=} \int_x^{\infty} \textrm{p}(x'|\hat{\theta}_{\textrm{max}})\textrm{d}x' = P(X\geq x|\hat{\theta}_{\textrm{max}})

Mean squared error

The mean squared error is taken as the squared difference between estimated parameter θ^(x) \hat{\theta}(\mathbf{x}) and the true parameter θ \theta over all possible x x :

E[(θ^(x)θ)2]=(E[θ^(x)]θ)2bias2+E[θ^(x)2]E[θ^(x)]2variance\mathbb{E}\left[\left(\hat{\theta}(\mathbf{x})-\theta\right)^2\right] = \underbrace{\left(\mathbb{E}\left[\hat{\theta}(\mathbf{x})\right]-\theta\right)^2}_{\textrm{bias}^2} + \underbrace{\mathbb{E}\left[\hat{\theta}(\mathbf{x})^2\right] - \mathbb{E}\left[\hat{\theta}(\mathbf{x})\right]^2}_{\textrm{variance}}

For unbiased estimators the mean squared error equals the variance of the estimator.

Estimating for example the parameter p p of a binomial trial with the maximum likelihood estimator yields:

p^(x)=xn\hat{p}(x) = \frac{x}{n} (Δp^)2=x(nx)px(1p)nx(xnp)2=pp2n(\Delta \hat{p})^2 = \sum_x\binom{n}{x}p^x(1-p)^{n-x}\left(\frac{x}{n}-p\right)^2 = \frac{p-p^2}{n}

Cramer Rao bound

Defining the scalar product f(x)g(x)E[f(x)g(x)] \langle f(x)g(x) \rangle \equiv \mathbb{E}\left[f(x)g(x)\right] one gets from the Cauchy–Schwarz inequality that the variance of an estimator θ^(x) \hat{\theta}(\mathbf{x}) is bounded by the Fisher information:

(θE[θ^(x)])2=E[(θ^(x)θ)θlog(p(xθ))]2E[(θ^(x)θ)2]E[(θlog(p(xθ)))2]\left(\frac{\partial}{\partial \theta}\mathbb{E}\left[\hat{\theta}(\mathbf{x})\right]\right)^2 = \mathbb{E}\left[\left(\hat{\theta}(\mathbf{x})-\theta\right) \frac{\partial}{\partial \theta} \log(\textrm{p}(x|\theta))\right]^2 \leq \mathbb{E}\left[\left(\hat{\theta}(\mathbf{x})-\theta\right)^2\right]\mathbb{E}\left[\left(\frac{\partial}{\partial \theta} \log(\textrm{p}(x|\theta))\right)^2\right]

Where the first equality follows from the interchange of integration and differentiation:

E[(θ^(x)θ)log(p(xθ))θ]=p(xθ)(θ^(x)θ)1p(xθ)p(xθ)θdx=θθ^(x)p(xθ)dx=E[θ^(x)]θθp(xθ)dx=1\begin{aligned} \mathbb{E}\left[\left(\hat{\theta}(\mathbf{x})-\theta\right) \frac{\partial \log(\textrm{p}(x|\theta))}{\partial \theta} \right] &= \int \textrm{p}(x|\theta)\left(\hat{\theta}(\mathbf{x})-\theta\right) \frac{1}{\textrm{p}(x|\theta)} \frac{\partial \textrm{p}(x|\theta)}{\partial \theta}\textrm{d}x \\ &= \frac{\partial}{\partial \theta}\underbrace{\int \hat{\theta}(\mathbf{x}) \textrm{p}(x|\theta) \textrm{d}x}_{=\,\mathbb{E\left[\hat{\theta}(\mathbf{x})\right]}} - \theta\frac{\partial}{\partial \theta} \underbrace{\int \textrm{p}(x|\theta) \textrm{d}x}_{=\,1} \end{aligned}

For unbiased estimators (E[θ^(x)]=θ) \left(\mathbb{E}[\hat{\theta}(\mathbf{x})] = \theta \right) this gives a lower bound on the mean squared error as the inverse of the fisher information I(θ) I(\theta) :

E[(θ^(x)θ)2]1I(θ)\mathbb{E}\left[\left(\hat{\theta}(\mathbf{x})-\theta\right)^2\right] \geq \frac{1}{I(\theta)}

The Cauchy-Schwarz inequality follows from:

12 E[(xE[x2]yE[y2])2]=1E[xy]E[x2]E[y2]0\frac{1}{2}\ \mathbb{E}\left[\left(\frac{x}{\sqrt{\mathbb{E}[x^2]}} - \frac{y}{\sqrt{\mathbb{E}[y^2]}}\right)^2\right] = 1 - \frac{\mathbb{E}[xy]}{\sqrt{\mathbb{E}[x^2]\mathbb{E}[y^2]}} \geq 0

Linear regression

Likelihood ratio test