Niklas Buschmann

Least squares and least absolute deviations

When doing any kind of linear regression, the result can always be computed by requiring the estimated parameters θj \theta_j to minimize the sum of squared differences between the measured values yi y_i and predicted values fθ(xi) f_{\theta}(x_i) :

S1=iΔi2=iyifθ(xi)2S_1 = \sum_i |\Delta_i|^2 = \sum_i |y_i - f_{\theta}(x_i)|^2

But this requirement always seemed kind of arbitrary to me - why not minimize the sum of absolute differences for example?

S2=iΔi=iyifθ(xi)S_2 = \sum_i |\Delta_i| = \sum_i |y_i - f_{\theta}(x_i)|

In both cases the fact that the sum is minimized implies that the derivative with respect to any parameter θj \theta_j must vanish:

S1θj=iΔi2θj=2iΔi1fθ(xi)θj=0S2θj=iΔiθj=iΔiΔifθ(xi)θj=0\frac{\partial S_1}{\partial \theta_j} = \sum_i\frac{\partial |\Delta_i|^2}{\partial \theta_j} = -2 \sum_i \frac{\Delta_i}{1} \frac{\partial f_{\theta}(x_i)}{\partial \theta_j} = 0 \\[2ex] \frac{\partial S_2}{\partial \theta_j} = \sum_i\frac{\partial |\Delta_i|}{\partial \theta_j} = -\sum_i \frac{\Delta_i}{|\Delta_i|} \frac{\partial f_{\theta}(x_i)}{\partial \theta_j} = 0

Let us now assume that the model function f f includes a constant c c so that fθ(xi)=gθ(xi)+c f_{\theta}(x_i) = g_{\theta}(x_i) + c . Since fθc=1 \frac{\partial f_{\theta}}{\partial c} = 1 this implies that:

S1c=2iΔi=0iΔin=0\frac{\partial S_1}{\partial c} = -2\sum_i \Delta_i = 0 \quad \Rightarrow \quad \sum_i \frac{\Delta_i}{n} = 0

We can see that - with least squares - the constant c c ensures that the mean difference Δi \Delta_i between predicted and measured values is zero.

Comparing this with the least absolute deviations approach yields:

S2c=iΔiΔi=0n(Δi>0)=n(Δi<0)\frac{\partial S_2}{\partial c} = -\sum_i \frac{\Delta_i}{|\Delta_i|} = 0 \quad \Rightarrow \quad n(\Delta_i > 0) = n(\Delta_i < 0)

Now - with least absolute deviations - the constant c c ensures that the median difference between predicted and measured values is zero. Since ΔiΔi=±1 \frac{\Delta_i}{|\Delta_i|}=\pm 1 , depending on whether the difference is positive or negative, the amount of times the difference is lower and greater than zero must be the same.

The same applies to all other parameters: least squares optimizes the mean difference and least absolute deviations optimizes the median difference - only that now the differences are weighted by a factor fθθj \frac{\partial f_{\theta}}{\partial \theta_j} (which since fθf_{\theta} is linear in θj \theta_j only depends on xi x_i).

This explains why usually the least squares approach is chosen, because intuitively one would expect the model to be on average equal to the measurement. But when outliers are present, the least absolute deviations approach can be more useful, since optimizing the median is more robust to outliers.

Another comparison between the two approaches can be made using the maximum likelihood method, where least squares corresponds to normally distributed errors and least absolute deviations corresponds to double exponentially distributed errors.

See also Modes, Medians and Means: A Unifying Perspective as foundation of this post.