When doing a linear regression, the result can be computed by requiring the estimated parameters θj to minimize the sum of squared differences between the measured values yi and predicted values fθ(xi):
S2≡i∑∣fθ(xi)−yi∣2≡i∑∣Δi∣2
But this requirement seems kind of arbitrary - why not minimize the sum of absolute differences instead?
S1≡i∑∣fθ(xi)−yi∣1≡i∑∣Δi∣1
Or one could even go a step lower and minimize the following sum:
S0≡i∑∣fθ(xi)−yi∣0≡i∑∣Δi∣0
For the sum Sm to be minimal requires the derivatives to vanish: ∂θiSn=0. If the model function f includes a constant α so that fθ(x)=gθ(x)+α, then ∂α∂Sn=∂f(x)∂Sn must be zero.
For the leat squares approach optimizing α ensures that the mean difference between predicted and measured values is zero since:
∂α∂S2=∂f∂i∑∣Δi∣2=i∑Δi=!0
With the least absolute deviations approach instead, the optimization ensures that the median difference between predicted and measured values is zero:
∂α∂S1=∂f∂i∑∣Δi∣1=i∑∣Δi∣Δi=n(Δi>0)−n(Δi<0)=!0
Optimizing S0 will ensure that the difference has a mode at zero:
∂α∂S0=∂f∂i∑∣Δi∣0=−∂αn(Δi=0)=!0
This explains why usually the least squares approach is chosen, because intuitively one would expect the model to be on average equal to the measurement. But when outliers are present, the least absolute deviations approach can be more useful, since optimizing the median is more robust to outliers.
Another comparison between the two approaches can be made using the maximum likelihood method, where least squares corresponds to normally distributed errors and least absolute deviations corresponds to double exponentially distributed errors.
See also Modes, Medians and Means: A Unifying Perspective as foundation of this post.