Calculator Statistics

I find it fascinating that calculators can perform some quite complex statistical analysis without storing the whole data set. Rather, they store just six values: $n$, $\Sigma x$, $\Sigma x^2$ [meaning $\sum{(x^2)}$, not $(\sum{x})^2$], $\Sigma y$, $\Sigma y^2$ and $\Sigma xy$.

Note that these are simple (atomic) symbols. On the other hand $\sum_{i} x_i$ and $\sum_{x \in X} x$ are formulae requiring the full details of the data set $X$.

Rather than stating similar formulae for $x$ and $y$, the variable $z$ will be used for $x$ or $y$.

Initial Empty Data Set

The initial state (for the empty data set) has:

$n = \Sigma x = \Sigma x^2 = \Sigma y = \Sigma y^2 = \Sigma xy = 0$.

Given a new point $\langle f,x,y \rangle$ where $f$ is the frequency and/or weight, these values are updates thus:

 $n$ :+= $f$ $\Sigma x$ :+= $f x$ (i.e. $f \times x$) $\Sigma x^2$ :+= $f x^2$ $\Sigma y$ :+= $f y$ $\Sigma y^2$ :+= $f y^2$ $\Sigma xy$ :+= $f x y$

To delete a point $\langle f,x,y \rangle$, simply add the point $\langle -f,x,y \rangle$.

Mean

The mean values of $x$ and $y$ are given by:
$\mu_z = \overline{z} = \dfrac{\sum_{i} f_i z_i}{\sum_{i} f_i} = \dfrac{\Sigma z}{n}$.

Variance and SD

The variances are given by:
$\sigma^2_z = {\rm var}(Z) = \dfrac{\sum_{i} f_i (z_i - \overline{z})^2}{\sum_{i} f_i} = \dfrac{n \Sigma z^2 - (\Sigma z)^2}{n^2}$.

The first (fraction) formula here is the definition. The second is that formula algebraically rearranged to use just our basic values.

The population standard deviations are given by:
$\sigma_z = {\rm sd}(Z) = \sqrt{\sigma^2_z}$.

Some Intermediate Definitions

At this point it is useful to define some intermediate values that will crop up a few times:
$F_{zz} = n^2 {\rm var} (Z) = \sum_i f_i \sum_i f_i (z_i - \overline{z})^2$,
so $F_{zz} = n \Sigma z^2 - (\Sigma z)^2$;
$F_{xy} = n^2 {\rm covar} (X,Y) = \sum_i f_i \sum_i f_i (x_i - \overline{x})(y_i - \overline{y})$,
so $F_{xy} = n \Sigma xy - \Sigma x \Sigma y$.

(Actually, to say that e.g. $F_{zz} = n^2 {\rm var} (Z)$ is a little sloppy; that is really only true in the common case when $\forall i \bullet f_i = 1$, that is, all the $f_i$ are equal to one. It’s also a little sloppy to write ${\rm var} (Z)$ with no mention of $F$.)

Here’s one of the derivations:
$F_{zz} = \sum_i f_i \left[ \sum_i f_i (z_i - \overline{z})^2 \right]$
$= \sum_i f_i \left[ \sum_i f_i (z_i^2 + \overline{z}^2 - 2\overline{z}z_i) \right]$
$= \sum_i f_i \left[ \sum_i f_i z_i^2 + \overline{z}^2\sum_i f_i - 2\overline{z}\sum_i f_i z_i \right]$
$= n \left[ \Sigma z^2 + n \left( \dfrac{\Sigma z}{n}\right)^2 - 2\dfrac{\Sigma z}{n}\Sigma z \right]$
$= n \Sigma z^2 + (\Sigma z)^2 - 2 (\Sigma z)^2$
$= n \Sigma z^2 - (\Sigma z)^2$.

Variance (again)

So now we have:
${\rm var}(Z) = \dfrac{F_{zz}}{n^2}$ and ${\rm sd}(z) = \dfrac{\sqrt{F_{zz}}}{n}$.

RMS, Covariance, Correlation

We also have:
mean squares: $\overline{z^2} = \dfrac{\Sigma z^2}{n}$;
root mean squares: ${\rm rms}(z^2) = \sqrt{\dfrac{\Sigma z^2}{n}}$;
covariance: ${\rm covar}(X,Y) = \dfrac{F_{xy}}{n^2}$;
correlation: ${\rm correl}(X,Y) = \dfrac{{\rm covar}(X,Y)}{\sqrt{{\rm var}(X){\rm var}(Y)}} = \dfrac{F_{xy}}{\sqrt{F_{xx}F_{yy}}}$.

Some Preparatory Mathematics

The basic values, however, can be used to produce much more statistical information. To go further we need to do some more preparation.

Calculus: Differentiations

First, some calculus; some differentiations that we’ll need later:

$\dfrac{\rm d}{{\rm d}b} \left( \dfrac{1}{b^2+1} \right) = - \dfrac{2b}{(b^2+1)^2}$;

$\dfrac{\rm d}{{\rm d}b} \left( \dfrac{2b}{b^2+1} \right) = - \dfrac{2(b^2-1)}{(b^2+1)^2}$;

$\dfrac{\rm d}{{\rm d}b} \left( \dfrac{b^2}{b^2+1} \right) = (+) \dfrac{2b}{(b^2+1)^2}$.

Data Set Linear Translation

Second some more formula manipulations: shifting statistical data sets. (Here, we’ll abuse notation slightly, and drop the $f_i$‘s.)

Let $u_i = x_i + r$ and $v_i = y_i + s$.

So $\Sigma u = \Sigma(x+r) = \Sigma x + rn$;
similarly, $\Sigma v = \Sigma y + sn$;
$\Sigma u^2 = \Sigma(x+r)^2 = \Sigma(x^2+2rx+r^2) = \Sigma x^2 + r \Sigma x + r^2 n$;
similarly, $\Sigma v^2 = \Sigma y^2 + s \Sigma y + s^2 n$;
$\Sigma uv = \Sigma(x+r)(y+s) = \Sigma(xy+ry+sx+rx)$
$= \Sigma xy + r \Sigma y + s \Sigma x + rs n$.

In particular, if we let $r = - \overline{x}$ and $s = - \overline{y}$, then:

$\Sigma u = \Sigma v = 0$;
$\Sigma u^2 = \dfrac{F_{xx}}{n}$;
$\Sigma v^2 = \dfrac{F_{yy}}{n}$;
$\Sigma v = \dfrac{F_{xy}}{n}$.

So, also $\overline{u} = \overline{v} = 0$.

Geometry: Distance of Point from Line through the Origin

Finally, some simple geometry.

Consider a point $\langle u_0,v_0 \rangle$, a line $V=bU$, and the locus of a point $\langle u,v \rangle$ on the line. So, $v=bu$.

The square of the distance $d^2$ between the points is given (using Pythagoras) by:

$d^2 = (u-u_0)^2 + (v-v_0)^2 = (u-u_0)^2 + (bu-v_0)^2$
$= (u^2 + u_0^2 - 2 u u_0) + (b^2 u^2 + v_o^2 - 2 b u v_0)$
$= (b^2+1) u^2 - 2 (u_0 - b v_0) u + (u_0^2 + v_o^2)$.

The distance (squared) between the point $\langle u_0,v_0 \rangle$ and the line may by found by varying the locus of the point on the line abd finding the minimum. So, differentiate:

$0 = \dfrac{{\rm d} d^2}{{\rm d}u} = 2(b^2+1)u - 2(u_0+bv_0)$
so $u = \dfrac{u_0+b v_0}{b^2+1}$.

Thus, back-substituting, we find that
$d^2 = u_0^2 + v_0^2 - \dfrac{(u_0 + b v_0)^2}{b^2+1}$.

Linear Regression to a Line

We’re now ready to progress to linear regression to a line.

Vertical Offsets

Consider data points $\langle F,U,V \rangle$ with the mean at the origin, and a line $v=bu$ through the origin. For what gradient $b_V$ does this minimise the square $O^2_V$ of vertical offsets?

$O^2_V = \sum_i (v_i-v(u_i))^2 = \sum_i (v_i-b u_i)^2 = \Sigma v^2 - 2b \Sigma uv + b^2 \Sigma u^2$.

This is minimal when
$0 = \dfrac{{\rm d}O^2_V}{{\rm d}b} = -2 \Sigma uv + 2b \Sigma u^2$,
so $b_V = \dfrac{\Sigma uv}{\Sigma u^2}$.

Therefore, for general data points $\langle F,X,Y \rangle$, we have the best fit line $y=a + bx$ given by:
$b_V = \dfrac{F_{xy}/n}{F_{xx}/n} = \dfrac{F_{xy}}{F_{xx}}$
and $a_V = \overline{y} - b_V \overline{x}$.

A similar formula minimises the horizontal offsets.

Perpendicular Offsets

Now, for what gradient $b_{\bot}$ does this minimise the square $O^2_{\bot}$ of perpendicular offsets? This is only appropriate when $X$ and $Y$ (and thus $U$ and $V$) are of like dimension or type.

$O^2_{\bot} = \sum_i d_i^2 = \sum_i u_i^2 + v_i^2 - \dfrac{(u_i +b v_i)^2}{b^2+1}$
$= \Sigma u^2 + \Sigma v^2 - \dfrac{1}{b^2+1} \sum_i ( u_i^2 + 2b u_i v_i + b^2 v_i^2 )$
$= \Sigma u^2 + \Sigma v^2 - \left( \dfrac{\Sigma u^2}{b^2+1} + \dfrac{2b \Sigma uv}{b^2+1} + \dfrac{b^2 \Sigma v^2}{b^2+1} \right)$.

This is minimal when:

$0 = \dfrac{{\rm d}O^2_{\bot}}{{\rm d}b}$
$= \dfrac{\rm d}{{\rm d}b} \left( \dfrac{1}{b^2+1} \right) \Sigma u^2 + \dfrac{\rm d}{{\rm d}b} \left( \dfrac{2b}{b^2+1} \right) \Sigma uv + \dfrac{\rm d}{{\rm d}b} \left( \dfrac{b^2}{b^2+1} \right) \Sigma v^2$
$= - \dfrac{2b}{(b^2+1)^2} \Sigma u^2 - \dfrac{2(b^2-1)}{(b^2+1)^2} \Sigma uv + \dfrac{2b}{(b^2+1)^2} \Sigma v^2$
$= - \dfrac{2}{(b^2+1)^2} \left( b \Sigma u^2 + (b^2-1) \Sigma uv - b \Sigma v^2 \right)$
iff $0 = \Sigma uv b^2 + (\Sigma u^2 - \Sigma v^2) b - \Sigma uv$.

Thus, $b_{\bot} = \dfrac{\Sigma v^2 - \Sigma u^2 \pm \sqrt{(\Sigma v^2 - \Sigma u^2)^2 + 4 (\Sigma uv)^2 }}{2 \Sigma uv}$.

This formula gives two values for $b_{\bot}$. One is $b_0$, and the other is $b_1 = \dfrac{-1}{b_0}$. One of the values gives the minimum and the other the maximum square of perpendicular offsets.

For a general data points $\langle F,X,Y \rangle$, we have the best fit line $y=a + bx$ given by:

$h = \dfrac{F_{yy} - F_{xx}}{2 F_{xy}}$;
$b_{\bot} = h \pm \sqrt{h^2 + 1}$
and $a_{\bot} = \overline{y} - b_{\bot} \overline{x}$.

Conclusion

These values may all be calculated from the basic statistical values $n$, $\Sigma x$, $\Sigma x^2$, $\Sigma y$, $\Sigma y^2$ and $\Sigma xy$; storage of the data set itself is not required.