Calculator Statistics

I find it fascinating that calculators can perform some quite complex statistical analysis without storing the whole data set. Rather, they store just six values: n, \Sigma x, \Sigma x^2 [meaning \sum{(x^2)}, not (\sum{x})^2], \Sigma y, \Sigma y^2 and \Sigma xy.

Note that these are simple (atomic) symbols. On the other hand \sum_{i} x_i and \sum_{x \in X} x are formulae requiring the full details of the data set X.

Rather than stating similar formulae for x and y, the variable z will be used for x or y.

Initial Empty Data Set

The initial state (for the empty data set) has:

n = \Sigma x = \Sigma x^2 = \Sigma y = \Sigma y^2 = \Sigma xy = 0.

Data Set Updates

Given a new point \langle f,x,y \rangle where f is the frequency and/or weight, these values are updates thus:

n :+= f
\Sigma x :+= f x (i.e. f \times x)
\Sigma x^2 :+= f x^2
\Sigma y :+= f y
\Sigma y^2 :+= f y^2
\Sigma xy :+= f x y

To delete a point \langle f,x,y \rangle, simply add the point \langle -f,x,y \rangle.


The mean values of x and y are given by:
\mu_z = \overline{z} = \dfrac{\sum_{i} f_i z_i}{\sum_{i} f_i} = \dfrac{\Sigma z}{n}.

Variance and SD

The variances are given by:
\sigma^2_z = {\rm var}(Z) = \dfrac{\sum_{i} f_i (z_i - \overline{z})^2}{\sum_{i} f_i} = \dfrac{n \Sigma z^2 - (\Sigma z)^2}{n^2}.

The first (fraction) formula here is the definition. The second is that formula algebraically rearranged to use just our basic values.

The population standard deviations are given by:
\sigma_z = {\rm sd}(Z) = \sqrt{\sigma^2_z}.

Some Intermediate Definitions

At this point it is useful to define some intermediate values that will crop up a few times:
F_{zz} = n^2 {\rm var} (Z) = \sum_i f_i \sum_i f_i (z_i - \overline{z})^2,
so F_{zz}  = n \Sigma z^2 - (\Sigma z)^2 ;
F_{xy} = n^2 {\rm covar} (X,Y) = \sum_i f_i \sum_i f_i (x_i - \overline{x})(y_i - \overline{y}),
so F_{xy} = n \Sigma xy - \Sigma x \Sigma y.

(Actually, to say that e.g. F_{zz} = n^2 {\rm var} (Z) is a little sloppy; that is really only true in the common case when \forall i \bullet f_i = 1, that is, all the f_i are equal to one. It’s also a little sloppy to write {\rm var} (Z) with no mention of F.)

Here’s one of the derivations:
F_{zz} = \sum_i f_i  \left[ \sum_i f_i (z_i - \overline{z})^2 \right]
= \sum_i f_i  \left[ \sum_i f_i (z_i^2 + \overline{z}^2 - 2\overline{z}z_i) \right]
= \sum_i f_i  \left[ \sum_i f_i z_i^2 + \overline{z}^2\sum_i f_i  - 2\overline{z}\sum_i f_i z_i \right]
= n \left[ \Sigma z^2 + n \left( \dfrac{\Sigma z}{n}\right)^2  - 2\dfrac{\Sigma z}{n}\Sigma z \right]
= n \Sigma z^2 + (\Sigma z)^2  - 2 (\Sigma z)^2
= n \Sigma z^2 - (\Sigma z)^2.

Variance (again)

So now we have:
{\rm var}(Z) = \dfrac{F_{zz}}{n^2} and {\rm sd}(z) = \dfrac{\sqrt{F_{zz}}}{n}.

RMS, Covariance, Correlation

We also have:
mean squares: \overline{z^2} = \dfrac{\Sigma z^2}{n};
root mean squares: {\rm rms}(z^2) = \sqrt{\dfrac{\Sigma z^2}{n}};
covariance: {\rm covar}(X,Y) = \dfrac{F_{xy}}{n^2};
correlation: {\rm correl}(X,Y) = \dfrac{{\rm covar}(X,Y)}{\sqrt{{\rm var}(X){\rm var}(Y)}} = \dfrac{F_{xy}}{\sqrt{F_{xx}F_{yy}}}.

Some Preparatory Mathematics

The basic values, however, can be used to produce much more statistical information. To go further we need to do some more preparation.

Calculus: Differentiations

First, some calculus; some differentiations that we’ll need later:

\dfrac{\rm d}{{\rm d}b} \left( \dfrac{1}{b^2+1} \right) = - \dfrac{2b}{(b^2+1)^2};

\dfrac{\rm d}{{\rm d}b} \left( \dfrac{2b}{b^2+1} \right) = - \dfrac{2(b^2-1)}{(b^2+1)^2};

\dfrac{\rm d}{{\rm d}b} \left( \dfrac{b^2}{b^2+1} \right) = (+) \dfrac{2b}{(b^2+1)^2}.

Data Set Linear Translation

Second some more formula manipulations: shifting statistical data sets. (Here, we’ll abuse notation slightly, and drop the f_i‘s.)

Let u_i = x_i + r and v_i = y_i + s.

So \Sigma u = \Sigma(x+r) = \Sigma x + rn;
similarly, \Sigma v = \Sigma y + sn;
\Sigma u^2 = \Sigma(x+r)^2 = \Sigma(x^2+2rx+r^2) = \Sigma x^2 + r \Sigma x + r^2 n;
similarly, \Sigma v^2 = \Sigma y^2 + s \Sigma y + s^2 n;
\Sigma uv = \Sigma(x+r)(y+s) = \Sigma(xy+ry+sx+rx)
= \Sigma xy + r \Sigma y + s \Sigma x + rs n.

In particular, if we let r = - \overline{x} and s = - \overline{y}, then:

\Sigma u = \Sigma v = 0;
\Sigma u^2 = \dfrac{F_{xx}}{n};
\Sigma v^2 = \dfrac{F_{yy}}{n};
\Sigma v = \dfrac{F_{xy}}{n}.

So, also \overline{u} = \overline{v} = 0.

Geometry: Distance of Point from Line through the Origin

Finally, some simple geometry.

Consider a point \langle u_0,v_0 \rangle, a line V=bU, and the locus of a point \langle u,v \rangle on the line. So, v=bu.

The square of the distance d^2 between the points is given (using Pythagoras) by:

d^2 = (u-u_0)^2 + (v-v_0)^2 = (u-u_0)^2 + (bu-v_0)^2
= (u^2 + u_0^2 - 2 u u_0) + (b^2 u^2 + v_o^2 - 2 b u v_0)
= (b^2+1) u^2 - 2 (u_0 - b v_0) u + (u_0^2 + v_o^2).

The distance (squared) between the point \langle u_0,v_0 \rangle and the line may by found by varying the locus of the point on the line abd finding the minimum. So, differentiate:

0 = \dfrac{{\rm d} d^2}{{\rm d}u} = 2(b^2+1)u - 2(u_0+bv_0)
so u = \dfrac{u_0+b v_0}{b^2+1}.

Thus, back-substituting, we find that
d^2 = u_0^2 + v_0^2 - \dfrac{(u_0 + b v_0)^2}{b^2+1}.

Linear Regression to a Line

We’re now ready to progress to linear regression to a line.

Vertical Offsets

Consider data points \langle F,U,V \rangle with the mean at the origin, and a line v=bu through the origin. For what gradient b_V does this minimise the square O^2_V of vertical offsets?

O^2_V = \sum_i (v_i-v(u_i))^2 = \sum_i (v_i-b u_i)^2 = \Sigma v^2 - 2b \Sigma uv + b^2 \Sigma u^2.

This is minimal when
0 = \dfrac{{\rm d}O^2_V}{{\rm d}b} = -2 \Sigma uv + 2b \Sigma u^2,
so b_V = \dfrac{\Sigma uv}{\Sigma u^2}.

Therefore, for general data points \langle F,X,Y \rangle, we have the best fit line y=a + bx given by:
b_V = \dfrac{F_{xy}/n}{F_{xx}/n} = \dfrac{F_{xy}}{F_{xx}}
and a_V = \overline{y} - b_V \overline{x}.

A similar formula minimises the horizontal offsets.

Perpendicular Offsets

Now, for what gradient b_{\bot} does this minimise the square O^2_{\bot} of perpendicular offsets? This is only appropriate when X and Y (and thus U and V) are of like dimension or type.

O^2_{\bot} = \sum_i d_i^2 = \sum_i u_i^2 + v_i^2 - \dfrac{(u_i +b v_i)^2}{b^2+1}
= \Sigma u^2 + \Sigma v^2 - \dfrac{1}{b^2+1} \sum_i ( u_i^2 + 2b u_i v_i + b^2 v_i^2 )
= \Sigma u^2 + \Sigma v^2 - \left( \dfrac{\Sigma u^2}{b^2+1} + \dfrac{2b \Sigma uv}{b^2+1} + \dfrac{b^2 \Sigma v^2}{b^2+1} \right).

This is minimal when:

0 = \dfrac{{\rm d}O^2_{\bot}}{{\rm d}b}
= \dfrac{\rm d}{{\rm d}b} \left( \dfrac{1}{b^2+1} \right) \Sigma u^2 + \dfrac{\rm d}{{\rm d}b} \left( \dfrac{2b}{b^2+1} \right) \Sigma uv + \dfrac{\rm d}{{\rm d}b} \left( \dfrac{b^2}{b^2+1} \right) \Sigma v^2
= - \dfrac{2b}{(b^2+1)^2} \Sigma u^2 - \dfrac{2(b^2-1)}{(b^2+1)^2} \Sigma uv + \dfrac{2b}{(b^2+1)^2} \Sigma v^2
= - \dfrac{2}{(b^2+1)^2} \left( b \Sigma u^2 + (b^2-1) \Sigma uv - b \Sigma v^2 \right)
iff 0 = \Sigma uv b^2 + (\Sigma u^2 - \Sigma v^2) b - \Sigma uv.

Thus, b_{\bot} = \dfrac{\Sigma v^2 - \Sigma u^2 \pm \sqrt{(\Sigma v^2 - \Sigma u^2)^2 + 4 (\Sigma uv)^2 }}{2 \Sigma uv}.

This formula gives two values for b_{\bot}. One is b_0, and the other is b_1 = \dfrac{-1}{b_0}. One of the values gives the minimum and the other the maximum square of perpendicular offsets.

For a general data points \langle F,X,Y \rangle, we have the best fit line y=a + bx given by:

h = \dfrac{F_{yy} - F_{xx}}{2 F_{xy}};
b_{\bot} = h \pm \sqrt{h^2 + 1}
and a_{\bot} = \overline{y} - b_{\bot} \overline{x}.


These values may all be calculated from the basic statistical values n, \Sigma x, \Sigma x^2, \Sigma y, \Sigma y^2 and \Sigma xy; storage of the data set itself is not required.


One Response to “Calculator Statistics”

  1. Rob Says:

    See also B. Doyle’s

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: