I find it fascinating that calculators can perform some quite complex statistical analysis without storing the whole data set. Rather, they store just six values: , , [meaning , not ], , and .

Note that these are simple (atomic) symbols. On the other hand and are formulae requiring the full details of the data set .

Rather than stating similar formulae for and , the variable will be used for or .

### Initial Empty Data Set

The initial state (for the empty data set) has:

.

### Data Set Updates

Given a new point where is the frequency and/or weight, these values are updates thus:

:+= | ||

:+= | (i.e. ) | |

:+= | ||

:+= | ||

:+= | ||

:+= |

To delete a point , simply add the point .

### Mean

The mean values of and are given by:

.

### Variance and SD

The variances are given by:

.

The first (fraction) formula here is the *definition*. The second is that formula algebraically rearranged to use just our basic values.

The population standard deviations are given by:

.

### Some Intermediate Definitions

At this point it is useful to define some intermediate values that will crop up a few times:

,

so ;

,

so .

(Actually, to say that e.g. is a little sloppy; that is really only true in the common case when , that is, all the are equal to one. It’s also a little sloppy to write with no mention of .)

Here’s one of the derivations:

.

### Variance (again)

So now we have:

and .

### RMS, Covariance, Correlation

We also have:

mean squares: ;

root mean squares: ;

covariance: ;

correlation: .

### Some Preparatory Mathematics

The basic values, however, can be used to produce much more statistical information. To go further we need to do some more preparation.

#### Calculus: Differentiations

First, some calculus; some differentiations that we’ll need later:

;

;

.

#### Data Set Linear Translation

Second some more formula manipulations: shifting statistical data sets. (Here, we’ll abuse notation slightly, and drop the ‘s.)

Let and .

So ;

similarly, ;

;

similarly, ;

.

In particular, if we let and , then:

;

;

;

.

So, also .

#### Geometry: Distance of Point from Line through the Origin

Finally, some simple geometry.

Consider a point , a line , and the locus of a point on the line. So, .

The square of the distance between the points is given (using Pythagoras) by:

.

The distance (squared) between the point and the line may by found by varying the locus of the point on the line abd finding the minimum. So, differentiate:

so .

Thus, back-substituting, we find that

.

### Linear Regression to a Line

We’re now ready to progress to linear regression to a line.

#### Vertical Offsets

Consider data points with the mean at the origin, and a line through the origin. For what gradient does this minimise the square of *vertical* offsets?

.

This is minimal when

,

so .

Therefore, for general data points , we have the best fit line given by:

and .

A similar formula minimises the *horizontal* offsets.

#### Perpendicular Offsets

Now, for what gradient does this minimise the square of *perpendicular* offsets? This is only appropriate when and (and thus and ) are of like dimension or type.

.

This is minimal when:

iff .

Thus, .

This formula gives two values for . One is , and the other is . One of the values gives the minimum and the other the maximum square of perpendicular offsets.

For a general data points , we have the best fit line given by:

;

and .

### Conclusion

These values may all be calculated from the basic statistical values , , , , and ; storage of the data set itself is not required.

Saturday, 2 January 2016 at 22:48 |

See also B. Doyle’s

http://aperiodicity.com/2015/12/27/the-method-of-least-squares/