We'll never know exactly what the true value is for the average length of grass on the library lawn. The only way to know that would be to go through all the blades of grass and measure every one. Same with the height of everyone. We can't measure everyone on the planet. But we can come up with estimates of these.
For example we can pick 100 blades of grass and get an average 6.1". We could
do it again with different blades and get 5.9". Suppose we did that a million times.
we better get the true value (6"). If we don't, we're in trouble. We'd call our
estimate "biased". Is it biased?
Let's use a bar to denote a finite average:
So I'm saying that if we take the expectation value of the mean, we'd better get
the true average. Let's see if that works:
Now what's a good estimate of the variance? The definition in eqn. 1.39 can be said in words, though it's a mouthful: the average (expectation value) of the square of the differences of data values from their mean. That definition relies on something that you'd never really be able to measure. You're averaging over all members of your population. So since we can't measure all the blades of grass, what's a good measure using only 100 of them?
The first guess you'd have is that you take
The culprit in this is your estimate for
is off and this
messes up your answer. Let's illustrate with . And suppose the true
variance is and true average is 0. This simplifies the complete
calculation because the
and the above formula is
If we now take the expectation value of this we see the final "cross term" assuming independence of the data. So we get
, that's true. But if , which it sometimes is, then this makes a pretty big difference.
So in summary, we now know how to estimate the average and variance of a population from a finite number of data points. Now we'll get to the harder part, how to determine if two populations are the same or different.