Google
 
Sponsored Ads
Free Statistics Homework Help


Correlations

Suppose you roll two dice. Their outcomes are independent. You can also say that the outcomes are "uncorrelated". At the other extreme, when you roll two dice, the outcome of the first die landing 6, and the total being 12, are very highly correlated. How do you quantify correlation?

Suppose you have two variables, say $x$ and $y$, and they represent the outcomes of throwing the first and second die respectively. Then you know $\langle xy \rangle = \langle x\rangle \langle y \rangle$, because of independence. In this case you'd also say their correlation was zero. That suggests that a good measure for correlation is the covariance:

\begin{displaymath}
C(x,y) \equiv \langle xy\rangle - \langle x\rangle \langle y\rangle
\end{displaymath} (1.58)

It's easy to show that this is the same as
\begin{displaymath}
C(x,y) = \langle (x - \langle x\rangle) (y - \langle y\rangle)\rangle
\end{displaymath} (1.59)

How do we calculate these in practice for $N$ data points $x_1,x_2\dots,x_N$, and $y_1,y_2\dots,y_N$? using the definition of averaging define earlier in eqn. 1.13. Written this way, it's a bit less elegant:
\begin{displaymath}
C(x,y) = {1\over N}\sum_i^N (x_i - \langle x\rangle) (y_i - \langle y\rangle)
\end{displaymath} (1.60)

with
\begin{displaymath}
\langle x\rangle = {1\over N} \sum_i^N x_i
\end{displaymath} (1.61)

and
\begin{displaymath}
\langle y\rangle = {1\over N} \sum_i^N y_i
\end{displaymath} (1.62)

However this is "unnormalized". You might want it to be 1 if the variables were completely correlated, for example, $x=y$. In the $x=y$, we have $C(x,x) = \langle x^2\rangle - \langle x\rangle^2$ which is the variance of $x$, $Var(x)$. Often we want a "normalized" definition, by dividing by the appropriate factor we get the "correlation coefficient":

\begin{displaymath}
R(x,y) \equiv {C(x,y)\over \sqrt{Var(x)Var(y)}} = {\langle (...
...e x\rangle)^2\rangle \langle (y - \langle y\rangle)^2\rangle}}
\end{displaymath} (1.63)

So with to completely uncorrelated variables, you get a correlation 0. If they're perfectly correlated, then you get 1. If they're perfectly anti-correlated, you get -1.

Let's take an example. Take $x$ to be the weight of Swedish men, and $y$ to be their height. You'd expect there to be a correlation between the two variables, because there aren't going to be many 7 foot tall men weighing 110 lbs. But there are probably some 5 foor 2 inch men weighing that. This can be represented graphically. You take 20 Swedish men and plot weight versus height for each one (this isn't real data)

\begin{figure}\begin{center}
\epsfig{width=.4\textwidth,file=WeightHeight.eps}
\end{center}\end{figure}

In this case there is a correlation between the two, but it's far from perfect. You expect your R for this to be somewhat less than 1. But lets now take $x$ to be the time since a lawn has been mowed, and $y$ to be the height of grass.

\begin{figure}\begin{center}
\epsfig{width=.4\textwidth,file=TimeHeight.eps}
\end{center}\end{figure}

In this case the two are very highly correlated, and you expect a correlation coefficient close to 1.

You can see a lot of examples of different correlation coefficients here.

Note that the choice of variables x and y in the above was arbitrary, you could of called the s and t, or elephant and daisy instead.



Subsections

Keywords:

  • Find the covariance and the correlation of S and T

josh 2010-10-20