The point of averages and the two numbers SDX and SDY give us some information about a scatterplot, but they do not tell us the extent of the association between the variables. The correlation coefficient r is a quantitative measure of association: it tells us whether the scatterplot tilts up or down, and how tightly the data cluster around a straight line. In this chapter we study the correlation coefficient, and when it can be used with the point of averages, SDX, and SDY to summarize scatterplots.
is a scatterplot of the GMAT data, which were introduced in Chapter 3, "Multivariate Data." has a new number below the scatterplot: the correlation coefficient r. When you first load this page, should be a scatterplot of Quantitative GMAT score versus Verbal GMAT score. If not, please select those variables from the drop-down menus at the top of the figure. Then the figure should show r = 0.35. Quantitative GMAT and Verbal GMAT are positively associated: Students with above average Quantitative GMAT scores tend also to have larger than average Verbal GMAT scores, and students with below average Quantitative GMAT scores tend to have below average Verbal GMAT scores. The value of r is positive when variables are positively associated; the value of r is negative when variables are negatively associated. The value of r is always between -1 and +1.
You need Java to see this.
Sometimes we shall use subscripts to clarify which correlation coefficient we are talking about: The symbol rXY denotes the correlation coefficient for X and Y. The correlation coefficient for a scatterplot of Y versus X is always equal to the correlation coefficient for a scatterplot of X versus Y (rXY = rYX). See for yourself: Change to show the scatterplot of Verbal GMAT versus Quantitative GMAT. You should see r = 0.35 again.
Look at scatterplots of different pairs of variables in the GMAT data (ignore the School variable). There are six pairs of variables:
The following exercise helps train your eye to see small differences in correlation in scatterplots.
Correlation is a measure of linear association: how nearly a scatterplot follows a straight line. Two variables are positively correlated if the scatterplot slopes upwards (r > 0); they are negatively correlated if the scatterplot slopes downward (r < 0). Note that linear association is not the only kind of association: Some variables are nonlinearly associated (discussed later in this chapter). For example, the average monthly rainfall in Berkeley, CA, is associated with the month of the year, but that association is nonlinear: It is a seasonal variation that cycles annually. Correlation does not measure nonlinear association, only linear association. The correlation coefficient is appropriate only for quantitative variables, not ordinal or categorical variables, even if their values are numerical.
Correlation is a measure of association, not causation. For example, the average height of people at maturity in the United States has been increasing for decades. Similarly, there is evidence that the number of plant species is decreasing with time. These two variables have a negative correlation, but there is no (straightforward) causal connection between them. A secular trend in both manifests as a correlation between them.
The correlation coefficient r is close to 1 if the data cluster tightly around a straight line that slopes up from left to right. The correlation coefficient is close to -1 if the data cluster tightly around a straight line that slopes down from left to right. If the data do not cluster around a straight line, the correlation coefficient r is close to zero, even if the variables have a strong nonlinear association. lets you make scatterplots with specific values of the correlation coefficient r, and specific numbers of data n. Note that r is undefined if n is less than two.
The following exercises check your knowledge of basic facts about r, and your ability to gauge r by eye.
Some scatterplots show have curved patterns. Such scatterplots are said to show nonlinear association between the two variables. The correlation coefficient does not reflect nonlinear relationships between variables, only linear ones. For example, even if the association is quite strong, if it is nonlinear, the correlation coefficient r can be small or zero.
In the scatter in X for a given value of Y is very small, so the association is strong. In fact, there is a deterministic relationship between the two variables: Y = sin(X). (The plot is half a period of the sine function.) Even though the association is perfect—one can predict Y exactly from X—the correlation coefficient r is exactly zero. This is because the association is purely nonlinear. The correlation coefficient measures whether there is a trend in the data, and what fraction of the scatter in the data is accounted for by the trend.
In the correlation coefficient is reasonably large (0.71), because there is an overall trend in the data. However, the correlation coefficient still does not show how strongly associated the variables are, because the pattern of their relationship is curved (nonlinear). The correlation coefficient is not a good summary of the association of these variables.
Correlation and Association
The correlation coefficient r measures only linear association: how nearly the data fall on a straight line. It is not a good summary of association if the scatterplot has a nonlinear (curved) pattern.
Recall that data are homoscedastic if the SD of the values of Y for points in a vertical slice through the scatterplot is about the same, regardless of the location of the slice. In contrast, if the SD of the values of Y in a vertical slice varies a great deal depending on the location of the slice, the data are heteroscedastic. All the scatterplots we have seen so far in this chapter are roughly homoscedastic. shows a heteroscedastic scatterplot with the corresponding correlation coefficient.
The scatter in a vertical slice near the right of is much larger than the scatter in a vertical slice near the left of the plot. There is not much association between Y and X, but the correlation coefficient is still 0.15—an artifact of the heteroscedasticity.
Correlation and Heteroscedasticity
The correlation coefficient r is not a good summary of association if the data are heteroscedastic.
Recall that a datum that does not fit the overall pattern in the data or that is many SD from the other data in at least one of its coordinates is called an outlier. A single outlier that is far from the point of averages can have a large effect on the correlation coefficient. show two extreme examples. In the outlier makes the correlation coefficient nearly one; without it, the correlation coefficient would be nearly zero. In the outlier makes the correlation coefficient nearly zero; without it, the correlation coefficient would be nearly one.
lets you add points to the scatterplot by clicking the scatterplot; a point is added wherever the cursor is. Adding a point typically will change the correlation coefficient. You can add as many points as you wish. Click the Clear Added Points button to delete the points you added. If you click the Ignore Added Points button, the new points will not be included in computing the correlation coefficient
Try adding a point to the scatterplot and seeing how much you can change the correlation coefficient. Clear the point, and try again. See how large and how small you can make the correlation coefficient be by adding just one point. You should be able to change r from 0 to plus or minus 0.12 or more. If you could add a point beyond the limits of the plot, you could make r vary from nearly -1 to nearly 1. The following exercise checks your understanding of the influence a single point can have on the correlation coefficient.
Correlation and Outliers
The correlation coefficient r is not a good summary of association if the data have outliers.
If a scatterplot does not show nonlinearity, heteroscedasticity or outliers, it is "football-shaped."
Five-number summary of football-shaped Scatterplots
Football-shaped scatterplots can be summarized rather well by five numbers:
the mean of X, the mean of Y, the SD of X, the SD of Y, and r.
We saw earlier in this chapter that the correlation coefficient measures linear association. We know that r does not measure nonlinear association. We know that the value of r can be deceptive if the data are heteroscedastic or contain outliers. We know that r is always between -1 and +1. We know how to estimate r by eye. But we do not know how to compute r from data. In this section, we shall learn how to compute the correlation coefficient: r is the average product of X and Y, after putting X and Y on an equal footing by transforming them to standard units—standard deviations above the mean.
Standard units are a way of putting different kinds of observations on the same scale. The idea is to replace a datum by the number of standard deviations it is above the mean of the data. If a datum is above the mean, its value in standard units is positive; if it is below the mean, its value in standard units is negative. A datum that is above the mean by 2.5 times the SD is 2.5 in standard units.
When a list is transformed to standard units, the mean of the new list is zero, and the SD of the new list is one: that is what it means for a set of data to be in standard units. Standard units are dimensionless. If the original list has units, the original SD has the same units. To transform a measurement to standard units, we divide the measurement (minus the mean) by the SD, which cancels the original units
If we know the mean and SD of the original data, we can restore a datum that is in standard units to the original units of measurement, as follows:
original value = (value in standard units) × SD + mean.
Note that both the transformation from original units to standard units and the transformation from standard units to original units are affine transformations. illustrates converting from original units to standard units and back. It is a dynamic example: it changes whenever you reload the page.
Values that are larger than the mean are positive in standard units
Values that are less than the mean are negative in standard units.
The following exercise checks your ability to convert a measurement to standard units.
The correlation coefficient r of two variables X and Y is the average of the product of X in standard units and Y in standard units. You must be sure to multiply the measurements corresponding to the same individual. The order in which you multiply doesn't matter, but you should not change the order of one set of measurements relative to the other. will help make the idea clear.
Because the correlation coefficient uses the two lists after converting them to standard units, the correlation coefficient does not change if the lists are changed in any of the following ways:
If only one of the lists is multiplied by a negative number, r changes sign, but has the same magnitude. A positive association becomes negative, and vice versa.
Because computing the correlation coefficient involves converting both lists to standard units and multiplying the results, and because multiplication does not depend on the order of the factors, it does not matter which list is first:
rXY = rYX.
Two football shaped scatterplots can have the same correlation coefficient but look quite different if the SDs of the variables are different.
The following exercises check your ability to convert variables to standard units and to compute the correlation coefficient.
Correlations based on averages can be arbitrarily misleading if they are interpreted to be about individuals. Correlations based on averages are usually too high, because they ignore the variability across individuals. Correlation of averages is called ecological correlation.
For example, is a scatterplot of the GMAT data set, averaged by school. That is, there are now five "individuals;" each one is one of the five schools. The first-year MBA GPA for a school is the average of the first-year MBA GPAs of all the students at that school, etc.
For the averaged data, the correlation of quantitative and verbal GMAT scores is 0.95; for the original data, it was only 0.35. If you interpreted the correlation based on averaged data as the association between quantitative and verbal GMAT scores for individuals, you would be way off: There is far more scatter in individual students' scores. This effect is called "ecological correlation."
On the other hand, averaging can reduce correlations. For example the correlation of the averaged first-year MBA GPA and undergraduate GPA is zero, while for the original data, it is 0.24.
For a large group of college students, the ages of Freshmen will vary, as will the ages within other years, so the correlation coefficient for age and the number of years one has been in school will not equal 1.
However, if we take the average ages for each class {Freshmen, Sophomores, Juniors, Seniors}, the averages will probably be very close to 19, 20, 21, and 22, respectively, and the average number of year of education will be pretty close to 13, 14, 15, and 16. The correlation coefficient between the average age in a class and the average number of years of education in a class will be much closer to 1. Nonetheless, we cannot predict an individual's age very well from the number of years he or she has been in school.
For a really extreme example, imagine dividing the university population into two groups, faculty and undergraduates (we leave out graduate students and staff). The ages and educational levels vary both within and across these groups, but what happens if we plot just the averages for the two groups? We will get two points, one with average age about 20 and average number of years of education about 14, and one with average age closer to 45 and average number of years of education about 22. These two points will lie on a straight line (any two points do), and the line will have positive slope (the faculty are older on the average, and have more years of education on the average), so the correlation coefficient will be +1
Ecological Correlation
Correlation coefficients of averages are called ecological correlations.
Correlations of averages of measurements can differ enormously from correlations of individual measurements.
Typically, they are much larger, but they can be smaller, too.
When you examine a claim that the association between two variables is strong, be alert to the possibility that the stated correlation is an ecological correlation. If it is, the correlation coefficient for individuals could be quite different—but tends to be smaller in magnitude
Correlation is a measure of linear association between two variables. If larger than average values of X tend to occur in conjunction with larger than average values of Y and smaller than average values of X tend to occur in conjunction with smaller than average values of Y, the correlation coefficient rXY of X and Y is positive. If larger than average values of X tend to occur in conjunction with smaller than average values of Y and smaller than average values of X tend to occur in conjunction with larger than average values of Y, rXY is negative. The correlation coefficient of X and Y is always between -1 and +1. If the points in a scatterplot of Y versus X fall on a straight line with slope greater than zero, rXY = 1. If the points in a scatterplot of Y versus X fall on a straight line with slope less than zero, rXY = -1. If the points in a scatterplot of Y versus X fall on a horizontal line, rXY is not defined.
The correlation coefficient does not measure all kinds of association—only linear association. The correlation coefficient, the point of averages, SDX and SDY summarize football-shaped scatterplots well, but not scatterplots that show nonlinearity, heteroscedasticity or outliers. Two variables can have strong nonlinear association and small or zero correlation. A single outlier can make the correlation coefficient small or large.
Strong correlation between two variables does not entail a causal relationship between them; neither does a causal relationship between two variables entail any correlation between them. Beware claims of causality on the basis of correlation.
Converting to standard units makes different variables commensurable. A measurement in standard units is the number of SDs the measurement is above the mean. Values larger than the mean are positive in standard units; values below the mean are negative in standard units. The mean of a list in standard units is zero, and the SD of a list in standard units is 1. The correlation coefficient of X and Y is the average of the products of X and Y in standard units. It is important to multiply the value of X by the value of Y for the same individual.
Ecological correlations are correlation coefficients of averages across groups of individuals, rather than correlation coefficients for individuals. Ecological correlations tend to be stronger than the correlation coefficient for individuals, although the opposite is also possible. Beware arguments about association that rely on ecological correlations.