Many interesting questions about the world involve more than one variable and the relationships among variables. For example, within the United States, there is a negative association between the amount of money an individual spends on healthcare in a given year and the number of years he survives beyond that year: the more spent on healthcare, the shorter the life expectancy. What does this mean? Could it be true? Does healthcare shorten your life?
Association is a property of two or more variables; in this example, the amount an individual spends on healthcare, and the number of additional years the individual survives. Association is not the same as causation. This chapter presents tools for studying more than one variable at a time, and the relationships among variables, including association.
So far, we have been looking at one variable at a time. We now start to look at the relationship among two or more variables, each measured for the same collection of individuals. An "individual" is not necessarily a person: it might be an automobile, a place, a family, a university, etc. For example, the two variables might be the heights of a man and of his son, in which case the "individual" is the pair (father, son). Such pairs of measurements are called bivariate data. Observations of two or more variables per individual in general are called multivariate data.
We will use the GMAT data as an example of a multivariate data set. These data were made available by Howard Wainer of the Educational Testing Service. The data comprise 5 variables measured for each of 913 individuals, who were then students in their second year of an MBA program at five good business schools. The variables are the undergraduate GPA, verbal and quantitative GMAT scores, first-year MBA GPA, and an integer indicating which of the five business schools the student attended. I do not know the year these data were collected. The applet in allows you to display a histogram of each of those variables in turn:
You need Java to see this.
The drop-down menu at the top of (which should show "Verbal GMAT" when you first visit the page) lets you select which variable in the data set is displayed in the histogram. The List data button opens a table of the values of the 5 variables for all 913 students. The "Mean" and "SD" are the mean and SD of the variable currently displayed.
By looking at these five histograms, we can learn about the distribution of each of the variables. For example, both the verbal and quantitative GMAT scores have means of about 35 and SDs a bit over 6 points. That is, on the average, these students scored about 35 points on the verbal GMAT and about 35 points on the quantitative GMAT, but individual scores varied from those averages, typically by about 6 points.
Did students with higher than average verbal GMAT scores also tend to have higher than average quantitative GMAT scores (are verbal and quantitative GMAT scores positively associated)? Or, perhaps, was there a tendency for students who did better than average on the verbal GMAT to do worse than average on the quantitative GMAT (are verbal and quantitative GMAT scores negatively associated)? Similarly, did students who had higher than average undergraduate GPAs tend to do better than average in their first year as MBA students? Suppose you were the director of admissions for an MBA program. Which variables seem to predict how a student will do in his or her first year in the MBA program? How would you decide whom to admit?
Such questions are hard to answer using just the five histograms. These questions are about the association of the measured variables. The histograms say nothing about the association. The association is also quite hard to see directly from the list of data, especially for lists as long as this one. (Try it: click the List Data button and see whether you can find a relationship among the variables.) To see association graphically, we need to display more than one variable at a time.
One of the best tools for studying the association of two variables graphically is the scatterplot or scatter diagram. Scatterplots are especially helpful when the number of data is large—studying a list is then virtually hopeless. A scatterplot plots two measured variables against each other, for each individual. That is, the x (horizontal) coordinate of a point in a scatterplot is the value of one measurement (X) of an individual, and the y (vertical) coordinate of that point is the other measurement (Y) of the same individual. We call such a plot a scatterplot of Y versus X or a scatterplot of Y against X.
The applet in makes scatterplots of pairs of variables.
When you first arrive at this page, the scatterplot should show the quantitative GMAT scores (on the vertical or "y" axis) versus the verbal GMAT scores (on the horizontal or "x" axis). You can change which variable is plotted against which by using the drop-down menus containing the variable names, located at the top of the figure. If the scatterplot is not of quantitative GMAT versus verbal GMAT, please change to those variables.
Clicking the List Data button opens a table of the data. Clicking the Univariate Stats button opens a window that contains summary statistics of the 5 variables: the number of individuals for whom each variable was measured, the mean and SD of each variable, and the minimum, lower quartile, median, upper quartile and maximum of each variable.
The red square in the middle of the scatterplot is the point of averages. Its horizontal coordinate is the mean of the values of the variable plotted on the horizontal axis (the mean verbal GMAT score at first), and its vertical coordinate is the mean of the values of the variable plotted on the vertical axis (the mean quantitative GMAT score at first). The point of averages is a measure of the "center" of a scatterplot, quite analogous to the mean as a measure of the center of a list.
Put the cursor over the point of averages. The "meter" at the bottom of the plot that looks like
x = 35.nn y = 35.nn
will show that the x (horizontal) coordinate of the cursor is about 35 and the y (vertical) coordinate of the cursor is also about 35. Put the cursor over the highest blue dot on the plot. You should be able to tell from the meter that the x-value of that point is about 40, and its y-value is about 60. The dot corresponds to a single student whose verbal GMAT score was about 40, and whose quantitative GMAT score was about 60. That student scored above average on both parts of the GMAT test, but much further above average in the quantitative score.
Click the List Data button to open a table of all the student data. The top line shows that the variables are:
School "1st year MBA GPA" "Verbal GMAT" "Quant. GMAT" "Undergrad. GPA"
Each row in the table corresponds to one student. The first column in each row is the student's business school, the second column is his first-year MBA GPA, the third column is his verbal GMAT score, the fourth column is his quantitative GMAT score, and the fifth column is his undergraduate GPA. The first row in the table is:
1 3.155 31 37 3.53.
That is, the first student attended business school 1, had a first-year MBA GPA of 3.155, scored 31 on the verbal GMAT, 37 on the quantitative GMAT, and had an undergraduate GPA of 3.53.
Find the record of the student whose quantitative GMAT score was 60. (Hint: it's the 13th student in school 3.) Click that row of the table. The corresponding point in the scatterplot should turn yellow. You can highlight any number of points by clicking the corresponding rows of the table. To clear the highlighting of a point, click its row again.
The following exercises check your ability to use the scatterplot applet in to answer questions about multivariate data.
Scatterplots let us see the relationships among variables. Does one variable tend to be larger when another is large? Does the relationship follow a straight line? Is the scatter in one variable the same, regardless of the value of the other variable?
The scatterplot in illustrates a linear relationship between the variables. The scatterplot is roughly football-shaped: the points do not lie exactly on a line, but are scattered more-or-less evenly around one. (Note: this figure will be different every time you visit or reload the page.)
The scatterplot in illustrates nonlinearity. The pattern in the relationship between the variables is not a straight line—it is curved. The data are scattered more-or-less evenly around a curve: The scatter in the values of Y is about the same for different values of X, that is, in different vertical slices through the scatterplot.
When the scatter in Y is about the same in different vertical slices through a scatterplot, the data (and the scatterplot) are said to be homoscedastic (equal scatter). So far, all the plots in this section have been homoscedastic. is a scatterplot of heteroscedastic data: The scatter in vertical slices depends on where you take the slice.
A point that does not fit the overall pattern of the data, or that is many SDs from the bulk of the data, is called an outlier. are examples of scatterplots of data with with a large outlier.
The following exercises check your ability to categorize scatterplots.
We augmented measures of location for single variables by measures of spread, and we shall do the same for pairs of variables. A measure of the horizontal spread is the SD of the variable plotted on the horizontal or x axis. We shall write this as SDX. Similarly, the SD of the variable plotted on the vertical or y axis is a measure of the vertical spread. We'll write this as SDY.
The SD is a measure of the scatter in a list. The typical deviation of the x coordinate of a point from the mean of the x coordinates is SDX. The typical deviation of the y coordinate of a point from the mean of the y coordinates is SDY. We know from Chebychev's inequality that, for example, the x coordinates of at least 75% of the points will be within ±2SDX of the x coordinate of the point of averages, and that the y coordinates of at least 75% of the points will be within ±2×SDY of the y coordinate of the point of averages. However, in narrow ranges of x (vertical slices), the scatter in y might typically be smaller than SDY, and in narrow ranges of y (horizontal slices), the scatter in x might typically be smaller than SDX. If so, the two variables are associated.
If individuals with larger than average values of one variable tend to have larger than average values of the other, and individuals with smaller than average values of one variable tend to have smaller than average values of the other, the scatter of the values of Y in vertical slices through the scatterplot will be smaller than SDY. Such a scatterplot shows positive association. If individuals with larger than average values of one variable tend to have smaller than average values of the other, and individuals with smaller than average values of one variable tend have larger than average values of the other, the scatter of the values of Y in vertical slices through the scatterplot also will be smaller than SDY; this is called negative association. Positive and negative associations are examples of linear association; variables can be associated nonlinearly as well.
is the scatterplot of the GMAT data again, but this time with four new lines: two vertical lines at the mean value of X, plus and minus the SD of X, the variable plotted on the horizontal axis; SDX, and two horizontal lines at the mean value of Y, plus and minus the SD of Y, the variable plotted on the vertical axis. There is also a new button, labeled "No SDs." Click the button. The label will change to "SDs," and the lines will go away. Buttons on figures in this book usually say what will happen when you click them. If the figure does not show the scatterplot of Verbal GMAT versus Quantitative GMAT, change the variables using the drop-down menus at the top of the figure.
The cloud of points in the scatterplot tilts slightly upward towards the right. Individuals with larger than average Quantitative GMAT scores tend to have larger than average Verbal GMAT scores, and individuals with smaller than average Quantitative GMAT scores tend to have smaller than average Verbal GMAT scores, so these variables are positively associated. The association between the Verbal and Quantitative GMAT scores of these students is not very strong. Knowing that a particular student scored above average on the Verbal GMAT does not let us guess how well the student did on the Quantitative GMAT much more accurately than we could have without knowing the student's Verbal GMAT score. For example, the student with the highest Quantitative GMAT score didn't do nearly as well on the Verbal GMAT as many other students, and the student with the lowest Quantitative GMAT score did above average on the Verbal portion of the exam. The overall vertical scatter in the scatterplot, SDy, is the SD of the Quantitative GMAT scores, which is about 6.77. Now consider just those students whose Verbal GMAT scores are between 43 and 45 (there are 77 such students). The SD of the Quantitative GMAT scores of those students was only about 5.91, rather less than the overall SD of the Quantitative GMAT scores, because there is an association between the variables. If we took other "slices" through the data—other narrow ranges of Verbal GMAT—we would typically find something similar: the SD of the Quantitative GMAT scores for students whose Verbal GMAT scores are in narrow ranges tends to be a bit smaller than the overall SD of the Quantitative GMAT scores.
Two variables are associated if knowing the value of one of them tells us something about the value of the other. Slightly more precisely, X and Y are associated if the SD of the values of Y of points whose X coordinates are in a narrow range of values (a vertical slice through the scatterplot) is smaller than the overall SD of Y, or if the SD of the values of X of points whose Y coordinates are in a narrow range of values (horizontal slice through the scatterplot) is smaller than the overall SD of X. In the SD of values of Quantitative GMAT in narrow ranges of Verbal GMAT is about 5% smaller than the overall SD of Quantitative GMAT: They are associated, but only weakly.
Use the drop-down menus to plot first-year MBA GPA versus undergraduate GPA. Again, there is a slight positive association between these variables: The cloud of points tilts upward to the right. Students who had higher than average GPAs as undergraduates tended to have higher than average GPAs in their first year of business school. This is not surprising. What is perhaps surprising is how much scatter there is: The undergraduate GPA really does not predict the graduate GPA very well. If it did, the scatterplot would have less scatter in any vertical slice through the plot than it does. The association between these variables is even weaker than the association between Verbal and Quantitative GMAT.
If the association were strong, we could do a good job of predicting the first-year MBA GPA of a student from his or her undergraduate GPA. Because the association is weak, knowing a student's undergraduate GPA doesn't help us very much to predict his or her performance in the first year of an MBA program. This is part of what makes the admissions screening process difficult, and why schools combine several criteria in making admissions decisions.
There is a good reason the undergraduate GPA might not be a good predictor of MBA GPA for these students. How does a student get into the data set? He or she must have been admitted to an MBA program. If the admissions process balances undergraduate GPA with other factors that might predict whether a student will succeed, such as letters of recommendation, GMAT scores, etc., you might expect that the students with below average (for this group) undergraduate GPAs were admitted precisely because there were other reasons for thinking the student would succeed, as reflected in the first year MBA GPA. Indeed, this seems to be the case: The association between undergraduate GPA and first year MBA GPA is weaker for those students whose undergraduate GPA was significantly below average.
lets you look at the distribution of one variable for subsets of a multivariate data set defined by restricting some of the variables to various ranges.
You need Javato see this.
Use the drop-down menu at the top of to select "1st year MBA GPA" as the variable to show in the histogram. You should now see a green histogram of all the first year MBA GPAs. Now do the following:
Now you will see two histograms superposed on one another. The blue histogram is that of all 913 students' first-year MBA GPAs. The green histogram is that of the first year MBA GPAs of just those 59 students whose undergraduate GPA was 3.8 or above. For each class interval, the shorter bin is plotted in front, so you can see the height of every bin for both the original data and the restricted data. The bottom line of the tool, after the List Data button, shows the number of individuals in the full data set (913), the mean of their first-year MBA GPAs (3.1088), the SD of their first-year MBA GPAs (0.492), the number of individuals in the restricted data set (59), the mean of just their first-year MBA GPAs (3.5167), and the SD of just their first-year MBA GPAs (0.3853).
Notice that the first-year MBA GPAs of the 59 students who did really well as undergraduates (GPA 3.8 or above) is higher on the average than that of the overall group of 913 students, and the scatter in their scores is smaller: The association between these variables is positive.
Change the restrictions on the undergraduate GPA as follows:
The blue histogram remains as it was, but the green histogram is that of the first-year MBA GPAs of just those 82 students whose undergraduate GPA was between 2.6 and 2.8. In the last line of the tool, you can see that the SD of the first-year MBA GPAs of the students in this slice is larger (0.5489) than that of the entire group of 913 students (0.492). That is the case in many such slices through the scatterplot, so the association is weak.
The following exercises address assessing association from scatterplots and superposed histograms.
Association between variables often is used as evidence that there is a causal relationship between variables—erroneously. The introduction to this chapter noted that there is a negative association between money spent on healthcare, and life expectancy: The more one spends on healthcare in a given year, the fewer additional years he tends to survive. Does that mean that spending money on healthcare tends to shorten one's life?
Certainly not. As noted in the introduction, it is generally the sickest individuals who spend the most on healthcare in a given year. Their life expectancies are short, whether or not they get healthcare. Healthcare probably lengthens their lives. The negative association of these variables has little to do with any causal relationship between them.
More generally, association does not measure causation. There is a fallacy of logic known since ancient times: post hoc ergo propter hoc, which translates as "after this, therefore because of this." It is common to assume that if two things are associated, there is some causal relationship between them: One causes the other.
That is simply fallacious. For example, the Moon and Earth are gradually getting further apart. Similarly, the Dow-Jones Industrial Average (DJIA) generally has had an upward trend on a time scale of decades: both have positive secular trends. Does the increase in the DJIA cause the Earth and Moon to separate? Does the increased distance between Earth and Moon cause the DJIA to go up? Does some other single thing make them both increase? I think none of these is plausible.
As another example, consider the quantitative GMAT scores and first-year MBA GPAs of the students in the GMAT data set. Even though students with higher than average quantitative GMAT scores tended to have higher than average first-year MBA GPAs, getting higher quantitative GMAT scores did not cause the GPA to be higher. If a student takes a GMAT preparation class, that might help his or her GMAT score. Will it also improve the student's first-year MBA GPA? Probably not. Can doing better in the first year of business school cause a student's GMAT score to go up? Certainly not if the GMAT was taken before the first year of business school.
Work the following exercises to check your understanding of the difference between association and causation.
Multivariate data are observations of two or more variables per individual. Two variables at a time (bivariate data) can be displayed in a scatterplot. The points in a scatterplot represent individuals. The coordinates of each point are the values of the two variables for that individual. The point of averages in a scatterplot is the point with coordinates
(mean of X, mean of Y),
where X is the variable plotted on the x (horizontal) axis and Y is the variable plotted on the y (vertical) axis. Two variables are associated if the scatter of one variable in slices defined by restricting the other variable to a small range is less than the overall scatter of the first variable. (A vertical slice is a set of points with x coordinates that are in a restricted range. A horizontal slice is a set of points with y coordinates that are in a restricted range. Vertical scatter is the SD of the y coordinates of a collection of points; horizontal scatter is the SD of the x coordinates of a collection of points.)
One can detect association by comparing the histogram of one variable for the entire data set with the histogram of subsets of the data defined by restricting the other variable to small ranges: If the variables are associated, the histograms for the subsets will have less spread than the histogram for the entire data set has. It is easier to see association between pairs of variables with scatterplots. Association can be linear or nonlinear. If two variables are linearly associated, the points in their scatterplot are scattered more or less symmetrically around a straight line. If two variables are nonlinearly associated, the points in their scatterplot are scattered around a curve. Two variables are positively associated if individuals with higher than average values of one variable also tend to have higher than average values of the other variable, and individuals with lower than average values of one variable tend to have lower than average values of the other variable. Two variables are negatively associated if individuals with higher than average values of one variable tend to have lower than average values of the other variable, and vice versa. Linear association is always positive or negative. Nonlinear association need not be positive or negative.
Outliers are points many standard deviations away from the bulk of the data in at least one of their coordinates. Homoscedasticity means same scatter: The vertical scatter in different vertical slices through the scatterplot is about the same, regardless of where the slice is centered. Heteroscedasticity means different scatter: The vertical scatter in different vertical slices varies appreciably, depending on where the slice is centered. If a scatterplot shows linear association (or no association), homoscedasticity, and no outliers, it is said to be football shaped.
Association is not causation: two variables can have strong association and have no causal connection, and two variables can have a causal (deterministic) connection and no association. Post hoc ergo propter hoc is the fallacy of concluding from association between two variables that the variables have a cause-and-effect relationship.