This applet lets you study the relationship between pairs of variables using scatterplots, the correlation coefficient, the graph of averages, linear regression, and residual plots.

You can select one of four data sets using the drop-down menu, or type in the URL of a different dataset. The four data sets are

- data about the number of homeless in 50 cities in the USA
- data about the 47 smallest of those 50 cities
- pollutant emissions from EPA test vehicles in 96 tests
- data about the GPAs and GMAT scores of 913 first year MBA students

The next two choice boxes let you select which variable in the selected data set to plot on the X axis, and which to plot on the Y axis. Again, on monochrome monitors on unix systems, the box itself might not be visible: click on the name of the variable to see the other choices.

The buttons should be self-explanatory. They let you plot ±1 SD from the point of averages, which is plotted in red; plot the SD Line, the graph of averages (yellow squares), and the regression line, pop up a window containing the currently plotted dataset, pop up a window containing summary statistics for each variable, and toggle from a scatterplot of the data to a residual plot, use or ignore points you have added by clicking on the graph, and clear points you added previously by clicking on the graph. (The univariate summary statistics are always for the original data; they do not include any points you have added.) You can find the X and Y values for any point by positioning the mouse cursor over it: the coordinates of the cursor are given in the lower right corner of the applet. If you select some rows of data in the dataset window and strike "return," the corresponding points in the scatterplot will be plotted in yellow, rather than blue.

**Cities data**. These
data come from a September 25, 1987, article by W. Tucker in the
National Review. Mr. Tucker presented the results of analyses by
Prof. Jeffrey Simonoff of the Department of Statistics and
Operations Research, Stern School of Business, New York
University. The "conceivably relevant factors" Prof.
Simonoff considered in studying the the homeless rate per
thousand population were the population size, vacancy rate, and
unemployment rate, in 50 cities in the USA. The homeless figures
for 35 of the cities came from the 1984 Report to the Secretary
of Housing and Urban Development on Emergency Shelters and
Homeless Populations. The homeless data for the other 15 cities
(St. Louis, Santa Monica, Newark, Yonkers, Dallas-Fort Worth,
Denver, Charleston WV, Atlanta, San Diego, New Orleans,
Albuquerque, Tucson, Burlington, Milwaukee, Providence, and
Lincoln NE) were from local sources, and were chosen because 1987
or 1988 homeless estimates for those cities happened to be
available. The other data for those cities came from various
federal agencies, such as the Census Bureau, HUD, and the NOAA
(Prof. J. Simonoff, personal communication, 1998.).
The cities47 data set excludes the largest three
of the 50 cities.

**CCV data**. The
Correlation Check Vehicle (CCV) data were made available by Leo
Breiman, Department of Statistics, UCB. The data were collected
by the Environmental Protection Agency. The test vehicles are
1977 Chevrolet Novas, modified in various ways, including the
removal of their catalytic converters and other emissions-control
systems. These data were collected in 1979 at a single
laboratory, and used constant engine load and fuel temperature.
The measured variables are the emissions of hydrocarbons (HC),
nitrogen oxides (NOx), and carbon monoxide (CO), measured in
milligrams per mile. Three outliers whose cause was known were
removed from the data set; 96 measurements of each of the three
variables remain. "Test" is just a number that
identifies the case, not a measurement.

**GMAT data**. These data
were made available by Howard Wainer, Educational Testing
Service. The data are the undergraduate GPA's, verbal and
quantitative GMAT scores, and first-year MBA GPA's of 913
students from five major universities. I do not know how the
students were selected, nor do I know the year of the study.

Give the URL of a file on the web. The file should be in the following format:

- The first line has the variable names, separated by tabs or spaces. Complicated names that contain spaces can be surrounded with quotation marks. If a line contains two slashes (//), everything from the slashes to the end of the line is ignored---it is a comment.
- the remaining lines contain the data, separated by spaces or tabs. The order of the data should be every variable for the first case, then every variable for the second case, etc., up to the last case. Again, everything in a line after two slashes is ignored.