This text was written for a "terminal" introductory class in Statistics for Business, Social Science, or liberal arts; that is, this is the first and last class in Statistics for most students who take it. Accordingly, the text is not geared toward theory, numerical analysis, or sophisticated formulae; neither does it contain a bestiary of techniques or named probability distributions. Rather, I hope to help students to think logically about quantitative evidence and to translate real-world situations into mathematical questions; and to expose students to a few important statistical and probabilistic concepts and to some of the difficulties, subjective decisions, and pitfalls, in analyzing data and making inferences from numbers. The text develops probability, estimation, and inference using counting arguments: there is no calculus involved.
I hope that students who study from these materials will:
The text goes further with counting arguments and combinatorics than most elementary textbooks do; it also goes further with data analysis. The applets incorporated into the materials enable students to analyze real datasets (the largest has 913 observations of 5 variables) without the pedagogical overhead of teaching students to use a proprietary statistics package. Students also reproduce numerical experiments that demonstrate key concepts, such as sampling distributions, confidence intervals, and the Law of Large Numbers. Using applets also has eliminated the need to teach students to read arcane tables associated with different distributions; instead, students type the relevant parameters into textboxes, highlight a range of values, and read off the probability. I have tried to emphasize topics that can be taught most effectively with this sort of interactive online tool. I have sought to provide enough variety in the material that instructors can pick and choose from among the chapters to find material appropriate to the level at which they desire to teach. The most technical material is in footnotes and sidebars, so that it does not interrupt the flow. Many of the examples and datasets for exercises are real -- they arose in my consulting work, in experiments I am familiar with, or they are in the public domain (for example, data on GMAT scores, undergraduate GPA, and MBA GPA). Many of the inference problems are real, too. For example, the Kassel Dowsing Experiment is a real test of the ability of dowsers to determine whether water is running in a buried pipe; the derivation of Fisher's exact test is in the context of determining whether targeted Web advertising works, a problem I have studied for a consulting client; the case studies about employment discrimination and theft of trade secrets derive from my work as an expert witness.
I have tried to motivate many of the computations by inference problems. Probability, hypothesis testing, randomization, and sampling error, are woven into the discussion of experiments and sample surveys. For some introductory courses, the probability in those sections will suffice. For instructors who desire a more quantitative text, there are additional chapters on probability distributions, discrete random variables, and expectation. The book does not discuss continuous distributions: the normal curve, Student's t-curve, and the chi-square curve appear as approximations to the probability histograms of discrete random variables, not as probability densities of continuous random variables. These curves are motivated by interactive experiments using applets that show empirically that the sampling distributions of some random variables converge to the curves. Probability is developed by counting; inference is developed using counting and sampling experiments that illustrate regularities. Above all else, I have strived to be correct and not silly -- I tried to avoid presenting anything I would not do to data as a consultant. There are exceptions, but I have tried to mark them clearly. For example, I find little use for the t-test or Student t confidence intervals for the mean, but as a concession to their popularity, I have included them -- isolated in a single chapter that I usually do not cover.
These materials have been used to teach large undergraduate courses at UC Berkeley since 1997. They are the basis of the first fully online course offered by UC Berkeley, Statistics N21, Summer 2007.
If you are reading this in print, rather than in a Web browser, the following does not apply to the version you are reading. The online version of the materials has much more functionality than any print book can have:
The software empowers students to reproduce numerical experiments themselves, without having to learn a statistical language (using instead a standard Web browser), which encourages exploration and inquiry-based learning. The text uses the power of the Internet in many ways, including the following:
These materials do not assume that the reader has any previous knowledge of statistics or probability. However, the reader needs to be comfortable with percentages, exponentiation and square roots, and "scientific notation" (numbers times powers of ten). Assignment 0 is a review and quiz covering the prerequisite material. The ultimate calculations are all simple, but the logical reasoning needed to reduce the problems to those simple calculations are sometimes subtle.
Some of the footnotes and sidebars rely on elementary calculus to find stationary points of convex, continuously differentiable functions. For example, the mean is characterized as the number from which the rms of the residuals is smallest, and the regression line is characterized as the least-squares line. Those derivations can be skipped with impunity.
These materials are comprised of XHTML, CSS, Java, and JavaScript. As of 29 May 2007, they consisted of 186 XHTML files containing about 108,000 lines of XHTML and JavaScript, 65 Java classes containing about 16,000 lines of code, 16 JavaScript libraries containing about 5,000 lines of code, 34 data files containing about 5,000 records, a cascading style sheet with about 400 lines, and a handful of .jpg and .gif files. The choice to use XHTML, CSS, Java and JavaScript was motivated by these design criteria:
Using XHTML with Java, JavaScript and CSS allowed me to make the content dynamic: many of the examples and exercises in the text change whenever the page is reloaded, so students can get unlimited practice at certain kinds of problems. Similarly, each student gets a different version of each assignment and exam, but can see the solutions to his/her version after the due date.
There are a number of advantages to using Java applets rather than an integrated statistical package:
I would recommend that instructors who wish to evaluate these materials for possible adoption look first at Chapters 3-5. Those chapters illustrate several aspects of the text: dynamic exercises, the use of real data in examples and exercises, the histogram and scatterplot applets, and the gradual introduction of new functionality (buttons and displayed statistics) into the applets as students learn new concepts. For example, when the scatterplot applet arrives in Chapter 3, "Multivariate Data and Scatterplots," its only controls change the variables plotted, list the data, show univariate statistics of the variables in the dataset (summary statistics covered in the first two chapters), and display the coordinates of the cursor. (Selecting a row in the data listing highlights the corresponding point in the scatterplot.) In Chapter 4, "Correlation and Association," the scatterplot applet acquires the correlation coefficient, and a button to show graphically the standard deviations of the two variables plotted; it is also invoked to display randomly generated data that attain a given value of the correlation coefficient. It also starts to allow students to add points by clicking on the plot, to see the effect of additional data on the correlation coefficient. In Chapter 5, "Regression," the same applet gains buttons to show the graph of averages, the SD line, and the regression line.
After chapters 3-5, I would recommend looking at the collection of Java applets to see how various concepts are presented graphically; in particular, be sure to see the applets for Venn diagrams, sampling distributions, confidence intervals, and the Law of Large Numbers. To see how tables of probabilities are eliminated, see the applets for the Normal Distribution, Student's t-Distribution, and the Chi-square Distribution. I would recommend then looking at Chapters 10, 11, 20, and 23.
Philip B. Stark is Professor of Statistics at the University of California, Berkeley, where he has been on the faculty since 1988. He received his bachelor's degree in Philosophy from Princeton University in 1980, and his PhD in Earth Science from the Scripps Institution of Oceanography in 1986. He received a National Science Foundation Postdoctoral Fellowship in Mathematical Sciences in 1987 and the Presidential Young Investigator Award in 1989. He was elected a Fellow of the Institute of Physics in 1999. Philip dropped out of high school and law school. He has served on the editorial boards of journals in applied mathematics, geophysics, and statistics, and has given over 130 invited lectures at conferences and universities in 17 countries. He is the author or co-author of over 70 technical papers. Philip has done research in astrophysics, microwave cosmology, earthquake prediction, geomagnetism, geochemistry, seismic tomography, signal recovery, constrained confidence estimation, probability density estimation, spectrum estimation, information retrieval, inverse problems, adjusting the U.S. Census, causal inference, and human hearing. He specializes in problems with very large datasets; software written by him and his students performs part of the routine data reduction for a geomagnetic satellite and a network of solar telescopes. Philip has consulted in IC mask manufacturing, oil exploration, water treatment, predicting e-mail spool fill, electrical activity of the brain, and targeted Internet advertising; and he has served as an expert witness in litigation and legislation on topics ranging from natural resources to the U.S. Census, the Child Online Protection Act (sampling the Internet and testing content filters, which involved the controversial subpoena of search records and indexed webpages from Google, Yahoo! and MSN), consumer protection, employment discrimination, insurance, product liability, property tax assessment, truth in advertising, marketing, equal protection, trade secrets, intellectual property, risk assessment, wage and hour disputes, and anti-trust. He has testified to the U.S. House of Representatives Subcommittee on the Census, to the California State Senate Natural Resources and Wildlife Committee, and to the California Department of Fish and Game. He has consulted for the U.S. Department of Justice, the Federal Trade Commission, the U.S. Department of Agriculture, the U.S. Census Bureau, the U.S. Attorney’s Office of the Northern District of California, the U.S. Department of Veterans Affairs, the Los Angeles County Superior Court, the National Solar Observatory, the California Secretary of State, public utilities, major corporations, and numerous law firms, including six of the 25 largest. Philip has served on the Technical Advisory Boards of two online publishing firms, an online data mining company and a photo search engine. He was the Faculty Assistant for Educational Technology at The University of California, Berkeley, from 2001-2003 and chaired the U.C. Berkeley Educational Technology Committee from 2001-2005. Philip does not like to be called "Phil." He advocates open-source software, wears sandals whenever possible, runs 100 mile endurance trail races, roasts his own coffee beans for espresso, and thinks this book is proof that obsessive-compulsive disorder is a job qualification. Philip lives in Berkeley, California, with his laptop, his cell phone, and six muddy pairs of running shoes.
This project would not have been possible without Ofer Licht, who gave competent, intelligent, and congenial answers to my then-noobie questions about Java and JavaScript, pointed me to lots of useful material, and who wrote the original server-side Perl cgi scripts for grading homework and querying the grade database. Duncan Temple Lang was also a helpful and sympathetic resource regarding the intricacies and nuisances of Java 1.0; in addition, he is primarily responsible for the multi-threaded data server used to load large data sets (home-grown technology that anticipates Ajax). I am grateful to Rudy Guerra, who co-wrote an earlier version of the chapter on counting and the assignment on experiments. David A. Freedman, and the excellent "dead-tree book" Statistics by Freedman, Pisani, and Purves, were inspirational. Deirdre Lynch made several valuable suggestions regarding the user interface, and Sydney Jones was extremely helpful in identifying problems with flow, organization, consistency, and prose. This book is dedicated to Alessandra and Naomi.