# HCC Statistics Worksheet

The following is a couple of hints for these problems.

For problem 2: Set up a probability distribution table for winnings, I let

= winnings, like table below. You can fill in the probabilities it is given in the problem. Then set E(Z) = 0 to see what the relationship is between

xand y.

Zx

P(Z) |

-y |

For #3: This is an example of Bernoulli going to Binomial, see my example about the Audit in lecture 6 from June 17th. For part b, you do not need to write out all of the sequences, you do need to know how many there are and the probability of each sequence.

For #6: To determine the probability statement ignore the fact that it is over a week. Think about the fact that they are going to let the first one go by. The hint about the week is about what the parameter should be.

If you have any other questions about these let me know. I can add more hints if there is confusion.

Go to TOC

Statistics for the Sciences

Charles Peters

Contents

Go to TOC

1 Background

1.1 Overview and Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Populations, Samples and Variables . . . . . . . . . . . . . . . . . . . . . . . .

1.3 Types of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5 Random Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.6 Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.7 Computing in Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.8 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

2

2

3

4

4

5

6

7

7

2 Descriptive and Graphical Statistics

2.1 Location Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.2 Repeated Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.3 The Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.4 Other Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.5 Trimmed Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.6 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1.7 The Five Number Summary . . . . . . . . . . . . . . . . . . . . . . . .

2.1.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2 Grouped Data, Histograms, and Cumulative Frequency Diagrams . . . . . . .

2.2.1 Frequency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2.3 Cumulative Frequency Diagrams . . . . . . . . . . . . . . . . . . . . .

2.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3 Measures of Variability or Scale . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3.1 The Variance and Standard Deviation . . . . . . . . . . . . . . . . . .

2.3.2 The Mean and Median Absolute Deviation . . . . . . . . . . . . . . .

2.3.3 The Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.5 Factor Variables and Barplots . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

10

10

11

11

11

12

12

14

15

15

15

16

18

21

22

22

23

24

26

26

28

29

1

CONTENTS

2

2.5.1 Tabulated Factor Variables . . . . . . . . . . . . . . . . . . . . . . . .

2.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Jointly Distributed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6.1 Two Factor Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6.2 One Factor and One Numeric Variable . . . . . . . . . . . . . . . . . .

2.6.3 Two Numeric Variables . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

31

32

32

34

35

39

3 Probability

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2 Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3 Combinations of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4 Rules for Probability Measures . . . . . . . . . . . . . . . . . . . . . . . . . .

3.5 Counting Outcomes. Sampling with and without Replacement . . . . . . . .

3.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.6.1 Relating Conditional and Unconditional Probabilities . . . . . . . . .

3.6.2 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.7 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.7.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.8 Replications of a Random Experiment . . . . . . . . . . . . . . . . . . . . . .

41

41

41

43

44

45

46

48

48

50

51

52

52

53

4 Discrete Distributions

4.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3 Expected Values of Discrete Variables . . . . . . . . . . . . . . . . . . . . . .

4.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4 Bernoulli Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4.1 The Mean and Variance of a Bernoulli Variable . . . . . . . . . . . . .

4.5 Binomial Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5.1 The Mean and Variance of a Binomial Distribution . . . . . . . . . . .

4.6 Cumulative Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.7 Hypergeometric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.7.1 The Mean and Variance of a Hypergeometric Distribution . . . . . . .

4.8 Poisson Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.8.1 The Mean and Variance of a Poisson Distribution . . . . . . . . . . .

4.8.2 The Poisson Approximation to the Binomial Distribution . . . . . . .

4.8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.9 Jointly Distributed Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.9.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . .

4.10 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

55

56

58

59

60

60

61

62

63

64

66

67

68

70

70

71

72

73

75

77

2.6

Go to TOC

CONTENTS

3

5 Continuous Distributions

79

5.1 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2 Expected Values and Quantiles for Continuous Distributions . . . . . . . . . 83

5.2.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.2 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3 Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.1 The Mean, Variance and Quantile Function of a Uniform Distribution 87

5.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.4 Exponential Distributions and Their Relatives . . . . . . . . . . . . . . . . . . 88

5.4.1 Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.4.2 Gamma Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4.3 Weibull Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.5 Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.5.1 Tables of the Standard Normal Distribution . . . . . . . . . . . . . . . 97

5.5.2 Other Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . 98

5.5.3 The Normal Approximation to the Binomial Distribution . . . . . . . 100

5.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6 Joint Distributions and Sampling Distributions

102

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2 Jointly Distributed Continuous Variables . . . . . . . . . . . . . . . . . . . . . 102

6.2.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2.2 Bivariate Normal Distributions . . . . . . . . . . . . . . . . . . . . . . 107

6.3 Mixed Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.4 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.5 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.5.1 Simulating Random Samples . . . . . . . . . . . . . . . . . . . . . . . 113

6.6 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.7 Other Distributions Associated with Normal Sampling . . . . . . . . . . . . . 120

6.7.1 Chi Square Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.7.2 Student t Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.7.3 The Joint Distribution of the Sample Mean and Variance . . . . . . . 125

6.7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7 Statistical Inference for a Single Population

127

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.2 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.2.1 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.2.2 Desirable Properties of Estimators . . . . . . . . . . . . . . . . . . . . 128

7.3 Estimating a Population Mean . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.3.1 Finding the Required Sample Size . . . . . . . . . . . . . . . . . . . . 129

7.3.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Go to TOC

CONTENTS

7.4

7.5

7.6

4

7.3.3 Small Sample Confidence Intervals for a Normal Mean . . . . . . . . . 132

7.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Estimating a Population Proportion . . . . . . . . . . . . . . . . . . . . . . . 137

7.4.1 Choosing the Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.4.2 Confidence Intervals for p . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Estimating Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Estimating the Variance and Standard Deviation . . . . . . . . . . . . . . . . 143

7.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8 Hypothesis Testing

145

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.2 Test Statistics – Type 1 and Type 2 Errors . . . . . . . . . . . . . . . . . . . . 145

8.3 Hypotheses About a Population Mean . . . . . . . . . . . . . . . . . . . . . . 146

8.3.1 Large sample tests for the mean when the variance is unknown . . . . 147

8.3.2 Student t Tests for Small Samples . . . . . . . . . . . . . . . . . . . . 148

8.4 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

8.4.1 Using R’s t.test function . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8.4.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

8.5 Hypotheses About a Population Proportion . . . . . . . . . . . . . . . . . . . 153

8.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9 Regression and Correlation

157

9.1 Examples of Linear Regression Problems . . . . . . . . . . . . . . . . . . . . . 157

9.2 Least Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

9.2.1 The ”lm” Function in R . . . . . . . . . . . . . . . . . . . . . . . . . . 164

9.2.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

9.3 Distributions of the Least Squares Estimators . . . . . . . . . . . . . . . . . . 167

9.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

9.4 Inference for the Regression Parameters . . . . . . . . . . . . . . . . . . . . . 169

9.4.1 Confidence Intervals for the Parameters . . . . . . . . . . . . . . . . . 171

9.4.2 Hypothesis Tests for the Parameters . . . . . . . . . . . . . . . . . . . 172

9.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

9.5.1 Confidence intervals for ρ . . . . . . . . . . . . . . . . . . . . . . . . . 180

9.5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

10 Inferences on Two Groups or Populations

182

10.1 Large Sample Comparison of Means . . . . . . . . . . . . . . . . . . . . . . . 182

10.2 Large Sample Comparison of Proportions . . . . . . . . . . . . . . . . . . . . 184

10.3 Testing Equality of Population Proportions . . . . . . . . . . . . . . . . . . . 185

10.3.1 Comparing Proportions with R . . . . . . . . . . . . . . . . . . . . . . 185

10.4 Small Sample Comparison of Normal Means . . . . . . . . . . . . . . . . . . . 186

10.4.1 The Welch Test and Confidence Interval . . . . . . . . . . . . . . . . . 186

10.4.2 The t-test with Equal Variances . . . . . . . . . . . . . . . . . . . . . 187

10.4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

Go to TOC

CONTENTS

5

10.5 Paired Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

10.5.1 Crossover Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

10.5.2 Estimating the Size of the Effect . . . . . . . . . . . . . . . . . . . . . 193

10.5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

11 Analysis of Variance

194

11.1 Single Factor Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . 194

11.1.1 Mathematical Description of One Way Anova . . . . . . . . . . . . . . 195

11.1.2 Anova Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

11.1.3 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

11.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

11.2 Two-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 203

11.2.1 Two-way ANOVA with Replications . . . . . . . . . . . . . . . . . . . 207

11.2.2 The Additive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

11.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

12 Analysis of Categorical Data

211

12.1 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

12.1.1 Estimators and Hypothesis Tests for the Parameters . . . . . . . . . . 212

12.1.2 Multinomial Probabilities That Are Functions of Other Parameters . . 213

12.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

12.2 Testing Equality of Multinomial Probabilities . . . . . . . . . . . . . . . . . . 216

12.3 Independence of Attributes: Contingency Tables . . . . . . . . . . . . . . . . 218

12.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

13 Miscellaneous Topics

223

13.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

13.1.1 Inferences Based on Normality . . . . . . . . . . . . . . . . . . . . . . 224

13.1.2 Using R’s ”lm” Function for Multiple Regression . . . . . . . . . . . . 225

13.1.3 Factor Variables as Predictors . . . . . . . . . . . . . . . . . . . . . . . 228

13.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

13.2 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

13.2.1 The Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

13.2.2 The Mean and Variance of V and V+ . . . . . . . . . . . . . . . . . . 236

13.2.3 Confidence Intervals for the Location Parameter ∆ . . . . . . . . . . . 238

13.2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

13.2.5 The Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . 239

13.2.6 Estimating the Shift Parameter . . . . . . . . . . . . . . . . . . . . . . 240

13.2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

13.3 Bootstrap Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 242

13.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

Go to TOC

Chapter 1

Background

1.1

Go to TOC

Overview and Basic Concepts

Statistics is the art of summarizing data, depicting data, extracting information from data,

and inferring properties of data sources. Statistics and the theory of probability are often

conflated in popular discussion. They are distinct subjects, although statistics depends on

probability to quantify the strength of its inferences. Only a small amount of probability is

required for this book. It will be developed in Chapter 3 and throughout the text as needed.

We begin by introducing some basic ideas and terminology.

1.2

Populations, Samples and Variables

A population is a set of individual elements whose collective properties are the subject of investigation. Usually, populations are large collections whose individual members cannot all

be examined in detail. In statistical inference a manageable subset of the population is selected according to certain sampling procedures and properties of the subset are generalized

to the entire population. These generalizations are accompanied by statements quantifying their accuracy and reliability. The selected subset is called a sample from the population.

Examples:

(a) the population of registered voters in a congressional district,

(b) the population of U.S. adult males,

(c) the population of currently enrolled students at a certain large urban university,

(d) the population of all transactions in the U.S. stock market for the past month,

(e) the population of all peak temperatures at points on the Earth’s surface over a given

time interval.

Some samples from these populations might be:

(a) the voters contacted in a pre-election telephone poll,

6

CHAPTER 1. BACKGROUND

7

(b) adult males interviewed by a TV reporter,

(c) the dean’s list,

(d) transactions recorded on the books of Morgan Stanley,

(e) peak temperatures recorded at several weather stations.

Clearly, for these particular samples, some generalizations from sample to population would

be highly questionable. With the possible exception of (e), none of these subsets would

be accepted as scientifically valid samples allowing generalization to the entire population.

Researchers have devised a number of procedures for obtaining samples that allow generalization with a specified degree of confidence. The names of some common procedures are

simple random sampling, cluster sampling, and stratified sampling. We will focus for the

most part on procedures based on simple random sampling. Simple random sampling will

be described in Chapters 3 and 4.

A population variable is an attribute that has a value for each individual in the population. In other words, it is a function from the population to some set of possible values. It

may be helpful to imagine a population as a spreadsheet with one row or record for each

individual member. Along the ith row, the values of a number of attributes of the ith individual are recorded in different columns. The column headings of the spreadsheet can be

thought of as the population variables. For example, if the population is the set of currently

enrolled students at the urban university, some of the variables are academic classification,

number of hours currently enrolled, total hours taken, grade point average, gender, ethnic

classification, major, and so on. Variables, such as these, that are defined for the same

population are said to be jointly observed or jointly distributed. General results pertaining

to jointly distributed variables are presented in Chapter 6.

1.3

Types of Variables

Variables are classified according to the kinds of values they have. The three basic types

are numeric variables, factor variables, and ordered factor variables. Numeric variables are

those for which arithmetic operations such as addition and subtraction make sense. Numeric variables are often related to a scale of measurement and expressed in units, such

as meters, seconds, or dollars. Factor variables are those whose values are mere names, to

which arithmetic operations do not apply. Factors usually have a small number of possible

values. These values might be designated by numbers. If they are, the numbers that represent distinct values are chosen merely for convenience. The values of factors might also

be letters, words, or pictorial symbols. Factor variables are also called nominal variables

or categorical variables. Ordered factor variables are factors whose values are ordered in

some natural and important way. Some textbooks have a more elaborate classification of

variables, with various subtypes. The three types above are enough for our purposes.

Examples: Consider the population of students currently enrolled at a large university.

Each student has a residency status, either resident or nonresident. Residency status is an

unordered factor variable. Academic classification is an ordered factor with values “freshman”, “sophomore”, “junior”, “senior”, “post-baccalaureate” and “graduate student”. The

number of hours enrolled is a numeric variable with integer values. The distance a student

Go to TOC

CHAPTER 1. BACKGROUND

8

travels from home to campus is a numeric variable expressed in miles or kilometers. Home

area code is an unordered factor variable whose values happen to be designated by numbers.

1.4

Distributions

Let X be the name of a population variable such as the distance from home to campus for

students at the large urban university. The values assumed by X for the individual memebers of the population have a distribution. By this we mean that if a particular subset of

possible values of X is given, there is a certain proportion of the members of the population

whose X-values belong to that subset. This assignment of proportions to sets of possible

values is called the distribution of X.

For very large populations or for variables with many distinct values, statisticians may

employ mathematical models of distributions. One of the goals of statistical inference is to

find the best mathematical model from a given class of models. In Chapter 4 we describe

some of the most useful discrete model distributions, including the binomial, hypergeometric, poisson, and multinomial distributions. These are mainly for numeric variables that

have integer values or for factor variables. In Chapter 5 we discuss some of the continuous

model distributions: exponential, normal, gamma and others. These are mainly for numeric

variables recorded with high precision.

1.5

Random Experiments

An experiment can be something as simple as flipping a coin or as complex as conducting a

public opinion poll. A random experiment is one with the following two characteristics:

(1) The experiment can be replicated an indefinite number of times under essentially the

same experimental conditions.

(2) There is a degree of uncertainty in the outcome of the experiment. The outcome may

vary from replication to replication even though experimental conditions are the same.

When we say that an experiment can be replicated under the same conditions, we mean

that controllable or observable conditions that we think might affect the outcome are the

same. There may be hidden conditions that affect the outcome, but we cannot account for

them. Implicit in (1) is the idea that replications of a random experiment are independent,

that is, the outcomes of some replications do not affect the outcomes of others. Obviously, a

random experiment is an idealization of a real experiment. Some simple experiments, such

as tossing a coin, approach this ideal closely while more complicated experiments may not.

There are two broad categories of random experiments. Designed experiments are those

in which the researcher exercises control over some of the conditions that affect the outcome. Observational studies are experiments in which control is not possible, although

variables associated with the primary outcome may be observed. Experimental design is an

extensively studied aspect of statistics. We do not treat it here because it is impossible to do

Go to TOC

CHAPTER 1. BACKGROUND

9

it justice in a relatively short introductory textbook and because the design of experiments

can be left to more discipline-specific applied statistics courses.

Examples

(a) An engineer compares several brands of 9 volt batteries by placing samples of each

brand all under the same load and observing the times it takes each battery to decrease to

several pre-assignmed voltage levels. This is a designed experiment because the engineer

controls the voltages at which she observes the times. The times observed for each battery

are the outcomes of the experiment. Presumably if the experiment were to be replicated

with different samples of the same brands, the observed times would be different. Hence,

the outcomes of different replications are uncertain.

(b) Two astronomers investigate the spectral class (basically the surface temperature) and

the absolute luminosity of a large sample of stars in an effort to discover a relationship between these variables. This is an observational study. The astronomers have no control over

any stellar variables. Another astronomer observing a different sample of stars would observe different temperatures and luminosities. (N.B.This is an actual experiment carried out

by Ejnar Hertzsprung and Henry Norris Russell in 1910. It resulted in a famous scatterplot

known as the H-R diagram. We will encounter scatterplots in Chapter 2.)

1.6

Sample Spaces

The sample space of a random experiment is the set of all its possible outcomes. We use

the Greek capital letter Ω (omega)to denote the sample space. There may be some degree

of arbitrariness in the description of Ω depending on how the outcomes of the experiment

are represented symbolically.

Examples:

(a) Toss a coin. Ω = {H, T }, where “H” denotes a head and “T” a tail. Another way

of representing the outcome is to let the number 1 denote a head and 0 a tail (or vice-versa).

If we do this, then Ω = {0, 1}. In the latter representation the outcome of the experiment

is just the number of heads.

(b) Toss a coin 5 times, i.e., replicate the experiment in (a) 5 times. An outcome of this

experiment is a 5 term sequence of heads and tails. A typical outcome might be indicated

by (H,T,T,H,H), or by (1,0,0,1,1). Even for this little experiment it is cumbersome to list

all the outcomes, so we use a shorter notation

Ω = {(x1 , x2 , x3 , x4 , x5 ) | For each i xi = 0 or xi = 1} .

(c) Select a student randomly from the population of all currently enrolled students. The

sample space is the same as the population. Here, the word “randomly” is vague. We will

give it a precise definition later.

Go to TOC

CHAPTER 1. BACKGROUND

10

(d) Repeat the Michelson-Morley experiment to measure the speed of the Earth relative to

the ether (which doesn’t exist, as we now know). The outcome of the experiment could conceivably be any nonnegative number, so we take Ω = [0, ∞) = {x | x is a real number and x ≥ 0.}

Uncertainty arises from the fact that this is a very delicate experiment with several sources

of unpredictable error.

(e) Randomly select a 5 card draw poker hand from the standard deck of 52 cards. The

sample space is the collection of all subsets of size 5 of the set of 52 cards in a standard deck.

(f) Replicate the Hertzsprung-Russell experiment with a sample of n stars. The sample space

is the set of all n-term sequences of pairs of positive numbers (xi , yi ), i = 1, · · · , n, where

xi is the surface temperature of the ith star and yi is its luminosity. xi and yi presumably

could be any positive numbers.

1.7

Computing in Statistics

Even moderately large data sets cannot be managed effectively without a computer and

computer software. Furthermore, much of applied statistics is exploratory in nature and

cannot be carried out by hand, even with a calculator. Spreadsheet programs, such as

Microsoft Excel, are designed to manipulate data in tabular form and have functions for

performing the common tasks of statistics. In addition, many add-ins are available, some of

them free, for enhancing the graphical and statistical capabilities of spreadsheet programs.

Because it is so common in the business world, it is very beneficial for students to have some

experience with Excel or a similar program.

The disadvantages of spreadsheet programs are their dependence on the spreadsheet data

format with cell ranges as input for statistical functions, their lack of flexibility, and their

relatively poor graphics. Many highly sophisticated packages for statistics and data analysis are available. Some of the best known commercial packages are Minitab, SAS, SPSS,

Splus, Stata, and Systat. The package used in this text is called R. It is an open source

implementation of the same language used in Splus and it may be downloaded free at

http://www.r-project.org .

Since its first release in 2000 R has grown explosively, both in capabilities and in world-wide

popularity. It is now one of the most widely used packages for science and engineering applications. For the foreseeable future it will be one of the top data analysis systems available.

After downloading and installing R we recommend that you download and install another

free package called Rstudio. It can be obtained from

http://www.rstudio.com .

Rstudio makes importing data into R much easier and makes it easier to integrate R output

with other programs and applications.

Go to TOC

CHAPTER 1. BACKGROUND

11

Our approach to teaching R is to teach by example. Detailed instructions on using R

and Rstudio for some of the exercises will be provided. R solutions for many of the exercises

may readily be obtained by emulating examples worked out in the text. There are many

instructional videos for R on Youtube. There are also abundant internet resources which you

will find if you Google ”R”. Two of the most useful are the manual provided by the R project

https://cran.r-project.org/doc/manuals/R-intro.html

and the simpleR documentation by John Verzani at

https://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf

Go to TOC

1.8

Data Sources

Data files used in this course are from four sources. Some are local in origin and come from

student or course data at the University of Houston. Others are simulated but made to look

as realistic as possible. These and others are available at

http://www.math.uh.edu/ charles/data .

Many data sets are included with R in the datasets library and other contributed packages.

We will refer to them frequently. The main external sources of data are the data archives

maintained by the Journal of Statistics Education.

www.amstat.org/publications/jse

and the Statistical Science Web:

http://www.stasci.org/datasets.html.

1.9

Exercises

1. Go to http://www.math.uh.edu/ charles/data. Examine the data set “Air Pollution Filter Noise”. Identify the variables and give their types.

2. Highlight the data in Air Pollution Filter Noise. Include the column headings but not

the language preceding the column headings. Copy and paste the data into a plain text file,

for example with Notepad in Windows. Import the text file into Excel or another spread

sheet program. Create a new folder or directory named “math3339” and save both files there.

3. Start R by double clicking on the big blue R icon on your desktop. Click on the file menu

at the top of the R Gui window. Select “change dir . . . ” . In the window that opens next,

find the name of the directory where you saved the text file and double click on the name

of that directory. Suppose that you named your file “apfilternoise”. (Name it anything you

CHAPTER 1. BACKGROUND

12

like.) Import the file into R with the command

> apfilternoise=read.table(”apfilternoise.txt”,header=T)

and display it with the command

> apfilternoise

Click on the file menu at the top again and select “Exit”. At the prompt to save your

workspace, click “Yes”. If you open the folder where your work was saved you will see another big blue R icon. If you double click on it, R will start again and your previously saved

workspace will be restored.

If you use Rstudio for this exercise you can import apfilternoise into R by clicking on the

”Import Dataset” tab. This will open a window on your file system and allow you to select

the file you saved in Exercise 2. The dialog box allows you to rename the data and make

other minor changes before importing the data as a data frame in R.

4. If you are using Rstudio, click on the ”Packages” tab and then the word ”datasets”. Find

the data set ”airquality” and click on it. Read about it. If you are using R alone, type

> help(airquality)

at the command prompt > in the Console window.

Then type

> airquality

to view the data. Could ”Month” and ”Day” be considered ordered factors rather than numeric variables?

5. A random experiment consists of throwing a standard 6-sided die and noting the number

of spots on the upper face. Describe the sample space of this experiment.

6. An experiment consists of replicating the experiment in exercise 5 four times. Describe

the sample space of this experiment. How many possible outcomes does this experiment

have?

7. A random experiment consists of tossing a coin 4 times. Describe the sample space of

this experiment. In what proportion of all outcomes of the experiment will there be exactly

2 heads?

8. The airquality data set has 153 rows, one for each day in May through September of

1973. One of the variables is named ”Wind”, for wind speed. We will calculate some values

of the distribution of Wind using R. Suppose we are interested in the proportion of days

Go to TOC

CHAPTER 1. BACKGROUND

13

for which the wind speed was greater than 12. Attach the airquality data frame to your R

workspace with the command

> attach(airquality)

This allows you to address the variable Wind in R without going through bothersome intermediate steps. The number of days for which Wind was greater than 12 is given by

> sum(Wind > 12)

and the proportion is

> sum(Wind > 12)/153

Find the proportion of days for which Wind is less than or equal to 10. (Hint: Less than or

equal in R is denoted by 0 are constants and y = a + bx, ȳ = a + bx̄. Other location measures

introduced below behave in the same way.

2.1.2

Repeated Values

Go to TOC

When there are repeated values of x, there is an equivalent formula for the mean. Let the m

distinct values of x be denoted

v1 , . . . , vm . Let

Pby

Pmni be the number of times vi is repeated

m

and let fi = ni /n. Note that i=1 ni = n and i=1 fi = 1. Then the average is given by

m

X

m

1X

n i vi .

f i vi =

x̄ =

n i=1

i=1

(2.2)

The number ni is the frequency of the value vi and fi is its relative frequency.

2.1.3

The Median

Let x be a numeric variable with values x1 , x2 , . . . , xn . Arrange the values in increasing

order x(1) ≤ x(2) ≤ . . . ≤ x(n) . The median of x is a number median(x) such that at least

half the values of x are ≤ median(x) and at least half the values of x are ≥ median(x).

This is the essential idea but unfortunately there may be an interval of numbers that satisfy

this definition rather than a single number. The ambiguity is usually resolved by taking the

median to be the midpoint of that interval. Thus, if n is odd, n = 2k + 1, where k is a

positive integer, and

median(x) = x(k+1) ,

(2.3)

while if n is even, n = 2k, and

median(x) =

2.1.4

x(k) + x(k+1)

.

2

(2.4)

Other Quantiles

Let p ∈ (0, 1) be a number between 0 and 1. The pth quantile of x is more commonly known

as the 100pth percentile; e.g., the 0.8 quantile is the same as the 80th percentile. We define

it as a number q(x, p) such that the fraction of values of x that are ≤ q(x, p) is at least p

and the fraction of values of x that are ≥ q(x, p) is at least 1 − p. For example, at least 80

percent of the values of x are ≤ the 80th percentile of x and at least 20 percent of the values

of x are ≥ its 80th percentile. Again, this may not define a unique number q(x, p). Software

packages such as R have rules for resolving the ambiguity, but the details are usually not

important. If you are doing an exercise by hand, any number that satisfies the criteria for

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

16

q(x, p) is an acceptable answer. If you are using R you may accept the answer returned by R.

The median is the 50th percentile, i.e., the 0.5 quantile q(x, 0.50). The 25th and 75th percentiles q(x, 0.25) and q(x, 0.75) are called the first and third quartiles. The 10th , 20th , 30th ,

etc. percentiles are called the deciles. All quantiles including the median are location measures as defined above: if y = a + bx, where a and b > 0 are constants, then q(y, p) =

a + bq(x, p).

2.1.5

Trimmed Means

Trimmed means of a variable x are obtained by finding the mean of the values of x excluding

a given percentage of the largest and smallest values. For example, the 5% trimmed mean

is the mean of the values of x excluding the largest 5% of the values and the smallest 5%

of the values. In other words, it is the mean of all the values between the 5th and 95th

percentiles of x. A trimmed mean is a location measure.

2.1.6

Robustness

A robust measure of location is one that is not affected by a few extremely large or extremely

small values. Values of a numeric variable that lie a great distance from most of the other

values are called outliers. Outliers might be the result of mistakes in measuring or recording

data, perhaps from misplacing a decimal point. The mean is not a robust location measure.

It can be affected significantly by a single extreme outlier if that outlying value is extreme

enough. Thus, if there is any doubt about the quality of the data, the median or a trimmed

mean might be preferred to the mean as a reliable location measure. The median is very

insensitive to outliers. A 5% trimmed mean is insensitive to outliers that make up no more

than 5% of the data values.

Example 2.1. ”mydata” is a numeric vector of 21 made-up values.

> mydata

[1] 1 5

5

6

7

7

8 12 12 15 15 18 22 22 23 24 28 29 35 36 53

You can enter the data into your own R workspace with R’s ”scan” function, as follows.

> mydata=scan( )

1: 1 5 5 6 7 7 8 12 12 15

11: 15 18 22 22 23 24 28 29 35 36 53

22:

Read 21 items

When you call ”scan( )”, R will respond with the prompt 1:, meaning that it expects you

to enter the first data value. After entering the first data value, enter as many as you like,

separated by blank spaces. You can hit the enter key at any time to start a new line of

input. This is convenient if there are a lot of data values to be entered. After entering the

last data value hit the enter key twice to signal the end of the input. To check for errors,

simply type the name of the data object

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

> mydata

[1] 1 5

5

6

7

7

17

8 12 12 15 15 18 22 22 23 24 28 29 35 36 53

To find the mean of mydata, add the values together and divide by 21. Do this with your

calculator or by hand. Compare the answer to that given by R.

> mean(mydata)

[1] 18.2381

Now let us find the median of mydata. Since n = 21 = 2 × 10 + 1 is odd, we take the 11th

value in increasing order as the median.

median(mydata) = mydata(11) = 15.

This is confirmed by R:

> median(mydata)

[1] 15

Next, we will find the 25th percentile q(mydata, 0.25). According to the definition, we are

looking for a number q such that at least 25% of the data values are ≤ q and at least 75%

are ≥ q. At least 25% means at least 5.25, but since we must have a whole number, at least

6 values must be ≤ q. Likewise, at least 15.75 or at least 16 values must be ≥ q. The unique

number satisfying these criteria is q = 7. Thus,

q(mydata, 0.25) = 7

This agrees with R.

> quantile(mydata,0.25)

25%

7

Notice that there are repeated values of mydata. The distinct values and their frequencies

can be obtained with R using the ”table” function.

> table(mydata)

mydata

1 5 6 7 8 12 15 18 22 23 24 28 29 35 36 53

1 2 1 2 1 2 2 1 2 1 1 1 1 1 1 1

The first row of this table shows the distinct values vi and the second row shows their

frequencies ni . We can use the formula (2.2) for the mean when values are repeated.

mydata =

1

(1 × 1 + 2 × 5 + 1 × 6 + 2 × 7 + · · · ) = 18.2381.

21

Finally, let us calculate the 5% trimmed mean of mydata. Similar to the ambiguity in the

definition of the quantiles, there is some ambiguity in the definition of the trimmed mean.

Five percent of 21 is not a whole number, so we must round up or down and eliminate

that number of the largest and the smallest values to calculate the trimmed mean. R’s

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

18

convention is to round down. Thus, we eliminate the largest and also the smallest value of

mydata before calculating the mean. The 5% trimmed mean is

1

(5 + 5 + 6 + 7 + 7 + · · · + 29 + 35 + 36) = 17.31579.

19

The ”mean” function in R is also good for trimmed means. You simply have to tell R the

degree of trimming.

> mean(mydata,trim=.05)

[1] 17.31579

2.1.7

The Five Number Summary

Go to TOC

The five number summary is a convenient way of summarizing numeric data. The five

numbers are the minimum value, the first quartile (25th percentile), the median, the third

quartile (75th percentile), and the maximum value. The R function is ”fivenum”.

> fivenum(mydata)

[1] 1 7 15 24 53

R has another, more useful function ”summary” which returns the five numbers and, in

addition, the mean.

> summary(mydata)

Min. 1st Qu. Median

1.00

7.00

15.00

Mean 3rd Qu.

18.24

24.00

Max.

53.00

Example 2.2. The gap between the largest value of mydata, 53, and the next largest,

36 is much greater than the gap between any other two consecutive values. The largest

value might be considered an outlier. Suppose that experimenters decide to discard that

observation as a possible error. R has a neat trick for discarding one or more values. Since

53 is the 21st component of mydata, you can discard it as follows.

> mydata

[1] 1 5 5

> mydata[-21]

[1] 1 5 5

6

7

7

8 12 12 15 15 18 22 22 23 24 28 29 35 36 53

6

7

7

8 12 12 15 15 18 22 22 23 24 28 29 35 36

Now compare the summaries.

> summary(mydata)

Min. 1st Qu. Median

1.00

7.00

15.00

> summary(mydata[-21])

Min. 1st Qu. Median

1.00

7.00

15.00

Mean 3rd Qu.

18.24

24.00

Max.

53.00

Mean 3rd Qu.

16.50

23.25

Max.

36.00

Notice that the median has not changed with the elimination of the outlier, but the mean

has. This illustrates the greater robustness of the median as a location measure.

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

2.1.8

19

Exercises

The reacttimes data set has 50 observations of human reaction times to a physical stimulus.

The reaction times are named Times and arranged in increasing order below.

0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.13 1.17 1.21 1.23 1.35

1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20

2.29 2.32 2.39 2.47 2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73

1. Find the mean and median of Times without using R. You may use your calculator.

2. Import reacttimes into your workspace as a data frame with a single variable Times.

Attach the reacttimes data frame to your workspace with

> attach(reacttimes)

Then calculate the mean of Times by using R’s mean( ) function and the median with R’s

median( ) function.

3. Find the 60th percentile of Times without using R. There may be more than one acceptable answer.

4. Find the 60th percentile of Times using R’s quantile( ) function.

5. Find the 40th percentiles of mydata and mydata[-21] by hand and also by using R.

6. Find the 5% trimmed mean of Times.

7. Find the five number summary of Times.

8. The 40th value T imes(40) of the reaction time data is 2.32. Change it to 232.0 and

recalculate the mean and median. You can make the change in R by

> Times[40]=232.0

Change it back after you are finished with this exercise.

2.2

Grouped Data, Histograms, and Cumulative Frequency Diagrams

2.2.1

Frequency Tables

Large data sets are often summarized by grouping values. Let x be a numeric variable with

values x1 , x2 , . . . , xn . Choose numbers c0 < c1 < . . . < cm such that all the values of x are
between c0 and cm . For each i, let ni be the number of values of x (including repetitions)
that are in the interval (ci−1 , ci ], i.e., the number of indices j such that ci−1 < xj ≤ ci .
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
20
A frequency table of x is a table showing the class intervals (ci−1 , ci ] along with frequencies
ni with which the data values fall into each interval. Sometimes additional columns are
included
P showing the relative frequencies fi = ni /n, the cumulative relative frequencies
Fi = j≤i fj , and the midpoints of the class intervals.
Example 2.3. The reaction time data is repeated below.
0.12 0.30 0.35 0.37 0.44 0.57 0.61 0.62 0.71 0.80 0.88 1.02 1.08 1.12 1.13 1.17 1.21 1.23 1.35
1.41 1.42 1.42 1.46 1.50 1.52 1.54 1.60 1.61 1.68 1.72 1.86 1.90 1.91 2.07 2.09 2.16 2.17 2.20
2.29 2.32 2.39 2.47 2.60 2.86 3.43 3.43 3.77 3.97 4.54 4.73
We choose 5 class intervals of equal length 1 unit, beginning with c0 = 0 and ending with
c5 = 5.
Interval
(0,1]
(1,2]
(2,3]
(3,4]
(4,5]
Midpoint
0.5
1.5
2.5
3.5
4.5
ni
11
22
11
4
2
fi
0.22
0.44
0.22
0.08
0.04
Fi
0.22
0.66
0.88
0.96
1.00
With only a frequency table like the one above, the mean and median or the original data
cannot be calculated exactly. However, they can be estimated. If we take the midpoint of
an interval as a stand-in for all the values in that interval, then we can use the formula in
the preceding section for calculating a mean with repeated values. Thus, in the example
above, we would estimate the mean as
0.22(0.5) + .44(1.5) + 0.22(2.5) + 0.08(3.5) + 0.04(4.5) = 1.78
Estimating the median is a bit more difficult. By examining the cumulative frequencies Fi ,
we see that 22% of the data is less than or equal to 1 and 66% of the data is less than or
equal to 2. Therefore, the median lies between 1 and 2. That is, it is 1 + a certain fraction of
the distance from 1 to 2. A reasonable guess at that fraction is given by linear interpolation
between the cumulative frequencies at 1 and 2. In other words, we estimate the median as
1+
.50 − .22
(2 − 1) = 1.636.
.66 − .22
A cruder estimate of the median is just the midpoint of the interval that contains the
median, in this case 1.5. We leave it as an exercise to calculate the mean and median of the
reaction time data and to compare them to these approximations.
2.2.2
Histograms
The figure below is a histogram of the reaction times.
Go to TOC
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
21
> reacttimes=read.table(“reacttimes.txt”,header=T)

> hist(reacttimes$Times,breaks=0:5,xlab=”Reaction Times”)

20

Histogram of reacttimes$Times

10

0

5

Frequency

15

Go to TOC

0

1

2

3

4

5

Reaction Times

The histogram is a graphical depiction of the grouped data. The end points ci of the class

intervals are shown on the horizontal axis. This is an absolute frequency histogram because

the heights of the vertical bars above the class intervals are the absolute frequencies ni . A

relative frequency histogram would show the relative frequencies fi . A density histogram

has bars whose heights are the relative frequencies divided by the lengths of the corresponding class intervals. Thus, in a density histogram the area of the bar is equal to the relative

frequency. If all class intervals have the same length, these types of histograms all have the

same shape and convey the same visual information.

No doubt you have noticed that the description of the class intervals in a frequency table

and a histogram was very vague. The number of intervals can affect the appearance of the

histogram significantly. Too many class intervals result in a ”spiky” histogram that may emphasize spurious, accidental groupings of data values too much. Too few class intervals may

obscure real features of the data distribution. The number of intervals is usually decided

after some experimentation. A number of guidelines have been proposed. The default suggestion in R is Sturges’ rule: m = 1 + log2 (n), rounded up to the nearest integer. However,

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

22

Sturges’ rule is only a suggestion that probably will be overridden by the need to produce

a histogram that is easily interpreted and pleasing to the eye. R routinely and intelligently

violates Sturges’ rule for just such reasons.

In the example just above, the endpoints c0 = 0, c1 = 1, · · · , c5 = 5 of the class intervals were

given by the optional ”breaks” argument to the histogram function hist( ). If the ”breaks”

argument is omitted, R will choose its own class intervals.

> hist(reacttimes$Times,xlab=”Reaction Times”)

Histogram of reacttimes$Times

6

0

2

4

Frequency

8

10

12

Go to TOC

0

1

2

3

4

5

Reaction Times

2.2.3

Cumulative Frequency Diagrams

To construct a cumulative frequency diagram start with a frequency table of the grouped

data as in Example 2.3. Let us suppose there are m class intervals with endpoints c0 <
c1 < · · · < cm . Let F1 , F2 , · · · , Fm be the cumulative relative frequencies for the class
intervals. Plot the points (c0 , 0), (c1 , F1 ), · · · , (cm , Fm ) on a rectangular coordinate system
and connect adjacent points with straight line segments. The result will look similar to the
figure below.
CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS
23
0.8
0.2
0.4
0.6
Go to TOC
0.0
Cumulative Relative Frequency
1.0
> Fs=c(0, 0.22, 0.66, 0.88, 0.96, 1)

> plot(0:5,Fs,type=”l”,xlab=” “,ylab=”Cumulative Relative Frequency”)

> points(0:5,Fs)

0

1

2

3

4

5

At any position a on the horizontal axis, the height of the curve at that point is the approximate proportion of data values that are less than or equal to a. The height is the exact

proportion at the end points c0 , · · · , cm of the class intervals. In the diagram above, the

height of the curve at a = 2.5 is about 0.77. Therefore, about 77% of the data is ≤ 2.5.

24

0.8

0.6

0.2

0.4

Go to TOC

0.0

Cumulative Realtive Frequency

1.0

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

0

1

2

3

4

5

The diagram can be used in an inverse manner to find the approximate values of quantiles.

To approximate the pth quantile of the data, find the intersection of the curve with the horizontal line y = p. The horizontal coordinate of that point of intersection is the approximate

pth quantile. In our example, the horizontal line y = 0.60 intersects the curve at a point

with horizontal coordinate about 1.86. Therefore, the 60th percentile of the data is about

1.86.

25

0.8

0.6

0.2

0.4

Go to TOC

0.0

Cumulative Relative Frequency

1.0

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

0

2.2.4

1

2

3

4

5

Exercises

It is sometimes advantageous to transform data in some way, i.e., to define a new variable

y as a function of the old variable x. We might want to do this so that we can more easily

apply certain statistical inference procedures you will learn about later. A common transformation is the logarithmic transformation. The natural logarithms of the reaction times

are, to two places:

-2.12 -1.20 -1.05 -0.99 -0.82 -0.56 -0.49 -0.48 -0.34 -0.22 -0.13 0.02 0.08 0.11 0.12 0.16 0.19

0.21 0.30 0.34 0.35 0.35 0.38 0.40 0.42 0.43 0.47 0.48 0.52 0.54 0.62 0.64 0.65 0.73 0.74 0.77

0.78 0.79 0.83 0.84 0.87 0.90 0.96 1.05 1.23 1.23 1.33 1.38 1.51 1.55

1. Attach ”reacttimes” to your R workspace with and verify the data above with

> attach(reacttimes)

> log(Times)

Summarize the new variable.

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

26

> summary(log(Times))

2. Use 9 class intervals of equal length beginning with c0 = −2.5 and ending with c9 = 2.

Make a frequency table of log(Times) like the one in Example 3.

3. By hand, without using R, make a histogram of log(Times).

4. Use R to make a histogram of log(Times).

5. Estimate the mean and median of log(Times) from the grouped data. Compare to the

answers given in the summary.

Go to TOC

6. By hand, make a cumulative frequency diagram of log(Times). With it, estimate the 40th

percentile of log(Times). Compare your answer to that returned by the quantile( ) function.

7. Import the data set www.math.uh.edu/ charles/data/FEV.txt into R as a data frame

named ”FEV”. The variable ”fev” is a set of 654 values of forced expiratory volume (a measure of lung capacity) for human subjects. With R make a histogram of fev. Allow R to

choose its own class intervals. You can suggest a number of class intervals to R by using

the ”breaks” argument to the histogram function hist( ), e.g.,

> hist(FEV$fev,breaks=5)

R does not always accept your suggestion. Experiment with at least four different choices

for the number of class intervals and comment on the results.

2.3

Measures of Variability or Scale

2.3.1

The Variance and Standard Deviation

Let x be a population variable with values x1 , x2 , . . . , xn . Some of the values might be

repeated. The variance of x is

n

1X

var(x) = σ =

(xi − µ(x))2 .

n i=1

2

(2.5)

The standard deviation of x is

sd(x) = σ =

p

var(x).

(2.6)

When x1 , x2 , . . . , xn are values of x from a sample rather than the entire population, we

modify the definition of the variance slightly, use a different notation, and call these objects

the sample variance and standard deviation.

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

n

1 X

(xi − x̄)2 ,

n − 1 i=1

√

s = s2 .

s2 =

27

(2.7)

(2.8)

The reason for modifying the definition for the sample variance has to do with its properties

as an estimate of the population variance.

Alternate algebraically equivalent formulas for the variance and sample variance are

n

1X 2

x − µ(x)2 ,

σ =

n i=1 i

2

n

1 X 2

(

x − nx̄2 ).

s =

n − 1 i=1 i

2

These are sometimes easier to use for hand computation.

The standard deviation σ is called a measure of scale because of the way it behaves under linear transformations of the data. If a new variable y is defined by y = a + bx, where

a and b are constants, sd(y) = |b|sd(x). For example, the standard deviation of Fahrenheit

temperatures is 1.8 times the standard deviation of Celsius temperatures. The transformation y = a + bx can be thought of as a rescaling operation, or a choice of a different system

of measurement units, and the standard deviation takes account of it in a natural way.

2.3.2

The Mean and Median Absolute Deviation

Suppose that you must choose a single number c to represent all the values of a variable

x as accurately as possible. One measure of the overall error with which c represents the

values of x is

v

u n

u1 X

(xi − c)2 .

(2.9)

g(c) = t

n i=1

In the exercises, you are asked to show that this expression is minimized when c = x̄. In

other words, the single number which most accurately represents all the values is, by this

criterion, the mean of the variable. Furthermore, the minimum possible overall error, by

this criterion, is the standard deviation. However, this is not the only reasonable criterion.

Another is

n

1X

h(c) =

|xi − c|.

n i=1

It can be shown that this criterion is minimized when c = median(x). The minimum value

of h(c) is called the mean absolute deviation from the median. It is a scale measure which

is somewhat more robust(less affected by outliers) than the standard deviation, but still not

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

28

very robust. A related, very robust measure of scale is the median absolute deviation from

the median, or mad :

mad(x) = median(|x − median(x)|).

(2.10)

In R, the mad is adjusted by a constant factor of 1.4826. For data with a normal distribution

the adjusted mad is equal to the standard deviation. Normal distributions are discussed in

Chapter 5.

2.3.3

The Interquartile Range

The interquartile range of a variable x is the difference between its 75th and 25th percentiles.

IQR(x) = q(x, .75) − q(x, .25)

(2.11)

It is a robust measure of scale which is important in the construction and interpretation of

boxplots, discussed below.

All of these measures of scale are valid for comparison of the ”spread” or variability of numeric

variables about a central value. In general, the greater their values, the more spread out the

values of the variable are. Of course, the standard deviation, median absolute deviation, and

interquartile range of a variable are different quantities and one must be careful to compare

like measures. The standard deviation, mad, IQR and other measures of scale obey the

basic formula for changes in measurement scale. If a and b are constants and y = a + bx,

then

sd(y) = |b|sd(x)

(2.12)

mad(y) = |b|mad(x)

(2.13)

IQR(y) = |b|IQR(x)

(2.14)

Example 2.4. The ”xdata” data set is a simulated sample of 100 numeric observations.

The pictures below show the histograms arising from multiplying the data by 1, 2, 0.5, and

3. All the measures of scale – standard deviation, mad, IQR and so on, will be multiplied

by the same factors.

Go to TOC

−10

−5

0

5

10 20 30 40 50

10

−10

−5

0

5

10

Go to TOC

0

Frequency

0

10 20 30 40 50

2 * xdata

10 20 30 40 50

xdata

Frequency

29

0

Frequency

10 20 30 40 50

0

Frequency

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

−10

−5

0

5

10

0.5 * xdata

> var(xdata); sd(xdata)

[1] 1.144041

[1] 1.069598

> var(2*xdata); sd(2*xdata)

[1] 4.576163

[1] 2.139197

> mad(xdata); IQR(xdata)

[1] 0.9334394

[1] 1.24708

> mad(2*xdata); IQR(2*xdata)

[1] 1.866879

[1] 2.494159

−10

−5

0

3 * xdata

5

10

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

2.3.4

30

Exercises

1. By hand, find the sample variance and standard deviation of mydata. Repeat with R.

2. By hand, find the median absolute deviation of mydata. Repeat with R and observe that

R’s answer is 1.4826 times the answer you got by hand.

3. Find the variance and standard deviation of the response time data. Treat it as a sample

from a larger population.

4. Find the interquartile range and the median absolute deviation for the response time data.

5. In the response time data, replace the value x40 = 2.32 by 232.0. Recalculate the standard deviation, the interquartile range and the median absolute deviation and compare with

the answers from problems 3 and 4.

6. Show that the function g(c) in equation (2.9) is minimized when c = µ(x). Hint: Minimize

g(c)2 .

2.4

Boxplots

Boxplots are also called box and whisker diagrams. Essentially, a boxplot is a graphical

representation of the five number summary. The boxplot below depicts the sensory response

data of the preceding section without the log transformation.

> boxplot(reacttimes$Times,horizontal=T,xlab=”Reaction Times”)

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

31

Go to TOC

0

1

2

3

4

Reaction Times

> summary(reacttimes)

Times

Min.

:0.120

1st Qu.:1.090

Median :1.530

Mean

:1.742

3rd Qu.:2.192

Max.

:4.730

The central box in the diagram encloses the middle 50% of the numeric data. Its left and

right boundaries of the box mark the first and third quartiles. The boldface middle line in

the box marks the median of the data. Thus, the interquartile range is the distance between

the left and right boundaries of the central box. For construction of a boxplot, an outlier

is defined as a data value whose distance from the nearest quartile is more than 1.5 times

the interquartile range. Outliers are indicated by isolated points (tiny circles in this boxplot). The dashed lines extending outward from the quartiles are called the whiskers. They

extend from the quartiles to the most extreme values in either direction that are not outliers.

This boxplot shows a number of interesting things about the response time data.

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

32

(a) The median is about 1.5. The interquartile range is slightly more than 1.

(b) The three largest values are outliers. They lie a long way from most of the data. They

might call for special investigation or explanation.

(c) The distribution of values is not symmetric about the median. The values in the lower

half of the data are more crowded together than those in the upper half. This is shown by

comparing the distances from the median to the two quartiles, by the lengths of the whiskers

and by the presence of outliers at the upper end .

The asymmetry of the distribution of values is also evident in the histogram of the preceding

section.

2.4.1

Exercises

1. Make a boxplot of the log-transformed reaction time data. Is the transformed data more

symmetrically distributed than the original data?

2. The average public school teacher salaries in thousands of dollars for all 51 states and

Washington D.C. are in the data set teacher salaries. The salary data in the Pay variable

are listed below in increasing order.

18.1 18.4 19.5 19.6 20.3 20.3 20.5 20.6 20.8

21.0 21.4 21.6 21.7 21.8 22.0 22.1 22.3 22.3

22.8 22.9 23.4 24.3 24.5 24.6 25.2 25.6 25.8

25.9 26.0 26.5 26.6 26.6 26.8 27.2 27.2 27.2

27.6 29.1 29.5 30.2 30.7 34.0 41.5

20.9

22.5

25.8

27.2

20.9

22.6

25.9

27.4

By hand, make a boxplot of the data above.

3. Use R to make a boxplot of Pay in teacher salaries.

4. By hand, make a boxplot of mydata[-21]. Show any outliers.

5. Make a boxplot of mydata with R.

6. The data set airquality is one of R’s included data sets. It shows daily measurements of

ozone concentration (Ozone), solar radiation (Solar.R), wind speed (Wind), and temperature (Temp) for 5 summer months in 1977 in New York City. Some of the observations are

missing and are recorded as NA, meaning not available. View an overall summary of the

variables in airquality with the command

> summary(airquality)

Ignore the summaries for Month and Day since those variables should be factors, not numeric variables, and their summaries are meaningless. Attach airquality to your workspace

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

33

> attach(airquality)

and make boxplots of Ozone, Solar.R, Wind, and Temp. Comment on any noteworthy

features.

2.5

Factor Variables and Barplots

2.5.1

Tabulated Factor Variables

The location and scale measures discussed up to this point apply to numeric variables but

not to factor variables. The best way to summarize the values of factor variables is to

tabulate them and display the frequencies with a bar chart or barplot.

Example 2.5. The Montana outlook poll was a study conducted by the Bureau of Business and Economic Research, University of Montana in 1993. A sample of 209 Montana

residents were classified according to their age group, their sex, their income group, their

political afiliation, the area of the state they lived in, whether they expected their personal finances to improve, and whether they expected the state’s financial situation to

improve. The data is at Montana.txt. The data in numeric form with description is at

http://lib.stat.cmu.edu/DASL. Here is a summary of all the variables in the data frame.

> summary(Montana)

AGE

SEX

INC

35K :60

35-54:66

20-35K:83

NA’s : 1

NA’s :19

POL

Dem :84

Ind :40

Rep :78

NA’s: 7

AREA

NE:58

SE:78

W :73

FIN

better:71

same :76

worse :61

NA’s : 1

STAT

better

:118

no better: 63

NA’s

: 28

All of these variables are factor variables. Their values are tabulated in the summary above,

but for illustration we will tabulate the income variable INC separately using R’s table( )

function.

> attach(Montana)

> table(INC)

INC

35K 20-35K

47

60

83

Notice that the 19 missing values are simply omitted from the table. A bar chart or barplot

can be constructed from the tabulated values.

> attach(Montana)

> barplot(table(INC))

Go to TOC

34

60

80

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

0

20

40

Go to TOC

35K

20−35K

The category labels below this plot are not in their natural order. To correct it we will tell

R to rearrange the categories and put the third category (20-35K) second and the second

category (>35K) third.

> barplot(table(INC)[c(1,3,2)])

35

60

80

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

0

20

40

Go to TOC

35K

A bar plot and a histogram are superficially similar but they are different. A histogram is

for numeric data after it has been grouped, so it is a type of bar plot. However, bar plots are

also useful for non-numeric categories or factors. Notice that in our examples histograms

have a measurement scale on the horizontal axis. Barplots for factor variables do not.

2.5.2

Exercises

1. Make bar plots of the other variables in Montana.

2. The teacher salaries data set has a variable called Region which indicates which region

of the U.S. a state is in. Tabulate Region and make a bar plot of it.

3. WorldPhones is a dataset included with R. First, read about it.

> help(WorldPhones)

Then display it with

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

36

> WorldPhones

WorldPhones is a matrix, not a data frame. The row names of the matrix are the years

”1951”, ”1956”, etc. The column names are the geographical regions ”N.Amer”, ”Europe”,

etc. You can extract a single column or a single row by, for example,

> WorldPhones[,”Europe”]

> WorldPhones[”1961”,]

You can make barplots of any column or row simply by embedding these commands as

arguments of the barplot( ) function. Make barplots of all the rows. Does it seem that

telephone usage became more evenly distributed across the regions for the years 1951-1961?

Bear in mind that the vertical axis scales are different for different years.

2.6

Jointly Distributed Variables

When two or more variables are jointly distributed, or jointly observed, it is important to

understand how they are related and how closely they are related. We will consider just two

variables, generically named x and y.

2.6.1

Two Factor Variables

When x and y are both factor variables the best way to reveal their relationship is to

cross tabulate them. If x has levels a1 , a2 , · · · , ar , y has levels b1 , b2 , · · · , bc , and there are

n joint observations of x and y, then their cross tabulation is the r × c matrix with entries

nij equal to the number of cases in which x = ai and y = bj . The cross tabulation is easy

to accomplish with the table( ) function of R.

Example 2.6. There are n = 209 cases in the Montana data. The cross tabulation of the

two variables x = AREA (region of the state) and y = POL (political party preference) is

> table(AREA,POL)

POL

AREA Dem Ind Rep

NE 15 12 30

SE 30 16 31

W

39 12 17

From the table we see that there were 15 respondents to the survey in the northeastern

region who preferred the Democratic Party. There were 30 in the northeast who were Republicans.

It may be more revealing to show a table of relative frequencies rather than absolute frequencies. To do so, simply divide all the table entries nij by the total number of cases n.

In our example, the relative frequency table rounded to 3 places is

Go to TOC

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

37

> table(AREA,POL)/209

POL

AREA

Dem

Ind

Rep

NE 0.07177033 0.05741627 0.14354067

SE 0.14354067 0.07655502 0.14832536

W 0.18660287 0.05741627 0.08133971

> round(.Last.value,3)

POL

AREA

Dem

Ind

Rep

NE 0.072 0.057 0.144

SE 0.144 0.077 0.148

W 0.187 0.057 0.081

Go to TOC

The relative frequencies in a cross tabulation can be displayed with a mosaic plot.

0.0

Dem

0.2

Ind

0.4

POL

0.6

Rep

0.8

1.0

> attach(Montana)

> plot(POL~AREA)

NE

SE

W

AREA

The formula P OL ∼ AREA tells R to treat AREA as the x variable, with values arrayed

horizontally, and POL as the y variable with values arrayed vertically. The widths of the

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

38

vertical bars vary slightly because they are proportional to the relative frequencies of the

levels of the x variable. From the plot you can see that the western region is predominantly

Democratic while the northeastern region is predominantly Republican, at least in the sample. The southeastern region has the greatest sample representation and it is about evenly

split between the two major parties.

2.6.2

One Factor and One Numeric Variable

We will next consider the case where x is a factor and y is numeric. The figure below

compares placement test scores for each of the letter grades in a sample of 179 students who

took a particular math course in the same semester under the same instructor. The two

jointly observed population variables are the letter grade received and the placement test

score. The figure separates test scores according to the letter grade and shows a boxplot

for each group of students. One would expect to see a decrease in the median test score as

the letter grade decreases and that is confirmed by the picture. However, the decrease in

median test scores from a letter grade of B to a grade of F is not very dramatic, especially

compared to the size of the IQRs. This suggests that the placement test is not especially

good at predicting a student’s final grade in the course. Notice the two outliers. The outlier

for the ”W” group is clearly a mistake in recording data because the scale of scores only

went to 100.

> test.vs.grade=read.csv(“test.vs.grade.csv”,header=T)

> attach(test.vs.grade)

> plot(Test~Grade,varwidth=T)

Go to TOC

39

80

100

120

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

40

60

Test

Go to TOC

A

B

C

D

F

W

Grade

We used the same formula argument of the form y ∼ x here as in the previous example.

The plot function plot( ) knows to produce side by side boxplots when y is numeric and x

is a factor. The boxplot( ) function would work just as well. The argument ”varwidth=T”

tells R to allow the widths of the boxes to vary and reflect the number of observations in

each group.

2.6.3

Two Numeric Variables

Scatterplots

Next, we consider the case where both x and y are numeric variables, jointly observed, so

that we have the same number n of observations of each. Indeed, we have n pairs of observations (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). If we plot the n points in a Cartesian plane, we obtain

a scatterplot or a scatter diagram of the two variables.

Below are the first 10 rows of the ”Payroll” data set. The column labeled ”payroll” is the total

monthly payroll in thousands of dollars for each company listed. The column ”employees”

is the number of employees in each company and ”industry” indicates which of two related

industries the company is in. A scatterplot of all 50 values of the two variables ”payroll”

and ”employees” is also shown.

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

40

350

> Payroll=read.table(“Payroll.txt”,header=T)

> attach(Payroll)

> plot(payroll~employees,col=industry)

250

150

200

payroll

300

Go to TOC

50

100

150

employees

> Payroll[1:10,]

payroll employees industry

1

190.67

85

A

2

233.58

109

A

3

244.04

130

B

4

351.41

166

A

5

298.60

154

B

6

241.43

124

B

7

143.93

38

B

8

242.33

116

A

9

216.88

103

A

10 195.97

101

A

The scatterplot shows that in general the more employees a company has, the higher its

monthly payroll. Of course this is expected. It also shows that the relationship between the

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

41

number of employees and the payroll is quite strong. For any given number of employees, the

variation in payrolls for that number is small compared to the overall variation in payrolls

for all employment levels. In this plot, the data from industry A is in black and that from

industry B is red. The plot shows that for employees ≥ 100, payrolls for industry A are

generally greater than those for industry B at the same level of employment.

Covariance and Correlation

If x and y are jointly distributed numeric variables, we define their covariance as

n

cov(x, y) =

1X

(xi − µ(x))(yi − µ(y)).

n i=1

If x and y come from samples of size n rather than the whole population, replace the

denominator n by n − 1 and the population means µ(x), µ(y) by the sample means x̄, ȳ

to obtain the sample covariance. The sign of the covariance reveals something about the

relationship between x and y. If the covariance is negative, values of x greater than µ(x)

tend to be accompanied by values of y less than µ(y). Values of x less than µ(x) tend to go

with values of y greater than µ(y), so x and y tend to deviate from their means in opposite

directions. If cov(x, y) > 0, they tend to deviate in the same direction. The strength of

these tendencies is not expressed by the covariance because its magnitude depends on the

variability of each of the variables about its mean. To correct this, we divide each deviation

in the sum by the standard deviation of the variable. The resulting quantity is called the

correlation between x and y:

cor(x, y) =

cov(x, y)

.

sd(x) ∗ sd(y)

The correlation between payroll and employees in the example above is 0.9782 (97.82 %).

Theorem 2.1. The correlation between x and y satisfies −1 ≤ cor(x, y) ≤ 1. cor(x, y) = 1

if and only if there are constants a and b > 0 such that y = a + bx. cor(x, y) = −1 if and

only if y = a + bx with b < 0.
A correlation close to 1 indicates a strong positive relationship (tending to vary in the same
direction from their means) between x and y while a correlation close to −1 indicates a strong
negative relationship. A correlation close to 0 indicates that there is no linear relationship
between x and y. In this case, x and y are said to be (nearly) uncorrelated. There might
be a relationship between x and y but it would be nonlinear. The picture below shows a
scatterplot of two variables that are clearly related but very nearly uncorrelated.
> xs=runif(500,0,3*pi)

> ys=sin(xs)+rnorm(500,0,.15)

> cor(xs,ys)

[1] 0.005307598

> plot(xs,ys)

Go to TOC

42

0.5

1.0

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

0.0

−1.0

−0.5

ys

Go to TOC

0

2

4

6

8

xs

Some sample scatterplots of variables with different population correlations are shown below.

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

cor(x,y)=0.3

−1

0

−3 −2 −1

0

1

1

2

cor(x,y)=0

43

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

Go to TOC

cor(x,y)=0.9

−2

−2 −1

0

0

1

1

2

2

cor(x,y)=−0.5

−2

2.6.4

−1

0

1

2

−2

−1

0

1

2

Exercises

1. With the Montana data, cross tabulate AREA and INC. Also make a mosaic plot of

these two variables. Do these suggest anything about the economics of Montana?

2. Do the same for AREA and POL. What, if anything, do you conclude about the politics

of Montana?

3. Do the same for AREA and AGE. Draw the appropriate conclusions.

4. With the Auto Pollution Filter Noise data, construct side by side boxplots of the variable

NOISE for the different levels of the factor SIZE. Comment. Do the same for NOISE and

TYPE.

5. With the Payroll data, construct side by side boxplots of ”employees” versus ”industry”

and ”payroll” versus ”industry”. Are these boxplots as informative as the color coded scatterplot in Section 2.3.2?

CHAPTER 2. DESCRIPTIVE AND GRAPHICAL STATISTICS

44

6. If you are using Rstudio click on the ”Packages” tab, then the checkbox next to the library

MASS. Click on the word MASS and then the data set ”mammals” and read about it. If

you are using R alone, in the Console window at the prompt > type

> data(mammals,package=”MASS”).

View the data with

> mammals

Make a scatterplot with the following commands and comment on the result.

Go to TOC

> attach(mammals)

> plot(body,brain)

Also make a scatterplot of the log transformed body and brain weights.

> plot(log(body),log(brain))

A recently discovered hominid species homo floresiensis had an estimated average body

weight of 25 kg. Based on the scatterplots, what would you guess its brain weight to be?

7. Let x and y be jointly distributed numeric variables and let z = a + by, where a and b

are constants. Show that cov(x, z) = b ∗ cov(x, y). Show that if b > 0, cor(x, z) = cor(x, y).

What happens if b < 0?
8. Find the covariance and correlation between payroll and employees for the first 10 rows
only of the Payroll data.
Chapter 3
Probability
3.1
Go to TOC
Background
In Chapter 1 we described a random experiment as one that can be replicated indefinitely
many times and whose outcome has a degree of uncertainty from replication to replication.
The uncertainty in a random experiment is subject to treatment with the tools of mathematical probability. The mathematical theory of probability is a huge subject which has
developed separately from the development of statistics. In this chapter we describe only
its most basic elements.
Recall from Chapter 1 that the set of all possible outcomes of a random experiment is called
its sample space and is denoted by the symbol Ω. An event is a set of outcomes, i.e., a
subset of Ω. A probability measure is a function which assigns numbers between 0 and 1
to events. The number assigned to an event is called its probability. If the sample space
Ω, the collection of events, and the probability measure are all specified, they constitute a
probability model of the random experiment.
Probability models do not come directly from nature. They are devised by researchers seeking to understand regularities in the phenomena they are studying. Possibly, the observed
result of an experiment cannot easily be reconciled with predictions based on the probability
model. In this case, the model is called into question or even refuted. The formalization of
this process constitutes most of the subject of statistical inference.
3.2
Equally Likely Outcomes
The simplest probability models have a finite sample space Ω. The collection of events is
the collection of all subsets of Ω and the probability of an event is simply the proportion
of all possible outcomes that correspond to that event. In such models, we say that the
experiment has equally likely outcomes. If the sample space has N elements and E is a
subset of Ω, then
45
CHAPTER 3. PROBABILITY
46
P r(E) =
#(E)
.
N
Each of the elementary events {ω} consisting of a single outcome has the same probability N1 .
Here we introduce some notation that will be used throughout this text. The probability measure for a random experiment is denoted by the abbreviation P r, sometimes with
subscripts. Events will be denoted by upper case Latin letters near the beginning of the
alphabet. The expression #(E) denotes the number of elements of the subset E.
Example 3.1. The Payroll data consists of 50 observations of 3 variables, ”payroll”, ”employees” and ”industry”. Suppose that a random experiment is to choose one record from
the Payroll data and suppose that the experiment has equally likely outcomes. Then, as the
summary below shows, the probability that industry A is selected is
P r(industry = A) =
27
= 0.54.
50
> Payroll=read.table(“Payroll.txt”,header=T)

> summary(Payroll)

payroll

employees

industry

Min.

:129.1

Min.

: 26.00

A:27

1st Qu.:167.8

1st Qu.: 71.25

B:23

Median :216.1

Median :108.50

Mean

:228.2

Mean

:106.42

3rd Qu.:287.8

3rd Qu.:143.25

Max.

:354.8

Max.

:172.00

In this example we use another common and convenient notational convention. The event

whose probability we want is described in quasi-natural language as ”industry=A” rather

than with the the formal but too cumbersome {ω ∈ P ayroll|industry(ω) = A}. The description ”industry=A” refers to the set of all possible outcomes of the experiment for which

the variable ”industry” has the value ”A”. This sort of informal description of an event will

be used again and again.

The assumption of equally likely outcomes is an assumption about the selection procedure

for obtaining one record from the data. It is conceivable that a selection method is employed

for which this assumption is not valid. If so, we should be able to discover that it is invalid

by replicating the experiment sufficiently many times. This is a basic principle of classical

statistical inference. It relies on a famous result of mathematical probability theory called

the law of large numbers. One version of it is loosely stated as follows:

Law of Large Numbers: Let E be an event associated with a random experiment and let

P r be the probability measure of a true probability model of the experiment. Suppose the

experiment is replicated n times and let Pbr(E) = n1 × # replications in which E occurs.

Go to TOC

CHAPTER 3. PROBABILITY

47

Then Pbr(E) → P r(E) as n → ∞.

Pbr(E) is called the empirical probability of E.

3.3

Combinations of Events

Events are related to other events by familiar set operations. Let E1 , E2 , . . . be a finite or

infinite sequence of events. The union of E1 and E2 is the event

E1 ∪ E2 = {ω ∈ Ω|ω ∈ E1 or ω ∈ E2 }.

Go to TOC

More generally,

[

Ei = E1 ∪ E2 ∪ . . . = {ω ∈ Ω|ω ∈ Ei for some i }.

i

The intersection of E1 and E2 is the event

E1 ∩ E2 = {ω ∈ Ω|ω ∈ E1 and ω ∈ E2 },

and, in general,

\

Ei = E1 ∩ E2 ∩ . . . = {ω ∈ Ω|ω ∈ Ei for all i}.

i

Sometimes we omit the intersection symbol ∩ and simply conjoin the symbols for the events

in an intersection. In other words,

E1 E2 . . . En = E1 ∩ E2 ∩ . . . ∩ En .

The complement of the event E is the event

∼

E = {ω ∈ Ω|ω ∈

/ E}.

∼

E occurs if and only if E does not occur. The event E1∼ E2 occurs if and only if E1 occurs

and E2 does not occur.

Finally, the entire sample space Ω is an event with complement φ, the empty event. The

empty event never occurs. We need the empty event because it is possible to formulate a

perfectly sensible description of an event which happens never to be satisfied. For example,

if Ω = Payroll, the event ”employees < 25” is never satisfied, so it is the empty event.
We also have the subset relation between events. E1 ⊆ E2 means that if E1 occurs, then
E2 occurs, or in more familiar language, E1 is a subset of E2 . For any event E, it is true
that φ ⊆ E ⊆ Ω. E2 ⊇ E1 means the same as E1 ⊆ E2 .
CHAPTER 3. PROBABILITY
3.3.1
48
Exercises
1. A random experiment consists of throwing a pair of dice, say a red die and a green
die, simultaneously. They are standard 6-sided dice with one to six dots on different faces.
Describe the sample space.
2. For the same experiment, let E be the event that the sum of the numbers of spots on the
two dice is an odd number. Write E as a subset of the sample space, i.e., list the outcomes
in E.
3. List the outcomes in the event F = ”the sum of the spots is a multiple of 3”.
Go to TOC
4. Find ∼ F , E ∪ F , EF = E ∩ F , and E ∼ F .
5. Assume that the outcomes of this experiment are equally likely. Find the probability of
each of the events in # 4.
6. Show that for any events E1 and E2 , if E1 ⊆ E2 then ∼ E2 ⊆∼ E1 .
7. The ”mammals” data set in the ”MASS” library contains the result of a study of sleep in
mammal species. 1 2 Load the ”mammals” data set into your R workspace. In Rstudio you
can click on the ”Packages” tab and then on the checkbox next to MASS. Without Rstudio,
type
> data(mammals,package=”MASS”)

Attach the mammals data frame to your R search path with

> attach(mammals)

A random experiment is to choose one of the species listed in this data set. All outcomes

are equally likely. You can obtain a list of the species in the event ”body > 200” with the

command

> subset(mammals,body>200)

What is the probability of this event, i.e., what is the probability that you randomly select

a species with a body weight greater than 200 kg?

You can obtain a count of the species with body weights greater than 200 kg, by

> sum(body > 200)

1 Weisberg, S. (1985) Applied Linear Regression. 2nd edition. Wiley, pp. 144-5.

2 Allison, T. and Cicchetti, D. V. (1976) Sleep in mammals:

Science 194, 732-734

ecological and constitutional correlates.

CHAPTER 3. PROBABILITY

49

8. What are the species in the event that the ratio of brain weight to body weight is greater

than 0.02? Remember that brain weight is recorded in grams and body weight in kilograms,

so body weight must be multiplied by 1000 to make the two weights comparable. The species

belonging to this event can be obtained with the R command

> subset(mammals,brain/body/1000 > 0.02)

What is the probability of that event?

3.4

Rules for Probability Measures

The assumption of equally likely outcomes is often the starting point for the construction

of a probability model. However, there are many random experiments for which this assumption is wrong. Regardless of how a probability measure for a model of a a random

experiment is chosen, there are certain rules that it must satisfy. They are:

1. 0 ≤ P r(E) ≤ 1 for each event E.

2. P r(Ω) = 1.

3. IfSE1 , E2 , . .P

. is a finite or infinite sequence of events such that Ei Ej = φ for i 6= j, then

P r( i Ei ) = i P r(Ei ). If Ei Ej = φ for all i 6= j we say that the events E1 , E2 , . . . are

pairwise disjoint. This means that no two of the events can both occur simultaneously, so

to speak.

These are the basic rules. There are other properties that may be derived from them as

theorems.

4. P r(E ∼ F ) = P r(E)−P r(EF ) for all events E and F . In particular, P r(∼ E) = 1−P r(E)

5. P r(φ) = 0.

6. P r(E ∪ F ) = P r(E) + P r(F ) − P r(EF ) for all events E and F .

7. If E ⊆ F , then P r(E) ≤ P r(F ).

S

8. If E1 ⊆ E2 ⊆ . . . is an infinite sequence of events, then P r( i Ei ) = limi→∞ P r(Ei ).

T

9. If E1 ⊇ E2 ⊇ . . . is an infinite sequence of events, then P r( i Ei ) = limi→∞ P r(Ei ).

Go to TOC

CHAPTER 3. PROBABILITY

3.5

50

Counting Outcomes. Sampling with and without

Replacement

Suppose a random experiment with sample space Ω is replicated n times. The result is a

sequence (ω1 , ω2 , . . . , ωn ), where ωi ∈ Ω is the outcome of the ith replication. This sequence

is the outcome of a so-called compound experiment – the sequential replications of the basic

experiment. The sample space of this compound experiment is the n-fold cartesian product

Ωn = Ω × Ω × · · · × Ω. Now suppose that the basic experiment is to choose one member of a

finite population with N elements. We may identify the sample space Ω with the population.

Consider an outcome (ω1 , ω2 , . . . , ωn ) of the replicated experiment. There are N possibilities

for ω1 and for each of those there are N possibilities for ω2 and for each pair ω1 , ω2 there

are N possibilities for ω3 , and so on. In all, there are N × N × · · · × N = N n possibilities

for the entire sequence (ω1 , ω2 , · · · , ωn ). If all outcomes of the compound experiment are

equally likely, then each has probability N1n . Moreover, it can be shown that the compound

experiment has equally likely outcomes if and only if the basic experiment has equally likely

outcomes, each with probability N1 .

Definition: An ordered random sample of size n with replacement from a population of size

N is a randomly chosen sequence of length n of elements of the population, where repetitions

are possible and each outcome (ω1 , ω2 , · · · , ωn ) has probability N1n .

Now suppose that we sample one element ω1 from the population, with all N outcomes

equally likely. Next, we sample one element ω2 from the population excluding the one

already chosen. That is, we randomly select one element from Ω ∼ {ω1 } with all the remaining N − 1 elements being equally likely. Next, we randomly select one element ω3 from

the the N − 2 elements of Ω ∼ {ω1 , ω2 }, and so on until at last we select ωn from the

remaining N − (n − 1) elements of the population. The result is a nonrepeating sequence

(ω1 , ω2 , · · · , ωn ) of length n from the population. A nonrepeating sequence of length n is

also called a permutation of length n from the N objects of the population. The total

!

. Obviously, we

number of such permutations is N × (N − 1) × · · · × (N − n + 1) = (NN−n)!

must have n ≤ N for this to make sense. The number of permutations of length N from a

set of N objects is N !. It can be shown that, with the sampling scheme described above,

all permutations of length n are equally likely to result. Each has probability (NN−n)!

of

!

occurring.

Definition: An ordered random sample of size n without replacement from a population of

size N is a randomly chosen nonrepeating sequence of length n from the population where

each outcome (ω1 , ω2 , · · · , ωn ) has probability (NN−n)!

! .

Most of the time when sampling without replacement from a finite population, we do not

care about the order of appearance of the elements of the sample. Two nonrepeating sequences with the same elements in different order will be regarded as equivalent. In other

words, we are concerned only with the resulting subset of the population. Let us count the

number of subsets of size n from a set of N objects. Temporarily, let C denote that number.

Each subset of size n can be ordered in n! different ways to give a nonrepeating sequence.

Go to TOC

CHAPTER 3. PROBABILITY

51

!

Thus, the number of nonrepeating sequences of length n is C times n!. So, (NN−n)!

= C × n!

N

N

N!

i.e., C = n!(N −n)! = n . This is the same binomial coefficient n that appears in the

n N −n

PN

binomial theorem: (a + b)N = n=0 N

.

n a b

Definition: A simple random sample of size n from a population of size N is a randomly

chosen subset of size n from the population, where each subset has the same probability of

being chosen, namely N1 .

(n)

A simple random sample may be obtained by choosing objects from the population sequentially, in the manner described above, and then ignoring the order of their selection.

Example: The Birthday Problem

There are N = 365 days in a year. (Ignore leap years.) Suppose n = 23 people are

chosen randomly and their birthdays recorded. What is the probability that at least two of

them have the same birthday?

Solution: Arbitrarily numbering the people involved from 1 to n, their birthdays form an

ordered sample, with replacement, from the set of N = 365 birthdays. Therefore, each

sequence has probability N1n of occurring. No two people have the same birthday if and

only if the sequence is actually nonrepeating. The number of nonrepeating sequences of

birthdays is N (N − 1) · · · (N − n + 1). Therefore, the event ”No two people have the same

birthday” has probability

N (N − 1) · · · (N − n + 1)

N (N − 1) · · · (N − n + 1)

=

n

N

N × N × ··· × N

1

2

n−1

)(1 − ) · · · (1 −

)

N

N

N

With n = 23 and N = 365 we can find this in R as follows:

= (1 −

> prod(1-(1:22)/365)

[1] 0.4927028

So, there is about a 49% probability that no two people in a random selection of 23 have the

same birthday. In other words, the probability that at least two share a birthday is about

51%.

An important, intuitively obvious principle in statistics is that if the sample size n is very

small in comparison to the population size N , a sample taken without replacement may

be regarded as one taken with replacement, if it is mathematically convenient to do so.

A sample of size 100 taken with replacement from a population of 100,000 has very little

chance of repeating itself. The probability of a repetition is about 5%.

Go to TOC

CHAPTER 3. PROBABILITY

3.5.1

52

Exercises

1. A red 6-sided die and a green 6-sided die are thrown simultaneously. The outcomes of

this experiment are equally likely. What is the probability that at least one of the dice lands

with a 6 on its upper face?

2. A hand of 5-card draw poker is a simple random sample from the standard deck of 52

cards. How many 5 draw poker hands are there? In 5-card stud poker, the cards are dealt

sequentially and the order of appearance is important. How many 5-stud poker hands are

there?

3. How many hands of 5-draw poker contain the ace of hearts? What is the probability that

a 5-card draw hand contains the ace of hearts?

4. Everybody in Ourtown is a fool or a knave or possibly both. 70% of the citizens are fools

and 85% are knaves. One citizen is randomly selected to be mayor. What is the probability

that the mayor is both a fool and a knave?

5. What is the probability that the mayor is a fool but not a knave?

6. A Martian year has 669 days. An R program for calculating the probability of no repetitions in a sample with replacement of n birthdays from a year of N days is given below.

> birthdays=function(n,N) prod(1-1:(n-1)/N)

To invoke this function with, for example, n=12 and N=400 simply type

> birthdays(12,400)

Check that the program gives the right answer for N=365 and n=23. Then use it to find

the number n of Martians that must be sampled in order for the probability of a repetition

to be at least 0.5.

7. A standard deck of 52 cards has four queens. Two cards are randomly drawn in succession, without replacement, from a standard deck. What is the probability that the first

card is a queen? What is the probability that the second card is a queen? If three cards are

drawn, what is the probability that the third is a queen? Make a general conjecture. Prove

it if you can. (Hint: Does the probability change if ”queen” is replaced by ”king” or ”seven”?)

3.6

Conditional Probability

Definition: Let A and B be events with P r(B) > 0. The conditional probability of A, given

B is:

P r(AB)

P r(A|B) =

.

(3.1)

P r(B)

Go to TOC

CHAPTER 3. PROBABILITY

53

P r(A) itself is called the unconditional probability of A.

Example 3.2. R includes a tabulation by various factors of the 2201 passengers and crew

on the fatal voyage of the Titanic. Read about it by typing

> help(Titanic)

We are going to look at two of these factors, the class of accommodations of the passengers

and crew and whether they survived or not.

> dimnames(Titanic)

$Class

[1] “1st” “2nd” “3rd”

$Sex

[1] “Male”

Go to TOC

“Crew”

“Female”

$Age

[1] “Child” “Adult”

$Survived

[1] “No” “Yes”

> margin.table(Titanic,c(1,4))

Survived

Class

No Yes

1st 122 203

2nd 167 118

3rd 528 178

Crew 673 212

> addmargins(.Last.value)

Survived

Class

No Yes Sum

1st

122 203 325

2nd

167 118 285

3rd

528 178 706

Crew 673 212 885

Sum 1490 711 2201

Suppose that a passenger or crew member is selected randomly. The unconditional proba711

= 0.323.

bility that that person survived is 2201

Let us calculate the conditional probability of survival, given that the person selected was

in a first class cabin. If A = ”survived” and B = ”first class”, then

P r(AB) =

and

203

= 0.0922

2201

CHAPTER 3. PROBABILITY

54

P r(B) =

325

= 0.1477.

2201

Thus,

0.0922

= 0.625.

0.1477

First class passengers had a 62.5% chance of survival. For random sampling from a finite

population such as this, we can use the counts of occurrences of the events rather than their

probabilities because the denominators in P r(AB) and P r(B) cancel.

P r(A|B) =

P r(A|B) =

#(AB)

203

=

= 0.625

#(B)

325

Go to TOC

For comparison, look at the conditional probabilities of survival for the other classes.

P r(survived|second class) =

118

= 0.414

285

P r(survived|third class) =

178

= 0.252

706

P r(survived|crew) =

3.6.1

212

= 0.240

885

Relating Conditional and Unconditional Probabilities

The defining equation (3.1) for conditional probability can be written as

P r(AB) = P r(A|B)P r(B),

(3.2)

which is often more useful, especially when P r(A|B) is easily determined from the description of the experiment. There is an even more useful result sometimes called the law of total

probability. Let B1 , B2 , · · · , Bk be pairwise disjoint events such that each P r(Bi ) > 0 and

Ω = B1 ∪ B2 ∪ · · · ∪ Bk . Let A be another event. Then,

P r(A) =

k

X

P r(A|Bi )P r(Bi ).

(3.3)

i=1

This follows from property (3) of probability measures, the fact that A = (AB1 )∪· · ·∪(ABk )

is a union of pairwise disjoint events, and P r(ABi ) = P r(A|Bi )P r(Bi ).

Example 3.3. Diagnostic Tests:

Let D denote the presence of a disease in a randomly selected member of a given population. Suppose that there is a diagnostic test for the disease and let T denote the event

that a random subject tests positive, that is, that the test indicates the disease. The conditional probability P r(T |D) is called the sensitivity of the test. The conditional probability

P r(∼ T |∼ D) is called the specificity of the test. The unconditional probability P r(D) is

CHAPTER 3. PROBABILITY

55

called the prevalence of the disease in the population. A good test will have both a high

sensitivity and…

## We've got everything to become your favourite writing service

### Money back guarantee

Your money is safe. Even if we fail to satisfy your expectations, you can always request a refund and get your money back.

### Confidentiality

We don’t share your private information with anyone. What happens on our website stays on our website.

### Our service is legit

We provide you with a sample paper on the topic you need, and this kind of academic assistance is perfectly legitimate.

### Get a plagiarism-free paper

We check every paper with our plagiarism-detection software, so you get a unique paper written for your particular purposes.

### We can help with urgent tasks

Need a paper tomorrow? We can write it even while you’re sleeping. Place an order now and get your paper in 8 hours.

### Pay a fair price

Our prices depend on urgency. If you want a cheap essay, place your order in advance. Our prices start from $11 per page.