# MATH 1280 University of The People Statistics Worksheet

Introduction to Statistical Thinking(With R, Without Calculus)

Benjamin Yakir, The Hebrew University

March, 2011

2

In memory of my father, Moshe Yakir, and the family he lost.

ii

Preface

The target audience for this book is college students who are required to learn

statistics, students with little background in mathematics and often no motivation to learn more. It is assumed that the students do have basic skills in using

computers and have access to one. Moreover, it is assumed that the students

are willing to actively follow the discussion in the text, to practice, and more

importantly, to think.

Teaching statistics is a challenge. Teaching it to students who are required

to learn the subject as part of their curriculum, is an art mastered by few. In

the past I have tried to master this art and failed. In desperation, I wrote this

book.

This book uses the basic structure of generic introduction to statistics course.

However, in some ways I have chosen to diverge from the traditional approach.

One divergence is the introduction of R as part of the learning process. Many

have used statistical packages or spreadsheets as tools for teaching statistics.

Others have used R in advanced courses. I am not aware of attempts to use

R in introductory level courses. Indeed, mastering R requires much investment

of time and energy that may be distracting and counterproductive for learning

more fundamental issues. Yet, I believe that if one restricts the application of

R to a limited number of commands, the benefits that R provides outweigh the

difficulties that R engenders.

Another departure from the standard approach is the treatment of probability as part of the course. In this book I do not attempt to teach probability

as a subject matter, but only specific elements of it which I feel are essential

for understanding statistics. Hence, Kolmogorov’s Axioms are out as well as

attempts to prove basic theorems and a Balls and Urns type of discussion. On

the other hand, emphasis is given to the notion of a random variable and, in

that context, the sample space.

The first part of the book deals with descriptive statistics and provides probability concepts that are required for the interpretation of statistical inference.

Statistical inference is the subject of the second part of the book.

The first chapter is a short introduction to statistics and probability. Students are required to have access to R right from the start. Instructions regarding

the installation of R on a PC are provided.

The second chapter deals with data structures and variation. Chapter 3

provides numerical and graphical tools for presenting and summarizing the distribution of data.

The fundamentals of probability are treated in Chapters 4 to 7. The concept

of a random variable is presented in Chapter 4 and examples of special types of

random variables are discussed in Chapter 5. Chapter 6 deals with the Normal

iii

iv

PREFACE

random variable. Chapter 7 introduces sampling distribution and presents the

Central Limit Theorem and the Law of Large Numbers. Chapter 8 summarizes

the material of the first seven chapters and discusses it in the statistical context.

Chapter 9 starts the second part of the book and the discussion of statistical inference. It provides an overview of the topics that are presented in the

subsequent chapter. The material of the first half is revisited.

Chapters 10 to 12 introduce the basic tools of statistical inference, namely

point estimation, estimation with a confidence interval, and the testing of statistical hypothesis. All these concepts are demonstrated in the context of a single

measurements.

Chapters 13 to 15 discuss inference that involve the comparison of two measurements. The context where these comparisons are carried out is that of

regression that relates the distribution of a response to an explanatory variable.

In Chapter 13 the response is numeric and the explanatory variable is a factor

with two levels. In Chapter 14 both the response and the explanatory variable

are numeric and in Chapter 15 the response in a factor with two levels.

Chapter 16 ends the book with the analysis of two case studies. These

analyses require the application of the tools that are presented throughout the

book.

This book was originally written for a pair of courses in the University of the

People. As such, each part was restricted to 8 chapters. Due to lack of space,

some important material, especially the concepts of correlation and statistical

independence were omitted. In future versions of the book I hope to fill this

gap.

Large portions of this book, mainly in the first chapters and some of the

quizzes, are based on material from the online book “Collaborative Statistics”

by Barbara Illowsky and Susan Dean (Connexions, March 2, 2010. http://

cnx.org/content/col10522/1.37/). Most of the material was edited by this

author, who is the only person responsible for any errors that where introduced

in the process of editing.

Case studies that are presented in the second part of the book are taken

from Rice Virtual Lab in Statistics can be found in their Case Studies section.

The responsibility for mistakes in the analysis of the data, if such mistakes are

found, are my own.

I would like to thank my mother Ruth who, apart from giving birth, feeding

and educating me, has also helped to improve the pedagogical structure of this

text. I would like to thank also Gary Engstrom for correcting many of the

mistakes in English that I made.

This book is an open source and may be used by anyone who wishes to do so.

(Under the conditions of the Creative Commons Attribution License (CC-BY

3.0).))

Jerusalem, March 2011

Benjamin Yakir

Contents

Preface

iii

I

1

Introduction to Statistics

1 Introduction

1.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . .

1.2 Why Learn Statistics? . . . . . . . . . . . . . . . . . . . . . . . .

1.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5 Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.6 The R Programming Environment . . . . . . . . . . . . . . . . .

1.6.1 Some Basic R Commands . . . . . . . . . . . . . . . . . .

1.7 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

3

3

4

5

6

7

7

10

13

2 Sampling and Data Structures

15

2.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 15

2.2 The Sampled Data . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Variation in Data . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Variation in Samples . . . . . . . . . . . . . . . . . . . . . 16

2.2.3 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.4 Critical Evaluation . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Reading Data into R . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Saving the File and Setting the Working Directory . . . . 19

2.3.2 Reading a CSV File into R . . . . . . . . . . . . . . . . . . 23

2.3.3 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Descriptive Statistics

29

3.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 29

3.2 Displaying Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.2 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3 Measures of the Center of Data . . . . . . . . . . . . . . . . . . . 35

3.3.1 Skewness, the Mean and the Median . . . . . . . . . . . . 36

3.4 Measures of the Spread of Data . . . . . . . . . . . . . . . . . . . 38

v

vi

CONTENTS

3.5

3.6

Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

45

4 Probability

47

4.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 47

4.2 Different Forms of Variability . . . . . . . . . . . . . . . . . . . . 47

4.3 A Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.4 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.1 Sample Space and Distribution . . . . . . . . . . . . . . . 54

4.4.2 Expectation and Standard Deviation . . . . . . . . . . . . 56

4.5 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . 59

4.6 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Random Variables

65

5.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 65

5.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . 65

5.2.1 The Binomial Random Variable . . . . . . . . . . . . . . . 66

5.2.2 The Poisson Random Variable . . . . . . . . . . . . . . . 71

5.3 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . 74

5.3.1 The Uniform Random Variable . . . . . . . . . . . . . . . 75

5.3.2 The Exponential Random Variable . . . . . . . . . . . . . 79

5.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 The Normal Random Variable

87

6.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 87

6.2 The Normal Random Variable . . . . . . . . . . . . . . . . . . . . 87

6.2.1 The Normal Distribution . . . . . . . . . . . . . . . . . . 88

6.2.2 The Standard Normal Distribution . . . . . . . . . . . . . 90

6.2.3 Computing Percentiles . . . . . . . . . . . . . . . . . . . . 92

6.2.4 Outliers and the Normal Distribution . . . . . . . . . . . 94

6.3 Approximation of the Binomial Distribution . . . . . . . . . . . . 96

6.3.1 Approximate Binomial Probabilities and Percentiles . . . 96

6.3.2 Continuity Corrections . . . . . . . . . . . . . . . . . . . . 97

6.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 The Sampling Distribution

105

7.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 105

7.2 The Sampling Distribution . . . . . . . . . . . . . . . . . . . . . 105

7.2.1 A Random Sample . . . . . . . . . . . . . . . . . . . . . . 106

7.2.2 Sampling From a Population . . . . . . . . . . . . . . . . 107

7.2.3 Theoretical Models . . . . . . . . . . . . . . . . . . . . . . 112

7.3 Law of Large Numbers and Central Limit Theorem . . . . . . . . 115

7.3.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . 115

7.3.2 The Central Limit Theorem (CLT) . . . . . . . . . . . . . 116

7.3.3 Applying the Central Limit Theorem . . . . . . . . . . . . 119

7.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

CONTENTS

vii

8 Overview and Integration

125

8.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 125

8.2 An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

8.3 Integrated Applications . . . . . . . . . . . . . . . . . . . . . . . 127

8.3.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

8.3.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

8.3.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8.3.4 Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

8.3.5 Example 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

II

Statistical Inference

137

9 Introduction to Statistical Inference

139

9.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 139

9.2 Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

9.3 The Cars Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 141

9.4 The Sampling Distribution . . . . . . . . . . . . . . . . . . . . . 144

9.4.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

9.4.2 The Sampling Distribution . . . . . . . . . . . . . . . . . 145

9.4.3 Theoretical Distributions of Observations . . . . . . . . . 146

9.4.4 Sampling Distribution of Statistics . . . . . . . . . . . . . 147

9.4.5 The Normal Approximation . . . . . . . . . . . . . . . . . 148

9.4.6 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 149

9.5 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

10 Point Estimation

159

10.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 159

10.2 Estimating Parameters . . . . . . . . . . . . . . . . . . . . . . . . 159

10.3 Estimation of the Expectation . . . . . . . . . . . . . . . . . . . . 160

10.3.1 The Accuracy of the Sample Average . . . . . . . . . . . 161

10.3.2 Comparing Estimators . . . . . . . . . . . . . . . . . . . . 164

10.4 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . 166

10.5 Estimation of Other Parameters . . . . . . . . . . . . . . . . . . 171

10.6 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

11 Confidence Intervals

181

11.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 181

11.2 Intervals for Mean and Proportion . . . . . . . . . . . . . . . . . 181

11.2.1 Examples of Confidence Intervals . . . . . . . . . . . . . . 182

11.2.2 Confidence Intervals for the Mean . . . . . . . . . . . . . 183

11.2.3 Confidence Intervals for a Proportion . . . . . . . . . . . 187

11.3 Intervals for Normal Measurements . . . . . . . . . . . . . . . . . 188

11.3.1 Confidence Intervals for a Normal Mean . . . . . . . . . . 190

11.3.2 Confidence Intervals for a Normal Variance . . . . . . . . 192

11.4 Choosing the Sample Size . . . . . . . . . . . . . . . . . . . . . . 195

11.5 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

viii

CONTENTS

12 Testing Hypothesis

203

12.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 203

12.2 The Theory of Hypothesis Testing . . . . . . . . . . . . . . . . . 203

12.2.1 An Example of Hypothesis Testing . . . . . . . . . . . . . 204

12.2.2 The Structure of a Statistical Test of Hypotheses . . . . . 205

12.2.3 Error Types and Error Probabilities . . . . . . . . . . . . 208

12.2.4 p-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

12.3 Testing Hypothesis on Expectation . . . . . . . . . . . . . . . . . 211

12.4 Testing Hypothesis on Proportion . . . . . . . . . . . . . . . . . . 218

12.5 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

13 Comparing Two Samples

227

13.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 227

13.2 Comparing Two Distributions . . . . . . . . . . . . . . . . . . . . 227

13.3 Comparing the Sample Means . . . . . . . . . . . . . . . . . . . . 229

13.3.1 An Example of a Comparison of Means . . . . . . . . . . 229

13.3.2 Confidence Interval for the Difference . . . . . . . . . . . 232

13.3.3 The t-Test for Two Means . . . . . . . . . . . . . . . . . . 235

13.4 Comparing Sample Variances . . . . . . . . . . . . . . . . . . . . 237

13.5 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

13.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

14 Linear Regression

247

14.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 247

14.2 Points and Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

14.2.1 The Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . 248

14.2.2 Linear Equation . . . . . . . . . . . . . . . . . . . . . . . 251

14.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

14.3.1 Fitting the Regression Line . . . . . . . . . . . . . . . . . 253

14.3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

14.4 R-squared and the Variance of Residuals . . . . . . . . . . . . . . 260

14.5 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

14.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

15 A Bernoulli Response

281

15.1 Student Learning Objectives . . . . . . . . . . . . . . . . . . . . . 281

15.2 Comparing Sample Proportions . . . . . . . . . . . . . . . . . . . 282

15.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 285

15.4 Solved Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

16 Case Studies

299

16.1 Student Learning Objective . . . . . . . . . . . . . . . . . . . . . 299

16.2 A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

16.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

16.3.1 Physicians’ Reactions to the Size of a Patient . . . . . . . 300

16.3.2 Physical Strength and Job Performance . . . . . . . . . . 306

16.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

16.4.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . 313

16.4.2 Discussion in the Forum . . . . . . . . . . . . . . . . . . . 314

Part I

Introduction to Statistics

1

Chapter 1

Introduction

1.1

Student Learning Objectives

This chapter introduces the basic concepts of statistics. Special attention is

given to concepts that are used in the first part of this book, the part that

deals with graphical and numeric statistical ways to describe data (descriptive

statistics) as well as mathematical theory of probability that enables statisticians

to draw conclusions from data.

The course applies the widely used freeware programming environment for

statistical analysis, known as R. In this chapter we will discuss the installation

of the program and present very basic features of that system.

By the end of this chapter, the student should be able to:

• Recognize key terms in statistics and probability.

• Install the R program on an accessible computer.

• Learn and apply a few basic operations of the computational system R.

1.2

Why Learn Statistics?

You are probably asking yourself the question, “When and where will I use

statistics?”. If you read any newspaper or watch television, or use the Internet,

you will see statistical information. There are statistics about crime, sports,

education, politics, and real estate. Typically, when you read a newspaper

article or watch a news program on television, you are given sample information.

With this information, you may make a decision about the correctness of a

statement, claim, or “fact”. Statistical methods can help you make the “best

educated guess”.

Since you will undoubtedly be given statistical information at some point in

your life, you need to know some techniques to analyze the information thoughtfully. Think about buying a house or managing a budget. Think about your

chosen profession. The fields of economics, business, psychology, education, biology, law, computer science, police science, and early childhood development

require at least one course in statistics.

3

CHAPTER 1. INTRODUCTION

2

0

1

y = Frequency

3

4

4

5

5.5

6

6.5

7

8

9

x = Time

Figure 1.1: Frequency of Average Time (in Hours) Spent Sleeping per Night

Included in this chapter are the basic ideas and words of probability and

statistics. In the process of learning the first part of the book, and more so in

the second part of the book, you will understand that statistics and probability

work together.

1.3

Statistics

The science of statistics deals with the collection, analysis, interpretation, and

presentation of data. We see and use data in our everyday lives. To be able

to use data correctly is essential to many professions and is in your own best

self-interest.

For example, assume the average time (in hours, to the nearest half-hour) a

group of people sleep per night has been recorded. Consider the following data:

5, 5.5, 6, 6, 6, 6.5, 6.5, 6.5, 6.5, 7, 7, 8, 8, 9 .

In Figure 1.1 this data is presented in a graphical form (called a bar plot). A bar

plot consists of a number axis (the x-axis) and bars (vertical lines) positioned

1.4. PROBABILITY

5

above the number axis. The length of each bar corresponds to the number

of data points that obtain the given numerical value. In the given plot the

frequency of average time (in hours) spent sleeping per night is presented with

hours of sleep on the horizontal x-axis and frequency on vertical y-axis.

Think of the following questions:

• Would the bar plot constructed from data collected from a different group

of people look the same as or different from the example? Why?

• If one would have carried the same example in a different group with the

same size and age as the one used for the example, do you think the results

would be the same? Why or why not?

• Where does the data appear to cluster? How could you interpret the

clustering?

The questions above ask you to analyze and interpret your data. With this

example, you have begun your study of statistics.

In this course, you will learn how to organize and summarize data. Organizing and summarizing data is called descriptive statistics. Two ways to

summarize data are by graphing and by numbers (for example, finding an average). In the second part of the book you will also learn how to use formal

methods for drawing conclusions from “good” data. The formal methods are

called inferential statistics. Statistical inference uses probabilistic concepts to

determine if conclusions drawn are reliable or not.

Effective interpretation of data is based on good procedures for producing

data and thoughtful examination of the data. In the process of learning how

to interpret data you will probably encounter what may seem to be too many

mathematical formulae that describe these procedures. However, you should

always remember that the goal of statistics is not to perform numerous calculations using the formulae, but to gain an understanding of your data. The

calculations can be done using a calculator or a computer. The understanding

must come from you. If you can thoroughly grasp the basics of statistics, you

can be more confident in the decisions you make in life.

1.4

Probability

Probability is the mathematical theory used to study uncertainty. It provides

tools for the formalization and quantification of the notion of uncertainty. In

particular, it deals with the chance of an event occurring. For example, if the

different potential outcomes of an experiment are equally likely to occur then

the probability of each outcome is taken to be the reciprocal of the number of

potential outcomes. As an illustration, consider tossing a fair coin. There are

two possible outcomes – a head or a tail – and the probability of each outcome

is 1/2.

If you toss a fair coin 4 times, the outcomes may not necessarily be 2 heads

and 2 tails. However, if you toss the same coin 4,000 times, the outcomes will

be close to 2,000 heads and 2,000 tails. It is very unlikely to obtain more than

2,060 tails and it is similarly unlikely to obtain less than 1,940 tails. This is

consistent with the expected theoretical probability of heads in any one toss.

Even though the outcomes of a few repetitions are uncertain, there is a regular

6

CHAPTER 1. INTRODUCTION

pattern of outcomes when the number of repetitions is large. Statistics exploits

this pattern regularity in order to make extrapolations from the observed sample

to the entire population.

The theory of probability began with the study of games of chance such as

poker. Today, probability is used to predict the likelihood of an earthquake, of

rain, or whether you will get an “A” in this course. Doctors use probability

to determine the chance of a vaccination causing the disease the vaccination is

supposed to prevent. A stockbroker uses probability to determine the rate of

return on a client’s investments. You might use probability to decide to buy a

lottery ticket or not.

Although probability is instrumental for the development of the theory of

statistics, in this introductory course we will not develop the mathematical theory of probability. Instead, we will concentrate on the philosophical aspects of

the theory and use computerized simulations in order to demonstrate probabilistic computations that are applied in statistical inference.

1.5

Key Terms

In statistics, we generally want to study a population. You can think of a

population as an entire collection of persons, things, or objects under study.

To study the larger population, we select a sample. The idea of sampling is

to select a portion (or subset) of the larger population and study that portion

(the sample) to gain information about the population. Data are the result of

sampling from a population.

Because it takes a lot of time and money to examine an entire population,

sampling is a very practical technique. If you wished to compute the overall

grade point average at your school, it would make sense to select a sample of

students who attend the school. The data collected from the sample would

be the students’ grade point averages. In presidential elections, opinion poll

samples of 1,000 to 2,000 people are taken. The opinion poll is supposed to

represent the views of the people in the entire country. Manufacturers of canned

carbonated drinks take samples to determine if the manufactured 16 ounce

containers does indeed contain 16 ounces of the drink.

From the sample data, we can calculate a statistic. A statistic is a number

that is a property of the sample. For example, if we consider one math class to

be a sample of the population of all math classes, then the average number of

points earned by students in that one math class at the end of the term is an

example of a statistic. The statistic can be used as an estimate of a population

parameter. A parameter is a number that is a property of the population. Since

we considered all math classes to be the population, then the average number of

points earned per student over all the math classes is an example of a parameter.

One of the main concerns in the field of statistics is how accurately a statistic

estimates a parameter. The accuracy really depends on how well the sample

represents the population. The sample must contain the characteristics of the

population in order to be a representative sample.

Two words that come up often in statistics are average and proportion. If

you were to take three exams in your math classes and obtained scores of 86, 75,

and 92, you calculate your average score by adding the three exam scores and

dividing by three (your average score would be 84.3 to one decimal place). If, in

1.6. THE R PROGRAMMING ENVIRONMENT

7

your math class, there are 40 students and 22 are men and 18 are women, then

the proportion of men students is 22/40 and the proportion of women students

is 18/40. Average and proportion are discussed in more detail in later chapters.

1.6

The R Programming Environment

The R Programming Environment is a widely used open source system for statistical analysis and statistical programming. It includes thousands of functions

for the implementation of both standard and exotic statistical methods and it

is probably the most popular system in the academic world for the development

of new statistical tools. We will use R in order to apply the statistical methods that will be discussed in the book to some example data sets and in order

to demonstrate, via simulations, concepts associated with probability and its

application in statistics.

The demonstrations in the book involve very basic R programming skills and

the applications are implemented using, in most cases, simple and natural code.

A detailed explanation will accompany the code that is used.

Learning R, like the learning of any other programming language, can be

achieved only through practice. Hence, we strongly recommend that you not

only read the code presented in the book but also run it yourself, in parallel to

the reading of the provided explanations. Moreover, you are encouraged to play

with the code: introduce changes in the code and in the data and see how the

output changes as a result. One should not be afraid to experiment. At worst,

the computer may crash or freeze. In both cases, restarting the computer will

solve the problem . . .

You may download R from the R project home page http://www.r-project.

org and install it on the computer that you are using1 .

1.6.1

Some Basic R Commands

R is an object-oriented programming system. During the session you may create and manipulate objects by the use of functions that are part of the basic

installation. You may also use the R programming language. Most of the functions that are part of the system are themselves written in the R language and

one may easily write new functions or modify existing functions to suit specific

needs.

Let us start by opening the R Console window by double-clicking on the

R icon. Type in the R Console window, immediately after the “>” prompt,

the expression “1+2” and then hit the Return key. (Do not include the double

quotation in the expression that you type!):

> 1+2

[1] 3

>

The prompt “>” indicates that the system is ready to receive commands. Writing an expression, such as “1+2”, and hitting the Return key sends the expression

1 Detailed explanation of how to install the system on an XP Windows Operating System

may be found here: http://pluto.huji.ac.il/~msby/StatThink/install_R_WinXP.html.

8

CHAPTER 1. INTRODUCTION

to be executed. The execution of the expression may produce an object, in this

case an object that is composed of a single number, the number “3”.

Whenever required, the R system takes an action. If no other specifications

are given regarding the required action then the system will apply the preprogrammed action. This action is called the default action. In the case of

hitting the Return key after the expression that we wrote the default is to

display the produced object on the screen.

Next, let us demonstrate R in a more meaningful way by using it in order

to produce the bar-plot of Figure 1.1. First we have to input the data. We

will produce a sequence of numbers that form the data2 . For that we will use

the function “c” that combines its arguments and produces a sequence with the

arguments as the components of the sequence. Write the expression:

> c(5,5.5,6,6,6,6.5,6.5,6.5,6.5,7,7,8,8,9)

at the prompt and hit return. The result should look like this:

> c(5,5.5,6,6,6,6.5,6.5,6.5,6.5,7,7,8,8,9)

[1] 5.0 5.5 6.0 6.0 6.0 6.5 6.5 6.5 6.5 7.0 7.0 8.0 8.0 9.0

>

The function “c” is an example of an R function. A function has a name, “c”

in this case, that is followed by brackets that include the input to the function.

We call the components of the input the arguments of the function. Arguments

are separated by commas. A function produces an output, which is typically

an R object. In the current example an object of the form of a sequence was

created and, according to the default application of the system, was sent to the

screen and not saved.

If we want to create an object for further manipulation then we should save

it and give it a name. For example, it we want to save the vector of data under

the name “X” we may write the following expression at the prompt (and then

hit return):

> X

The arrow that appears after the “X” is produced by typing the less than key

“ x

Error: object “x” not found

An object named “x” does not exist in the R system and we have not created

such object. The object “X”, on the other hand, does exist.

Names of functions that are part of the system are fixed but you are free to

choose a name to objects that you create. For example, if one wants to create

10

CHAPTER 1. INTRODUCTION

an object by the name “my.vector” that contains the numbers 3, 7, 3, 3, and

-5 then one may write the expression “my.vector table(X)

X

5 5.5

6 6.5

1

1

3

4

7

2

8

2

9

1

Notice that the output of the function “table” is a table of the different levels

of the input vector and the frequency of each level. This output is yet another

type of an object.

The bar-plot of Figure 1.1 can be produced by the application of the function

“plot” to the object that is produced as an output of the function “table”:

> plot(table(X))

Observe that a graphical window was opened with the target plot. The plot that

appears in the graphical window should coincide with the plot in Figure 1.3.

This plot is practically identical to the plot in Figure 1.1. The only difference is

in the names given to the access. These names were changed in Figure 1.1 for

clarity.

Clearly, if one wants to produce a bar-plot to other numerical data all one has

to do is replace in the expression “plot(table(X))” the object “X” by an object

that contains the other data. For example, to plot the data in “my.vector” you

may use “plot(table(my.vector))”.

1.7

Solved Exercises

Question 1.1. A potential candidate for a political position in some state is

interested to know what are her chances to win the primaries of her party and be

selected as parties candidate for the position. In order to examine the opinions

of her party voters she hires the services of a polling agency. The polling is

conducted among 500 registered voters of the party. One of the questions that

the pollsters refers to the willingness of the voters to vote for a female candidate

for the job. Forty two percent of the people asked said that they prefer to have

a women running for the job. Thirty eight percent said that the candidate’s

gender is irrelevant. The rest prefers a male candidate. Which of the following

is (i) a population (ii) a sample (iii) a parameter and (iv) a statistic:

1. The 500 registered voters.

2. The percentage, among all registered voters of the given party, of those

that prefer a male candidate.

3. The number 42% that corresponds to the percentage of those that prefer

a female candidate.

4. The voters in the state that are registered to the given party.

11

2

0

1

table(X)

3

4

1.7. SOLVED EXERCISES

5

5.5

6

6.5

7

8

9

X

Figure 1.3: The Plot Produced by the Expression “plot(table(X))”

Solution (to Question 1.1.1): According to the information in the question

the polling was conducted among 500 registered voters. The 500 registered

voters corresponds to the sample.

Solution (to Question 1.1.2): The percentage, among all registered voters

of the given party, of those that prefer a male candidate is a parameter. This

quantity is a characteristic of the population.

Solution (to Question 1.1.3): It is given that 42% of the sample prefer a

female candidate. This quantity is a numerical characteristic of the data, of the

sample. Hence, it is a statistic.

Solution (to Question 1.1.4): The voters in the state that are registered to

the given party is the target population.

Question 1.2. The number of customers that wait in front of a coffee shop at

the opening was reported during 25 days. The results were:

4, 2, 1, 1, 0, 2, 1, 2, 4, 2, 5, 3, 1, 5, 1, 5, 1, 2, 1, 1, 3, 4, 2, 4, 3 .

CHAPTER 1. INTRODUCTION

4

0

2

table(n.cost)

6

8

12

0

1

2

3

4

5

n.cost

Figure 1.4: The Plot Produced by the Expression “plot(table(n.cost))”

1. Identify the number of days in which 5 costumers where waiting.

2. The number of waiting costumers that occurred the largest number of

times.

3. The number of waiting costumers that occurred the least number of times.

Solution (to Question 1.2): One may read the data into R and create a table

using the code:

> n.cost table(n.cost)

n.cost

0 1 2 3 4 5

1 8 6 3 4 3

For convenience, one may also create the bar plot of the data using the code:

> plot(table(n.cost))

1.8. SUMMARY

13

The bar plot is presented in Figure 1.4.

Solution (to Question 1.2.1): The number of days in which 5 costumers

where waiting is 3, since the frequency of the value “5” in the data is 3. That

can be seen from the table by noticing the number below value “5” is 3. It can

also be seen from the bar plot by observing that the hight of the bar above the

value “5” is equal to 3.

Solution (to Question 1.2.2): The number of waiting costumers that occurred the largest number of times is 1. The value ”1” occurred 8 times, more

than any other value. Notice that the bar above this value is the highest.

Solution (to Question 1.2.3): The value ”0”, which occurred only once,

occurred the least number of times.

1.8

Summary

Glossary

Data: A set of observations taken on a sample from a population.

Statistic: A numerical characteristic of the data. A statistic estimates the

corresponding population parameter. For example, the average number

of contribution to the course’s forum for this term is an estimate for the

average number of contributions in all future terms (parameter).

Statistics The science that deals with processing, presentation and inference

from data.

Probability: A mathematical field that models and investigates the notion of

randomness.

Discuss in the forum

A sample is a subgroup of the population that is supposed to represent the

entire population. In your opinion, is it appropriate to attempt to represent the

entire population only by a sample?

When you formulate your answer to this question it may be useful to come

up with an example of a question from you own field of interest one may want to

investigate. In the context of this example you may identify a target population

which you think is suited for the investigation of the given question. The appropriateness of using a sample can be discussed in the context of the example

question and the population you have identified.

14

CHAPTER 1. INTRODUCTION

Chapter 2

Sampling and Data

Structures

2.1

Student Learning Objectives

In this chapter we deal with issues associated with the data that is obtained from

a sample. The variability associated with this data is emphasized and critical

thinking about validity of the data encouraged. A method for the introduction

of data from an external source into R is proposed and the data types used by

R for storage are described. By the end of this chapter, the student should be

able to:

• Recognize potential difficulties with sampled data.

• Read an external data file into R.

• Create and interpret frequency tables.

2.2

The Sampled Data

The aim in statistics is to learn the characteristics of a population on the basis

of a sample selected from the population. An essential part of this analysis

involves consideration of variation in the data.

2.2.1

Variation in Data

Variation is given a central role in statistics. To some extent the assessment of

variation and the quantification of its contribution to uncertainties in making

inference is the statistician’s main concern.

Variation is present in any set of data. For example, 16-ounce cans of beverage may contain more or less than 16 ounces of liquid. In one study, eight 16

ounce cans were measured and produced the following amount (in ounces) of

beverage:

15.8, 16.1, 15.2, 14.8, 15.8, 15.9, 16.0, 15.5 .

Measurements of the amount of beverage in a 16-ounce may vary because the

conditions of measurement varied or because the exact amount, 16 ounces of

15

16

CHAPTER 2. SAMPLING AND DATA STRUCTURES

liquid, was not put into the cans. Manufacturers regularly run tests to determine

if the amount of beverage in a 16-ounce can falls within the desired range.

Be aware that if an investigator collects data, the data may vary somewhat

from the data someone else is taking for the same purpose. This is completely

natural. However, if two investigators or more, are taking data from the same

source and get very different results, it is time for them to reevaluate their

data-collection methods and data recording accuracy.

2.2.2

Variation in Samples

Two or more samples from the same population, all having the same characteristics as the population, may nonetheless be different from each other. Suppose

Doreen and Jung both decide to study the average amount of time students

sleep each night and use all students at their college as the population. Doreen

may decide to sample randomly a given number of students from the entire

body of collage students. Jung, on the other hand, may decide to sample randomly a given number of classes and survey all students in the selected classes.

Doreen’s method is called random sampling whereas Jung’s method is called

cluster sampling. Doreen’s sample will be different from Jung’s sample even

though both samples have the characteristics of the population. Even if Doreen

and Jung used the same sampling method, in all likelihood their samples would

be different. Neither would be wrong, however.

If Doreen and Jung took larger samples (i.e. the number of data values

is increased), their sample results (say, the average amount of time a student

sleeps) would be closer to the actual population average. But still, their samples

would be, most probably, different from each other.

The size of a sample (often called the number of observations) is important.

The examples you have seen in this book so far have been small. Samples of only

a few hundred observations, or even smaller, are sufficient for many purposes.

In polling, samples that are from 1200 to 1500 observations are considered large

enough and good enough if the survey is random and is well done. The theory of

statistical inference, that is the subject matter of the second part of this book,

provides justification for these claims.

2.2.3

Frequency

The primary way of summarizing the variability of data is via the frequency

distribution. Consider an example. Twenty students were asked how many

hours they worked per day. Their responses, in hours, are listed below:

5, 6, 3, 3, 2, 4, 7, 5, 2, 3, 5, 6, 5, 4, 4, 3, 5, 2, 5, 3 .

Let us create an R object by the name “work.hours” that contains these data:

> work.hours table(work.hours)

work.hours

2 3 4 5 6 7

3 5 3 6 2 1

2.2. THE SAMPLED DATA

17

Recall that the function “table” takes as input a sequence of data and produces

as output the frequencies of the different values.

We may have a clearer understanding of the meaning of the output of the

function “table” if we presented outcome as a frequency listing the different

data values in ascending order and their frequencies. For that end we may apply

the function “data.frame” to the output of the “table” function and obtain:

> data.frame(table(work.hours))

work.hours Freq

2

2

3

3

3

5

4

4

3

5

5

6

6

6

2

7

7

1

A frequency is the number of times a given datum occurs in a data set.

According to the table above, there are three students who work 2 hours, five

students who work 3 hours, etc. The total of the frequency column, 20, represents the total number of students included in the sample.

The function “data.frame” transforms its input into a data frame, which is

the standard way of storing statistical data. We will introduce data frames in

more detail in Section 2.3 below.

A relative frequency is the fraction of times a value occurs. To find the

relative frequencies, divide each frequency by the total number of students in

the sample – 20 in this case. Relative frequencies can be written as fractions,

percents, or decimals.

As an illustration let us compute the relative frequencies in our data:

> freq freq

work.hours

2 3 4 5 6 7

3 5 3 6 2 1

> sum(freq)

[1] 20

> freq/sum(freq)

work.hours

2

3

4

5

6

7

0.15 0.25 0.15 0.30 0.10 0.05

We stored the frequencies in an object called “freq”. The content of the object

are the frequencies 3, 5, 3, 6, 2 and 1. The function “sum” sums the components

of its input. The sum of the frequencies is the sample size , the total number of

students that responded to the survey, which is 20. Hence, when we apply the

function “sum” to the object “freq” we get 20 as an output.

The outcome of dividing an object by a number is a division of each element in the object by the given number. Therefore, when we divide “freq” by

“sum(freq)” (the number 20) we get a sequence of relative frequencies. The

first entry to this sequence is 3/20 = 0.15, the second entry is 5/20 = 0.25, and

the last entry is 1/20 = 0.05. The sum of the relative frequencies should always

be equal to 1:

18

CHAPTER 2. SAMPLING AND DATA STRUCTURES

> sum(freq/sum(freq))

[1] 1

The cumulative relative frequency is the accumulation of previous relative

frequencies. To find the cumulative relative frequencies, add all the previous

relative frequencies to the relative frequency of the current value. Alternatively,

we may apply the function “cumsum” to the sequence of relative frequencies:

> cumsum(freq/sum(freq))

2

3

4

5

6

7

0.15 0.40 0.55 0.85 0.95 1.00

Observe that the cumulative relative frequency of the smallest value 2 is the

frequency of that value (0.15). The cumulative relative frequency of the second

value 3 is the sum of the relative frequency of the smaller value (0.15) and

the relative frequency of the current value (0.25), which produces a total of

0.15 + 0.25 = 0.40. Likewise, for the third value 4 we get a cumulative relative

frequency of 0.15 + 0.25 + 0.15 = 0.55. The last entry of the cumulative relative

frequency column is one, indicating that one hundred percent of the data has

been accumulated.

The computation of the cumulative relative frequency was carried out with

the aid of the function “cumsum”. This function takes as an input argument a

numerical sequence and produces as output a numerical sequence of the same

length with the cumulative sums of the components of the input sequence.

2.2.4

Critical Evaluation

Inappropriate methods of sampling and data collection may produce samples

that do not represent the target population. A naı̈ve application of statistical

analysis to such data may produce misleading conclusions.

Consequently, it is important to evaluate critically the statistical analyses

we encounter before accepting the conclusions that are obtained as a result of

these analyses. Common problems that occurs in data that one should be aware

of include:

Problems with Samples: A sample should be representative of the population. A sample that is not representative of the population is biased.

Biased samples may produce results that are inaccurate and not valid.

Data Quality: Avoidable errors may be introduced to the data via inaccurate

handling of forms, mistakes in the input of data, etc. Data should be

cleaned from such errors as much as possible.

Self-Selected Samples: Responses only by people who choose to respond,

such as call-in surveys, that are often biased.

Sample Size Issues: Samples that are too small may be unreliable. Larger

samples, when possible, are better. In some situations, small samples are

unavoidable and can still be used to draw conclusions. Examples: Crash

testing cars, medical testing for rare conditions.

Undue Influence: Collecting data or asking questions in a way that influences

the response.

2.3. READING DATA INTO R

19

Causality: A relationship between two variables does not mean that one causes

the other to occur. They may both be related (correlated) because of their

relationship to a third variable.

Self-Funded or Self-Interest Studies: A study performed by a person or

organization in order to support their claim. Is the study impartial? Read

the study carefully to evaluate the work. Do not automatically assume

that the study is good but do not automatically assume the study is bad

either. Evaluate it on its merits and the work done.

Misleading Use of Data: Improperly displayed graphs and incomplete data.

Confounding: Confounding in this context means confusing. When the effects

of multiple factors on a response cannot be separated. Confounding makes

it difficult or impossible to draw valid conclusions about the effect of each

factor.

2.3

Reading Data into R

In the examples so far the size of the data set was very small and we were able

to input the data directly into R with the use of the function “c”. In more

practical settings the data sets to be analyzed are much larger and it is very

inefficient to enter them manually. In this section we learn how to upload data

from a file in the Comma Separated Values (CSV) format.

The file “ex1.csv” contains data on the sex and height of 100 individuals.

This file is given in the CSV format. The file can be found on the internet

at http://pluto.huji.ac.il/~msby/StatThink/Datasets/ex1.csv. We will

discuss the process of reading data from a file into R and use this file as an

illustration.

2.3.1

Saving the File and Setting the Working Directory

Before the file is read into R you may find it convenient to obtain a copy of the

file and store it in some directory on the computer and read the file from that

directory. We recommend that you create a special directory in which you keep

all the material associated with this course. In the explanations provided below

we assume that the directory to which the file is stored in called “IntroStat”.

(See Figure 2.1)

Files in the CSV format are ordinary text files. They can be created manually

or as a result of converting data stored in a different format into this particular

format. A convenient way to produce, browse and edit CSV files is by the use

of a standard electronic spreadsheet programs such as Excel or Calc. The Excel

spreadsheet is part of the Microsoft’s Office suite. The Calc spreadsheet is part

of OpenOffice suite that is freely distributed by the OpenOffice Organization.

Opening a CSV file by a spreadsheet program displays a spreadsheet with

the content of the file. Values in the cells of the spreadsheet may be modified

directly. (However, when saving, one should pay attention to save the file in

the CVS format.) Similarly, new CSV files may be created by the entering of

the data in an empty spreadsheet. The first row should include the name of

the variable, preferably as a single character string with no empty spaces. The

20

CHAPTER 2. SAMPLING AND DATA STRUCTURES

Figure 2.1: The File “read.csv”

following rows may contain the data values associated with this variable. When

saving, the spreadsheet should be saved in the CSV format by the use of the

“Save by name” dialog and choosing there the option of CSV in the “Save by

Type” selection.

After saving a file with the data in a directory, R should be notified where

the file is located in order to be able to read it. A simple way of doing so is

by setting the directory with the file as R’s working directory. The working

directory is the first place R is searching for files. Files produced by R are saved

in that directory. In Windows, during an active R session, one may set the

working directory to be some target directory with the “File/Change Dir…”

dialog. This dialog is opened by selecting the option “File” on the left hand

side of the ruler on the top of the R Console window. Selecting the option of

“Change Dir…” in the ruler that opens will start the dialog. (See Figure 2.2.)

Browsing via this dialog window to the directory of choice, selecting it, and

approving the selection by clicking the “OK” bottom in the dialog window will

set the directory of choice as the working directory of R.

Rather than changing the working directory every time that R is opened one

may set a selected directory to be R’s working directory on opening. Again, we

demonstrate how to do this on the XP Windows operating system.

The R icon was added to the Desktop when the R system was installed.

The R Console is opened by double-clicking on this icon. One may change

the properties of the icon so that it sets a directory of choice as R’s working

directory.

In order to do so click on the icon with the mouse’s right bottom. A menu

2.3. READING DATA INTO R

21

Figure 2.2: Changing The Working Directory

opens in which you should select the option “Properties”. As a result, a dialog

window opens. (See Figure 2.3.) Look at the line that starts with the words

“Start in” and continues with a name of a directory that is the current working

directory. The name of this directory is enclosed in double quotes and is given

with it’s full path, i.e. its address on the computer. This name and path should

be changed to the name and path of the directory that you want to fix as the

new working directory.

Consider again Figure 2.1. Imagine that one wants to fix the directory that

contains the file “ex1.csv” as the permanent working directory. Notice that

the full address of the directory appears at the “Address” bar on the top of

the window. One may copy the address and paste it instead of the name of the

current working directory that is specified in the “Properties” dialog of the

R icon. One should make sure that the address to the new directory is, again,

placed between double-quotes. (See in Figure 2.4 the dialog window after the

changing the address of the working directory. Compare this to Figure 2.3 of

the window before the change.) After approving the change by clicking the

“OK” bottom the new working directory is set. Henceforth, each time that the

R Console is opened by double-clicking the icon it will have the designated

directory as its working directory.

In the rest of this book we assume that a designated directory is set as R’s

working directory and that all external files that need to be read into R, such

as “ex1.csv” for example, are saved in that working directory. Once a working

directory has been set then the history of subsequent R sessions is stored in that

directory. Hence, if you choose to save the image of the session when you end

the session then objects created in the session will be uploaded the next time

22

CHAPTER 2. SAMPLING AND DATA STRUCTURES

Figure 2.3: Setting the Working Directory (Before the Change)

Figure 2.4: Setting the Working Directory (After the Change)

2.3. READING DATA INTO R

23

the R Console is opened.

2.3.2

Reading a CSV File into R

Now that a copy of the file “ex1.csv” is placed in the working directory we

would like to read its content into R. Reading of files in the CSV format can be

carried out with the R function “read.csv”. To read the file of the example we

run the following line of code in the R Console window:

> ex.1 ex.1

id

sex height

1

5696379 FEMALE

182

2

3019088

MALE

168

3

2038883

MALE

172

4

1920587 FEMALE

154

5

6006813

MALE

174

6

4055945 FEMALE

176

.

.

.

.

.

.

.

.

.

.

.

.

98 9383288

MALE

195

99 1582961 FEMALE

129

100 9805356

MALE

172

>

(Noticed that we have erased the middle rows. In the R Console window you

should obtain the full table. However, in order to see the upper part of the

output you may need to scroll up the window.)

The object “ex.1”, the output of the function “read.csv” is a data frame.

Data frames are the standard tabular format of storing statistical data. The

columns of the table are called variables and correspond to measurements. In

this example the three variables are:

id: A 7 digits number that serves as a unique identifier of the subject.

sex: The sex of each subject. The values are either “MALE” or “FEMALE”.

height: The height (in centimeter) of each subject. A numerical value.

1 If the file is located in a different directory then the complete address, including the path

to the file, should be provided. The file need not reside on the computer. One may provide,

for example, a URL (an internet address) as the address. Thus, instead of saving the file of the

example on the computer one may read its content into an R object by using the line of code

“ex.1 freq cumsum(freq)

1 2 3 4 5 6 7

4 7 18 28 32 38 45

1. How many cows were involved in this study?

2. How many cows gave birth to a total of 4 calves?

3. What is the relative frequency of cows that gave birth to at least 4 calves?

Solution (to Question 2.2.1): The total number of cows that were involved

in this study is 45. The object “freq” contain the table of frequency of the

cows, divided according to the number of calves that they had. The cumulative

frequency of all the cows that had 7 calves or less, which includes all cows in

the study, is reported under the number “7” in the output of the expression

“cumsum(freq)”. This number is 45.

Solution (to Question 2.2.2): The number of cows that gave birth to a total

of 4 calves is 10. Indeed, the cumulative frequency of cows that gave birth to

4 calves or less is 28. The cumulative frequency of cows that gave birth to 3

calves or less is 18. The frequency of cows that gave birth to exactly 4 calves is

the difference between these two numbers: 28 – 18 = 10.

Solution (to Question 2.2.3): The relative frequency of cows that gave birth

to at least 4 calves is 27/45 = 0.6. Notice that the cumulative frequency of

cows that gave at most 3 calves is 18. The total number of cows is 45. Hence,

the number of cows with 4 or more calves is the difference between these two

numbers: 45 – 18 = 27. The relative frequency of such cows is the ratio between

this number and the total number of cows: 27/45 = 0.6.

2.5. SUMMARY

2.5

27

Summary

Glossary

Population: The collection, or set, of all individuals, objects, or measurements

whose properties are being studied.

Sample: A portion of the population understudy. A sample is representative

if it characterizes the population being studied.

Frequency: The number of times a value occurs in the data.

Relative Frequency: The ratio between the frequency and the size of data.

Cumulative Relative Frequency: The term applies to an ordered set of data

values from smallest to largest. The cumulative relative frequency is the

sum of the relative frequencies for all values that are less than or equal to

the given value.

Data Frame: A tabular format for storing statistical data. Columns correspond to variables and rows correspond to observations.

Variable: A measurement that may be carried out over a collection of subjects.

The outcome of the measurement may be numerical, which produces a

quantitative variable; or it may be non-numeric, in which case a factor is

produced.

Observation: The evaluation of a variable (or variables) for a given subject.

CSV Files: A digital format for storing data frames.

Factor: Qualitative data that is associated with categorization or the description of an attribute.

Quantitative: Data generated by numerical measurements.

Discuss in the forum

Factors are qualitative data that are associated with categorization or the description of an attribute. On the other hand, numeric data are generated by

numerical measurements. A common practice is to code the levels of factors

using numerical values. What do you think of this practice?

In the formulation of your answer to the question you may think of an

example of factor variable from your own field of interest. You may describe a

benefit or a disadvantage that results from the use of a numerical values to code

the level of this factor.

28

CHAPTER 2. SAMPLING AND DATA STRUCTURES

Chapter 3

Descriptive Statistics

3.1

Student Learning Objectives

This chapter deals with numerical and graphical ways to describe and display

data. This area of statistics is called descriptive statistics. You will learn to

calculate and interpret these measures and graphs. By the end of this chapter,

you should be able to:

• Use histograms and box plots in order to display data graphically.

• Calculate measures of central location: mean and median.

• Calculate measures of the spread: variance, standard deviation, and interquartile range.

• Identify outliers, which are values that do not fit the rest of the distribution.

3.2

Displaying Data

Once you have collected data, what will you do with it? Data can be described

and presented in many different formats. For example, suppose you are interested in buying a house in a particular area. You may have no clue about the

house prices, so you may ask your real estate agent to give you a sample data

set of prices. Looking at all the prices in the sample is often overwhelming. A

better way may be to look at the median price and the variation of prices. The

median and variation are just two ways that you will learn to describe data.

Your agent might also provide you with a graph of the data.

A statistical graph is a tool that helps you learn about the shape of the

distribution of a sample. The graph can be a more effective way of presenting

data than a mass of numbers because we can see where data clusters and where

there are only a few data values. Newspapers and the Internet use graphs to

show trends and to enable readers to compare facts and figures quickly.

Statisticians often start the analysis by graphing the data in order to get an

overall picture of it. Afterwards, more formal tools may be applied.

In the previous chapters we used the bar plot, where bars that indicate the

frequencies in the data of values are placed over these values. In this chapter

29

30

CHAPTER 3. DESCRIPTIVE STATISTICS

15

0

5

10

Frequency

20

25

Histogram of ex.1$height

120

140

160

180

200

ex.1$height

Figure 3.1: Histogram of Height

our emphasis will be on histograms and box plots, which are other types of

plots. Some of the other types of graphs that are frequently used, but will not

be discussed in this book, are the stem-and-leaf plot, the frequency polygon

(a type of broken line graph) and the pie charts. The types of plots that will

be discussed and the types that will not are all tightly linked to the notion of

frequency of the data that was introduced in Chapter 2 and intend to give a

graphical representation of this notion.

3.2.1

Histograms

The histogram is a frequently used method for displaying the distribution of

continuous numerical data. An advantage of a histogram is that it can readily

display large data sets. A rule of thumb is to use a histogram when the data

set consists of 100 values or more.

One may produce a histogram in R by the application of the function “hist”

to a sequence of numerical data. Let us read into R the data frame “ex.1” that

contains data on the sex and height and create a histogram of the heights:

> ex.1 hist(ex.1$height)

The outcome of the function is a plot that apears in the graphical window and

is presented in Figure 3.1.

The data set, which is the content of the CSV file “ex1.csv”, was used in

Chapter 2 in order to demonstrate the reading of data that is stored in a external

file into R. The first line of the above script reads in the data from “ex1.csv”

into a data frame object named “ex.1” that maintains the data internally in R.

The second line of the script produces the histogram. We will discuss below the

code associated with this second line.

A histogram consists of contiguous boxes. It has both a horizontal axis and

a vertical axis. The horizontal axis is labeled with what the data represents (the

height, in this example). The vertical axis presents frequencies and is labeled

“Frequency”. By the examination of the histogram one can appreciate the shape

of the data, the center, and the spread of the data.

The histogram is constructed by dividing the range of the data (the x-axis)

into equal intervals, which are the bases for the boxes. The height of each box

represents the count of the number of observations that fall within the interval.

For example, consider the box with the base between 160 and 170. There is a

total of 19 subjects with height larger that 160 but no more than 170 (that is,

160 < height ≤ 170). Consequently, the height of that box1 is 19.
The input to the function “hist” should be a sequence of numerical values.
In principle, one may use the function “c” to produce a sequence of data and
apply the histogram plotting function to the output of the sequence producing
function. However, in the current case we have already the data stored in the
data frame “ex.1”, all we need to learn is how to extract that data so it can be
used as input to the function “hist” that plots the histogram.
Notice the structure of the input that we have used in order to construct
the histogram of the variable “height” in the “ex.1” data frame. One may
address the variable “variable.name” in the data frame “dataframe.name”
using the format: “dataframe.name$variable.name”. Indeed, when we type
the expression “ex.1$height” we get as an output the values of the variable
“height” from the given data frame:
> ex.1$height

[1] 182 168 172 154 174 176 193 156 157 186 143 182 194 187 171

[16] 178 157 156 172 157 171 164 142 140 202 176 165 176 175 170

[31] 169 153 169 158 208 185 157 147 160 173 164 182 175 165 194

[46] 178 178 186 165 180 174 169 173 199 163 160 172 177 165 205

[61] 193 158 180 167 165 183 171 191 191 152 148 176 155 156 177

[76] 180 186 167 174 171 148 153 136 199 161 150 181 166 147 168

[91] 188 170 189 117 174 187 141 195 129 172

This is a numeric sequence and can serve as the input to a function that expects a

numeric sequence as input, a function such as “hist”. (But also other functions,

for example, “sum” and “cumsum”.)

1 In some books an histogram is introduced as a form of a density. In densities the area of

the box represents the frequency or the relative frequency. In the current example the height

would have been 19/10 = 1.9 if the area of the box would have represented the frequency

and it would have been (19/100)/10 = 0.019 if the area of the box would have represented

the relative frequency. However, in this book we follow the default of R in which the height

represents the frequency.

32

CHAPTER 3. DESCRIPTIVE STATISTICS

There are 100 observations in the variable “ex.1$height”. So many observations cannot be displayed on the screen on one line. Consequently, the

sequence of the data is wrapped and displayed over several lines. Notice that

the square brackets on the left hand side of each line indicate the position in

the sequence of the first value on that line. Hence, the number on the first

line is “[1]”. The number on the second line is “[16]”, since the second line

starts with the 16th observation in the display given in the book. Notice, that

numbers in the square brackets on your R Console window may be different,

depending on the setting of the display on your computer.

3.2.2

Box Plots

The box plot, or box-whisker plot, gives a good graphical overall impression of

the concentration of the data. It also shows how far from most of the data the

extreme values are. In principle, the box plot is constructed from five values: the

smallest value, the first quartile, the median, the third quartile, and the largest

value. The median, the first quartile, and the third quartile will be discussed

here, and then once more in the next section.

The median, a number, is a way of measuring the “center” of the data. You

can think of the median as the “middle value,” although it does not actually

have to be one of the observed values. It is a number that separates ordered

data into halves. Half the values are the same size or smaller than the median

and half the values are the same size or larger than it. For example, consider

the following data that contains 14 values:

1, 11.5, 6, 7.2, 4, 8, 9, 10, 6.8, 8.3, 2, 2, 10, 1 .

Ordered, from smallest to largest, we get:

1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5 .

The median is between the 7th value, 6.8, and the 8th value 7.2. To find the

median, add the two values together and divide by 2:

6.8 + 7.2

=7

2

The median is 7. Half of the values are smaller than 7 and half of the values

are larger than 7.

Quartiles are numbers that separate the data into quarters. Quartiles may

or may not be part of the data. To find the quartiles, first find the median or

second quartile. The first quartile is the middle value of the lower half of the

data and the third quartile is the middle value of the upper half of the data.

For illustration consider the same data set from above:

1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5 .

The median or second quartile is 7. The lower half of the data is:

1, 1, 2, 2, 4, 6, 6.8 .

The middle value of the lower half is 2. The number 2, which is part of the data

in this case, is the first quartile which is denoted Q1. One-fourth of the values

are the same or less than 2 and three-fourths of the values are more than 2.

33

2

4

6

8

10

3.2. DISPLAYING DATA

Figure 3.2: Box Plot of the Example

The upper half of the data is:

7.2, 8, 8.3, 9, 10, 10, 11.5

The middle value of the upper half is 9. The number 9 is the third quartile

which is denoted Q3. Three-fourths of the values are less than 9 and one-fourth

of the values2 are more than 9.

Outliers are values that do not fit with the rest of the data and lie outside of

the normal range. Data points with values that are much too large or much too

small in comparison to the vast majority of the observations will be identified

as outliers. In the context of the construction of a box plot we identify potential

outliers with the help of the inter-quartile range (IQR). The inter-quartile range

is the distance between the third quartile (Q3) and the first quartile (Q1), i.e.,

IQR = Q3 − Q1. A data point that is larger than the third quartile plus 1.5

times the inter-quartile range will be marked as a potential outlier. Likewise,

a data point smaller than the first quartile minus 1.5 times the inter-quartile

2 The actual computation in R of the first quartile and the third quartile may vary slightly

from the description given here, depending on the exact structure of the data.

34

CHAPTER 3. DESCRIPTIVE STATISTICS

range will also be so marked. Outliers may have a substantial effect on the

outcome of statistical analysis, therefore it is important that one is alerted to

the presence of outliers.

In the running example we obtained an inter-quartile range of size 9-2=7.

The upper threshold for defining an outlier is 9 + 1.5 × 7 = 19.5 and the lower

threshold is 2 − 1.5 × 7 = −8.5. All data points are within the two thresholds,

hence there are no outliers in this data.

In the construction of a box plot one uses a vertical rectangular box and two

vertical “whiskers” that extend from the ends of the box to the smallest and

largest data values that are not outliers. Outlier values, if any exist, are marked

as points above or blow the endpoints of the whiskers. The smallest and largest

non-outlier data values label the endpoints of the axis. The first quartile marks

one end of the box and the third quartile marks the other end of the box. The

central 50% of the data fall within the box.

One may produce a box plot with the aid of the function “boxplot”. The

input to the function is a sequence of numerical values and the output is a plot.

As an example, let us produce the box plot of the 14 data points that were used

as an illustration:

> boxplot(c(1,11.5,6,7.2,4,8,9,10,6.8,8.3,2,2,10,1))

The resulting box plot is presented in Figure 3.2. Observe that the end

points of the whiskers are 1, for the minimal value, and 11.5 for the largest

value. The end values of the box are 9 for the third quartile and 2 for the first

quartile. The median 7 is marked inside the box.

Next, let us examine the box plot for the height data:

> boxplot(ex.1$height)

The resulting box plot is presented in Figure 3.3. In order to assess the plot let

us compute quartiles of the variable:

> summary(ex.1$height)

Min. 1st Qu. Median

117.0

158.0

171.0

Mean 3rd Qu.

170.1

180.2

Max.

208.0

The function “summary”, when applied to a numerical sequence, produce the

minimal and maximal entries, as well the first, second and third quartiles (the

second is the Median). It also computes the average of the numbers (the Mean),

which will be discussed in the next section.

Let us compare the results with the plot in Figure 3.3. Observe that the

median 171 coincides with the thick horizontal line inside the box and that the

lower end of the box coincides with first quartile 158.0 and the upper end with

180.2, which is the third quartile. The inter-quartile range is 180.2 − 158.0 =

22.2. The upper threshold is 180.2 + 1.5 × 22.2 = 213.5. This threshold is

larger than the largest observation (208.0). Hence, the largest observation is

not an outlier and it marks the end of the upper whisker. The lower threshold

is 158.0 − 1.5 × 22.2 = 124.7. The minimal observation (117.0) is less than this

threshold. Hence it is an outlier and it is marked as a point below the end of the

lower whisker. The second smallest observation is 129. It lies above the lower

threshold and it marks the end point of the lower whisker.

35

120

140

160

180

200

3.3. MEASURES OF THE CENTER OF DATA

●

Figure 3.3: Box Plot of Height

3.3

Measures of the Center of Data

The two most widely used measures of the central location of the data are the

mean (average) and the median. To calculate the average weight of 50 people

one should add together the 50 weights and divide the result by 50. To find

the median weight of the same 50 people, one may order the data and locate

a number that splits the data into two equal parts. The median is generally a

better measure of the center when there are extreme values or outliers because

it is not affected by the precise numerical values of the outliers. Nonetheless,

the mean is the most commonly used measure of the center.

We shall use small Latin letters such as x to mark the sequence of data.

In such a case we may mark the sample mean by placing a bar over the x: x̄

(pronounced “x bar”).

The mean can be calculated by averaging the data points or it also can be

calculated with the relative frequencies of the values that are present in the data.

In the latter case one multiplies each distinct value by its relative frequency and

then sum the products across all values. To see that both ways of calculating

36

CHAPTER 3. DESCRIPTIVE STATISTICS

4

2

0

Frequency

6

Histogram of x

4

5

6

7

8

9

10

8

9

10

8

9

10

x

4

2

0

Frequency

6

Histogram of x

4

5

6

7

x

4

2

0

Frequency

6

Histogram of x

4

5

6

7

x

Figure 3.4: Three Histograms

the mean are the same, consider the data:

1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4 .

In the first way of calculating the mean we get:

x̄ =

1+1+1+2+2+3+4+4+4+4+4

= 2.7 .

11

Alternatively, we may note that the distinct values in the sample are 1, 2, 3,

and 4 with relative frequencies of 3/11, 2/11, 1/11 and 5/11, respectively. The

alternative method of computation produces:

x̄ = 1 ×

3.3.1

3

2

1

5

+2×

+3×

+4×

= 2.7 .

11

11

11

11

Skewness, the Mean and the Median

Consider the following data set:

4, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 10

3.3. MEASURES OF THE CENTER OF DATA

37

This data produces the upper most histogram in Figure 3.4. Each interval has

width one and each value is located at the middle of an interval. The histogram

displays a symmetrical distribution of data. A distribution is symmetrical if a

vertical line can be drawn at some point in the histogram such that the shape

to the left and to the right of the vertical line are mirror images of each other.

Let us compute the mean and the median of this data:

> x mean(x)

[1] 7

> median(x)

[1] 7

The mean and the median are each 7 for these data. In a perfectly symmetrical

distribution, the mean and the median are the same3 .

The functions “mean” and “median” were used in order to compute the mean

and median. Both functions expect a numeric sequence as an input and produce

the appropriate measure of centrality of the sequence as an output.

The histogram for the data:

4, 5, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8

is not symmetrical and is displayed in the middle of Figure 3.4. The right-hand

side seems “chopped off” compared to the left side. The shape of the distribution

is called skewed to the left because it is pulled out towards the left.

Let us compute the mean and the median for this data:

> x mean(x)

[1] 6.416667

> median(x)

[1] 7

(Notice that the original data is replaced by the new data when object x is

reassigned.) The median is still 7, but the mean is less than 7. The relation

between the mean and the median reflects the skewing.

Consider yet another set of data:

6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 9, 10

The histogram for the data is also not symmetrical and is displayed at the

bottom of Figure 3.4. Notice that it is skewed to the right. Compute the mean

and the median:

> x mean(x)

[1] 7.583333

> median(x)

[1] 7

3 In the case of a symmetric distribution the vertical line of symmetry is located at the

mean, which is also equal to the median.

38

CHAPTER 3. DESCRIPTIVE STATISTICS

The median is yet again equal to 7, but this time the mean is greater than 7.

Again, the mean reflects the skewing.

In summary, if the distribution of data is skewed to the left then the mean

is less than the median. If the distribution of data is skewed to the right then

the median is less than the mean.

Examine the data on the height in “ex.1”:

> mean(ex.1$height)

[1] 170.11

> median(ex.1$height)

[1] 171

Observe that the histogram of the height (Figure 3.1) is skewed to the left. This

is consistent with the fact that the mean is less than the median.

3.4

Measures of the Spread of Data

One measure of the spread of the data is the inter-quartile range that was

introduced in the context of the box plot. However, the most important measure

of spread is the standard deviation.

Before dealing with the standard deviation let us discuss the calculation of

the variance. If xi is a data value for subject i and x̄ is the sample mean,

then xi − x̄ is called the deviation of subject i from the mean, or simply the

deviation. In a data set, there are as many deviations as there are data values.

The variance is in principle the average of the squares of the deviations.

Consider the following example: In a fifth grade class, the teacher was interested in the average age and the standard deviation of the ages of her students.

Here are the ages of her students to the nearest half a year:

9, 9.5, 9.5, 10, 10, 10, 10, 10.5, 10.5, 10.5, 10.5, 11, 11, 11, 11, 11, 11,

11.5, 11.5, 11.5 .

In order to explain the computation of the variance of these data let us create

an object x that contains the data:

> x length(x)

[1] 20

Pay attention to the fact that we did not write the “+” at the beginning of the

second line. That symbol was produced by R when moving to the next line to

indicate that the expression is not complete yet and will not be executed. Only

after inputting the right bracket and the hitting of the Return key does R carry

out the command and creates the object “x”. When you execute this example

yourself on your own computer make sure not to copy the “+” sign. Instead, if

you hit the return key after the last comma on the first line, the plus sign will

be produced by R as a new prompt and you can go on typing in the rest of the

numbers.

The function “length” returns the length of the input sequence. Notice that

we have a total of 20 data points.

The next step involves the computation of the deviations:

3.4. MEASURES OF THE SPREAD OF DATA

39

> x.bar x.bar

[1] 10.525

> x – x.bar

[1] -1.525 -1.025 -1.025 -0.525 -0.525 -0.525 -0.525 -0.025

[9] -0.025 -0.025 -0.025 0.475 0.475 0.475 0.475 0.475

[17] 0.475 0.975 0.975 0.975

The average of the observations is equal to 10.525 and when we delete this

number from each of the components of the sequence x we obtain the deviations.

For example, the first deviation is obtained as 9 – 10.525 = -1.525, the second

deviation is 9.5 – 10.525 = -1.025, and so forth. The 20th deviation is 11.5 10.525 = 0.975, and this is the last number that is presented in the output.

From a more technical point of view observe that the expression that computed the deviations, “x – x.bar”, involved the deletion of a single value

(x.bar) from a sequence with 20 values (x). The expression resulted in the

deletion of the value from each component of the sequence. This is an example

of the general way by which R operates on sequences. The typical behavior of

R is to apply an operation to each component of the sequence.

As yet another illustration of this property consider the computation of the

squares of the deviations:

> (x – x.bar)^2

[1] 2.325625 1.050625 1.050625 0.275625 0.275625 0.275625

[7] 0.275625 0.000625 0.000625 0.000625 0.000625 0.225625

[13] 0.225625 0.225625 0.225625 0.225625 0.225625 0.950625

[19] 0.950625 0.950625

Recall that “x – x.bar” is a sequence of length 20. We apply the square function to this sequence. This function is applied to each of the components of the

sequence. Indeed, for the first component we have that (−1.525)2 = 2.325625,

for the second component (−1.025)2 = 1.050625, and for the last component

(0.975)2 = 0.950625.

For the variance we sum the square of the deviations and divide by the total

number of data values minus one (n − 1). The standard deviation is obtained

by taking the square root of the variance:

> sum((x – x.bar)^2)/(length(x)-1)

[1] 0.5125

> sqrt(sum((x – x.bar)^2)/(length(x)-1))

[1] 0.715891

If the variance is produced as a result of dividing the sum of squares by the

number of observations minus one then the variance is called the sample variance.

The function “var” computes the sample variance and the function “sd”

computes the standard deviations. The input to both functions is the sequence

of data values and the outputs are the sample variance and the standard deviation, respectively:

> var(x)

[1] 0.5125

40

CHAPTER 3. DESCRIPTIVE STATISTICS

> sd(x)

[1] 0.715891

In the computation of the variance we divide the sum of squared deviations

by the number of deviations minus one and not by the number of deviations.

The reason for that stems from the theory of statistical inference that will be

discussed in Part II of this book. Unless the size of the data is small, dividing

by n or by n − 1 does not introduce much of a difference.

The variance is a squared measure and does not have the same units as

the data. Taking the square root solves the problem. The standard deviation

measures the spread in the same units as the data.

The sample standard deviation, s, is either zero or is larger than zero. When

s = 0, there is no spread and the data values are equal to each other. When s

is a lot larger than zero, the data values are very spread out about the mean.

Outliers can make s very large.

The standard deviation is a number that measures how far data values are

from their mean. For example, if the data contains the value 7 and if the mean

of the data is 5 and the standard deviation is 2, then the value 7 is one standard

deviation from its mean because 5 + 1 × 2 = 7. We say, then, that 7 is one

standard deviation larger than the mean 5 (or also say “to the right of 5”). If

the value 1 was also part of the data set, then 1 is two standard deviations

smaller than the mean (or two standard deviations to the left of 5) because

5 − 2 × 2 = 1.

The standard deviation, when first presented, may not be too simple to

interpret. By graphing your data, you can get a better “feel” for the deviations

and the standard deviation. You will find that in symmetrical distributions, the

standard deviation can be very helpful but in skewed distributions, the standard

deviation is less so. The reason is that the two sides of a skewed distribution

have different spreads. In a skewed distribution, it is better to look at the first

quartile, the median, the third quartile, the smallest value, and the largest value.

3.5

Solved Exercises

Question 3.1. Three sequences of data were saved in 3 R objects named “x1”,

“x2” and “x3”, respectively. The application of the function “summary” to each

of these objects is presented below:

> summary(x1)

Min. 1st Qu. Median

Mean 3rd Qu.

Max.

0.000

2.498

3.218

3.081

3.840

4.871

> summary(x2)

Min.

1st Qu.

Median

Mean

3rd Qu.

Max.

0.0001083 0.5772000 1.5070000 1.8420000 2.9050000 4.9880000

> summary(x3)

Min. 1st Qu. Median

Mean 3rd Qu.

Max.

2.200

3.391

4.020

4.077

4.690

6.414

In Figure 3.5 one may find the histograms of these three data sequences, given

in a random order. In Figure 3.6 one may find the box plots of the same data,

given in yet a different order.

3.5. SOLVED EXERCISES

0

1

41

2

3

4

5

Histogram 1

2

3

4

5

6

Histogram 2

0

1

2

3

4

5

Histogram 3

Figure 3.5: Three Histograms

1. Match the summary result with the appropriate histogram and the appropriate box plot.

2. Is the value 0.000 in the sequence “x1” an outlier?

3. Is the value 6.414 in the sequence “x3” an outlier?

Solution (to Question 3.1.1): Consider the data “x1”. From the summary

we see that it is distributed in the range between 0 and slightly below 5. The

central 50% of the distribution are located between 2.5 and 3.8. The mean and

median are approximately equal to each other, which suggests an approximately

symmetric distribution. Consider the histograms in Figure 3.5. Histograms 1

and 3 correspond to a distributions in the appropriate range. However, the

distribution in Histogram 3 is concentrated in lower values than suggested by

the given first and third quartiles. Consequently, we match the summary of

“x1” with Histograms 1.

Consider the data “x2”. Again, the distribution is in the range between 0 and

slightly below 5. The central 50% of the distribution are located between 0.6 and

1.8. The mean is larger than the median, which suggests a distribution skewed

CHAPTER 3. DESCRIPTIVE STATISTICS

1

2

3

4

5

6

42

0

●

●

●

Boxplot 1

Boxplot 2

Boxplot 3

Figure 3.6: Three Box Plots

to the right. Therefore, we match the summary of “x2” with Histograms 3.

For the data in “x3” we may note that the distribution is in the range

between 2 and 6. The histogram that fits this description is Histograms 2.

The box plot is essentially a graphical representation of the information presented by the function “summary”. Following the rational of matching the summary with the histograms we may obtain that Histogram 1 should be matched

with Box-plot 2 in Figure 3.6, Histogram 2 matches Box-plot 3, and Histogram 3

matches Box-plot 1. Indeed, it is easier to match the box plots with the summaries. However, it is a good idea to practice the direct matching of histograms

with box plots.

Solution (to Question 3.1.2): The data in “x1” fits Box-plot 2 in Figure 3.6.

The value 0.000 is the smallest value in the data and it corresponds to the

smallest point in the box plot. Since this point is below the bottom whisker it

follows that it is an outlier. More directly, we may note that the inter-quartile

range is equal to IQR = 3.840 − 2.498 = 1.342. The lower threshold is equal to

2.498 − 1.5 × 1.342 = 0.485, which is larger that the given value. Consequently,

the given value 0.000 is an outlier.

3.5. SOLVED EXERCISES

43

Solution (to Question 3.1.3): Observe that the data in “x3” fits Box-plot 3

in Figure 3.6. The vale 6.414 is the largest value in the data and it corresponds

to the endpoint of the upper whisker in the box plot and is not an outlier.

Alternatively, we may note that the inter-quartile range is equal to IQR =

4.690 − 3.391 = 1.299. The upper threshold is equal to 4.690 + 1.5.299 = 6.6385,

which is larger that the given value. Consequently, the given value 6.414 is not

an outlier.

Question 3.2. The number of toilet facilities in 30 buildings were counted.

The results are recorded in an R object by the name “x”. The frequency table

of the data “x” is:

> table(x)

x

2 4 6 8 10

10 6 10 2 2

1. What is the mean (x̄) of the data?

2. What is the sample standard deviation of the data?

3. What is the median of the data?

4. What is the inter-quartile range (IQR) of the data?

5. How many standard deviations away from the mean is the value 10?

Solution (to Question 3.2.1): In order to compute the mean of the data we

may write the following simple R code:

> x.val freq rel.freq x.bar x.bar

[1] 4.666667

We created an object “x.val” that contains the unique values of the data

and an object “freq” that contains the frequencies of the values. The object

“rel.freq” contains the relative frequencies, the ratios between the frequencies

and the number of observations. The average is computed as the sum of the

products of the values with their re…

## We've got everything to become your favourite writing service

### Money back guarantee

Your money is safe. Even if we fail to satisfy your expectations, you can always request a refund and get your money back.

### Confidentiality

We don’t share your private information with anyone. What happens on our website stays on our website.

### Our service is legit

We provide you with a sample paper on the topic you need, and this kind of academic assistance is perfectly legitimate.

### Get a plagiarism-free paper

We check every paper with our plagiarism-detection software, so you get a unique paper written for your particular purposes.

### We can help with urgent tasks

Need a paper tomorrow? We can write it even while you’re sleeping. Place an order now and get your paper in 8 hours.

### Pay a fair price

Our prices depend on urgency. If you want a cheap essay, place your order in advance. Our prices start from $11 per page.