# Concordia College Ratio and Regression Questions

R Companion for

Sampling

Design and Analysis

Third Edition

R Companion for

Sampling

Design and Analysis

Third Edition

Yan Lu and Sharon L. Lohr

First edition published 2022

by CRC Press

6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press

2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

© 2022 Yan Lu and Sharon L. Lohr

CRC Press is an imprint of Taylor & Francis Group, LLC

Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot

assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have

attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders

if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please

write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized

in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying,

microfilming, and recording, or in any information storage or retrieval system, without written permission from the

publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the

Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are

not available on CCC please contact mpkbookspermissions@tandf.co.uk

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for

identification and explanation without intent to infringe. SAS® and all other SAS Institute Inc. product or service

names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA

registration.

Library of Congress Cataloging‑in‑Publication Data

Names: Lu, Yan, author. | Lohr, Sharon L., 1960- author.

Title: R companion for sampling : design and analysis / Yan Lu and Sharon

L. Lohr.

Description: First edition. | Boca Raton : CRC Press, 2022. | Includes

bibliographical references and index. | Summary: “The R Companion for

Sampling: Design and Analysis, designed to be read alongside Sampling:

Design and Analysis, Third Edition by Sharon L. Lohr (SDA; 2022, CRC

Press), shows how to use functions in base R and contributed packages to

perform calculations for the examples in SDA. No prior experience with R

is needed. Chapter 1 tells you how to obtain R and RStudio, introduces

basic features of the R statistical software environment, and helps you

get started with analyzing data. Each subsequent chapter provides

step-by-step guidance for working through the data examples in the

corresponding chapter of SDA, with code, output, and interpretation.

Tips and warnings help you develop good programming practices and avoid

common survey data analysis errors. R features and functions are

introduced as they are needed so you can see how each type of sample is

selected and analyzed. Each chapter builds on the knowledge developed

earlier for simpler designs; after finishing the book, you will know how

to use R to select and analyze almost any type of probability sample”-Provided by publisher.

Identifiers: LCCN 2021039318 (print) | LCCN 2021039319 (ebook) | ISBN

9781032135946 (paperback) | ISBN 9781032132150 (hardback) | ISBN

9781003228196 (ebook)

Subjects: LCSH: R (Computer program language) | Sampling (Statistics)

Classification: LCC QA276.45.R3 L8 2022 (print) | LCC QA276.45.R3 (ebook)

| DDC 519.5/202855133–dc23

LC record available at https://lccn.loc.gov/2021039318

LC ebook record available at https://lccn.loc.gov/2021039319

ISBN: 978-1-032-13215-0 (hbk)

ISBN: 978-1-032-13594-6 (pbk)

ISBN: 978-1-003-22819-6 (ebk)

DOI: 10.1201/9781003228196

Access the Support Material: https://www.routledge.com/9781032135946

To Guoyi and Lynn, and to Doug

Contents

Preface

xi

1 Getting Started

1.1 Obtaining the Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Installing R Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3 R Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4 Reading Data into R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5 Saving Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.6 Integrating R Output into LATEX Documents . . . . . . . . . . . . . . . . .

1.7 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.8 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . .

1

2

2

4

5

7

10

12

13

2 Simple Random Sampling

2.1 Selecting a Simple Random Sample . . . . . . . . . . . . . . . . . . . . . .

2.2 Computing Statistics from a Simple Random Sample . . . . . . . . . . . .

2.3 Additional Code for Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

2.4 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . .

15

15

18

24

25

3 Stratified Sampling

3.1 Allocation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2 Selecting a Stratified Random Sample . . . . . . . . . . . . . . . . . . . . .

3.3 Computing Statistics from a Stratified Random Sample . . . . . . . . . . .

3.4 Estimating Proportions from a Stratified Random Sample . . . . . . . . . .

3.5 Additional Code for Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

3.6 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . .

27

27

30

32

36

37

38

4 Ratio and Regression Estimation

4.1 Ratio Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2 Regression Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3 Domain Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4 Poststratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5 Ratio Estimation with Stratified Sampling . . . . . . . . . . . . . . . . . .

4.6 Model-Based Ratio and Regression Estimation . . . . . . . . . . . . . . . .

4.7 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . .

41

41

44

46

48

49

50

54

5 Cluster Sampling with Equal Probabilities

5.1 Estimates from One-Stage Cluster Samples . . . . . . . . . . . . . . . . . .

5.2 Estimates from Multi-Stage Cluster Samples . . . . . . . . . . . . . . . . .

5.3 Model-Based Design and Analysis for Cluster Samples . . . . . . . . . . . .

5.4 Additional Code for Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

5.5 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . .

57

57

59

63

65

67

vii

viii

Contents

6 Sampling with Unequal Probabilities

6.1 Selecting a Sample with Unequal Probabilities . . . . . . . . . . . . . . . .

6.1.1 Sampling with Replacement . . . . . . . . . . . . . . . . . . . . . . .

6.1.2 Sampling without Replacement . . . . . . . . . . . . . . . . . . . . .

6.2 Selecting a Two-Stage Cluster Sample . . . . . . . . . . . . . . . . . . . . .

6.3 Computing Estimates from an Unequal-Probability Sample . . . . . . . . .

6.3.1 Estimates from with-Replacement Samples . . . . . . . . . . . . . .

6.3.2 Estimates from without-Replacement Samples . . . . . . . . . . . . .

6.4 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . .

69

69

69

70

71

77

77

79

83

7 Complex Surveys

7.1 Selecting a Stratified Two-Stage Sample . . . . . . . . . . . . . . . . . . . .

7.2 Estimating Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.3 Computing Estimates from Stratified Multistage Samples . . . . . . . . . .

7.4 Univariate Plots from Complex Surveys . . . . . . . . . . . . . . . . . . . .

7.5 Scatterplots from Complex Surveys . . . . . . . . . . . . . . . . . . . . . .

7.6 Additional Code for Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

7.7 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . .

85

85

88

89

92

95

103

105

8 Nonresponse

107

8.1 How R Functions Treat Missing Data . . . . . . . . . . . . . . . . . . . . . 107

8.2 Poststratification and Raking . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8.3 Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.4 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . . 112

9 Variance Estimation in Complex Surveys

113

9.1 Replicate Samples and Random Groups . . . . . . . . . . . . . . . . . . . . 113

9.2 Constructing Replicate Weights . . . . . . . . . . . . . . . . . . . . . . . . 116

9.2.1 Balanced Repeated Replication . . . . . . . . . . . . . . . . . . . . . 117

9.2.2 Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

9.2.3 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

9.2.4 Replicate Weights and Nonresponse Adjustments . . . . . . . . . . . 124

9.3 Using Replicate Weights from a Survey Data File . . . . . . . . . . . . . . 126

9.4 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . . 127

10 Categorical Data Analysis in Complex Surveys

129

10.1 Contingency Tables and Odds Ratios . . . . . . . . . . . . . . . . . . . . . 129

10.2 Chi-Square Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

10.3 Loglinear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

10.4 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . . 137

11 Regression with Complex Survey Data

139

11.1 Straight Line Regression with a Simple Random Sample . . . . . . . . . . . 139

11.2 Linear Regression for Complex Survey Data . . . . . . . . . . . . . . . . . 142

11.3 Using Regression to Compare Domain Means . . . . . . . . . . . . . . . . . 145

11.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

11.5 Additional Resources and Code . . . . . . . . . . . . . . . . . . . . . . . . 151

11.6 Summary, Tips, and Warnings . . . . . . . . . . . . . . . . . . . . . . . . . 152

12 Additional Topics for Survey Data Analysis

155

12.1 Two-Phase Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

12.2 Estimating the Size of a Population . . . . . . . . . . . . . . . . . . . . . . 157

Contents

12.2.1 Ratio Estimation of Population Size . . . . . . . . . . . . . . . . . .

12.2.2 Loglinear Models with Multiple Lists . . . . . . . . . . . . . . . . . .

12.3 Small Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

157

159

161

162

A Data Set Descriptions

163

Bibliography

197

Index

205

Preface

R Companion for Sampling: Design and Analysis, Third Edition shows how to use the

R statistical software environment to perform the calculations in the textbook Sampling:

Design and Analysis, Third Edition (SDA) by Sharon L. Lohr. It is intended to be read

in conjunction with SDA and is not a standalone text. The parallel book by Lohr (2022)

shows how to perform the computations for the examples using SAS® software, and could

be read together with this book and SDA to learn how to perform the analyses in each

software package.

All code and data sets can be downloaded from any of the following websites:

https://math.unm.edu/~luyan/rbook.html

https://www.sharonlohr.com

https://www.routledge.com/9781032135946

The first two websites also contain additional programs, not discussed in this book, that

you can adapt for some of the SDA exercises. The data sets used in this book have also

been saved in R format in the contributed R package SDAResources (Lu and Lohr, 2021).

In this book, we give step-by-step guidance for using functions from base R and contributed

packages to select samples and analyze the data sets discussed in Chapters 1–13 of SDA.

The software, however, can do much more than analyze the examples presented in this book.

You can find information on advanced capabilities for the survey and sampling contributed

packages in the documentation for those packages by Lumley (2020) and Tillé and Matei

(2021); the books and articles by Lumley (2004, 2010) and Tillé and Matei (2010) provide

additional information about the packages. Goga (2018) gives an overview of using R for

survey sampling.

For easy reference, the index at the back of the book gives page numbers for the examples

in SDA. To locate the code and output for Example 2.5, for example, look up the subentry

“Example 02.05” under “Examples in SDA” in the index. The book also gives code and suggestions for some of the exercises in SDA, and these are listed under index entry “Exercises

in SDA.”

Each chapter ends with a summary section containing tips and warnings for the analyses

discussed in that chapter. These provide ways of avoiding common survey data analysis

errors and checking whether you did the analysis correctly.

Although prior experience with R is helpful, it is not needed to read this book. Chapter 1

tells how to obtain the software and do basic operations in R. It also lists resources for

learning more about programming in R and tells how to obtain help.

This book makes use of functions that exist in base R and contributed packages, and does

not discuss how to write R functions. One of R’s most valuable features, however, is the

capacity for writing functions to carry out new tasks. Advanced R users may want to write

their own functions to select samples or analyze data from a complex survey. When teaching

xi

xii

Preface

survey sampling to students who have R programming experience, we have sometimes asked

them to write their own functions to carry out various sampling tasks. This helps solidify

their knowledge of the material and allows them to do computations not available in existing

functions. For example, we have asked students to write R functions to perform allocation

for and analyze data from a stratified random sample, select a with-replacement unequalprobability sample using Lahiri’s method, compute the Sen–Yates–Grundy estimate of the

variance, simulate the sampling distribution of a statistic, and find empirical estimates of

the coverage probability of a confidence interval for a biased estimator.

All code, data sets, and output in this book are provided for educational purposes only

and without warranty. Base R does not contain functions for survey data, and this book

relies heavily on contributed packages that have been developed. These packages are in

widespread use and have been quality-checked by their authors and other users. We have

verified that the calculations from the R functions used for the examples in this book agree

with calculations by the formulas and with calculations performed in other survey software

packages.

Other R packages may not be checked as carefully, however. Although R contributed packages undergo some consistency and functionality tests when they are submitted (see Wickham, 2015, for a description of checks that are performed), no central authority reviews

the packages to make sure that the functions do what they claim or that the algorithms

perform computations accurately. Most R contributed packages are not peer-reviewed, and

you should be aware that some may contain errors.

The code and output in this book were developed using version 4.0.4 of R for Windows (R

Core Team, 2021) and the versions of the packages listed in their respective bibliography

entries, and all code in the book works with those versions. But R is a dynamic language,

and the R Core Team and authors of contributed packages can change or remove functions at

any time. Although most authors who revise a package try to avoid changes that will affect

previously written code, functions in R are not guaranteed to be backward compatible—it

is possible that R code you write today may not work the same way with future versions

of the software. If backward compatibility is important to you—for example, if you will be

using the same code to produce estimates each year for an annual survey—you may want

to perform or check your computations in a package that is backward compatible, such as

SAS software. If a function changes in a subsequent version of an R package, you can either:

• Read the documentation and change your code so that it works with the modified

function, or

• Download and use the older version of the package. You can find previous versions on

the package’s web page under the heading “Old sources.”

Acknowledgments. Many thanks to John Kimmel, our editor at CRC Press, for encouraging

us to write this book, and to the CRC Press production team for all their support and

help. We are grateful to Yves Tillé and Thomas Lumley for answering questions about the

sampling and survey packages. Students in Yan Lu’s sampling class at the University of

New Mexico provided helpful suggestions for clarifying the material. We also want to thank

Lynn Zhang for helping with the preparation of the SDAResources package.

1

Getting Started

The R statistical software environment is a powerful and flexible platform for performing

statistical analyses. The basic package contains thousands of functions for computing statistics, and user-contributed packages for this open-source software provide thousands more.

Advanced users can write their own functions to implement new methods for statistical

analyses.

Best of all, the base R package and all user-contributed packages are available free of charge

to anyone with an internet connection.

This chapter tells you how to obtain R software and contributed packages and introduces

you to some basic R functions. It also shows you how to read data sets into R and save

output and graphics produced while you are using the package.

Conventions used in this book. This book is intended to be read in conjunction with

Sampling: Design and Analysis, Third Edition by Sharon L. Lohr, henceforth referred to as

SDA. Many of the examples in this book refer to figures, tables, examples, or exercises in

SDA. To avoid confusion, we refer to figures in SDA as “Figure x.x in SDA.” We refer to

figures in this book as “Figure x.x” with no qualifier.

The names of external data files and programs, such as agsrs.csv and ch02.R, are in

typewriter font, as are the names of R packages and code we type. Variable names,

function names, and internal R data set names are in italic type.

Much of this book consists of R commands and output, set in light shaded boxes such as

the following:

# This is a comment

# Enter data values into vector ‘myvec’

myvec

## We've got everything to become your favourite writing service

### Money back guarantee

Your money is safe. Even if we fail to satisfy your expectations, you can always request a refund and get your money back.

### Confidentiality

We don’t share your private information with anyone. What happens on our website stays on our website.

### Our service is legit

We provide you with a sample paper on the topic you need, and this kind of academic assistance is perfectly legitimate.

### Get a plagiarism-free paper

We check every paper with our plagiarism-detection software, so you get a unique paper written for your particular purposes.

### We can help with urgent tasks

Need a paper tomorrow? We can write it even while you’re sleeping. Place an order now and get your paper in 8 hours.

### Pay a fair price

Our prices depend on urgency. If you want a cheap essay, place your order in advance. Our prices start from $11 per page.