Exploratory Data Analysis Project Paper
The final project will be a
short
exploratory data analysis using a real data set
to investigate some research questions. At a minimum, the project must involve importing and cleaning data and
creating visualizations and/or statistical summaries that help to address the research questions. The most successful projects will incorporate multiple and varied aspects of the coding techniques we cover in class. There is no required page length. The document should be written in R Markdown and turned in as a .pdf or .html file. Note that the R Markdown file for the final project should be well-organized (i.e., with sections and headers) and readable (i.e., it should be free from typographical errors and the writing should flow well). Data from Kaggle and machine learning repositories are not allowed. PLOS ONE is my first recommendation for places to look for data.
- Project details and syllabus are attached.
Here are some links to publicly available data resources:
- • PLOS ONE – open access journal that requires all data be made publicly available; search for a randomized trial; papers published after 2016 are more likely to have data files availablehttps://journals.plos.org/plosone/
- • data.gov – open data from the US government; search for a randomized trialhttps://data.gov/
- • Harvard Datahub for Field Experiments – data from randomized experiments in the social scienceshttps://dataverse.harvard.edu/dataverse/DFEEP
- • OPEN ICPSC – social, behavioral, and health sciences research datahttps://www.openicpsr.org/openicpsr/
- • Journal of Open Psychology Data – openly available data from a variety of papershttps://openpsychologydata.metajnl.com/
- • Data – Open Access Journal – open access journal on data in sciencehttps://www.mdpi.com/journal/data
- • The National Center for Education Statisticshttps://nces.ed.gov/, including ECLShttps://nces.ed.gov/surveys/SurveyGroups.asp?group…
- • The Youth Risk Behavior Survey, available from the CDChttps://www.cdc.gov/healthyyouth/data/yrbs/data.ht…
- • The Current Population Survey, available from the U.S. Census Bureauhttps://www.census.gov/programs-surveys/cps/data-d…
- • The Fragile Families & Child Wellbeing Study, available from Princeton University https://fragilefamilies.princeton.edu/documentatio…
Introduction to Data Analysis and Graphics in R
HUDM 5026 – 3 Credits
Fall 2022
———————————————————————————————————–Instructor: Bryan Keller, PhD
Faculty Webpage
Time: T 3:00 PM – 4:40 PM
ONLINE via Canvas Zoom
Office Hours: TBA
Office: GDH 456
Contact: keller4@tc.columbia.edu
Course TA: Rachel Lee (yl3751@tc.columbia.edu)
Office Hour: Fri 12 – 1 PM
———————————————————————————————————–Email is the best way to get in touch with me outside of class meetings.
Overview:
R is free open-source software maintained by a Core Team whose members
continually work to improve the R source code. The base distribution of R comes with
the capability to run a wide variety of statistical procedures and many options for
graphical exploration of data. In addition to the base distribution many user-written
packages are freely available which extend the functionality of R. This course provides
an introduction to R with emphasis on the application of fundamental data management,
graphical and statistical techniques.
Most of the work you do for this course will be done outside of class through
readings, howework assignments, data camp, and your final project paper. By the time
we meet as a class, you should be prepared to discuss the assigned reading materials and
work on data analysis problems related to the readings.
Prerequisites:
One of either HUDM 4122 or HUDM 4125. A course in regression is
recommended. Prior programming experience with R is not required.
Required texts:
[WG]
Wickam, H. & Grolemund, G (2017). R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data. Sebastopol, CA: O’Reilly Media,
Inc. (link)
[C]
Cairo, A. (2016). The Truthful Art: Data, Charts, and Maps for
Communication. Berkeley, CA: New Riders. (Amazon link)
[K]
Kabacoff, R. I. (2011). R in Action: Data Analysis and Graphics with R.
Shelter Island, NY: Manning, 2nd Ed. (Amazon link; any edition is ok but
chapters noted in the reading list below refer to the 2nd edition)
Other useful texts:
• Braun, W. J. & Murdoch, D. J. (2007). A First Course in Statistical Programming
with R. Cambridge, UK: Cambridge University Press. (A primer for statistical
programming in R.)
•
•
•
•
•
•
•
•
Field, J., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Thousand
Oaks, CA: Sage. (Survey of statistical procedures for social sciences in R.)
Fox, J. and Weisberg, S. (2011). An R Companion to Applied Regression (2nd ed.).
Thousand Oaks, CA: SAGE. (This book is the R cookbook that accompanies
Fox’s Applied Regression Analysis text.)
Matloff, N. (2011). The Art of R Programming: A Tour of Statistical Software
Design. San Francisco, CA: No Starch Press. (R from a design perspective.)
R Core Development Team (2022). An Introduction to R. (link) (Free online
resource that covers the basics with more of an emphasis on statistical modeling.)
Tufte, E. (2001). The Visual Display of Quantitative Data (2nd ed.). Cheshire, CT:
Graphics Press. (Classic book on theory of visual display of data.)
Venables, W. N. & Ripley, B. D. (2002). Modern Applied Statistics with S (4th
ed.). New York, NY: Springer-Verlag. (A classic book that showcases the power
and functionality of base R with a focus on applied statistical procedures.)
[FOX] [Fox, J. D. (2020). Regression diagnostics: An introduction (2nd ed.).
SAGE. link
[GGP2] Wickham, H. (2010). ggplot2: Elegant Graphics for Data Analysis, (2nd
ed.). New York, NY: Springer. (This book is a ggplot2 handbook.)
Links
• Comprehensive R Archive Network (CRAN). Download R here. Also find
manuals and archived help lists. https://cran.r-project.org/
• RStudio. Download RStudio here. https://www.rstudio.com/
• Tidyverse. Packages for data science. https://www.tidyverse.org/
• Stack Overflow. For R programming questions and answers.
https://stackoverflow.com/
• Quick-R. Resources and cheat sheet-style primers. http://www.statmethods.net/
• R Markdown. https://rmarkdown.rstudio.com/
• R-CHARTS. Plot examples with code. https://r-charts.com/
• Course datacamp invite (link will be provided in Canvas)
Learning goals:
By the end of this course you should be comfortable doing the following:
1. Importing data and handling routine data management issues including
a. identifying and deleting or replacing missing values
b. dropping or adding cases or variables
c. merging data from different sources
d. pivoting data from long to wide or wide to long format
e. creating new variables by transforming existing variables
2. Summarizing data with graphical displays including:
a. univariate quantitative data
i. histogram
ii. boxplot
iii. kernel density plot
iv. qq-plot
b. Univariate qualitative data
i. barplot
ii. pie chart
c. Bivariate quantitative data
i. scatterplot
d. Bivariate qualitative data
i. stacked or grouped barplot
e. Multivariate data of mixed type
i. scatterplot with enhanced features
ii. conditional boxplot
f. Geographic data
i. choropleth
3. Creating frequency tables and calculating statistics including
a. measures of central tendency (mean, median, mode)
b. measures of spread (variance, IQR, quantiles)
c. Pearson and non-parametric correlations
d. running and interpreting linear regression output
4. Working in both base R and in specialized packages including those in the
tidyverse such as ggplot2 and dplyr.
Tentative Course Schedule
Dates
Topic
01. 09/06
Introduction
DATA TYPES AND GRAPHING
02. 09/13
Functions and sample statistics
03. 09/20
Vectors & univariate plots
04. 09/27
Data frames & bivariate plots
05. 10/04
More on plotting multivariate data
DATA WRANGLING AND BASIC STATISTICS
06. 10/11
Data transformations & descriptive statistics
07. 10/18
Tibbles & tidy data & frequency tables
08. 10/26
Strings, regular expressions & correlation
09. 11/01
Review goals of final project
INTERMEDIATE-LEVEL EXTENSIONS
11/08
Election Day. College Holiday.
10. 11/15
Relational data & t-tests
11. 11/22
Choropleth maps
12. 11/29
Multiple linear regression
13. 12/06
Regression diagnostics
14. 12/13
Review & meetings about final projects
15. 12/20
No class; final projects due
Readings
Syllabus
K 1, 5; C 1
K 2, 6; WG 1:2; C 2
K 3; WG 3, 11; C 3
K 11; WG 7; C 4
K 7, 8; WG 5; C 5
K 7; WG 10, 12; C 6
K 5, 7; WG 14; C 7
K 7; WG 13; C 8
K 8; GGP2 6; link; C 10
K 8; WG 23:24; C 9
FOX 4-7
Grading:
Attendance will account for 5% of the final grade. Your attendance grade will
be based on your attendance record, including being on time to class sessions. Note that
you will be required to use both audio and video so make necessary preparations to have
them in good working order by class time. Failure to have audio and video access will
negatively impact your attendance grade.
DataCamp participation will account for 15% of the final grade. Register for
a free DataCamp account and get started right away with the “Introduction to R” course.
During the first week of class I will send you an email invite to join my DataCamp
classroom, and once you join you will have access to other courses. Complete 2,000 XP
per week for classes 1-10 for a total of 20,000 XP by the end of class 10. For full credit,
be sure to hit 20,000 XP by the end of the day on Monday 6/27. Note that it is fine to mix
and match chapters from different courses but that all XP has to be completed from
chapters related to R or relevant R packages. Some suggested courses:
1. Introduction to R (6200 XP)
2. Introduction to the Tidyverse (4150 XP)
3. Introduction to Data Visualization with ggplot2 (4300 XP)
4. Transforming Data with dplyr (3850 XP)
5. Introduction to Statistics in R (4250 XP)
6. Intermediate R (6950 XP)
Reading responses will account for 15% of the final grade. Each week with
assigned readings will have a discussion thread on Canvas where the expectation is that
you will post at least one thought or question you had about the reading and also respond
to at least one classmates’ post. The posts do not need to be long and it is not necessary to
post about all of the reading assignments, although it is ok to do so if you have something
to share.
Homework problem sets will account for 35% of the final grade. Most class
periods will involve time to work on a POTD which will be written up and submitted on
Canvas before the next class meeting. POTDs should be written up in R Markdown.
The final project will account for 30% of the final grade. The final project will
be a short exploratory data analysis using a real data set to investigate some research
questions. At a minimum, the project must involve importing and cleaning data and
creating visualizations and/or statistical summaries that help to address the research
questions. The most successful projects will incorporate multiple and varied aspects of
the coding techniques we cover in class. There is no required page length. The document
should be written in R Markdown and turned in as a .pdf or .html file. Note that the R
Markdown file for the final project should be well-organized (i.e., with sections and
headers) and readable (i.e., it should be free from typographical errors and the writing
should flow well). Data from Kaggle and machine learning repositories are not allowed.
PLOS ONE is my first recommendation for places to look for data.
Here are some links to publicly available data resources:
• PLOS ONE – open access journal that requires all data be made publicly
available; search for a randomized trial; papers published after 2016 are more
likely to have data files available
• data.gov – open data from the US government; search for a randomized trial
• Harvard Datahub for Field Experiments – data from randomized experiments in
the social sciences
• OPEN ICPSC – social, behavioral, and health sciences research data
• Journal of Open Psychology Data – openly available data from a variety of papers
•
•
•
•
•
Data – Open Access Journal – open access journal on data in science
The National Center for Education Statistics, including ECLS
The Youth Risk Behavior Survey, available from the CDC
The Current Population Survey, available from the U.S. Census Bureau
The Fragile Families & Child Wellbeing Study, available from Princeton
University
Average Rare (93,100] (90,93] (87,90] (83,87] (80,83] (77,80] (73,77] (70,73] [0,70]
Grade
A+ A
AB+
B
BC+
C
CF
For the full text of the grading symbols approved by Teachers College Faculty
please refer to http://www.tc.columbia.edu/policylibrary/Grading.
Late work:
If extenuating circumstances arise and you know you will not be able to turn in an
assignment on time, get in touch with me as far in advance of the due date as possible,
but no later than 24 hours before the due date, except in the case of an emergency. Late
work submitted without advance approval will receive no credit.
-continues on next page-
The Provost and Dean of the College in conjunction with the Faculty has adopted the
following statements to be included on all Teachers College syllabi.
1. Accommodations – The College will make reasonable accommodations for persons
with documented disabilities. Students are encouraged to contact the Office of
Access and Services for Individuals with Disabilities (OASID) for information about
registration. You can reach OASID by email at oasid@tc.columbia.edu, stop by 163
Thorndike Hall or call 212-678-3689. Services are available only to students who
have registered and submit appropriate documentation. As your instructor, I am
happy to discuss specific needs with you as well. Please report any access related
concerns about instructional material to OASID and to me as your instructor.
2. Incomplete Grades – For the full text of the Incomplete Grade policy please refer
to http://www.tc.columbia.edu/policylibrary/Incomplete Grades
3. Student Responsibility for Monitoring TC email account – Students are expected
to monitor their TC email accounts. For the full text of the Student Responsibility for
Monitoring TC email account please refer
to http://www.tc.columbia.edu/policylibrary/Student Responsibility for Monitoring
TC Email Account
4. Religious Observance – For the full text of the Religious Observance policy, please
refer to http://www.tc.columbia.edu/policylibrary/provost/religious-observance/
5. Sexual Harassment and Violence Reporting – Teachers College is committed to
maintaining a safe environment for students. Because of this commitment and
because of federal and state regulations, we must advise you that if you tell any of
your instructors about sexual harassment or gender-based misconduct involving a
member of the campus community, your instructor is required to report this
information to the Title IX Coordinator, Janice Robinson. She will treat this
information as private, but will need to follow up with you and possibly look into the
matter. The Ombuds officer for Gender-Based Misconduct is a confidential resource
available for students, staff and faculty. “Gender-based misconduct” includes sexual
assault, stalking, sexual harassment, dating violence, domestic violence, sexual
exploitation, and gender-based harassment. For more information,
see http://sexualrespect.columbia.edu/gender-based-misconduct-policy-students.
Emergency Plan:
TC is prepared for a wide range of emergencies. After declaring an emergency situation, the
President/Provost will provide the community with critical information on procedures and
available assistance. If travel to campus is not feasible, instructors will facilitate academic
continuity through Canvas and other technologies, if possible.
1. It is the student’s responsibility to ensure that they are set to receive email
notifications from TC and communications from their instructor at their TC email
address.
2. Within the first two sessions for the course, students are expected to review and be
prepared to follow the instructions stated in the emergency plan.
3. The plan may consist of downloading or obtaining all available readings for the
course or the instructor may provide other instructions.
Rubric for HUDM 5026 Projects
This is an exploratory data analysis project with summary statistics and graphical displays. That
said, you should have some motivating research questions as you begin your exploration of the
data.
Paper details:
•
•
•
Write in R Markdown
Include references in APA style in a section at the end
Upload your work as a knitted .pdf or .html document
Introduction & literature review (20 points)
•
•
•
•
1-2 pages, most of which should be literature review
There are two types of papers you would want to review here: (1) papers which have used
your dataset, and (2) papers which looked at the same variables to answer questions
similar yours. If you use a data set from PLOS ONE, you will already have one paper to
discuss.
When you review research, focus on two things: (1) what did they find? and (2) how does
it relate to and inform my motivating questions?
If you are working with a data set from a particular paper, try to replicate some of the
fundamental findings from the paper.
Methods and Sample (30 points)
•
•
•
1-2 pages, go more in depth in this paper than you would if you were writing a research
paper for another class
Describe your sample (20 points), who are they? You must present a table of
descriptive statistics, or you will lose points. Suppose you are working with six
variables for this project. In that case, I expect to see descriptive information about these
variables in the methods part so that the scale and measurement of these variables is
understood. Generally, you want to report at least the mean and SD of numeric variables
(I also think the min and max are very useful, as they help find data errors / give
confidence that there are not issues with your variables). And for categorical variables,
give frequencies and percentages (Descriptive data here should focus on the raw
categorical variable, not the dummy variable)
Describe the graphical and statistical methods you will use for this project (10 points).
For example, if you use boxplots, you should note that the boxplot provides a summary
of the first, second, and third quartiles and the min and the max and outliers etc. and
explain how to interpret the plot. Also describe the cleaning process. How did you handle
missing data? How did you organize data and what did you do to preprocess data to get it
ready for analysis?
Findings (30 points)
•
•
•
•
1-2 pages
Here is where you will show off your visualizations and statistical summaries
Focus your reporting and interpretation on only the things that are relevant to your
research question!
The most successful projects will incorporate multiple and varied aspects of the coding
techniques we cover in class.
Discussion (10 points)
•
•
•
•
1-2 pages
Relate your findings back to your literature review, what did you find that is useful for
the scientific community.
How do your analyses inform your motivating research questions?
All research papers have weaknesses, what are yours? You can think of weaknesses of
regression, or maybe of your data set, like was an important variable missing?
Organization, grammar, and flow (10 points)
•
•
Make sure your paper is free of grammatical errors and is organized well
Using hyperlinks and table of contents in Markdown is a good idea.
We've got everything to become your favourite writing service
Money back guarantee
Your money is safe. Even if we fail to satisfy your expectations, you can always request a refund and get your money back.
Confidentiality
We don’t share your private information with anyone. What happens on our website stays on our website.
Our service is legit
We provide you with a sample paper on the topic you need, and this kind of academic assistance is perfectly legitimate.
Get a plagiarism-free paper
We check every paper with our plagiarism-detection software, so you get a unique paper written for your particular purposes.
We can help with urgent tasks
Need a paper tomorrow? We can write it even while you’re sleeping. Place an order now and get your paper in 8 hours.
Pay a fair price
Our prices depend on urgency. If you want a cheap essay, place your order in advance. Our prices start from $11 per page.