question regarding R language
Preprocessing your dataIt’s way more important than you think…
ISTA 321
What is preprocessing?
Preprocessing is the process of making your data useable for statistical or
machine learning models.
It is probably one of the most critical steps for the whole data mining pipeline, as
improper preprocessing can lead to improper inference
What tasks are in preprocessing?
● Selecting columns or data you want
● Cleaning up special characters
○
$ or % in numeric columns
● Identify erroneous values in your data
○
○
○
●
●
●
●
Correct?
Convert to NA?
Delete whole row or column?
Center / scale variables
Imputing missing data
Convert categorical data to multiple binary columns (one hot encoding)
Splitting data into training and test sets
What order do you need to do these tasks?
● There isn’t a specific order for these tasks, and often you’ll jump back and
forth.
● The one thing that is critical is if you impute (filling missing values with an
estimated one) or scale/center your data, you must only do that on the
training set. Doing it on the whole data can cause something called data
leakage… more on that later.
Selecting columns
Continent
Population
GDP
year
lifeExp
Asia
32
30
1992
66
Europe
64
100
1992
69
N America
401
400
1999
65
N America
200
200
1995
66
S America
25
50
2013
68
Asia
135
125
2002
68
Asia
75
75
2007
63
Selecting columns
So what is EDA?
Country
Continent
Population
GDP
year
lifeExp
yr_collected
Cambodia
Asia
32
$30
1992
66
2017
Germany
Europe
64
$$100
1992
69
2017
USA
N America
401,000
$400
1999
65
2017
Mexico
N America
200
$200
1995
66
2017
Argentina
S America
25
$50
2013
68
2017
Japan
Aia
135
$125
2002
6
2017
Thailand
Asia
75
$75
3007
63
2017
One hot encoding
Fancy way of converting all your categoricals with levels > 2 to binary
Continent
Is_Asia
Is_Europe
Is_North
Is_South
Asia
1
0
0
0
Europe
0
1
0
0
N America
0
0
1
0
N America
0
0
1
0
S America
0
0
0
1
Asia
1
0
0
0
Asia
1
0
0
0
Scaling and centering your data
● Lots of models calculate fit based on distance.
● Let’s say you have two features:
○
○
Salary: $15,000 to $150,000
Age: 25 to 75
● If you try and predict something that uses distance salary will dominate
● So need to convert to the same scale.
○
Center to mean 0 and SD 1
Scaling and centering your data
Splitting your data
● Lots of machine learning models don’t have p-values or traditional statistical
measures associated with them.
● Instead what we do is we ‘train’ or fit our model to a large set of the data
● And then we test how accurate our model is at predicting the rest.
● Let’s think about our bike data from 2018.
○
○
○
Make model that determines how temperature influences number of rides for randomly
selected 300 days.
■ # rides ~ avg_temp
Then go an predict the other 65 values of # rides using just the avg_temp on those days
You then compare the known values of your target against the predicted values of your target
Top-quality papers guaranteed
100% original papers
We sell only unique pieces of writing completed according to your demands.
Confidential service
We use security encryption to keep your personal data protected.
Money-back guarantee
We can give your money back if something goes wrong with your order.
Enjoy the free features we offer to everyone
-
Title page
Get a free title page formatted according to the specifics of your particular style.
-
Custom formatting
Request us to use APA, MLA, Harvard, Chicago, or any other style for your essay.
-
Bibliography page
Don’t pay extra for a list of references that perfectly fits your academic needs.
-
24/7 support assistance
Ask us a question anytime you need to—we don’t charge extra for supporting you!
Calculate how much your essay costs
What we are popular for
- English 101
- History
- Business Studies
- Management
- Literature
- Composition
- Psychology
- Philosophy
- Marketing
- Economics