Mark van der Loo A systematic approach to data cleaning with R. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding Mark van der Loo A systematic approach to data cleaning with R. The statistical value chain. The tidyverse cheat sheet will guide you through some general information on the tidyverse, and then covers topics such as useful functions, loading in your data, manipulating it with dplyr and lastly, visualize it with ggplot2. In short, everything that you need to kickstart your data science learning with R! Do you want to learn more?
- R Data Manipulation Cheat Sheet
- R Data Cleaning Cheat Sheet Printable
- R Data Cleaning Cheat Sheet Pdf
- Data Cleaning In R
- Cheat Sheet Dplyr
Data Science Cheatsheets Table of Contents Business Science Business Science Problem Framework (PDF) Data Science with Python Workflow (PDF) Data Science with R Workflow (PDF) Python Datacamp Python Crash Course Dataquest Others R Datacamp RStudio Math and Calculus Big Data Python R Machine Learning Python R Supervised Learning Unsupervised. This cheat sheet summarizes common Stata commands for econometric analysis and provides their equivalent expression in R. References for importing/cleaning data, manipulating variables, and other basic commands include Hanck et al. (2019), Econometrics with R, and Wickham and Grolemund (2017), R for Data Science. R data frames can be joined on specific columns using one of the dplyr join functions and the by argument. The dplyr join functions can take the additional by argument, which indicates the columns in the “left” and “right” data frames of a join to match on. For example, consider the orders and products data frames of a business. The orders data frame contains five columns: id.
By Anna Kayfitz, CEO of StrategicDB Corp
As millions or billions of data elements come into your business each day, it is almost inevitable that some of it will lack the necessary quality to create efficient business models. Ensuring that your data is clean should always be the first and arguably most important part of a Data Science workflow as without it, you will have difficulty seeing what is important and potentially make the wrong decisions due to duplicates, anomalies or missing information.
One of the most common and powerful data programming tools is R, an open source language and environment for statistical computing and graphics. R provides uses with all the tools needed to create data science projects but with anything, it is only as good as the data that feeds into it. With that, there are a number of libraries within the R environment that help with data cleaning and manipulation before the start of any project.
Exploring the data
Most of the tools for exploring a set of data that you’ve imported already exist within the R platform.
Summary(data)
This handy command simply gives an overview of all your data attributes, displaying the min, max, median, mean and category splits for each. It is a good method for quickly spotting any potential data anomalies.
Following on from this, you can use a histogram to better understand the distribution of your data. This will visualise show any outliers within the dataset or any numeric column that you are particularly looking to observe.
The plyr package
You will need to install the plyr package to create your Histogram, using the standard R functionality for installing libraries
This will create a visualisation of your data to spot any anomalies quickly for. A boxplot visualisation uses the same package but splits into quartiles for outlier detection. Both of these combined will quickly tell you if you need to limit the dataset or only use certain segments of it within any algorithms or statistical modelling.
Correcting errors
R has a number of pre-built methods for correcting data errors such as converting values as you might do in Excel or SQL with simple logic e.g. as.charater() converts the column to a character string.
However, if you want to start correcting the errors that you saw in your histogram or boxplot, there are additional packages that have the capability of doing just that.
The stringr package
There are a few different ways in which stringr can help cleanse your data including trimming white spaces and replacing certain unnecessary words. These are quite standard bits of code structured as str_trim(YOUR_DATA_FIELD) which simply removes the white space.
However, what about removing the anomalies that our histogram told us we had? It would need a bit more complexity than this but as a basic example, we can tell R to replace all the outliers in our field with the median value of that field. This will move everything in together and take away anomaly bias.
Missing values
It is very simple in R to check for incomplete data and perform and action with that field. For example, this function will eliminate missing vales completely from your chosen data column.
There are similar options to replace blank values with 0’s or N/A depending on the field type and improve the consistency of the dataset.
The tidyr package
The tidyr package is designed to tidy your data. It works by identifying the variables in your dataset and using the tools provided to move them into columns with three main functions or gather (), separate () and spread().
The gather() function takes multiple columns and gathers them into key value pairs. A an example, say you have exam score data like.
Name | Exam A | Exam B |
John | 55 | 80 |
Mike | 76 | 90 |
Sam | 45 | 75 |
The gather functions work by transforming that into usable columns like this.
Name | Exam | Score |
John | A | 55 |
Mike | A | 76 |
Sam | A | 45 |
John | B | 80 |
Mike | B | 90 |
Sam | B | 75 |
Now we are truly able to analyse the exam scores. The separate and spread functions do similar things which you can explore once you have the package but ultimately theyalig your data as needed.
Here are a few other packages of note that may be useful for data cleansing in R
- The purr package
The purr package is designed for data wrangling. It is quite similar to the plyr package, albeit older and some users simply find it easier to use and more standardised in its functionality.
- The sqldf package
A lot of R users are more comfortable coding in SQL language rather than R. This function allows you to write SQL code within R studio to select your data elements
- The janitor package
This package is able to find duplicates by multiple columns and make friendly columns with ease from your dataframe. It even has a get_dupes() function for finding duplicate values amongst multiple rows of data. If you are looking to dedupe your data in a more advanced manner, for example, finding different combinations or using fuzzy logic, you may want to look into a deduping tool instead.
- The splitstackshape package
This is an older package that can work with comma separated values in a dataframe column. Useful for survey or text analysis preparation.
R has a huge number of packages and this article only really touches the surface of what it can do. With new libraries popping up all the time it is important to do your research and get the right ones for you before starting any new project.
Bio: Anna Kayfitz is a CEO of StrategicDB Corp, a data cleansing and analytics company. She holds an MBA from a Schulich School of Business, and has spent over 10 years working in data analytics and marketing roles prior to founding StrategicDB.
Resources:
Related:
Data Cleaning is the process of transforming raw data into consistent data that can be analyzed. It is aimed at improving the content of statistical statements based on the data as well as their reliability. Data cleaning may profoundly influence the statistical statements based on the data.
R has a set of comprehensive tools that are specifically designed to clean data in an effective and comprehensive manner.
STEP 1: Initial Exploratory Analysis
The first step to the overall data cleaning process involves an initial exploration of the data frame that you have just imported into R. It is very important to understand how you can import data into R and save it as a data frame.
setwd(“C:/Users/NAGRAJ/Desktop/House Pricing“)
dir()
data<-read.csv(“Regression-Analysis-House Pricing.csv“,na.strings = “”)
View(data)
The first thing that you should do is check the class of your data frame:
class(data)
This renders an output as shown below in which we can clearly see that our dataset is saved as a data frame.
[1] “data.frame”
R Data Manipulation Cheat Sheet
Next, we want to check the number of rows and columns the data frame has.
The code give us & its result:
dim(data)
[1] 932 10
Here we can see that the data frame has 932 rows and 10 columns.
We can view the summary statistics for all the columns of the data frame using the code shown below:
summary(data)
This renders an output as shown below:
STEP 2: Visual Exploratory Analysis
There are 2 types of plots that you should use during your cleaning process –The Histogram and the BoxPlot
- Histogram
The histogram is very useful in visualizing the overall distribution of a numeric column. We can determine if the distribution of data is normal or bi-modal or unimodal or any other kind of distribution of interest. We can also use Histograms to figure out if there are outliers in the particular numerical column under study. In order to plot a histogram for any particular column we need to use the code shown below:
install.packages(“plyr”)
library(plyr)
hist(data$Dist_Taxi)
- BoxPlot
Boxplots are super useful because it shows you the median, along with the first, second and third quartiles. BoxPlots are the best way of spotting outliers in your data frame. In order to visualize a box plot we need to use the code shown below:
boxplot(data$Dist_Taxi)
STEP 3: Correcting the errors!
This step focuses on the methods that you can use to correct all the errors that you have seen.
If we want to change the name of our data frame we can do so using the code shown below:
data$carpet_area<-data$Carpet
In the code above we renamed the Carpet column as “Carpet_area”.
Sometimes columns have an incorrect type associated with them. For example, a column containing text elements stored as a numeric column. In such a case we can change the type of column by using the code shown below:
data$Dist_Taxi<-as.character(data$Dist_Taxi)
class(data$Dist_Taxi)
“character”
There are a wide array of type conversions you can carry out in R. They are listed below.
as.character()
as.numeric()
as.integer()
R Data Cleaning Cheat Sheet Printable
as.logical()
as.factor()
String manipulation in R comes in handy when you are working with datasets that have a lot of text based elements.
In order to change all the text to uppercase or lowercase in a particular column we need to execute the code shown below:
#Making all uppercase
data$Parking<-toupper(data$Parking)
#Making all Lowercase
data$Parking<-toupper(data$Parking)
If we want to trim the whitespaces in the next under a column we need to use the code shown below:
#Installing and loading the required packages
install.packages(“stringr”)
library(stringr)
#Trimming all whitespaces
data$Dist_Taxi<-str_trim(data$Dist_Taxi)
If we want to replace a particular word or letter under a column we can do so using the code below:
#Replacing “Not Provided” with “Not Available”
data$Parking<-str_replace(data$Parking,”Not Provided”,”Not Available”)
In order to replace the outliers with the summary statistics like median the following code is used.
#Replacing the outliers of a particular column with median
vec1<-boxplot.stats(data$Dist_Taxi)$out;
data$Dist_Taxi[data$Dist_Taxi %in% vec1]<-median(data$Dist_Taxi)
The next section will show you how to deal with your missing values:
#Checking for missing values in the entire dataframe
any(is.na(data))
#Checking for the total number of missing values in the entire dataframe
sum(is.na(data))
#Checking for the total number of missing values in a particular column
sum(is.na(data$Dist_Taxi))
#Eliminating missing values completely from the entire dataframe
na.omit(data)
#Eliminating missing values completely from a particular column
na.omit(data$Dist_Taxi)
#Replacing the NA’s in the entire dataframe with ‘0’s
data[is.na(data)]<- 0
#Replacing the NA’s in a particular column with ‘0’s
data$Dist_Taxi[is.na(data$Dist_Taxi)]<-0
#Replacing the NA’s in a particular column with a summary statistics like median
data$Dist_Taxi[is.na(data$Dist_Taxi)]<-median(data$Dist_Taxi)
Suppose we want to unite two columns in our data frame we can do so using the code shown below:
#Installing and loading the required package
install.packages(“tidyr”)
library(tidyr)
data1<-unite(data = data,col = city_category_with_parking,City_Category,Parking)
View(data1)
The unite() function takes 4 arguments – The data frame, the new column name, the first column and the second column name that you want to unite.
Conversely we can also separate a column as shown below:
data2<-separate(data = data1,city_category_with_parking,c(“City_Category”,”Parking”), sep = “-“)
View(data2)
The separate() function takes 4 arguments – The data frame, the column that we want to separate, the names of the new columns and the indicator at which we want the column to be separated at.
steps 1 to 3 above gives you a relatively clean dataset. Always keep exploring new ways that you can clean your data and never stop exploring.
R Data Cleaning Cheat Sheet Pdf
Happy Cleaning!
Data Cleaning In R
About Chandana:
Cheat Sheet Dplyr
Chandana is B.E. She was working as Analyst Intern with Nikhil Guru Consulting Analytics Service LLP (Nikhil Analytics), Bangalore.