Tidy Data

Data Wrangling
What is Tidy Data?
Tidyverse R Packages
Exercises
Additional Resources
- Tidy Video

Data Wrangling

Learning how to manipulate data is essential for data science. You need to know how to import data, inspect it, clean it, tidy it, and document it.

Data wrangling, or data munging - cleaning and reformatting data to get it ready for data analysis - is often one of the most time-consuming aspects of data analysis.

Below are some useful resources for thinking about and doing essential data handling tasks.

What is Tidy Data?

Hadley Wickham introduced a structured framework for thinking about data organization in his 2014 paper Tidy Data, published in the Journal of Statistical Software.

The paper makes a clear distinction between “messy” vs. “tidy” data, and introduced a small set of tools to perform essential tasks for data cleaning and tidying. The following brief introduction distills / excerpts some of the basics of data tidying based on Wickham’s framework, as described in the chapters on data tidying from these two books:

R for Data Science: Tidy Data - by Hadley Wickham
Data Science with R: Data Tidying - by Garrett Grolemund

Basic Principles

Tidy data builds on a premise of data science that data sets contain both values and relationships. Tidy data displays the relationships in a data set as consistently as it displays the values in a data set.

Before diving in further, it is useful to remember that R is a vectorized programming language, which is a core feature that tidy data takes advantage of.

Data structures in R are built from vectors, and R’s operations are optimized to work with vectors. In fact, under the hood a data frame is just a list of atomic vectors arranged to look like a table, where each column in the table is an atomic vector:

A tidy data table

Tidy datasets

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure:

Each type of observational unit is a table
- Each variable in the data set is placed in its own column
- Each observation is placed in its own row
- Each value is placed in its own cell

A tidy data table

Pivoting

Pivot longer

Sometimes column names are not names of variables but they are actually values of a variable (e.g. year) across which multiple observations (e.g. tuberculosis cases) have been made, as illustrated below:

country	1999	2000
Afghanistan	745	2666
Brazil	37737	80488
China	212258	213766

In this format, each row actually represents two “observations” (here, number of cases per country per year) instead of one.

To make the data tidy, we want to pivot the columns labeled 1999 and 2000 into a new pair of variables, year and cases:

A tidy data table

This creates a longer table that is easier to extract data summaries from using standard R tools. A more abstract representation of the same idea is shown below:

Long format {width=60%}

Pivot wider

On the flip side, sometimes multiple measurements (e.g. cases, population) for the same variable (e.g. year) are distributed across multiple rows. In this situation, we want to pivot the data into a wider format so that each observation is arranged on a single row:

A tidy data table

Missing values

A really important thing you need to watch out for is missing data. This is pretty common, especially for e.g. RNA-seq data, where some genes may not have any read counts, and so will be absent from a table of differentially expressed genes.

Missing data may be represented in different ways in tabular data files that you will import, e.g.:

empty cells
text such as “-“, “na”, “N/A”, etc.
NaN

You will need to check for these and replace them with a valid symbol for missing data in R. This is one reason why it is crucial to inspect your data before you start working with it.

Missing values in R: NA

In R, missing data values are represented as NA (without quotes; this is not treated as a character string).

Before you start working with your data, it is crucial to figure out if there are any missing values and what symbols are used to represent them. If missing data are represented in a systematic way using some character string (e.g. “n/a”), you can specify this string when you read in a data table from a file:

# read in a table and recode missing values as NA
> my_table = read.csv("example.csv", na.strings = "n/a")
> head(my_table)
  patientid grade nodesexam nodespos extent nodalstatus size pgr er
1  pid00001     3        32        3     10           6   60   2  2
2  pid00002     2        13        1     10           6   15   1  1
3  pid00003     3         8        0     10           0    8   1  1
4  pid00004     3        20       NA     10           0   10   2  2
5  pid00005     2        16        8     10           6   15   2  1
6  pid00006     3        19        0     10           0   48   1  1

Once you have imported your data and all your missing values are properly represented, you can test for them using logical operators, for example:

> my_vector = c(1:5,NA,7:10)
> is.na(my_vector)
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

You can use this to filter out rows with missing data when you are cleaning and tidying your data, before you start working with it. There are a variety of ways to do this; one example is shown below:

# use dplyr na.omit() to remove lines with missing values in the nodespos colum
> my_table %>% na.omit(nodespos)
   patientid grade nodesexam nodespos extent nodalstatus size pgr er
1   pid00001     3        32        3     10           6   60   2  2
2   pid00002     2        13        1     10           6   15   1  1
3   pid00003     3         8        0     10           0    8   1  1
5   pid00005     2        16        8     10           6   15   2  1
6   pid00006     3        19        0     10           0   48   1  1

Tidyverse R Packages

Tidyverse tools provide a collection of packages that help you store, format, manipulate, and display your data. We will introduce some of these tools as we practice working with data throughout the semester.

Exercises

For practice, try working through these vignettes:

Chapter 12: Tidy Data in R for Data Science
Tidy data from the CRAN tidyr documentation

Additional Resources

Tidy Video

If you like watching videos, you can watch watch his presentation on Tidy Data and Tidy Tools (58:45):

Tidy Data

Table of contents

Data Wrangling

What is Tidy Data?

Basic Principles

Tidy datasets

Pivoting

Pivot longer

Pivot wider

Missing values

Missing values in R: NA

Tidyverse R Packages

Exercises

Additional Resources

Tidy Video