dplyr
Our World in Data publishes data, visualizations, and analyses of issues of global importance, such as poverty, climate change, and disease. Today, we will analyze a data set on COVID19 compiled by them. It contains profiles of 207 countries or territories. The data include the daily number of cases, deaths, vaccinations etc., and they are accompanied by some useful country stats, e.g. their population. We downloaded the data set (version 44) from kaggle but only kept a few selected columns.
First, import and examine the data.
covid <- read.csv("owid-covid-data_filtered.csv")
head(covid)
## continent location date new_cases new_deaths icu_patients new_tests
## 1 Asia Afghanistan 2020-02-24 1 NA NA NA
## 2 Asia Afghanistan 2020-02-25 0 NA NA NA
## 3 Asia Afghanistan 2020-02-26 0 NA NA NA
## 4 Asia Afghanistan 2020-02-27 0 NA NA NA
## 5 Asia Afghanistan 2020-02-28 0 NA NA NA
## 6 Asia Afghanistan 2020-02-29 0 NA NA NA
## population
## 1 39835428
## 2 39835428
## 3 39835428
## 4 39835428
## 5 39835428
## 6 39835428
tail(covid)
## continent location date new_cases new_deaths icu_patients
## 116299 Africa Zimbabwe 2021-09-07 125 10 NA
## 116300 Africa Zimbabwe 2021-09-08 135 14 NA
## 116301 Africa Zimbabwe 2021-09-09 125 4 NA
## 116302 Africa Zimbabwe 2021-09-10 107 11 NA
## 116303 Africa Zimbabwe 2021-09-11 57 4 NA
## 116304 Africa Zimbabwe 2021-09-12 49 2 NA
## new_tests population
## 116299 NA 15092171
## 116300 NA 15092171
## 116301 NA 15092171
## 116302 NA 15092171
## 116303 NA 15092171
## 116304 NA 15092171
str(covid)
## 'data.frame': 116304 obs. of 8 variables:
## $ continent : chr "Asia" "Asia" "Asia" "Asia" ...
## $ location : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ date : chr "2020-02-24" "2020-02-25" "2020-02-26" "2020-02-27" ...
## $ new_cases : int 1 0 0 0 0 0 0 0 1 2 ...
## $ new_deaths : int NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients: int NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests : int NA NA NA NA NA NA NA NA NA NA ...
## $ population : num 39835428 39835428 39835428 39835428 39835428 ...
Determine whether European countries that have the highest total number of COVID19 cases normalized to population size also have the highest number of COVID19-related deaths.
First, describe each step in words.
1. Subset countries located in Europe.
2. For each country, sum up new cases and deaths across all recorded days to get their total numbers.
3. Normalize the number of cases and deaths to the population size.
4. Sort the data by the normalized total number of cases and plot them.
Next, write dplyr
code for each individual step.
# 1.1. subset countries located in Europe
covid_europe <- filter(covid,
continent == "Europe")
head(covid_europe)
## continent location date new_cases new_deaths icu_patients new_tests
## 1 Europe Albania 2020-02-25 NA NA NA 8
## 2 Europe Albania 2020-02-26 NA NA NA 5
## 3 Europe Albania 2020-02-27 NA NA NA 4
## 4 Europe Albania 2020-02-28 NA NA NA 1
## 5 Europe Albania 2020-02-29 NA NA NA 8
## 6 Europe Albania 2020-03-01 NA NA NA 3
## population
## 1 2872934
## 2 2872934
## 3 2872934
## 4 2872934
## 5 2872934
## 6 2872934
# 1.2. the same, but using piping
covid_europe <-
covid %>%
filter(continent == "Europe") # no need to have covid inside parenthesis because it has already been piped into the function
head(covid_europe)
## continent location date new_cases new_deaths icu_patients new_tests
## 1 Europe Albania 2020-02-25 NA NA NA 8
## 2 Europe Albania 2020-02-26 NA NA NA 5
## 3 Europe Albania 2020-02-27 NA NA NA 4
## 4 Europe Albania 2020-02-28 NA NA NA 1
## 5 Europe Albania 2020-02-29 NA NA NA 8
## 6 Europe Albania 2020-03-01 NA NA NA 3
## population
## 1 2872934
## 2 2872934
## 3 2872934
## 4 2872934
## 5 2872934
## 6 2872934
# 2.1. group by country - the data won't change but this step is necessary before performing any sort of calculations that need to be done for each group separately
covid_europe_grouped <-
covid_europe %>%
group_by(location)
head(covid_europe_grouped) # the data are now a tibble (think of it as an enhanced data.frame) - note the new attribute "Groups"
## # A tibble: 6 x 8
## # Groups: location [1]
## continent location date new_cases new_deaths icu_patients new_tests
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Europe Albania 2020-02-25 NA NA NA 8
## 2 Europe Albania 2020-02-26 NA NA NA 5
## 3 Europe Albania 2020-02-27 NA NA NA 4
## 4 Europe Albania 2020-02-28 NA NA NA 1
## 5 Europe Albania 2020-02-29 NA NA NA 8
## 6 Europe Albania 2020-03-01 NA NA NA 3
## # … with 1 more variable: population <dbl>
# 2.2. sum up cases and deaths across days for each country (=for each group)
covid_europe_grouped_totals <-
covid_europe_grouped %>%
summarize(total_cases = sum(new_cases, na.rm = TRUE), # na.rm=TRUE makes sum ignore NA values
total_deaths = sum(new_deaths, na.rm = TRUE), # na.rm=TRUE makes sum ignore NA values
population = mean(population)) # could have been max or min - population is likely the same for each day
head(covid_europe_grouped_totals)
## # A tibble: 6 x 4
## location total_cases total_deaths population
## <chr> <int> <int> <dbl>
## 1 Albania 157026 2543 2872934
## 2 Andorra 15083 130 77354
## 3 Austria 710030 10839 9043072
## 4 Belarus 503073 3917 9442867
## 5 Belgium 1205516 25454 11632334
## 6 Bosnia and Herzegovina 220636 9995 3263459
# 3. normalize cases and deaths to the population size
covid_europe_grouped_totals_norm <-
covid_europe_grouped_totals %>%
mutate(total_cases_per_thousand = 1000*total_cases/population,
total_deaths_per_100k = 100000*total_deaths/population) # we arbitrarily normalize by 100k instead of 1k people to make the number of cases and death easily comparable when plotted on the same scale
head(covid_europe_grouped_totals_norm)
## # A tibble: 6 x 6
## location total_cases total_deaths population total_cases_per… total_deaths_pe…
## <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Albania 157026 2543 2872934 54.7 88.5
## 2 Andorra 15083 130 77354 195. 168.
## 3 Austria 710030 10839 9043072 78.5 120.
## 4 Belarus 503073 3917 9442867 53.3 41.5
## 5 Belgium 1205516 25454 11632334 104. 219.
## 6 Bosnia … 220636 9995 3263459 67.6 306.
# 4.1. sort in descending order
covid_europe_grouped_totals_norm_sorted <-
covid_europe_grouped_totals_norm %>%
arrange(desc(total_cases_per_thousand)) # use desc() to sort in descending order
head(covid_europe_grouped_totals_norm_sorted)
## # A tibble: 6 x 6
## location total_cases total_deaths population total_cases_per… total_deaths_pe…
## <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Andorra 15083 130 77354 195. 168.
## 2 Montene… 121908 1800 628051 194. 287.
## 3 San Mar… 5373 90 34010 158. 265.
## 4 Czechia 1683179 30413 10724553 157. 284.
## 5 Slovenia 276174 4469 2078723 133. 215.
## 6 Cyprus 116209 528 888005 131. 59.5
# 4.2. only keep those columns that are relevant
covid_europe_grouped_totals_norm_sorted_selected <-
covid_europe_grouped_totals_norm_sorted %>%
select(location, total_cases_per_thousand, total_deaths_per_100k)
head(covid_europe_grouped_totals_norm_sorted_selected)
## # A tibble: 6 x 3
## location total_cases_per_thousand total_deaths_per_100k
## <chr> <dbl> <dbl>
## 1 Andorra 195. 168.
## 2 Montenegro 194. 287.
## 3 San Marino 158. 265.
## 4 Czechia 157. 284.
## 5 Slovenia 133. 215.
## 6 Cyprus 131. 59.5
# 4.3. transform the data into the long format
covid_europe_grouped_totals_norm_sorted_selected_long <-
covid_europe_grouped_totals_norm_sorted_selected %>%
gather(key = "metric",
value = "value",
"total_cases_per_thousand",
"total_deaths_per_100k") # gather() is actually from the package tidyr, but it is compatible with dplyr
head(covid_europe_grouped_totals_norm_sorted_selected_long)
## # A tibble: 6 x 3
## location metric value
## <chr> <chr> <dbl>
## 1 Andorra total_cases_per_thousand 195.
## 2 Montenegro total_cases_per_thousand 194.
## 3 San Marino total_cases_per_thousand 158.
## 4 Czechia total_cases_per_thousand 157.
## 5 Slovenia total_cases_per_thousand 133.
## 6 Cyprus total_cases_per_thousand 131.
tail(covid_europe_grouped_totals_norm_sorted_selected_long)
## # A tibble: 6 x 3
## location metric value
## <chr> <chr> <dbl>
## 1 Finland total_deaths_per_100k 18.9
## 2 Faeroe Islands total_deaths_per_100k 0
## 3 Gibraltar total_deaths_per_100k 0
## 4 Guernsey total_deaths_per_100k 0
## 5 Isle of Man total_deaths_per_100k 0
## 6 Jersey total_deaths_per_100k 0
# 4.4. convert "location" to a factor and force its levels to maintain the same order as in the sorted data frame (otherwise, countries will be sorted alphabetically by ggplot2)
covid_europe_grouped_totals_norm_sorted_selected_long$location <-
factor(covid_europe_grouped_totals_norm_sorted_selected_long$location,
levels = unique(covid_europe_grouped_totals_norm_sorted_selected_long$location)) # we use unique() here because each country name is repeated twice in the tibble, and level names must be unique. this trick works here but won't necessarily work elsewhere
Finally, plot the data on a ggplot2
bar graph. Make sure that the bars representing number of cases and number of deaths for each country are plotted side-by-side.
ggplot(data = covid_europe_grouped_totals_norm_sorted_selected_long,
mapping = aes(x = location,
y = value,
fill = metric)) +
geom_col(position = "dodge") +
coord_flip()
dplyr
steps into a single pipelineJoin all the individual steps with %>%
, feed the input data.frame into the pipeline, and assign the output tibble to a variable.
covid_dplyr <-
covid %>%
filter(continent == "Europe") %>%
group_by(location) %>%
summarize(total_cases = sum(new_cases, na.rm = TRUE),
total_deaths = sum(new_deaths, na.rm = TRUE),
population = mean(population)) %>%
mutate(total_cases_per_thousand = 1000*total_cases/population,
total_deaths_per_100k = 100000*total_deaths/population) %>%
arrange(desc(total_cases_per_thousand)) %>%
select(location, total_cases_per_thousand, total_deaths_per_100k) %>%
gather(key = "metric",
value = "value",
"total_cases_per_thousand",
"total_deaths_per_100k") %>%
mutate(location = factor(location, levels = unique(location)))
covid_dplyr
## # A tibble: 102 x 3
## location metric value
## <fct> <chr> <dbl>
## 1 Andorra total_cases_per_thousand 195.
## 2 Montenegro total_cases_per_thousand 194.
## 3 San Marino total_cases_per_thousand 158.
## 4 Czechia total_cases_per_thousand 157.
## 5 Slovenia total_cases_per_thousand 133.
## 6 Cyprus total_cases_per_thousand 131.
## 7 Luxembourg total_cases_per_thousand 121.
## 8 Serbia total_cases_per_thousand 120.
## 9 Netherlands total_cases_per_thousand 117.
## 10 Lithuania total_cases_per_thousand 115.
## # … with 92 more rows
Plot the data again to make sure the plot looks the same.
ggplot(data = covid_dplyr,
mapping = aes(x = location,
y = value,
fill = metric)) +
geom_col(position = "dodge") +
coord_flip()