Our World in Data: the COVID19 pandemic

Our World in Data publishes data, visualizations, and analyses of issues of global importance, such as poverty, climate change, and disease. Today, we will analyze a data set on COVID19 compiled by them. It contains profiles of 207 countries or territories. The data include the daily number of cases, deaths, vaccinations etc., and they are accompanied by some useful country stats, e.g. their population. We downloaded the data set (version 44) from kaggle but only kept a few selected columns.

Look at the data

First, import and examine the data.

covid <- read.csv("owid-covid-data_filtered.csv")
head(covid)
##   continent    location       date new_cases new_deaths icu_patients new_tests
## 1      Asia Afghanistan 2020-02-24         1         NA           NA        NA
## 2      Asia Afghanistan 2020-02-25         0         NA           NA        NA
## 3      Asia Afghanistan 2020-02-26         0         NA           NA        NA
## 4      Asia Afghanistan 2020-02-27         0         NA           NA        NA
## 5      Asia Afghanistan 2020-02-28         0         NA           NA        NA
## 6      Asia Afghanistan 2020-02-29         0         NA           NA        NA
##   population
## 1   39835428
## 2   39835428
## 3   39835428
## 4   39835428
## 5   39835428
## 6   39835428
tail(covid)
##        continent location       date new_cases new_deaths icu_patients
## 116299    Africa Zimbabwe 2021-09-07       125         10           NA
## 116300    Africa Zimbabwe 2021-09-08       135         14           NA
## 116301    Africa Zimbabwe 2021-09-09       125          4           NA
## 116302    Africa Zimbabwe 2021-09-10       107         11           NA
## 116303    Africa Zimbabwe 2021-09-11        57          4           NA
## 116304    Africa Zimbabwe 2021-09-12        49          2           NA
##        new_tests population
## 116299        NA   15092171
## 116300        NA   15092171
## 116301        NA   15092171
## 116302        NA   15092171
## 116303        NA   15092171
## 116304        NA   15092171
str(covid)
## 'data.frame':    116304 obs. of  8 variables:
##  $ continent   : chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ location    : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ date        : chr  "2020-02-24" "2020-02-25" "2020-02-26" "2020-02-27" ...
##  $ new_cases   : int  1 0 0 0 0 0 0 0 1 2 ...
##  $ new_deaths  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ population  : num  39835428 39835428 39835428 39835428 39835428 ...

Set an ambitious goal

Determine whether European countries that have the highest total number of COVID19 cases normalized to population size also have the highest number of COVID19-related deaths.

Split it into steps

First, describe each step in words.

1. Subset countries located in Europe.
2. For each country, sum up new cases and deaths across all recorded days to get their total numbers.
3. Normalize the number of cases and deaths to the population size.
4. Sort the data by the normalized total number of cases and plot them.

Next, write dplyr code for each individual step.

# 1.1. subset countries located in Europe
covid_europe <- filter(covid,
                       continent == "Europe")

head(covid_europe)
##   continent location       date new_cases new_deaths icu_patients new_tests
## 1    Europe  Albania 2020-02-25        NA         NA           NA         8
## 2    Europe  Albania 2020-02-26        NA         NA           NA         5
## 3    Europe  Albania 2020-02-27        NA         NA           NA         4
## 4    Europe  Albania 2020-02-28        NA         NA           NA         1
## 5    Europe  Albania 2020-02-29        NA         NA           NA         8
## 6    Europe  Albania 2020-03-01        NA         NA           NA         3
##   population
## 1    2872934
## 2    2872934
## 3    2872934
## 4    2872934
## 5    2872934
## 6    2872934
# 1.2. the same, but using piping
covid_europe <-
  covid %>%
  filter(continent == "Europe") # no need to have covid inside parenthesis because it has already been piped into the function

head(covid_europe)
##   continent location       date new_cases new_deaths icu_patients new_tests
## 1    Europe  Albania 2020-02-25        NA         NA           NA         8
## 2    Europe  Albania 2020-02-26        NA         NA           NA         5
## 3    Europe  Albania 2020-02-27        NA         NA           NA         4
## 4    Europe  Albania 2020-02-28        NA         NA           NA         1
## 5    Europe  Albania 2020-02-29        NA         NA           NA         8
## 6    Europe  Albania 2020-03-01        NA         NA           NA         3
##   population
## 1    2872934
## 2    2872934
## 3    2872934
## 4    2872934
## 5    2872934
## 6    2872934
# 2.1. group by country - the data won't change but this step is necessary before performing any sort of calculations that need to be done for each group separately
covid_europe_grouped <-
  covid_europe %>%
  group_by(location)

head(covid_europe_grouped) # the data are now a tibble (think of it as an enhanced data.frame) - note the new attribute "Groups"
## # A tibble: 6 x 8
## # Groups:   location [1]
##   continent location date       new_cases new_deaths icu_patients new_tests
##   <chr>     <chr>    <chr>          <int>      <int>        <int>     <int>
## 1 Europe    Albania  2020-02-25        NA         NA           NA         8
## 2 Europe    Albania  2020-02-26        NA         NA           NA         5
## 3 Europe    Albania  2020-02-27        NA         NA           NA         4
## 4 Europe    Albania  2020-02-28        NA         NA           NA         1
## 5 Europe    Albania  2020-02-29        NA         NA           NA         8
## 6 Europe    Albania  2020-03-01        NA         NA           NA         3
## # … with 1 more variable: population <dbl>
# 2.2. sum up cases and deaths across days for each country (=for each group)
covid_europe_grouped_totals <-
  covid_europe_grouped %>%
  summarize(total_cases = sum(new_cases, na.rm = TRUE), # na.rm=TRUE makes sum ignore NA values
            total_deaths = sum(new_deaths, na.rm = TRUE), # na.rm=TRUE makes sum ignore NA values
            population = mean(population)) # could have been max or min - population is likely the same for each day

head(covid_europe_grouped_totals)
## # A tibble: 6 x 4
##   location               total_cases total_deaths population
##   <chr>                        <int>        <int>      <dbl>
## 1 Albania                     157026         2543    2872934
## 2 Andorra                      15083          130      77354
## 3 Austria                     710030        10839    9043072
## 4 Belarus                     503073         3917    9442867
## 5 Belgium                    1205516        25454   11632334
## 6 Bosnia and Herzegovina      220636         9995    3263459
# 3. normalize cases and deaths to the population size
covid_europe_grouped_totals_norm <-
  covid_europe_grouped_totals %>% 
  mutate(total_cases_per_thousand = 1000*total_cases/population,
         total_deaths_per_100k = 100000*total_deaths/population) # we arbitrarily normalize by 100k instead of 1k people to make the number of cases and death easily comparable when plotted on the same scale

head(covid_europe_grouped_totals_norm)
## # A tibble: 6 x 6
##   location total_cases total_deaths population total_cases_per… total_deaths_pe…
##   <chr>          <int>        <int>      <dbl>            <dbl>            <dbl>
## 1 Albania       157026         2543    2872934             54.7             88.5
## 2 Andorra        15083          130      77354            195.             168. 
## 3 Austria       710030        10839    9043072             78.5            120. 
## 4 Belarus       503073         3917    9442867             53.3             41.5
## 5 Belgium      1205516        25454   11632334            104.             219. 
## 6 Bosnia …      220636         9995    3263459             67.6            306.
# 4.1. sort in descending order
covid_europe_grouped_totals_norm_sorted <-
  covid_europe_grouped_totals_norm %>%
  arrange(desc(total_cases_per_thousand)) # use desc() to sort in descending order

head(covid_europe_grouped_totals_norm_sorted)
## # A tibble: 6 x 6
##   location total_cases total_deaths population total_cases_per… total_deaths_pe…
##   <chr>          <int>        <int>      <dbl>            <dbl>            <dbl>
## 1 Andorra        15083          130      77354             195.            168. 
## 2 Montene…      121908         1800     628051             194.            287. 
## 3 San Mar…        5373           90      34010             158.            265. 
## 4 Czechia      1683179        30413   10724553             157.            284. 
## 5 Slovenia      276174         4469    2078723             133.            215. 
## 6 Cyprus        116209          528     888005             131.             59.5
# 4.2. only keep those columns that are relevant
covid_europe_grouped_totals_norm_sorted_selected <-
  covid_europe_grouped_totals_norm_sorted %>%
  select(location, total_cases_per_thousand, total_deaths_per_100k)

head(covid_europe_grouped_totals_norm_sorted_selected)
## # A tibble: 6 x 3
##   location   total_cases_per_thousand total_deaths_per_100k
##   <chr>                         <dbl>                 <dbl>
## 1 Andorra                        195.                 168. 
## 2 Montenegro                     194.                 287. 
## 3 San Marino                     158.                 265. 
## 4 Czechia                        157.                 284. 
## 5 Slovenia                       133.                 215. 
## 6 Cyprus                         131.                  59.5
# 4.3. transform the data into the long format
covid_europe_grouped_totals_norm_sorted_selected_long <-
  covid_europe_grouped_totals_norm_sorted_selected %>%
  gather(key = "metric",
         value = "value",
         "total_cases_per_thousand",
         "total_deaths_per_100k") # gather() is actually from the package tidyr, but it is compatible with dplyr

head(covid_europe_grouped_totals_norm_sorted_selected_long)
## # A tibble: 6 x 3
##   location   metric                   value
##   <chr>      <chr>                    <dbl>
## 1 Andorra    total_cases_per_thousand  195.
## 2 Montenegro total_cases_per_thousand  194.
## 3 San Marino total_cases_per_thousand  158.
## 4 Czechia    total_cases_per_thousand  157.
## 5 Slovenia   total_cases_per_thousand  133.
## 6 Cyprus     total_cases_per_thousand  131.
tail(covid_europe_grouped_totals_norm_sorted_selected_long)
## # A tibble: 6 x 3
##   location       metric                value
##   <chr>          <chr>                 <dbl>
## 1 Finland        total_deaths_per_100k  18.9
## 2 Faeroe Islands total_deaths_per_100k   0  
## 3 Gibraltar      total_deaths_per_100k   0  
## 4 Guernsey       total_deaths_per_100k   0  
## 5 Isle of Man    total_deaths_per_100k   0  
## 6 Jersey         total_deaths_per_100k   0
# 4.4. convert "location" to a factor and force its levels to maintain the same order as in the sorted data frame (otherwise, countries will be sorted alphabetically by ggplot2)
covid_europe_grouped_totals_norm_sorted_selected_long$location <- 
  factor(covid_europe_grouped_totals_norm_sorted_selected_long$location,
         levels = unique(covid_europe_grouped_totals_norm_sorted_selected_long$location)) # we use unique() here because each country name is repeated twice in the tibble, and level names must be unique. this trick works here but won't necessarily work elsewhere

Finally, plot the data on a ggplot2 bar graph. Make sure that the bars representing number of cases and number of deaths for each country are plotted side-by-side.

ggplot(data = covid_europe_grouped_totals_norm_sorted_selected_long,
       mapping = aes(x = location,
                     y = value,
                     fill = metric)) +
  geom_col(position = "dodge") +
  coord_flip()

Combine all dplyr steps into a single pipeline

Join all the individual steps with %>%, feed the input data.frame into the pipeline, and assign the output tibble to a variable.

covid_dplyr <- 
  covid %>%
  filter(continent == "Europe") %>%
  group_by(location) %>% 
  summarize(total_cases = sum(new_cases, na.rm = TRUE),
            total_deaths = sum(new_deaths, na.rm = TRUE),
            population = mean(population)) %>% 
  mutate(total_cases_per_thousand = 1000*total_cases/population,
         total_deaths_per_100k = 100000*total_deaths/population) %>% 
  arrange(desc(total_cases_per_thousand)) %>%
  select(location, total_cases_per_thousand, total_deaths_per_100k) %>%
  gather(key = "metric",
         value = "value",
         "total_cases_per_thousand",
         "total_deaths_per_100k") %>%
  mutate(location = factor(location, levels = unique(location)))

covid_dplyr
## # A tibble: 102 x 3
##    location    metric                   value
##    <fct>       <chr>                    <dbl>
##  1 Andorra     total_cases_per_thousand  195.
##  2 Montenegro  total_cases_per_thousand  194.
##  3 San Marino  total_cases_per_thousand  158.
##  4 Czechia     total_cases_per_thousand  157.
##  5 Slovenia    total_cases_per_thousand  133.
##  6 Cyprus      total_cases_per_thousand  131.
##  7 Luxembourg  total_cases_per_thousand  121.
##  8 Serbia      total_cases_per_thousand  120.
##  9 Netherlands total_cases_per_thousand  117.
## 10 Lithuania   total_cases_per_thousand  115.
## # … with 92 more rows

Plot the data again to make sure the plot looks the same.

ggplot(data = covid_dplyr,
       mapping = aes(x = location,
                     y = value,
                     fill = metric)) +
  geom_col(position = "dodge") +
  coord_flip()