r – how to group and summarise column by two other columns

The most obvious problem in the code submitted in the question is that it does not correctly aggregate cases and deaths by country and year. Less obvious errors are related to choices made to subset the data, handling of missing values, and the use of stat = "identity" as an argument in geom_bar().

Here’s a completely reproducible example that reproduces the instructor’s chart.

First, we load the data from the European CDC.

library(ggplot2)
library(dplyr)
library(tidyr)

data <- read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", 
                 na.strings = "", fileEncoding = "UTF-8-BOM")

Next, we create a list of countries to subset that match those in the instructor’s chart.

# select some countries
countryList <- c("France","Italy","Germany","Poland","Romania")

Here we group by country and year, and then aggregate cases & deaths, then we calculate the death rate (deaths per 100 confirmed cases), and save to an output data frame.

data %>%
     filter(countriesAndTerritories %in% countryList) %>%
     group_by(countriesAndTerritories,year) %>% 
     summarise(cases = sum(cases,na.rm=TRUE),
               deaths = sum(deaths,na.rm=TRUE)) %>%
     mutate(deathRate = deaths / (cases / 100)) -> summedData

We use the na.rm = TRUE argument on sum() to include as much of the data as possible, since the codebook tells us that these are daily reports of cases and deaths.

If we view the data frame with View(summedData)we see that the death rates are between 1 and 4, as expected.

Having inspected the data, we plot it with ggplot().

enter image description here

Diagnosing the Error

We’ll walk step-by-step through the code of the original post to find the errors, now that we know we are able to reproduce the professor’s chart with the data provided from the European CDC.

After reading the data as above, we subset countries and execute the first part of the dplyr pipeline, and save to a data frame.

selected_countries <- c("France","Italy","Norway","Sweden","Finland")

data%>%
     filter(countriesAndTerritories%in%selected_countries) -> step1

In the RStudio object viewer we see that the resulting data frame has 4,240 observations.

If we summarize the data to look at the average daily cases and average daily deaths, we see that the cases average 12260.9 while deaths average 81.03.

> mean(step1$cases,na.rm=TRUE)   
[1] 12260.9
> mean(step1$deaths,na.rm=TRUE) 
[1] 81.03202

So far, so good because this means that the average death rate across all the data is less than 1.0, which makes sense given worldwide reports about COVID mortality rates since March 2020.

> mean(step1$deaths,na.rm=TRUE) / (mean(step1$cases,na.rm=TRUE) /100)
[1] 0.6608977

Next, we execute the tidyr::drop_na() function and see what happens.

library(tidyr)
step1 %>%     drop_na(deaths) -> step2 
nrow(step2)

Looks like we’ve lost 24 observations.

> nrow(step2)
[1] 4216

When we sort by deaths and inspect the output from step1 in the RStudio data viewer, we find the disappearing observations in Norway. There are 24 days where there were cases recorded but no deaths.

enter image description here

Still, there’s nothing to suggest we’d generate a graph where 400 people die for every 100 confirmed COVID cases.

Next, we apply the next operation in the original poster’s dplyr pipeline and count the rows.

step2 %>% filter(deaths>0, cases>0) -> step3 nrow(step3)

Hmm… we’ve lost over 980 rows of data now.

> nrow(step3)
[1] 3231

At this point, the code is throwing valid data away, which is going to skew the results. Why? COVID case and death counts are notorious for data corrections over time, so sometimes governments will report negative cases or deaths to correct past over-reporting errors.

Sure enough, our data includes rows with negative values.

enter image description here

Wow! one country reported -348,846 cases in one day. That’s a major correction.

Inspecting the data again, we see that France is the culprit here. If we were a more serious observation with this data, the researcher would be obligated to assess the validity of by doing things as reviewing news reports about case COVID reporting corrections in France during 2021.

enter image description here

Now, when the original code uses mutate() to calculate death rates, it does not aggregate over countriesAndTerritories or year.

step3 %>%  mutate(d =(deaths*100)/cases) -> step4 

Therefore, when the code uses ggplot() with geom(stat = "identity") the y axis uses the values ​​of the individual observations, 3,231 at this point, and produces unexpected results.

Corrected Version of Original Poster’s Chart

Here is the code that correctly analyzes the data, using the five countries selected by the original poster.

countryList <- c("France","Italy","Norway","Sweeden","Finland")
data %>%
     filter(countriesAndTerritories %in% countryList) %>%
     group_by(countriesAndTerritories,year) %>% 
     summarise(cases = sum(cases,na.rm=TRUE),
               deaths = sum(deaths,na.rm=TRUE)) %>%
     mutate(deathRate = deaths / (cases / 100)) -> summedData

# plot the results

ggplot(data = summedData,
       aes(x=countriesAndTerritories, y=deathRate, fill=as.factor(year)))+
     geom_bar(position = "dodge", stat = "identity")+
     labs(x="Countries",y="Deaths Per 100 Cases", fill="year")+
     ggtitle("Number of Deaths per 100 confirmed cases in each year")

…and the output, which has death rates between 1 and 3.5%.

enter image description here

Note that the data frame used to generate the chart has only

Conclusions

First, it’s important to understand the details of the data we’re analyzing, especially when there are plenty of outside references such as worldwide COVID death rates.

Second, it’s important to understand what are valid observations in a data set, such as whether it’s reasonable for a data correction like the one France made in May 2021.

Finally, it’s important to conduct a face validity analysis of the results. Is it realistic to expect 400 people to die for every 100 confirmed COVID cases? For a disease with a worldwide death rate between 1 and 4%, probably not.

Leave a Comment