Data wrangling tricks from the R4DS slack

Every now and then there is a question on the R4DS slack that produces a large number of replies. This case was particularly interesting because the user that had the issue deleted all the original content, the question, the clarifications and so on.

I thought it would be interesting to document this, because it may be useful to other people.

The question was posted with regards to some genes data, and the task was to filter out some of the genes that have missing values. But generally speaking the question can be summarized like this:

How to filter a dataset in a such way that all observations from a group are dropped if the group doesn’t meet certain criteria?

These are our sample data:

dat <- data.frame(
    groups = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
    features = c("c", "d", "e", "e", "e", "d", "d", "f", "g"),
    col1 = c(1, 2, 5, NA, 5, NA, 6, 7, NA),
    col2 = c(1, 2, 4, 4, NA, 5, 6, NA, 7),
    col3 = c(1, 3, 4, NA, 5, NA, 4, 7, 8)
)

dat
##   groups features col1 col2 col3
## 1      a        c    1    1    1
## 2      a        d    2    2    3
## 3      a        e    5    4    4
## 4      b        e   NA    4   NA
## 5      b        e    5   NA    5
## 6      b        d   NA    5   NA
## 7      c        d    6    6    4
## 8      c        f    7   NA    7
## 9      c        g   NA    7    8

We want to keep only group a since it is the only one where none of the observations have missing data.

There is a case of complete observations in group c as well, but it is only one (on row 7), so we don’t want to keep that group.

A simple pipeline to do this would be:

library(dplyr)
library(tidyr)

dat |> 
  drop_na() |> 
  group_by(groups) |> 
  filter(n() == 3)
## # A tibble: 3 × 5
## # Groups:   groups [1]
##   groups features  col1  col2  col3
##   <chr>  <chr>    <dbl> <dbl> <dbl>
## 1 a      c            1     1     1
## 2 a      d            2     2     3
## 3 a      e            5     4     4

After providing this solution, I thought about the different scenarios where a similar recipe may be needed. For example, what if the data is too big and, data.table would be preferred? Or what if the pipeline should be part of a bigger project using python for example?

More R

So here are the possible solutions with other libraries. First data.table.

library(data.table)

dat_dt <- setDT(dat)

dat_grp <-
  dat_dt[complete.cases(dat_dt),][, .N, by = groups][N == 3]

dt_result <- dat_dt[dat_grp, on = "groups"]

dt_result
##    groups features col1 col2 col3 N
## 1:      a        c    1    1    1 3
## 2:      a        d    2    2    3 3
## 3:      a        e    5    4    4 3

Most of the time I find data.table unreadable, and luckily, a tidytable solution is also possible.

dat |> 
  tidytable::drop_na() |> 
  tidytable::group_by(groups) |> 
  tidytable::filter(dplyr::n() == 3)
## # A tidytable: 3 × 5
## # Groups:      groups
##   groups features  col1  col2  col3
##   <chr>  <chr>    <dbl> <dbl> <dbl>
## 1 a      c            1     1     1
## 2 a      d            2     2     3
## 3 a      e            5     4     4

Python

On to python. Here is a simple chain in pandas that does the same.

import pandas as pd

dat_py = r.dat

(dat_py.dropna(axis = 0, how = "any")
.groupby("groups")
.filter(lambda x: len(x) == 3))
##   groups features  col1  col2  col3
## 0      a        c   1.0   1.0   1.0
## 1      a        d   2.0   2.0   3.0
## 2      a        e   5.0   4.0   4.0

The interesting bit is that drop_na() and dropna() behave differently. In tidyr, the function assumes that it is checking all of the columns by default, while in pandas arguments have to be called to check if any of the columns have missing values.

SQL

And, how about some SQL? It is possible, of course, that your data may be in a database and not in a flat file. Here is a sqlite solution.

library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

dbWriteTable(con, "dat", dat)

query <- "SELECT t1.* FROM dat AS t1
INNER JOIN (
    SELECT groups, COUNT(*) as cnt
    FROM dat
    WHERE col1 IS NOT NULL AND col2 IS NOT NUll AND col3 IS NOT NULL
    GROUP BY groups
) AS t2
ON t1.groups = t2.groups
WHERE t2.cnt = 3;"
  
res <- dbSendQuery(con, query)

dbFetch(res)
##   groups features col1 col2 col3
## 1      a        c    1    1    1
## 2      a        d    2    2    3
## 3      a        e    5    4    4

Summary

This was a quick post that provides answers with different tools to the question: How to filter a dataset in a such way that all observations from a group are dropped if the group doesn’t meet certain criteria?

We saw how to do it in tidyverse and in data.table. What is missing is base R. Sorry about that! However there are solutions in pandas and SQL. Hopefully something can be useful to somebody.

Novica Nakov
Novica Nakov

Data Wrangler.