<- data.frame(a=c(1,1,1,2), b=c(1,1,NA,2))) (df
a b
1 1 1
2 1 1
3 1 NA
4 2 2
# How many times is 1 in column a
nrow(df[df$a == 1,])
[1] 3
# How many times is 1 in column b
nrow(df[df$b == 1,])
[1] 3
This example is modeled on this Mastodon post:
https://fediscience.org/@thadryanjs/111188342897535820
<- data.frame(a=c(1,1,1,2), b=c(1,1,NA,2))) (df
a b
1 1 1
2 1 1
3 1 NA
4 2 2
# How many times is 1 in column a
nrow(df[df$a == 1,])
[1] 3
# How many times is 1 in column b
nrow(df[df$b == 1,])
[1] 3
As there are only two 1
’s in column b
, this answer of 3
is incorrect.
What’s going on here?
What’s a correct way to count the number 1’s in each of these two columns?
This doesn’t work because of the NA
causes this to return three rows:
$b == 1,] df[df
a b
1 1 1
2 1 1
NA NA NA
Using a data.table
instead of a data.frame
would work:
library(data.table)
<- data.table(a=c(1,1,1,2), b=c(1,1,NA,2))) (dt
a b
<num> <num>
1: 1 1
2: 1 1
3: 1 NA
4: 2 2
# How many times is 1 in column a
nrow(dt[df$a == 1,])
[1] 3
# How many times is 1 in column b
nrow(dt[df$b == 1,])
[1] 2
Counting it more directly is another possibility:
sum(df$b==1, na.rm = TRUE)
[1] 2
Tidyverse commands also gives the correct answer:
suppressMessages(library(tidyverse))
%>% filter(b == 1) %>% nrow() df
[1] 2
r
’s in the vector LETTERS
?I used ‘which’ to determine there were zero copies of the target in the vector of interest, but then testing whether the answer returned by ‘which’ is zero is tricky.
See discussion on Mastodon here:
https://fediscience.org/@StatGenDan/111052432535136731
# LETTERS contains the uppercase letters
LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
# This returns a vector of length zero:
which(LETTERS == "r")
integer(0)
# Testing if it is equal to the integer zero does not work
0 == which(LETTERS == "r")
logical(0)
# Testing if it is equal to the integer one does not work
1 == which(LETTERS == "r")
logical(0)
So what is more correct way to test if there are any r
is present in the LETTERS
vector?
Thomas Lumley commented:
“In general, there are functions in R that return a length-1 answer (any, length, sum, min,…) and there are functions that return a variable-length answer (==, which, +, -,…). You have a length-1 question: are there any ’r’s? You need a function with a fixed length-1 return value.”
https://fediscience.org/@tslumley/111053882380113100
# Number of r's in LETTERS
length(which(LETTERS == "r"))
[1] 0
# Number of r's in LETTERS
sum(LETTERS=='r')
[1] 0
# Is r present in LETTERS?
any(LETTERS=='r')
[1] FALSE
June Choe shared this on Mastodon
https://fosstodon.org/@yjunechoe/111026163637396686
A student in my intro #rstats class taught me something new today (by way of a cryptic “bug”).
Suppose you’re asked why this {purrr} code that should return the mean of each list element is not working as expected.
map(list(x=1:3, y=4:6), mean)
#> $x
#> [1] 1
#>
#> $y
#> [1] 4
What do you think is the simplest explanation for this behavior (in terms of the mistake that the student could’ve made)? It’s not so obvious - there are multiple R “quirks” cascading!
library(purrr)
set.seed(123)
<- mean(sample(2, 10, replace=TRUE))
mean mean
[1] 1.4
# These means are correct
mean(1:3)
[1] 2
mean(4:6)
[1] 5
# These means are correct:
lapply(list(x=1:3, y=4:6), mean)
$x
[1] 2
$y
[1] 5
# But these means are incorrect:
map(list(x=1:3, y=4:6), mean)
$x
[1] 1
$y
[1] 4
Why are the means computed using the map
function from the purrr
package incorrect?
It is not applying the mean
function, but rather it is applying the mean
variable, which has a value of 1.4
.
As the map
documentation states, while the map
command is typically used to apply a function in its .f
argument, the .f
argument can also accept an integer - when it does so, it is interpreted as follows:
A string, integer, or list, e.g.
"idx"
,1
, orlist("idx", 1)
which are shorthand forpluck(x, "idx")
,pluck(x, 1)
, andpluck(x, "idx", 1)
respectively.
So when we map
using the mean
variable, it is used as an index to pluck elements out of the list - during the double to integer conversion, it is rounded down to 1
, so it plucks the first element of each list.
map(list(x=1:3, y=4:6), 1.4)
$x
[1] 1
$y
[1] 4
map(list(x=1:3, y=4:6), 1)
$x
[1] 1
$y
[1] 4
Moral: Be careful to avoid using existing R function names, like mean
, as the names of your variables.
Relevant discussion can be found in
https://adv-r.hadley.nz/functions.html#functions-versus-variables
where it is stated:
“For the record, using the same name for different things is confusing and best avoided!”