library(tidyverse)
library(tidylog)
22 R Reordering Exercise
22.1 Load Libraries
22.2 Create some example data
Here we set up a data dictionary dd
and some corresponding data ds
. However, it is better if the order of the rows in the data dictionary dd
match the order of the columns in the data ds
.
set.seed(1562345)
# Set up a data dictionary
<- data.frame(VARNAME = sample(letters, 26), TYPE = "numeric")
dd # Set up data
<- as.data.frame(t(dd %>%
ds arrange(VARNAME)))
names(ds) <- letters
rownames(ds) <- NULL
1, ] <- rnorm(26)
ds[2, ] <- runif(26)
ds[$ID <- c(1, 2)
ds<- ds %>%
ds select(ID, everything())
select: columns reordered (ID, a, b, c, d, …)
# Randomly rearrange the columns
<- sample(letters, 26)
idx <- c("ID", idx)
idx <- ds %>%
ds select(all_of(idx))
select: columns reordered (ID, b, z, a, p, …)
<- bind_rows(dd, data.frame(VARNAME = "ID", TYPE = "string"))
dd dim(dd)
[1] 27 2
head(dd)
VARNAME TYPE
1 c numeric
2 m numeric
3 f numeric
4 e numeric
5 a numeric
6 d numeric
dim(ds)
[1] 2 27
head(ds[1:3])
ID b z
1 1 1.02333343074042 0.47956883003516
2 2 0.858655267162248 0.136965574463829
names(ds)
[1] "ID" "b" "z" "a" "p" "f" "u" "m" "q" "n" "d" "o" "s" "k" "e"
[16] "x" "c" "h" "i" "g" "j" "r" "t" "y" "l" "w" "v"
22.3 Task: Reorder rows in dd
in the order of ds
’s columns
colnames(ds)
[1] "ID" "b" "z" "a" "p" "f" "u" "m" "q" "n" "d" "o" "s" "k" "e"
[16] "x" "c" "h" "i" "g" "j" "r" "t" "y" "l" "w" "v"
$VARNAME dd
[1] "c" "m" "f" "e" "a" "d" "v" "h" "k" "t" "p" "j" "l" "x" "w"
[16] "y" "b" "o" "s" "r" "i" "z" "u" "n" "g" "q" "ID"
This assumes that every row of dd
is in colnames(ds)
and every colnames(ds)
value is represented in dd
. Perhaps that should be checked first.
22.4 Assumption Check Question
How would you check that every variable listed in the data dictionary dd
is named in colnames(ds)
and every colnames(ds)
value is represented in the data dictionary dd
?
table(dd$VARNAME %in% colnames(ds))
TRUE
27
table(colnames(ds) %in% dd$VARNAME)
TRUE
27
Note that we should also check to see if the VARNAME
’s are unique and the colnames
of ds
are unique.
sum(duplicated(dd$VARNAME))
[1] 0
sum(duplicated(colnames(ds)))
[1] 0
22.5 Task: Reorder rows in dd
to match the order of the columns in ds
Task: Reorder rows in the data dictionary dd
to match the order of the columns in the data ds
- What are various ways you could rearrange the rows of a data frame?
# Assign VARNAME to be the rownames of dd
rownames(dd) <- dd$VARNAME
# Rearrange by row names:
<- dd[colnames(ds), ]
dd2 # Check if this worked:
all.equal(dd2$VARNAME, colnames(ds))
[1] TRUE
We can use match
also:
# match returns a vector of the positions of (first) matches of its first
# argument in its second.
<- dd[match(colnames(ds), dd$VARNAME), ]
dd3 # Check if this worked:
all.equal(dd3$VARNAME, colnames(ds))
[1] TRUE
22.6 Question: use arrange
?
Question: Is there a way to do this using arrange
?
This does not work, because tidyverse
wants to work on columns of data within dd
:
<- dd %>%
dd4 arrange(colnames(ds))
# Check if this worked:
all.equal(dd4$VARNAME, colnames(ds))
[1] "26 string mismatches"
22.7 Question: use arrange
?
Question: Is there a way to do this using arrange
?
arrange()
orders the rows of a data frame by the values of selected columns.
<- dd %>%
dd4 mutate(neworder = match(.$VARNAME, colnames(ds))) %>%
arrange(neworder) %>%
select(-neworder)
mutate: new variable 'neworder' (integer) with 27 unique values and 0% NA
select: dropped one variable (neworder)
all.equal(dd4$VARNAME, colnames(ds))
[1] TRUE
22.8 Question: use slice
Question: Is there a way to do this using the slice
command?
slice()
lets you index rows by their (integer) locations.
<- dd %>%
dd6 slice(match(colnames(ds), .$VARNAME))
slice: no rows removed
all.equal(dd6$VARNAME, colnames(ds))
[1] TRUE
22.9 Question: use select
?
Question: Is there a way to do this by transposing and then using select
?
# Transpose so rows become columns, and then we can use 'select' to rearrange
# those columns, and then transpose back, and rename columns as needed.
<- dd %>%
dd5 t() %>%
as_tibble(.name_repair = "unique") %>%
select(colnames(ds)) %>%
t() %>%
as.data.frame() %>%
rename(VARNAME = "V1", TYPE = "V2")
select: columns reordered (ID, b, z, a, p, …)
rename: renamed 2 variables (VARNAME, TYPE)
all.equal(dd5$VARNAME, colnames(ds))
[1] TRUE
22.10 Question: use row names
Question: What about using row names?
“While a tibble can have row names (e.g., when converting from a regular data frame), they are removed when subsetting with the [ operator. A warning will be raised when attempting to assign non-NULL row names to a tibble. Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column.”
From: https://tibble.tidyverse.org/reference/rownames.html