22 R Reordering Exercise
22.1 Create some example data
Here we set up a data dictionary dd
and some corresponding data ds
. However, it is better if the order of the rows in the data dictionary dd
match the order of the columns in the data ds
.
Let’s examine what we’ve created:
22.2 Task: Reorder rows in dd
in the order of ds
’s columns
This assumes that every row of dd
is in colnames(ds)
and every colnames(ds)
value is represented in dd
. Perhaps that should be checked first.
22.3 Assumption Check Question
How would you check that every variable listed in the data dictionary dd
is named in colnames(ds)
and every colnames(ds)
value is represented in the data dictionary dd
?
Note that we should also check to see if the VARNAME
’s are unique and the colnames
of ds
are unique.
22.4 Task: Reorder rows in dd
to match the order of the columns in ds
Task: Reorder rows in the data dictionary dd
to match the order of the columns in the data ds
- What are various ways you could rearrange the rows of a data frame?
This works by assigning VARNAME
to be the row names of dd
, and then rearranging dd
by row names.
We can use match
also:
To better understand how this works, let’s look at the match
function’s output:
The first element of the match
output is 27. This means that the first element in colnames(ds)
, ID
is the 27th element in dd$VARNAME
. So when we do dd[match(colnames(ds), dd$VARNAME),]
, we’re moving the 27th row of dd
into the first position, so as to put the ID
row first. And so on. Thus dd[match(colnames(ds), dd$VARNAME),]
rearranges the rows of dd
into the order of the colnames(ds)
.
22.5 Question: use arrange
?
Question: Is there a way to do this using arrange
?
This does not work, because tidyverse
wants to work on columns of data within dd
:
You can’t use colnames(ds)
directly in arrange
because arrange
wants to work on columns of data within dd
, and colnames(ds)
is not a column of dd
. In the Tidyverse, it is generally required that you work with columns of data contained within the input data frame, and not with external data that lies outside of the input data frame such as, in this case, the column names of ds
.
22.6 Question: use arrange
?
Question: Is there a way to do this using arrange
?
arrange()
orders the rows of a data frame by the values of selected columns.
This works by creating a new column neworder
that contains the position of the VARNAME
in the colnames(ds)
. We then use arrange
to sort by this new column, and then remove the new column.
22.7 Question: use slice
Question: Is there a way to do this using the slice
command?
The slice()
function lets you index rows by their (integer) locations.
Hint: Combine slice
with match
.
The match
function returns a vector of the positions of (first) matches of its first argument in its second. So if you have a data frame with row names, you can use match
to get the positions of the row names you want to select. You can then use slice
to select the rows by position.
According to its documentation, slice
lets you index rows by their (integer) locations. It allows you to select, remove, and duplicate rows. Here’s an example combining slice
and match
where we have a data frame with row names and we want to select the rows in the order “A”, “B”, “C”:
> df
a b
B 1 3
C 2 2
A 3 1
> match(c("A","B","C"),rownames(df))
[1] 3 1 2
> df %>% slice(match(c("A","B","C"),rownames(df)))
a b
A 3 1
B 1 3
C 2 2
Here the match vector of 3 1 2
tells us that row “A” is in position 3 in the rownames(df)
vector, row “B” is in position 1, and row “C” is in position 2. So when we do slice(3,1,2)
that moves row 3 into position 1, row 1 into position 2, and row 2 into position 3, as desired.
In our case, we can use match
to get the positions of the VARNAME
’s, and then use slice
to rearrange the rows by position.
22.8 Question: use select
?
Question: Is there a way to do this by transposing and then using select
?
The t()
function transposes a matrix or data frame. It generates a new data frame where the rows are the columns of the original data frame and the columns are the rows of the original data frame.
Here is an example of using t()
to transpose a data frame:
> df
a b
B 1 3
C 2 2
A 3 1
> t(df)
B C A
a 1 2 3
b 3 2 1
When we use t(df)
we get a new data frame where the rows are the columns of the original data frame and the columns are the rows of the original data frame.
We can then use select
to rearrange the columns, and then transpose back to get the original data frame with the columns in the desired order.
So here we apply this to our data dictionary dd
:
This works by transposing the data frame so that the rows become columns. We then use select
to rearrange the columns, and then transpose back to get the original data frame with the columns in the desired order. We then rename the columns as needed.
22.9 Question: use row names
Question: What about using row names?
“While a tibble can have row names (e.g., when converting from a regular data frame), they are removed when subsetting with the [ operator. A warning will be raised when attempting to assign non-NULL row names to a tibble. Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column.”
From: https://tibble.tidyverse.org/reference/rownames.html