library(tidyverse)
# library(tidylog)
16 R Functions Excercise
16.1 Load Libraries
16.2 Data set creation code
i <- 6
for (i in 1:10) {
fl <- data.frame(name=rep(paste0("name",i),26))
b <- data.frame(name = rep(NA, 26))
b$name <- paste0(fl$name,"_",letters)
b$trait <- rnorm(26)
write_tsv(b,paste0("data/dataset",i,".txt"))
}
16.3 Example
Here we have been sent three data sets in the files that contain the trait quantitative values for each person in the data set:
“dataset1.txt” “dataset2.txt” “dataset3.txt”
And we’ve been asked to make a table that gives, for each dataset, the sample size (N), the mean of the trait, the median, and the variance.
We could do this by reading in each data set, one by one, as follows:
<- data.frame(dataset=rep(NA,3),N=NA, mean=NA, median=NA, var=NA)
results <- read.table("data/dataset1.txt",sep="\t",header=TRUE)
fl1 $dataset[1] <- "dataset1"
results$N <- nrow(fl1)
results$mean[1] <- mean(fl1$trait)
results$median[1] <- median(fl1$trait)
results$var[1] <- var(fl1$trait)
results results
dataset N mean median var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2 <NA> 26 NA NA NA
3 <NA> 26 NA NA NA
<- read.table("data/dataset2.txt",sep="\t",header=TRUE)
fl2 $dataset[2] <- "dataset2"
results$N <- nrow(fl2)
results$mean[2] <- mean(fl2$trait)
results$median[2] <- median(fl2$trait)
results$var[2] <- var(fl2$trait)
results results
dataset N mean median var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2 dataset2 26 0.43486401 0.3558736 1.0936651
3 <NA> 26 NA NA NA
<- read.table("data/dataset3.txt",sep="\t",header=TRUE)
fl3 $dataset[3] <- "dataset3"
results$N <- nrow(fl3)
results$mean[3] <- mean(fl3$trait)
results$median[3] <- median(fl3$trait)
results$var[3] <- var(fl3$trait)
results results
dataset N mean median var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2 dataset2 26 0.43486401 0.3558736 1.0936651
3 dataset3 26 0.07508335 0.0445614 0.7950574
Your colleague initially sent you the three data sets above, but now your colleague has sent you three more data sets and asked you to update the ‘results’ table.
As you can see, the code above is very repetitive. So let’s automate this by writing a function that loops through a list of data set files named “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, etc., building up the results table as above.
16.3.1 Question: How could we construct a list of file names?
This Run code
WebR chunk needs to be run first, before the later ones, as it downloads and reads in the required data files. The WebR chunks should be run in order, as you encounter them, from beginning to end.
We now have the files “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, …, “dataset6.txt” in the ‘data’ directory.
Question: How could we construct a list of file names?
Hint: Use the list.files
command
Hint: the list.files
command provides a handy way to get a list of the input files:
<- list.files(path="data",pattern="dataset*")
fls fls
[1] "dataset1.txt" "dataset2.txt" "dataset3.txt" "dataset4.txt" "dataset5.txt"
[6] "dataset6.txt"
16.3.2 Question: Outline a possible algorithm
Outline a possible algorithm that loops through a list of input data set files named “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, etc., building up the results table as above.
- Read in the input file names into a list
- Set up an empty results table
- For each file in our file name list
- Read the file
- Compute the statistics
- Insert the information into the results table
- Return the filled-in results table
16.3.3 Question: Construct a more detailed step-by-step algorithm.
Construct a more detailed step-by-step algorithm.
- Input the path to the folder containing the data files
- Read in the input file names into a list
fls
- Count the number of input files
N
- Set up an empty results table with
N
rows - For each file in our file name list
fls
- Read the file
- Compute the statistics
- Insert the information into the correct row of the results table
- Return the filled-in results table
16.3.4 Task: Write a read_data_file
function.
Write a read_data_file
function to accomplish the required steps for a single input data file.
- Make the number in the data file name an argument.
Here we make the number in the data file name an argument
<- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
results <- function(n=1, results) {
read_data_file <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
fl1 $dataset[n] <- paste0("dataset",n,".txt")
results$N <- nrow(fl1)
results$mean[n] <- mean(fl1$trait)
results$median[n] <- median(fl1$trait)
results$var[n] <- var(fl1$trait)
results
results
}<- read_data_file(n=1, results)) (results
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 <NA> 26 NA NA NA
3 <NA> 26 NA NA NA
4 <NA> 26 NA NA NA
5 <NA> 26 NA NA NA
6 <NA> 26 NA NA NA
<- read_data_file(n=2, results)) (results
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3 <NA> 26 NA NA NA
4 <NA> 26 NA NA NA
5 <NA> 26 NA NA NA
6 <NA> 26 NA NA NA
- Make the path to the input file an argument to your
read_data_file
function.
Here we make the path to the input file an argument.
<- function(n=1, flnm="dataset1.txt", results) {
read_data_file_v2 <- read.table(paste0("data/",flnm),sep="\t",header=TRUE)
fl1 $dataset[n] <- flnm
results$N <- nrow(fl1)
results$mean[n] <- mean(fl1$trait)
results$median[n] <- median(fl1$trait)
results$var[n] <- var(fl1$trait)
results
results
}<- read_data_file_v2(n=1, flnm = "dataset1.txt", results)) (results
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3 <NA> 26 NA NA NA
4 <NA> 26 NA NA NA
5 <NA> 26 NA NA NA
6 <NA> 26 NA NA NA
<- read_data_file_v2(n=2, flnm = "dataset2.txt", results)) (results
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3 <NA> 26 NA NA NA
4 <NA> 26 NA NA NA
5 <NA> 26 NA NA NA
6 <NA> 26 NA NA NA
16.3.5 Question: What does the above code assume?
What does the above code assume?
Assumes a file naming style of ‘dataset*.txt’ where the asterisk represents 1, 2, 3, …
Assumes the files are in the “data” folder.
16.3.6 Question: Extend your function to process all of the files
The above function read_data_file
processes one file at a time. How would you write a function to loop this over to process all of our files?
<- list.files(path="data",pattern="dataset*")
fls
<- function(fls) {
loop_over_dataset # Input: the list of file names
# Output: the 'results table
# Count the number of data set file names in fls
<- length(fls)
n_datasets # Set up a results dataframe with n_datasets rows
<- data.frame(dataset=rep(NA,n_datasets),N=NA, mean=NA, median=NA, var=NA)
results for (n in 1:n_datasets) {
<- read_data_file(n=n, results=results)
results
}return(results)
}
loop_over_dataset(fls = fls)
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.21989574 0.5974116
2 dataset2.txt 26 0.43486401 0.35587359 1.0936651
3 dataset3.txt 26 0.07508335 0.04456140 0.7950574
4 dataset4.txt 26 0.06259720 0.04813915 0.9186042
5 dataset5.txt 26 -0.09288522 -0.19155759 0.9978161
6 dataset6.txt 26 -0.20266667 -0.23845426 1.5605823
16.3.7 Bonus question: Can you find a subtle mistake in the read_data_file
function?
results <- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
read_data_file <- function(n=1, results) {
fl1 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
results$dataset[n] <- paste0("dataset",n,".txt")
results$N <- nrow(fl1)
results$mean[n] <- mean(fl1$trait)
results$median[n] <- median(fl1$trait)
results$var[n] <- var(fl1$trait)
invisible(results)
}
If N
varies across the data sets, then this line will not do the right thing:
results$N <- nrow(fl1)
Instead this line should be
results$N[n] <- nrow(fl1)
<- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
results <- function(n=1, results) {
read_data_file <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
fl1 $dataset[n] <- paste0("dataset",n,".txt")
results$N[n] <- nrow(fl1)
results$mean[n] <- mean(fl1$trait)
results$median[n] <- median(fl1$trait)
results$var[n] <- var(fl1$trait)
resultsinvisible(results)
}
16.3.8 Bonus question: Why does this end in an error?
read_data_file_v2("dataset1.txt",results)
Error in file(file, "rt"): invalid 'description' argument
The read_data_file_v2
function’s arguments are n
, flnm
, and results
.
When we call it in this manner:
read_data_file_v2("dataset1.txt",results)
we are calling it using unamed arguments, so they are interpreted by position. That means it is assigning the string “dataset1.txt” to the n
argument, and the results
R object to the flnm
argument, but this is not what was intended.
If we use named arguments, then this runs without any errors:
read_data_file_v2(flnm = "dataset1.txt",results = results)
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 <NA> 26 NA NA NA
3 <NA> 26 NA NA NA
4 <NA> 26 NA NA NA
5 <NA> 26 NA NA NA
6 <NA> 26 NA NA NA
In this case, note that n
took on the default value of 1
.
16.3.9 Bonus question: Write a more concise function
Instead of inserting item by item, write a more concise function by putting all the data in a one-row data frame, and then insert the one-row data frame into the appropriate row of the pre-allocated results
data frame.
Here we set up a data frame containing a new row of data.
<- function(n=1, results) {
read_data_file_v3 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
fl1 <- data.frame(dataset = paste0("dataset",n,".txt"),
NewRow N = nrow(fl1),
mean = mean(fl1$trait),
median = median(fl1$trait),
var = var(fl1$trait)
)<- NewRow
results[n,]
results
}read_data_file_v3(1, results)
dataset N mean median var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 <NA> NA NA NA NA
3 <NA> NA NA NA NA
4 <NA> NA NA NA NA
5 <NA> NA NA NA NA
6 <NA> NA NA NA NA