16  R Functions Excercise

16.1 Load Libraries

 library(tidyverse)
# library(tidylog)

16.2 Data set creation code

i <- 6
for (i in 1:10) {
fl <- data.frame(name=rep(paste0("name",i),26))
b <- data.frame(name = rep(NA, 26))
b$name <- paste0(fl$name,"_",letters)
b$trait <- rnorm(26)
write_tsv(b,paste0("data/dataset",i,".txt"))
}

16.3 Example

Here we have been sent three data sets in the files that contain the trait quantitative values for each person in the data set:

“dataset1.txt” “dataset2.txt” “dataset3.txt”

And we’ve been asked to make a table that gives, for each dataset, the sample size (N), the mean of the trait, the median, and the variance.

We could do this by reading in each data set, one by one, as follows:

results <- data.frame(dataset=rep(NA,3),N=NA, mean=NA, median=NA, var=NA)
fl1 <- read.table("data/dataset1.txt",sep="\t",header=TRUE)
results$dataset[1] <- "dataset1"
results$N <- nrow(fl1)
results$mean[1] <- mean(fl1$trait)
results$median[1] <- median(fl1$trait)
results$var[1] <- var(fl1$trait)
results
   dataset  N       mean    median       var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2     <NA> 26         NA        NA        NA
3     <NA> 26         NA        NA        NA
fl2 <- read.table("data/dataset2.txt",sep="\t",header=TRUE)
results$dataset[2] <- "dataset2"
results$N <- nrow(fl2)
results$mean[2] <- mean(fl2$trait)
results$median[2] <- median(fl2$trait)
results$var[2] <- var(fl2$trait)
results
   dataset  N       mean    median       var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2 dataset2 26 0.43486401 0.3558736 1.0936651
3     <NA> 26         NA        NA        NA
fl3 <- read.table("data/dataset3.txt",sep="\t",header=TRUE)
results$dataset[3] <- "dataset3"
results$N <- nrow(fl3)
results$mean[3] <- mean(fl3$trait)
results$median[3] <- median(fl3$trait)
results$var[3] <- var(fl3$trait)
results
   dataset  N       mean    median       var
1 dataset1 26 0.09762111 0.2198957 0.5974116
2 dataset2 26 0.43486401 0.3558736 1.0936651
3 dataset3 26 0.07508335 0.0445614 0.7950574

Your colleague initially sent you the three data sets above, but now your colleague has sent you three more data sets and asked you to update the ‘results’ table.

As you can see, the code above is very repetitive. So let’s automate this by writing a function that loops through a list of data set files named “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, etc., building up the results table as above.

16.3.1 Question: How could we construct a list of file names?

Note

This Run code WebR chunk needs to be run first, before the later ones, as it downloads and reads in the required data files. The WebR chunks should be run in order, as you encounter them, from beginning to end.

We now have the files “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, …, “dataset6.txt” in the ‘data’ directory.

Question: How could we construct a list of file names?

Hint: Use the list.files command

Hint: the list.files command provides a handy way to get a list of the input files:

fls <- list.files(path="data",pattern="dataset*")
fls
[1] "dataset1.txt" "dataset2.txt" "dataset3.txt" "dataset4.txt" "dataset5.txt"
[6] "dataset6.txt"

16.3.2 Question: Outline a possible algorithm

Outline a possible algorithm that loops through a list of input data set files named “dataset1.txt”, “dataset2.txt”, “dataset3.txt”, etc., building up the results table as above.

  • Read in the input file names into a list
  • Set up an empty results table
  • For each file in our file name list
    • Read the file
    • Compute the statistics
    • Insert the information into the results table
    • Return the filled-in results table

16.3.3 Question: Construct a more detailed step-by-step algorithm.

Construct a more detailed step-by-step algorithm.

  • Input the path to the folder containing the data files
  • Read in the input file names into a list fls
  • Count the number of input files N
  • Set up an empty results table with N rows
  • For each file in our file name list fls
    • Read the file
    • Compute the statistics
    • Insert the information into the correct row of the results table
  • Return the filled-in results table

16.3.4 Task: Write a read_data_file function.

Write a read_data_file function to accomplish the required steps for a single input data file.

  1. Make the number in the data file name an argument.

Here we make the number in the data file name an argument

results <- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
read_data_file <- function(n=1, results) {
  fl1 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
  results$dataset[n] <- paste0("dataset",n,".txt")
  results$N <- nrow(fl1)
  results$mean[n] <- mean(fl1$trait)
  results$median[n] <- median(fl1$trait)
  results$var[n] <- var(fl1$trait)
  results
}
(results <- read_data_file(n=1, results))
       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2         <NA> 26         NA        NA        NA
3         <NA> 26         NA        NA        NA
4         <NA> 26         NA        NA        NA
5         <NA> 26         NA        NA        NA
6         <NA> 26         NA        NA        NA
(results <- read_data_file(n=2, results))
       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3         <NA> 26         NA        NA        NA
4         <NA> 26         NA        NA        NA
5         <NA> 26         NA        NA        NA
6         <NA> 26         NA        NA        NA
  1. Make the path to the input file an argument to your read_data_file function.

Here we make the path to the input file an argument.

read_data_file_v2 <- function(n=1, flnm="dataset1.txt", results) {
  fl1 <- read.table(paste0("data/",flnm),sep="\t",header=TRUE)
  results$dataset[n] <- flnm
  results$N <- nrow(fl1)
  results$mean[n] <- mean(fl1$trait)
  results$median[n] <- median(fl1$trait)
  results$var[n] <- var(fl1$trait)
  results
}
(results <- read_data_file_v2(n=1, flnm = "dataset1.txt", results))
       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3         <NA> 26         NA        NA        NA
4         <NA> 26         NA        NA        NA
5         <NA> 26         NA        NA        NA
6         <NA> 26         NA        NA        NA
(results <- read_data_file_v2(n=2, flnm = "dataset2.txt", results))
       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2 dataset2.txt 26 0.43486401 0.3558736 1.0936651
3         <NA> 26         NA        NA        NA
4         <NA> 26         NA        NA        NA
5         <NA> 26         NA        NA        NA
6         <NA> 26         NA        NA        NA

16.3.5 Question: What does the above code assume?

What does the above code assume?

Assumes a file naming style of ‘dataset*.txt’ where the asterisk represents 1, 2, 3, …

Assumes the files are in the “data” folder.

16.3.6 Question: Extend your function to process all of the files

The above function read_data_file processes one file at a time. How would you write a function to loop this over to process all of our files?

fls <- list.files(path="data",pattern="dataset*")

loop_over_dataset <- function(fls) {
  # Input: the list of file names
  # Output: the 'results table
  # Count the number of data set file names in fls
  n_datasets <- length(fls)
  # Set up a results dataframe with n_datasets rows
  results <- data.frame(dataset=rep(NA,n_datasets),N=NA, mean=NA, median=NA, var=NA)
  for (n in 1:n_datasets) {
    results <- read_data_file(n=n, results=results)
  }
  return(results)
}

loop_over_dataset(fls = fls)
       dataset  N        mean      median       var
1 dataset1.txt 26  0.09762111  0.21989574 0.5974116
2 dataset2.txt 26  0.43486401  0.35587359 1.0936651
3 dataset3.txt 26  0.07508335  0.04456140 0.7950574
4 dataset4.txt 26  0.06259720  0.04813915 0.9186042
5 dataset5.txt 26 -0.09288522 -0.19155759 0.9978161
6 dataset6.txt 26 -0.20266667 -0.23845426 1.5605823

16.3.7 Bonus question: Can you find a subtle mistake in the read_data_file function?

results <- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
read_data_file <- function(n=1, results) {
  fl1 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
  results$dataset[n] <- paste0("dataset",n,".txt")
  results$N <- nrow(fl1)
  results$mean[n] <- mean(fl1$trait)
  results$median[n] <- median(fl1$trait)
  results$var[n] <- var(fl1$trait)
  invisible(results)
}

If N varies across the data sets, then this line will not do the right thing:

results$N <- nrow(fl1)

Instead this line should be

results$N[n] <- nrow(fl1)
results <- data.frame(dataset=rep(NA,6),N=NA, mean=NA, median=NA, var=NA)
read_data_file <- function(n=1, results) {
  fl1 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
  results$dataset[n] <- paste0("dataset",n,".txt")
  results$N[n] <- nrow(fl1)
  results$mean[n] <- mean(fl1$trait)
  results$median[n] <- median(fl1$trait)
  results$var[n] <- var(fl1$trait)
  invisible(results)
}

16.3.8 Bonus question: Why does this end in an error?

read_data_file_v2("dataset1.txt",results)
Error in file(file, "rt"): invalid 'description' argument

The read_data_file_v2 function’s arguments are n, flnm, and results.

When we call it in this manner:

read_data_file_v2("dataset1.txt",results)

we are calling it using unamed arguments, so they are interpreted by position. That means it is assigning the string “dataset1.txt” to the n argument, and the results R object to the flnm argument, but this is not what was intended.

If we use named arguments, then this runs without any errors:

read_data_file_v2(flnm = "dataset1.txt",results = results)
       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2         <NA> 26         NA        NA        NA
3         <NA> 26         NA        NA        NA
4         <NA> 26         NA        NA        NA
5         <NA> 26         NA        NA        NA
6         <NA> 26         NA        NA        NA

In this case, note that n took on the default value of 1.

16.3.9 Bonus question: Write a more concise function

Instead of inserting item by item, write a more concise function by putting all the data in a one-row data frame, and then insert the one-row data frame into the appropriate row of the pre-allocated results data frame.

Here we set up a data frame containing a new row of data.

read_data_file_v3 <- function(n=1, results) {
  fl1 <- read.table(paste0("data/dataset",n,".txt"),sep="\t",header=TRUE)
  NewRow <- data.frame(dataset = paste0("dataset",n,".txt"),
  N = nrow(fl1),
  mean = mean(fl1$trait),
  median = median(fl1$trait),
  var = var(fl1$trait)
  )
  results[n,] <- NewRow
  results
}
read_data_file_v3(1, results)
       dataset  N       mean    median       var
1 dataset1.txt 26 0.09762111 0.2198957 0.5974116
2         <NA> NA         NA        NA        NA
3         <NA> NA         NA        NA        NA
4         <NA> NA         NA        NA        NA
5         <NA> NA         NA        NA        NA
6         <NA> NA         NA        NA        NA