11  R Factors Exercise

11.1 Ackowledgements

This chapter is a slightly modified version of the

Understanding Factors

chapter from

Programming with R

by

software carpentry

which is made available under the Creative Commons Attribution license 4.0 link.

This chapter is made available according to the same license.

11.2 Objectives

  • Understand how to represent categorical data in R.
  • Know the difference between ordered and unordered factors.
  • Be aware of some of the problems encountered when using factors.

11.3 Questions

  • How is categorical data represented in R?
  • How do I work with factors?

11.4 Factors

Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sortslevelsin alphabetical order. For instance, if you have a factor with 2 levels:

The factor() Command

The factor() command is used to create and modify factors in R:

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can check this by using the function levels(), and check the number of levels using nlevels():

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows us to compare levels:

In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple integer labels because factors are self describing: "low", "medium", and "high"” is more descriptive than 1, 2, 3. Which is low? You wouldn’t be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels (like the subjects in our example data set).

11.5 Challenge: Representing Data in R

You have a vector representing levels of exercise undertaken by 5 subjects

“L”, “N”, “N”, “I”, “L” ; N=none, L=light, I=intense

What is the best way to represent this in R?

  1. exercise <- c("L", "N", "N", "I", "L")

  2. exercise <- factor(c("L", "N", "N", "I", "L"), ordered = TRUE)

  3. exercise < -factor(c("L", "N", "N", "I", "L"), levels = c("N", "L", "I"), ordered = FALSE)

  4. exercise <- factor(c("L", "N", "N", "I", "L"), levels = c("N", "L", "I"), ordered = TRUE)

Correct solution is d.

We only expect three categories (“N”, “L”, “I”). We can order these from least intense to most intense, so let’s use ordered.

11.5.1 Converting Factors

Converting from a factor to a number can cause problems:

This does not behave as expected (and there is no warning).

The recommended way is to use the integer vector to index the factor levels:

This returns a character vector, the as.numeric() function is still required to convert the values to the proper type (numeric).

11.5.2 Using Factors

Lets load our example data to see the use of factors:

Default Behavior

stringsAsFactors = TRUE was the default behavior for R prior to version 4.0. We are using it here to override the default behaviour for R version 4.0 which is stringsAsFactors = FALSE. It is included here for clarity.

Notice the first 3 columns have been converted to factors. These values were text in the data file so R automatically interpreted them as categorical variables.

Notice the summary() function handles factors differently to numbers (and strings), the occurrence counts for each value is often more useful information.

The summary() Function

The summary() function is a great way of spotting errors in your data (look at the dat$Gender column). It’s also a great way for spotting missing data.

11.6 Challenge: Reordering Factors

The function table() tabulates observations and can be used to create bar plots quickly. For instance:

Use the factor() command to modify the column dat$Group so that the control group is plotted last.

11.6.1 Removing Levels from a Factor

Some of the Gender values in our dataset have been coded incorrectly. Let’s remove levels from this factor.

Values should have been recorded as lowercase ‘m’ and ‘f’. We should correct this.

11.7 Challenge: Updating Factors

Why does this plot show 4 levels?

How many levels does dat$Gender have?

dat$Gender has 4 levels, so the plot shows 4 levels.

We need to tell R that “M” is no longer a valid value for this column. We use the droplevels() function to remove extra levels.

Adjusting Factor Levels

Adjusting the levels() of a factor provides a useful shortcut for reassigning values in this case.

  • Factors are used to represent categorical data.
  • Factors can be ordered or unordered.
  • Some R functions have special methods for handling factors.