Efficient Coding Practices

Learn best practices for creating efficient code in R, including parallelization and vectorization.

We'll cover the following...

Avoiding unnecessary looping
Vectorization
Optimizing data structures and functions
Memory management
Parallelization
Conclusion

Looping over data can be a slow and inefficient process in R. In many cases, there are ways to avoid unnecessary looping and perform the same operation more quickly. One way to do this is to ensure we leverage built-in functions and packages designed to handle large datasets, such as the tidyverse packages, in particular dplyr or data.table.

Whenever we’re tempted to build a loop, our first instinct should be to seek out the functions that more efficiently handle the requirement. Doing so can have a dramatic impact on our code’s performance! It’s often tempting to construct a loop when we encounter a new requirement that involves manipulating a set of rows or columns, but in most cases, there’s already a tidyverse function that’ll handle our need efficiently; it’s just a matter of finding it.

Files

#load tidyverse libraries
library(ggplot2)
library(purrr)
library(tibble)
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(stringr)
library(readr)
#Bad example: Manipulating data with a for loop
#Create an empty list to store the results
VAR_IrisData <- as_tibble(iris)
OUT_ResultsFor <- list()
#Iterate over each species in the iris dataset
for (VAR_CurrSpecies in unique(VAR_IrisData$Species)) {
  
  #Subset the data for the current species
  VAR_SubsetData <- VAR_IrisData[VAR_IrisData$Species == VAR_CurrSpecies, ]
  
  #Calculate the mean of Sepal.Length for the current species
  VAR_MeanSepalLength <- mean(VAR_SubsetData$Sepal.Length)
  
  #Add the result to the list
  OUT_ResultsFor[[VAR_CurrSpecies]] <- VAR_MeanSepalLength
}
#Print the results for looping
paste0("Looping results")
OUT_ResultsFor
#Good example: Manipulating data with tidyverse functions
#Use group_by and summarize functions from dplyr
OUT_ResultsTidy <- VAR_IrisData %>%
                      group_by(Species) %>%
                      summarize(MeanSepalLength = mean(Sepal.Length))
#Print the results for tidyverse functions
paste0("Tidyverse results")
OUT_ResultsTidy

In this example, we present two solutions for calculating the mean Sepal.Length for different flower species in the iris dataset. The first—unideal—solution, is to use a for loop. The second much better solution, which is much better, is to use the tidyverse summarize function.

Line 14: Create a list object to store the results of the mean Sepal.Length calculation.
Line 20: Filter the dataset to the current Species being looped on.
Line 23: Calculate the mean Sepal.Length for the current Species.
Line 26: Add the result to our list object, OUT_ResultsFor.
Lines 35–37: Carry out the same calculation, mean Sepal.Length by Species, using tidyverse functions that remove the need for looping.

From this code example, notice that the two outputs are similar but not identical. The for loop returns a list where each element is the mean Sepal.Length for a particular flower species, while the tidyverse approach returns a tibble containing the same results. Also note that the printing of the tibble defaults to showing two decimal ...

Why R?

R Fundamentals

R Fundamentals Exercises

Readable Coding with tidyverse

Tidyverse Exercises

Importing More Data Sources

Data Visualization with ggplot2

Best Practices for Data Scientists

Statistical Analysis and Machine Learning with tidymodels

Exploring tidymodels through Exercises

Useful Libraries for Data Science

Git Integration

Getting The Most Out of R

Appendix

Credit Card Fraud Detection using the R Language

Efficient Coding Practices

Avoiding unnecessary looping