Embarrassingly Parallel Computing with doAzureParallel

Page content

Why?

  1. You want to run 100 regressions, they each take one hour, and the only difference is the data set they are using. This is an embarrassingly parallel problem.
  2. For whatever reason, you want to use Azure instead of google compute engine…

Before you start

I will assume that:

  • you have an Azure account,
  • you have correctly installed, and configured doAzureParallel

Create some fake data

library(dplyr)
library(stringr)
set.seed(12618)
n<-10000
fakeData <- list()
for(ii in 1:100){
  fakeData[[ii]] <- future({
    fakeDF <- data.frame(x=rnorm(n,0,1), e=rnorm(n,0,1)) %>% mutate(y=0.5*x+e) %>% select(-e)
    fname <- paste0("./data/file",str_pad(ii, 3, pad = "0"),".RDS")
    saveRDS(fakeDF, file = fname)
    return(paste0(fname, " has been writen"))
  })
}

v <- lapply(fakeData, FUN = value)

Run with doAzureParallel

# Getting Started ---------------------------------------------------------
library(doAzureParallel)

# 1. Generate your credential and cluster configuration files.  
generateClusterConfig("cluster.json")
generateCredentialsConfig("credentials.json")

# 2. Fill out your credential config and cluster config files.
# Enter your Azure Batch Account & Azure Storage keys/account-info into your credential config ("credentials.json") and configure your cluster in your cluster config ("cluster.json")

# 3. Set your credentials - you need to give the R session your credentials to interact with Azure
setCredentials("credentials.json")

# 4. Register the pool. This will create a new pool if your pool hasn't already been provisioned.
cluster <- makeCluster("cluster.json")

# 5. Register the pool as your parallel backend
registerDoAzureParallel(cluster)

# 6. Check that your parallel backend has been registered
getDoParWorkers()

# 7. Run something

my_files <- lapply(paste0("./data/", list.files("data")), readRDS)

results <- foreach(fileX = my_files) %dopar% {
  lm1  <- lm(formula = y~x, data = fileX)
  return(lm1)
}

# 8. Shut down your pool
stopCluster(cluster)

Things to consider

In the future I want to look into preloading data into each node.