Embarrassingly Parallel Computing with doAzureParallel
Page content
Why?
- You want to run 100 regressions, they each take one hour, and the only difference is the data set they are using. This is an embarrassingly parallel problem.
- For whatever reason, you want to use Azure instead of google compute engine…
Before you start
I will assume that:
- you have an Azure account,
- you have correctly installed, and configured doAzureParallel
Create some fake data
library(dplyr)
library(stringr)
set.seed(12618)
n<-10000
fakeData <- list()
for(ii in 1:100){
fakeData[[ii]] <- future({
fakeDF <- data.frame(x=rnorm(n,0,1), e=rnorm(n,0,1)) %>% mutate(y=0.5*x+e) %>% select(-e)
fname <- paste0("./data/file",str_pad(ii, 3, pad = "0"),".RDS")
saveRDS(fakeDF, file = fname)
return(paste0(fname, " has been writen"))
})
}
v <- lapply(fakeData, FUN = value)
Run with doAzureParallel
# Getting Started ---------------------------------------------------------
library(doAzureParallel)
# 1. Generate your credential and cluster configuration files.
generateClusterConfig("cluster.json")
generateCredentialsConfig("credentials.json")
# 2. Fill out your credential config and cluster config files.
# Enter your Azure Batch Account & Azure Storage keys/account-info into your credential config ("credentials.json") and configure your cluster in your cluster config ("cluster.json")
# 3. Set your credentials - you need to give the R session your credentials to interact with Azure
setCredentials("credentials.json")
# 4. Register the pool. This will create a new pool if your pool hasn't already been provisioned.
cluster <- makeCluster("cluster.json")
# 5. Register the pool as your parallel backend
registerDoAzureParallel(cluster)
# 6. Check that your parallel backend has been registered
getDoParWorkers()
# 7. Run something
my_files <- lapply(paste0("./data/", list.files("data")), readRDS)
results <- foreach(fileX = my_files) %dopar% {
lm1 <- lm(formula = y~x, data = fileX)
return(lm1)
}
# 8. Shut down your pool
stopCluster(cluster)
Things to consider
In the future I want to look into preloading data into each node.