Create EMR cluster with support for R.
Fix Development Tools
to fix gower package installation,
sudo yum remove gcc72-c++.x86_64 libgcc72.x86_64
sudo yum groupinstall 'Development Tools'also create file .R\Makevars:
CC = /usr/bin/gcc64
CXX = /usr/bin/g++
SHLIB_OPENMP_CFLAGS = -fopenmp
Also this:
sudo yum install R-develThen install required packages
install.packages("tidymodels")
install.packages("tune")
install.packages("mlbench")
install.packages("magrittr")
install.packages("dplyr")
install.packages("parsnip")
install.packages("kernlab")library(dplyr)
library(magrittr)
library(parsnip)
library(recipes)
library(rsample)
library(yardstick)
library(tune)
Follow the tidymodels Grid Search Tutorial.
Now lets try with sparklyr, in this case, using a 3 node cluster:
install.packages("remotes")
remotes::install_github("sparklyr/sparklyr")
library(sparklyr)
# Connect to Spark using 3 nodes with 8 CPUs each
sc <- spark_connect(
master = "yarn",
spark_home = "/usr/lib/spark/",
config = list(
"spark.executor.instances" = 24
)
)
# Validate spark_apply() is working properely, repartition to 3 nodes with 8 CPUs each
sdf_len(sc, 3 * 8, repartition = 3 * 8) %>% spark_apply(~ 42)First, lets capture execution time without using Spark:
system.time({
tune_grid(
Class ~ .,
model = svm_mod,
resamples = iono_rs,
metrics = roc_vals,
control = ctrl
)
}) user system elapsed
133.386 0.503 133.883
You can then register Spark as a foreeach backend, notice this is a new feature to be released in sparklyr 1.2:
# Registere Spark as foreach backend
registerDoSpark(sc)
# Check number of parallel workers
foreach::getDoParWorkers()
[1] 24
And then rerun using grid search using Spark this time:
system.time({
tune_grid(
Class ~ .,
model = svm_mod,
resamples = iono_rs,
metrics = roc_vals,
control = ctrl
)
}) user system elapsed
3.735 0.310 85.088