Scalable Machine Learning with Apache Spark

Duration: 2 days

Industry: Information Technology

About this course

What is the Scalable Machine Learning with Apache Spark course all about?

In this course, you will experience the full data science workflow, including data exploration, feature engineering, model building, and hyperparameter tuning. You will have built an end-to-end distributed machine learning pipeline ready for production by the end of this course.

This course guides students through the process of building machine learning solutions using Spark. You will build and tune ML models with SparkML using transformers, estimators, and pipelines. This course highlights some of the key differences between SparkML and single-node libraries such as sci-kit-learn. Furthermore, you will reproduce your experiments and version your models using MLflow.

You will also integrate 3rd party libraries into Spark workloads, such as XGBoost. In addition, you will leverage Spark to scale inference of single-node models and parallelize hyperparameter tuning. This course includes hands-on labs and concludes with a collaborative capstone project. All the notebooks are available in Python, and Scala as well where available.

What is Apache Spark?

Infoworld describes Spark as a data processing framework that can quickly perform processing tasks on very large data sets and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. These two qualities are key to the big data and machine learning worlds, which require the marshaling of massive computing power to crunch through large data stores. Spark also takes some of the programming burdens of these tasks off the shoulders of developers with an easy-to-use API that abstracts away much of the grunt work of distributed computing and big data processing.

For more information about this course, please check this blog from P2L.

Who can benefit?

Data scientist
Machine learning engineer

This is what you'll learn

Create data processing pipelines with Spark
Build and tune machine learning models with SparkML
Track, version, and deploy models with MLflow
Perform distributed hyperparameter tuning with Hyperopt
Use Spark to scale the inference of single-node models

Prerequisite Skills

Intermediate experience with Python Beginning experience with the PySpark DataFrame API (or have taken the Apache Spark Programming with Databricks class) Working knowledge of machine learning and data science

Schedule (iMVP)

Sep 29-30, 2022

Oct 27-28, 2022

Nov 17-18, 2022

Dec 8-9, 2022

Enroll