Apache Spark Programming with Databricks 101

Apache Spark Programming with Databricks 101

What is Apache Spark?

Databricks defines Apache Spark as a lightning-fast unified analytics engine for big data and machine learning. Since its release, Apache Spark, the unified analytics engine, has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Netflix, Yahoo, and eBay have deployed Spark at a massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open-source community in big data, with over 1000 contributors from 250+ organizations.

What is the Apache Spark Programming with Databricks all about?

This course uses a case study-driven approach to explore the fundamentals of Spark Programming with Databricks, including Spark architecture, the DataFrame API, query optimization, and Structured Streaming. First, you will become familiar with Databricks and Spark, recognize their major components, and explore datasets for the case study using the Databricks environment. After ingesting data from various file formats, you will process and analyze datasets by applying a variety of DataFrame transformations, Column expressions, and built-in functions. Lastly, you will execute streaming queries to process streaming data and highlight the advantages of using Delta Lake.

What is the duration of the course?

The course is two days long.

Course Objectives:

  • Upon completion of the course, students should be able to meet the following objectives:
  • Define the major components of Spark architecture and execution hierarchy
  • Describe how DataFrames are built, transformed, and evaluated in Spark
  • Apply the DataFrame API to explore, preprocess, join, and ingest data in Spark
  • Apply the Structured Streaming API to perform analytics on streaming data
  • Navigate the Spark UI and describe how the catalyst optimizer, partitioning and caching affect Spark’s execution performance

Target Audience:

  • Data engineer
  • Data scientist
  • Machine learning engineer
  • Data architect

Prerequisites:

  • Familiarity with basic SQL concepts (select, filter, group by, join, etc.)
  • Beginner programming experience with Python or Scala (syntax, conditions, loops, functions)

Additional Notes:

All ​participants ​will ​need-

  • An ​internet ​connection
  • A ​device ​that is compliant with the following supported internet browsers ​

NOTE: GoToTraining ​is ​our chosen online ​platform ​through which the ​class ​will ​be ​delivered and ​prior ​to ​attendance, ​each ​registrant ​will ​receive ​GoToTraining ​log-in ​instructions.

Course Outline:

Day 1: DataFrames

  • Introduction: Databricks Ecosystem, Spark Overview, Case Study
  • Databricks Platform: Databricks Concepts, Databricks Platform, Lab
  • Spark SQL: Spark SQL, DataFrames, SparkSession, Lab
  • Reader and Writer: Data Sources, DataFrameReader/Writer, Lab

Day 2: DataFrames and Transformations

  • DataFrame and Column: Columns and Expressions, Transformations, Actions, Rows, Lab
  • Aggregation: Groupby, Grouped Data Methods, Aggregate Functions, Math Functions, Lab
  • Datetimes: Dates and Timestamps, Datetime Patterns, Date Functions, Lab
  • Complex types: String Functions, Collection Functions
  • Additional Functions: Non-aggregate Functions, Na Functions, Lab

Day 3: Transformations and Spark Internals

  • Transformations: UDFs: UDFs, Vectorized UDFs, Performance, Lab
  • Spark Architecture: Spark Cluster, Spark Execution, Shuffling, Query Optimization, Catalyst Optimizer, Adaptive Query Execution
  • Query Optimization: Query Optimization, Catalyst Optimizer, Adaptive Query Execution
  • Partitioning: Partitions vs. Cores, Default Shuffle Partitions, Repartition, Lab
  • Review: Review of lab

Day 4: Structured Streaming and Delta

  • Streaming Query: Streaming Concepts, Streaming Query, Transformations, Monitoring, Lab
  • Processing Streams: Lab
  • Delta Lake: Delta Lake Concepts, Batch and Streaming

Conclusion:

Are you looking to learn the mechanics of an analytics platform that accelerates innovation by unifying data science, engineering, and business? Then look no further. The Apache Spark Programming with Databricks training course will shed light on the basics of creating Spark jobs, loading data, and working with data.

To enroll, contact P2L today!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>