Apache Spark 3 for Data Engineering and Analytics with Python

Preview this course

This course primarily focuses on explaining the concepts of Python and PySpark. It will help you enhance your data analysis skills using structured Spark DataFrames APIs.

Unlimited access to 750+ courses.
Enjoy a Free Trial. Cancel Anytime.

- OR -

30-Day Money-Back Guarantee
Full Lifetime Access.
88 on-demand videos & exercises
Level: Beginner
English
8hrs 30mins
Access on mobile, web and TV

What to know about this course

Apache Spark 3 is an open-source distributed engine for querying and processing data. This course will provide you with a detailed understanding of PySpark and its stack. This course is carefully developed and designed to guide you through the process of data analytics using Python Spark. The author uses an interactive approach in explaining keys concepts of PySpark such as the Spark architecture, Spark execution, transformations and actions using the structured API, and much more. You will be able to leverage the power of Python, Java, and SQL and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark. Followed by the techniques for collecting, cleaning, and visualizing data by creating dashboards in Databricks. You will learn how to use SQL to interact with DataFrames.

The author provides an in-depth review of RDDs and contrasts them with DataFrames. There are multiple problem challenges provided at intervals in the course so that you get a firm grasp of the concepts taught in the course. The code bundle for this course is available here: https://github.com/PacktPublishing/Apache-Spark-3-for-Data-Engineering-and-Analytics-with-Python-

Who's this course for?

  • This course is designed for Python developers who wish to learn how to use the language for data engineering and analytics with PySpark.
  • Any aspiring data engineering and analytics professionals.
  • Data scientists/analysts who wish to learn an analytical processing strategy that can be deployed over a big data cluster.
  • Data managers who want to gain a deeper understanding of managing data over a cluster.

What you'll learn

  • Learn Spark architecture, transformations, and actions using the structured API.
  • Learn to set up your own local PySpark environment.
  • Learn to interpret DAG (Directed Acyclic Graph) for Spark execution.
  • Learn to interpret the Spark web UI.
  • Learn the RDD (Resilient Distributed Datasets) API.
  • Learn to visualize (graphs and dashboards) data on Databricks.

Key Features

  • The course explains how the exam is structured, the way that the questions should be approached and how to study successfully to pass.
  • The course also includes invaluable advice on the best way to prepare and what to expect from the testing process.

Course Curriculum

About the Author

David Mngadi

David Mngadi is a data management professional who is influenced by the power of data in our lives and has helped several companies become more data-driven to gain a competitive edge as well as meet the regulatory requirements. In the last 15 years, he has had the pleasure of designing and implementing data warehousing solutions in retail, telco, and banking industries, and recently in more big data lake-specific implementations. He is passionate about technology and teaching programming online.