50 Hours of Big Data, PySpark, AWS, Scala, and Scraping

Preview this course

In this four-in-one course, we will cover data scraping and data mining for beginners to pros with Python; master Big Data with Scala and Spark; learn PySpark and AWS to master Big Data; and MongoDB for beginners.

Unlimited access to 750+ courses.
Enjoy a Free Trial. Cancel Anytime.

- OR -

30-Day Money-Back Guarantee
Full Lifetime Access.
621 on-demand videos & exercises
Level: Beginner
English
54hrs 32mins
Access on mobile, web and TV

What to know about this course

Part 1 is designed to reflect the most in-demand Scala skills. It provides an in-depth understanding of core Scala concepts. We will wrap up with a discussion on Map Reduce and ETL pipelines using Spark from AWS S3 to AWS RDS (includes six mini-projects and one Scala Spark project).
Part 2 covers PySpark to perform data analysis. You will explore Spark RDDs, Dataframes, a bit of Spark SQL queries, transformations, and actions that can be performed on the data using Spark RDDs and dataframes, the ecosystem of Spark and Hadoop, and their underlying architecture. You will also learn how we can leverage AWS storage, databases, computations, and how Spark can communicate with different AWS services.
Part 3 is all about data scraping and data mining.

You will cover important concepts such as Internet Browser execution and communication with the server, synchronous and asynchronous, parsing data in response from the server, tools for data scraping, Python requests module, and more.
In Part 4, you will be using MongoDB to develop an understanding of the NoSQL databases. You will explore the basic operations and explore the MongoDB query, project and update operators. We will wind up this section with two projects: Developing a CRUD-based application using Django and MongoDB and implementing an ETL pipeline using PySpark to dump the data in MongoDB.

By the end of this course, you will be able to relate the concepts and practical aspects of learned technologies with real-world problems. All the resources of this course are available at https://github.com/PacktPublishing/50-Hours-of-Big-Data-PySpark-AWS-Scala-and-Scraping


Who's this course for?

This course is designed for absolute beginners who want to create intelligent solutions, study with actual data, and enjoy learning theory and then putting it into practice. Data scientists, machine learning experts, and drop shippers will all benefit from this training.

A basic understanding of programming, HTML tags, Python, SQL, and Node JS is required.

However, no prior knowledge of data scraping, and Scala is needed.

What you'll learn

  • Build ETL pipeline from AWS S3 to AWS RDS using Spark
  • Explore Spark/Hadoop applications, ecosystem, and architecture
  • Learn collaborative filtering in PySpark
  • Recognize the distinction between synchronous and asynchronous requests
  • Understand MongoDB CRUD, query operators, projection operators, and update operators
  • Build APIs for CRUD operations in MongoDB through Django

Key Features

  • Data scraping and data mining for beginners to pro with Python
  • Clear unfolding of concepts with examples in Python, Scrapy, Scala, PySpark, and MongoDB
  • Master Big Data with PySpark and AWS

Course Curriculum

About the Author

AI Sciences

AI Sciences are experts, PhDs, and artificial intelligence practitioners, including computer science, machine learning, and Statistics. Some work in big companies such as Amazon, Google, Facebook, Microsoft, KPMG, BCG, and IBM.
AI sciences produce a series of courses dedicated to beginners and newcomers on techniques and methods of machine learning, statistics, artificial intelligence, and data science. They aim to help those who wish to understand techniques more easily and start with less theory and less extended reading. Today, they publish more comprehensive courses on specific topics for wider audiences. Their courses have successfully helped more than 100,000 students master AI and data science.