Mastering Big Data Analytics with PySpark, Published by Packt
This is the code repository for Mastering Big Data Analytics with PySpark [Video], published by Packt. It contains all the supporting project files necessary to work through the video course from start to finish. Authored by: Danny Meijer
PySpark helps you perform data analysis at-scale; it enables you to build more scalable analyses and pipelines. This course starts by introducing you to PySpark's potential for performing effective analyses of large datasets. You'll learn how to interact with Spark from Python and connect Jupyter to Spark to provide rich data visualizations. After that, you'll delve into various Spark components and its architecture.
You'll learn to work with Apache Spark and perform ML tasks more smoothly than before. Gathering and querying data using Spark SQL, to overcome challenges involved in reading it. You'll use the DataFrame API to operate with Spark MLlib and learn about the Pipeline API. Finally, we provide tips and tricks for deploying your code and performance tuning.
By the end of this course, you will not only be able to perform efficient data analytics but will have also learned to use PySpark to easily analyze large datasets at-scale in your organization.
This course will greatly appeal to data science enthusiasts, data scientists, or anyone who is familiar with Machine Learning concepts and wants to scale out his/her work to work with big data.
If you find it difficult to analyze large datasets that keep growing, then this course is the perfect guide for you!
A working knowledge of Python assumed.
For successful completion of this course, students will require the computer systems with at least the following:
OS: Windows, Mac, or Linux Processor: Any processor from the last few years Memory: 2GB RAM Storage: 300MB for the Integrated Development Environment (IDE) and 1GB for cache
For an optimal experience with hands-on labs and other practical activities, we recommend the following configuration:
OS: Windows, Mac, or Linux Processor: Core i5 or better (or AMD equivalent) Memory: 8GB RAM or better Storage: 2GB free for build caches and dependencies
Operating system: Windows, Mac, or Linux Docker
setting up your interactive development environment.
Once you have cloned this repository locally, simply navigate to the folder you have
stored the repo in and run:
python download_data.py
This will populate the data-sets
folder in your repo with a number of data sets that
will be used throughout the course.
The Docker Image bundled with this course (see Dockerfile
) is based on the
pyspark-notebook
, distributed and maintained by Jupyter
Github link Original copyright (c) Jupyter Development Team. Distributed under the terms of the Modified BSD License.
This Course's Docker image extends the pyspark-notebook
with the following additions:
pyspark-stubs
and blackcellmagic
) using
jupyter_contrib_nbextensions
There are 2 ways to access the Docker container in this course:
run_me.py
script (recommended to use)The easiest way to run the container that belongs to this course is by running
python run_me.py
from the course's repository. This will automatically
build the Docker image, set up the Docker container, download the data, and set up the
necessary volume mounts.
If you rather start the Docker container manually, use the following instructions:
Download the data
python download_data.py
Build the image
docker build --rm -f "Dockerfile" -t mastering_pyspark_ml:latest .
Run the image
Ensure that you replace /path/to/mastering_pyspark_ml/repo/
in the following command, and run it in a terminal or command prompt:
docker run -v /path/to/mastering_pyspark_ml/repo/:/home/jovyan/ --rm -d -p 8888:8888 -p 4040:4040 --name mastering_pyspark_ml mastering_pyspark_ml .
Open Jupyter lab once Docker image is running Navigate to http://localhost:8888/lab
Once you are ready to shutdown the Docker container, you can use the following command:
docker stop mastering_pyspark_ml