Course

PySpark: Level 01

One of the most valuable technology skills is the ability to analyze huge data sets, and one of the best technology for this task is Apache Spark. Top technology companies like Google, Facebook, Netflix, Airbnb, Amazon, NASA, and more are all using Spark to solve their big data problems. Spark is a tool for doing parallel computation with large datasets and it integrates well with Python. PySpark is the Python package that can make this happen. Apache Spark is a lightning-fast cluster computing designed for fast computation. It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which includes Interactive Queries and Stream Processing. This is a brief tutorial that explains the basics of Spark Core programming.This course has been designed to enhance your Spark skills. At the end of this course, you will be able to understand Spark DataFrames, DataFrame Operations and important concepts such as Partitioning of data.

22 Lessons
Outcomes

By the end of the course, learners will be able to:

  • Rationale behind usage of Apache Spark, and its implementation
  • Situations where Spark, and its Python implementation PySpark can be leveraged for super-fast processing
  • Different operations using PySpark including passing functions, caching and transformations/actions
  • Modes of running Spark so as to choose the right method for launching an application
  • Parameters of Spark Session and difference between SparkSession and SparkContext
  • Creating Data Frames from csv files, existing RDD and by transforming existing DataFrame as well as using StructType
  • Working with file formats – Parquet, Avro & ORC and create DataFrames from the same
  • Transformations, Actions and other operations on DataFrame
  • Performing partitioning & repartitioning
  • Hands-on experience to work on DF’s, SparkSQL

Course Contributors:

  • Issac Joseph
  • Shivanand Ukkali
Level: 01
Duration: 22 Hours
Pre-requisites: Python (Level 1), SQL (Level 1) – For Spark SQL
What’s next: Data Management