Data Processing with PySpark: Exploring Data Cleaning, Transformation, and Action Operations on DataFrames

  • Home /
  • Schedule /
  • Data Processing with PySpark: Exploring Data Cleaning, Transformation, and Action Operations on DataFrames

Data Processing with PySpark: Exploring Data Cleaning, Transformation, and Action Operations on DataFrames

  • General Python, Web/DevOps
  • Long Talk
  • General
  • Image Description

    By Sandra Oriji

    Data Analyst

    Abstract:

    In the era of big data, efficient data processing is crucial for extracting meaningful insights. PySpark, a Python library built on Apache Spark, provides powerful tools for distributed data processing. This talk delves into PySpark’s capabilities, focusing on data cleaning, transformation, and action operations. The following will be discussed in this session: 1. Understand the fundamentals of PySpark and Explore the benefits of using Spark’s Resilient Distributed Datasets (RDDs) and DataFrames. 2. Discuss common data quality issues and techniques for handling missing values, and duplicates—Showcase PySpark’s functions for data cleansing, such as filtering, imputation, and deduplication. 3. Dive into PySpark’s powerful transformations, like map. Explore DataFrame operations like select, groupBy, and join. 4. Lastly, we will discuss best practices for optimizing performance.


    GO BACK
    Image Description