Data Processing with PySpark: Exploring Data Cleaning, Transformation, and Action Operations on DataFrames
By Sandra Oriji
Data AnalystAbstract:
In the era of big data, efficient data processing is crucial for extracting meaningful insights. PySpark, a Python library built on Apache Spark, provides powerful tools for distributed data processing. This talk delves into PySpark’s capabilities, focusing on data cleaning, transformation, and action operations. The following will be discussed in this session: 1. Understand the fundamentals of PySpark and Explore the benefits of using Spark’s Resilient Distributed Datasets (RDDs) and DataFrames. 2. Discuss common data quality issues and techniques for handling missing values, and duplicates—Showcase PySpark’s functions for data cleansing, such as filtering, imputation, and deduplication. 3. Dive into PySpark’s powerful transformations, like map. Explore DataFrame operations like select, groupBy, and join. 4. Lastly, we will discuss best practices for optimizing performance.
GO BACK
Other Talks
-
-
Ensuring Fairness in AI: Mitigating Bias with Python Libraries and Frameworks
by Kweyakie Afi Blebo -
FastDjango: Conjuring Powerful APIs with the Sorcery of Django Ninja
by Julius Boakye -
-
Keynote
by Marlene Mhangami