Software Developer
Berlin, Germany
Mastering a data pipeline with Python: 6 years of learned lessons from mistakes to success.
View
Robson is a developer since 2003 with a multifaceted life. Since 2014 transitioned the career to be a Data Engineer and use python to handle complex pipelines and glue other technologies. Living in Berlin, in the free time, he is an apprentice paramedical tattooer helping people to recover the self-esteem and run olympic and ironman triathlon races.
-
Mastering a data pipeline with Python: 6 years of learned lessons from mistakes to success.
Description
Building data pipelines are a consolidated task, there are a vast number of tools that automate and help developers to create data pipelines with few clicks on the cloud. It might solve non-complex or well-defined standard problems. This presentation is a demystification of years of experience and painful mistakes using Python as a core to create reliable data pipelines and manage insanely amount of valuable data. Let's cover how each piece fits into this puzzle: data acquisition, ingestion, transformation, storage, workflow management and serving. Also, we'll walk through best practices and possible issues. We'll cover PySpark vs Dask and Pandas, Airflow, and Apache Arrow as a new approach.
Outline
1. Anatomy of a data product ( 5 minutes )
*A brief insight into a data product compared with a traditional software product.
*Reasons why data products fail or go well?
2. Lambda and Kappa architecture ( 5 minutes )
* Introducing the concepts
* Use cases
* Common issues
* Identifying opportunities to use.
3. Qualities of a Pipeline ( 10 minutes )
* Security
* Automation
* Monitoring
* Testable and Traceable
4. Properties of a Pipeline ( 25 minutes )
1. Producers vs Consumers
* WWW ( When, What and Where ) is the data?
2. Python tools (ELT and Streaming)
* PySpark
* Pandas and Dask
* Apache Arrow
3. Analysis
* Dask
* Pandas
4. Management and Scheduling
* Airflow
* Luigi
* Kendro
5. Debugging
6. Testing
7. Validation