Skip to content

swandrn/pyspark-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PySpark ETL pipeline

A PySpark ETL to transform a CSV stored on Amazon S3 and orchestrated using Apache Airflow.

Dockerfile

The docker-compose.yaml file has been tested with version 27.2.0 of Docker. The Dockerfile extends the Apache Airflow image to install additional dependencies for PySpark to run.

Dependencies

  • Java JDK 15.0.2
  • Apache Hadoop 3.3.6
  • Spark 3.5.2
  • Airflow 2.10.1

pip installs can be found here.

About

A PySpark ETL pipeline orchestrated with Airflow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published