The main purpose of the project is to practice Apache Spark in Scala.
Project Overview:
This project leverages Scala to implement an Extract, Transform, and Load (ETL) pipeline. Data is extracted from various sources (CSV files and PostgreSQL databases), undergoes transformations and analysis, and is then loaded into three distinct sinks (CSV, Parquet, and PostgreSQL).
Data Sources:
Multiple CSV files PostgreSQL databases: Transaction Poland Transaction France Transaction China Transaction USA
Data Transformations and Analysis:
The specific transformations and analysis steps are not explicitly mentioned in the image or description. However, the project likely involves data cleaning, filtering, aggregation, and potentially more complex operations depending on the data's nature and intended use.
Data Sinks:
CSV files Parquet files PostgreSQL databases
- Copy the project from GitHub
- Open project
- Build "postgres" Docker image cd PostgresSQL && docker build -t postgres .
- Start the Docker container
- Check PostgreSQL connection: docker exec -it postgres psql -U postgres -d postgres \dt
- Run Scala application