Skip to content

Latest commit

 

History

History
44 lines (24 loc) · 1.92 KB

README.md

File metadata and controls

44 lines (24 loc) · 1.92 KB

Data Lake Porject

Create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. Also allows tests against the datasets after the ETL steps have been executed to catch any discrepancies in the datasets.

Dataset

Working with two datasets that reside in S3:

s3://udacity-dend/song_data

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.

s3://udacity-dend/log_data

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.

Soluctions

Create the custom operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step in Apache Airflow. The ETL pipleline loads song data and song play log from S3 to populate the staging tables staging_event and staging_songs, then load data into the final fact and dimension tables using multiple user defined operators and subDAG.

Scripts

dags/dag.py has all the imports and tasks in place

dags/subdag.py has all the imports and tasks to create and load dimension tables

plugins/helpers/sql_queries.py define the SQL statements to create tables and populate fact and dimension tables using data from staging tables

plugins/operators/stage_redshift.py load data from S3 into staging tables on Redshift

plugins/operators/load_fact.py populate fact tables using data from staging tables

plugins/operators/load_dimension.py populate dimension tables using data from staging tables

plugins/operators/data_quality.py check data quality after data load

Running the tests

Start the Airflow web server and run the main DAG through the Airflow UI.