Skip to content

Create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills

Notifications You must be signed in to change notification settings

tjin35/dend-data-pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Data Lake Porject

Create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. Also allows tests against the datasets after the ETL steps have been executed to catch any discrepancies in the datasets.

Dataset

Working with two datasets that reside in S3:

s3://udacity-dend/song_data

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.

s3://udacity-dend/log_data

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.

Soluctions

Create the custom operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step in Apache Airflow. The ETL pipleline loads song data and song play log from S3 to populate the staging tables staging_event and staging_songs, then load data into the final fact and dimension tables using multiple user defined operators and subDAG.

Scripts

dags/dag.py has all the imports and tasks in place

dags/subdag.py has all the imports and tasks to create and load dimension tables

plugins/helpers/sql_queries.py define the SQL statements to create tables and populate fact and dimension tables using data from staging tables

plugins/operators/stage_redshift.py load data from S3 into staging tables on Redshift

plugins/operators/load_fact.py populate fact tables using data from staging tables

plugins/operators/load_dimension.py populate dimension tables using data from staging tables

plugins/operators/data_quality.py check data quality after data load

Running the tests

Start the Airflow web server and run the main DAG through the Airflow UI.

About

Create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages