Create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. Also allows tests against the datasets after the ETL steps have been executed to catch any discrepancies in the datasets.
Working with two datasets that reside in S3:
s3://udacity-dend/song_data
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.
s3://udacity-dend/log_data
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.
Create the custom operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step in Apache Airflow. The ETL pipleline loads song data and song play log from S3 to populate the staging tables staging_event and staging_songs, then load data into the final fact and dimension tables using multiple user defined operators and subDAG.
dags/dag.py has all the imports and tasks in place
dags/subdag.py has all the imports and tasks to create and load dimension tables
plugins/helpers/sql_queries.py define the SQL statements to create tables and populate fact and dimension tables using data from staging tables
plugins/operators/stage_redshift.py load data from S3 into staging tables on Redshift
plugins/operators/load_fact.py populate fact tables using data from staging tables
plugins/operators/load_dimension.py populate dimension tables using data from staging tables
plugins/operators/data_quality.py check data quality after data load
Start the Airflow web server and run the main DAG through the Airflow UI.