The DeFtunes Music Purchase Data Pipeline is an end-to-end solution designed to enable data analytics for a new music purchase feature at DeFtunes—a subscription-based music streaming service. This pipeline ingests, transforms, and stores purchase data to facilitate comprehensive analysis on song purchases, user behavior, and service trends.
The data model is designed in a star schema format to optimize analytical queries, centered around a fact table and multiple dimension tables.
- fact_session: Captures details of each song purchase session.
- dim_songs: Contains song details such as title, release year, and track ID.
- dim_artists: Provides artist information including artist name and MusicBrainz Identifier.
- dim_users: Stores user data like name, subscription date, location, and country code.
The pipeline is orchestrated using Apache Airflow and includes the following DAGs:
Follow the steps below to set up and run the data pipeline:
Before initializing terraform make sure to have the AWS environment setup through the template.yml
file, which will setup a VSCode in an EC2 instance.
-
Initialize Terraform
cd terraform terraform init
-
Apply Terraform Configurations
Extract Data
terraform apply -target=module.extract_job
Transform Data
terraform apply -target=module.transform_job
Set Up Serving Layer
terraform apply -target=module.serving
-
Run AWS Glue Jobs
Use the outputs from Terraform to execute AWS Glue jobs that create the necessary tables.
-
Apply Data Quality Configuration
terraform apply -target=module.data_quality
-
Run Airflow DAGs
Trigger the following DAGs in Airflow to execute the data pipeline:
deftunes_api_pipeline_dag
deftunes_songs_pipeline_dag
- Data Extraction and Transformation: AWS Glue, Apache Iceberg
- Data Storage: AWS S3, Amazon Redshift Spectrum
- Orchestration: Apache Airflow
- Data Modeling: dbt (Data Build Tool)
- Visualization: Apache Superset
- Infrastructure as Code: Terraform