Introduction

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

This project builds an ETL pipeliine that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.

Database Schema Design

There are two main parts of Schema: Fact Table and Dimension Tables. Note: PK stands for primary key and FK stands for foreign key

Fact Table

songplays - records in log data associated with song play i.e. records with page NextSong
- songplay_id (PK)
- start_time (FK to time table)
- user_id (FK to users table)
- level
- song_id (FK to songs table)
- artist_id (FK to artists table)
- session_id
- location
- user_agent

Dimension Tables

users - users in the app
- user_id (PK)
- first_name
- last_name
- gender
- level
songs - songs in music database
- song_id (PK)
- title
- artist_id (FK to artists)
- year
- duration
artists - artist in music database
- artist_id (PK)
- name
- location
- latitude
- longitude
time - timestamps of records in songplays broken down into specific units
- start_time (PK)
- hour
- day
- week
- month
- year
- weekday

Songplays table is at the center of the schema and all four of dimension tables are the brances from the songplays table.

ETL Pipeline

Load data from S3 to staging tables on Redshift, staging_events and staging_songs.
Load data from staging tables to the above fact and dimension tables on Redshift.

Files in Project

create_tables.py drops all tables existed in the database and creates all tables.
dwh.cfg contains all information of the database.
etl.py loads data to the staging tables and inserts data into the analytic tables.
sql_queries contains all needed queries for create_tables and etl file.
READE.md provides the information of projects.
images folder contains the ER diagram of this project.

How to Run the Project

Run create_tables.py to create all tables on Redshift.
Then run etl.py to load all data to the tables on Redshift.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Database Schema Design

ETL Pipeline

Files in Project

How to Run the Project

ER Diagram

About

Releases

Packages

Languages

tonydo95/Data-Warehouse

Folders and files

Latest commit

History

Repository files navigation

Introduction

Database Schema Design

ETL Pipeline

Files in Project

How to Run the Project

ER Diagram

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages