The startup called Sparkify has grown their songs & users database. The sparkify team want to move their processes and data into the cloud-based. Sparkify data sources are provided by two public 'S3 buckets'. Their data storaged in dire of JSON logs on user activity on the app. The Redshift service is where data will be ingested and transformed,in fact though COPY command we will access to the JSON files inside the buckets and copy their content on our staging tables. In this project we are going to use two Amazon Web Services,S3 (Data storage) & Redshift (Data warehouse with)
- dhw.cfg - Configuration file used that contains info about Redshift, IAM and S3.
- create_tables.py - The file of create have a two function one called 'create_tables' and 'drop_tables' which python script of creating and droping the table.
- sql_querys.py - This file contains variables with SQL statement in String formats, partitioned by CREATE, DROP, COPY and INSERT statements
- etl.py - This file is a python script to handle the way of copy json data from log files to staging tables and other tables.
- IaC.ipyn - The steps of infrastructure as code to create and define a IAM & Cluster obj.
In this project most of ETL is done with SQL (Python used just as bridge), transformation and data normalization is done by Query, check out the sql_queries
python module
Although the data-sources are provided by two 'S3 buckets' the only thing you need for running the example is an 'AWS Redshift Cluster' up and running
- In this example a redshift dc2.large cluster with 4 nodes has been created, with a cost of USD 0.25 per hour.
- Also in this example I used 'IAM role' authorization mechanism, the only policy attached to 'IAM role' is 'AmazonS3ReadOnlyAccess'
- DON'T FORGET TO DELETE YOUR CLUSTER CAUSE THAT WILL COST YOU MORE. BE CAREFUL, PLEASE.