Data Pipeline - Amazon Products

Problem statement

We have a dataset that was scrapped from the internet in Sep 2023, which includes Amazon products' prices and sales. You are tasked to extract the data information and upload them to AWS. It is also required to develop a dashboard to visulize the data to the users. The data is from kaggle.

Requirements

Data Extraction from data sources, such as databases, csv files, or APIs.
Data loading to AWS bucket
Transform the data using AWS Glue, saving the parquet file in AWS bucket
Query the data using AWS Athena, saving the query results in bucket
Import the query results and use Tableau Dashboard Visulization

Data Pipeline Mapping

Download data

download the csv file from kaggle
csv files have not been pushed to the repo

AWS

Make sure to create IAM user to run the project
attach s3 full access, glue, athena, quicksight access to the user

Create s3 bucket

create a bucket based on your region
create 2 folders in the bucket (stagging/datawarehouse)

Upload the csv files

upload the files to stagging folder

AWS Glue: transform & transfer the data to datawarehouse folder

search glue - Visual ETL
create job - Visual ETL
choose the source
set up data source
transforms (join/drop columns/change column name/update schema)
targets bucket
need to create a role that let glue call aws services, attached s3 full access policy
set up the properties and save
run the pipeline
transfer the data to bucket in parquet format

Crawlers

create a crawler
add data sources (datawarehouse folder)
add IAM role (choose AWSGlueServiceRole-xx)
create a database and use it as the target database
hit create crawler
run crawler
a table is created

Athena (Query)

choose "Analyze your data using PySpark and Spark SQL" Launch notebook editor
Query editor - settings - manage - Query result location (create a new bucket for it)
do the sql query and it returns the result and the file saved in the bucket

Tableau Public

sign up Tableau public
upload the athena query results
build your dashboard

Tableau Report

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
photos		photos
.gitattributes		.gitattributes
.gitignore		.gitignore
README.MD		README.MD
explore.ipynb		explore.ipynb
mapping.drawio		mapping.drawio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pipeline - Amazon Products

Problem statement

Requirements

Data Pipeline Mapping

Download data

AWS

Create s3 bucket

Upload the csv files

AWS Glue: transform & transfer the data to datawarehouse folder

Crawlers

Athena (Query)

Tableau Public

About

Languages

sarah-zhan/data_pipeline_amazon_products

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline - Amazon Products

Problem statement

Requirements

Data Pipeline Mapping

Download data

AWS

Create s3 bucket

Upload the csv files

AWS Glue: transform & transfer the data to datawarehouse folder

Crawlers

Athena (Query)

Tableau Public

About

Topics

Resources

Stars

Watchers

Forks

Languages