We have a dataset that was scrapped from the internet in Sep 2023, which includes Amazon products' prices and sales. You are tasked to extract the data information and upload them to AWS. It is also required to develop a dashboard to visulize the data to the users. The data is from kaggle.
- Data Extraction from data sources, such as databases, csv files, or APIs.
- Data loading to AWS bucket
- Transform the data using AWS Glue, saving the parquet file in AWS bucket
- Query the data using AWS Athena, saving the query results in bucket
- Import the query results and use Tableau Dashboard Visulization
- download the csv file from kaggle
- csv files have not been pushed to the repo
- Make sure to create IAM user to run the project
- attach s3 full access, glue, athena, quicksight access to the user
- create a bucket based on your region
- create 2 folders in the bucket (stagging/datawarehouse)
- upload the files to stagging folder
- search glue - Visual ETL
- create job - Visual ETL
- choose the source
- set up data source
- transforms (join/drop columns/change column name/update schema)
- targets bucket
- need to create a role that let glue call aws services, attached s3 full access policy
- set up the properties and save
- run the pipeline
- transfer the data to bucket in parquet format
- create a crawler
- add data sources (datawarehouse folder)
- add IAM role (choose AWSGlueServiceRole-xx)
- create a database and use it as the target database
- hit create crawler
- run crawler
- a table is created
- choose "Analyze your data using PySpark and Spark SQL" Launch notebook editor
- Query editor - settings - manage - Query result location (create a new bucket for it)
- do the sql query and it returns the result and the file saved in the bucket
- sign up Tableau public
- upload the athena query results
- build your dashboard