UDACITY DEND - Capstone Project

Introduction

Here is the final project for the Udacity Data Engineer NanoDegree.

For simplicity of sharing the project with the need to setup the less services on the cloud I decided to do this project thanks to Google Collab Notebook (Online Jupyter Notebook with GPU for free). I use a local session of Spark but you need to setup some AWS services :

AWS IAM User with AmazonS3FullAccess permission
S3 bucket URL
Google Map API Key

You are guided all along with this notebook that you can find here :

https://colab.research.google.com/drive/18v5oG_lFUvaHzBfYfIQIIRtL3E1u8Uxa

You can save your own copy on your drive if you have permission issue to launch this project.

Additionally you will find a .ipynb file with the ful project if you prefer to run it locally

You need to change, in this case, the method to upload the .csv file for the AWS credential. You also need a local installation of Spark or setup a cluster on the cloud and the setup code accordingly

Project summary

This project use a slightly modified version of the Iowa Liquor Sales from Kaggle (more info in the Collab Notebook). The goal is to ETL this file of 3.5Go into a schema more suitable for data analysis. From a flat file to a star schema, ready to be consume for a Data Analyst or a Data Scientist.

You can find a data dictionary in this repository in an Open Source format to make it available for everyone.

The focus will be on the sales with the ability to slice/dice the data in different dimension like depending of the vendor, the location, date...

Choice of technology

The dataset is too big to be wrangle by pandas only but don't need a big Hadoop cluster either. That's why I choose to use spark locally with all the core available on the system.

Future proof scenario

Even if this dataset is big, it cannot be labelled as Big Data. If I had 100x times the Go I would have to set up a real Spark cluster like with AWS EMR.

Also this data set is a one time file, but in a real case with data that would come every day from report I would setup a daily workflow with AirFlow and split all the logic included in this notebook in several task in a @daily DAG.

A real Hadoop cluster with Spark SQL on top would be also a good fit if 100 persons in a company had to work with the data since it would scale automatically depending of the CPU consumption.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
DEND_Capstone.ipynb		DEND_Capstone.ipynb
DataDictionnary.ods		DataDictionnary.ods
README.md		README.md
StarSchema.png		StarSchema.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UDACITY DEND - Capstone Project

Introduction

Project summary

Choice of technology

Future proof scenario

About

Releases

Packages

Languages

gfelot/DEND-Capstone

Folders and files

Latest commit

History

Repository files navigation

UDACITY DEND - Capstone Project

Introduction

Project summary

Choice of technology

Future proof scenario

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages