Table of contents:
The data used for this project is the olist data from kaggle. This is data from Brazilian E-commerce.
The objective of this project is to extract all user who have received delivery with more than 10 day delay.
In order to make the project more interessting their are some contraints on the times data:
- order_purchase_timestamp is by default in Sao Paulo timezone
- order_delivered_customer_date is by default in the customer delivery address timezone
In order to work with all the timezone of brazil I had a csv with the timezone for each state. (This data comme from wikipedia brazilian time zone and brazilian states)
To discover the data and test some piece of code I use a notebook, it is the olist.ipynb in the repository. To execute cells you need jupyter notebook with spylon kernel.
The architecture of this project is based on batch olist application and file system.
The olist batch is composed of 3 steps:
- get the user with more than 10 days delay and who didn't receive the order (except if the order is canceled)
- get the user with more than 10 days delay and who received the order
- merge the both result
- clone the repo
git clone https://github.com/souff/olist.git
- run the batch with mill
mill batch.standalone.run
- assembly with mill
mill batch.assembly
- run with spark submit
spark-submit --class OlistCli out/batch/assembly.dest/out.jar
- assembly with mill
mill batch.standalone.assembly
-
Create an S3 bucket and upload file:
- jar file in jobs folder (you need to use the standalone jar)
- csv in data folder
-
Create the EMR cluster, use the same version of spark that you have locally. You can use this screen for the configuration:
Let the other parameter by default.
-
create the step:
You can see the success of the step in the interface:
To get the result you can access to the S3 bucket: