OList project

Table of contents:

information
architecture
execute the batch locally
pakcage the batch
run the batch on aws

Information

The data used for this project is the olist data from kaggle. This is data from Brazilian E-commerce.

The objective of this project is to extract all user who have received delivery with more than 10 day delay.

In order to make the project more interessting their are some contraints on the times data:

order_purchase_timestamp is by default in Sao Paulo timezone
order_delivered_customer_date is by default in the customer delivery address timezone

In order to work with all the timezone of brazil I had a csv with the timezone for each state. (This data comme from wikipedia brazilian time zone and brazilian states)

To discover the data and test some piece of code I use a notebook, it is the olist.ipynb in the repository. To execute cells you need jupyter notebook with spylon kernel.

architecture

The architecture of this project is based on batch olist application and file system.

The olist batch is composed of 3 steps:

get the user with more than 10 days delay and who didn't receive the order (except if the order is canceled)

get the user with more than 10 days delay and who received the order

merge the both result

Execute the batch locally

clone the repo

git clone https://github.com/souff/olist.git

run the batch with mill

mill batch.standalone.run

package the batch

package for spark

assembly with mill

mill batch.assembly

run with spark submit

spark-submit --class OlistCli out/batch/assembly.dest/out.jar

package standalone

assembly with mill

mill batch.standalone.assembly

Run the batch on AWS

With AWS interface

Create an S3 bucket and upload file:
1. jar file in jobs folder (you need to use the standalone jar)
2. csv in data folder
Create the EMR cluster, use the same version of spark that you have locally. You can use this screen for the configuration: Let the other parameter by default.
create the step:
1. Spark application
2. spark-submit option: --class OlistCli s3://olist
3. jar, choose the jar upload on S3
4. arguments, the S3 uri of your data and output folder (replace the bucket name by yours): s3: //olist-project/jobs/out_standalone.jar s3://olist-project/data/ s3://olist-project/csv/

You can see the success of the step in the interface:

To get the result you can access to the S3 bucket:

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
batch		batch
data		data
docs		docs
.gitignore		.gitignore
README.md		README.md
build.sc		build.sc
olist.ipynb		olist.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OList project

Information

architecture

Execute the batch locally

package the batch

package for spark

package standalone

Run the batch on AWS

With AWS interface

About

Releases

Packages

Languages

souff/olist

Folders and files

Latest commit

History

Repository files navigation

OList project

Information

architecture

Execute the batch locally

package the batch

package for spark

package standalone

Run the batch on AWS

With AWS interface

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages