Path to the project repositary on the cluster: ~/home/team24/project/bigdata_project_team24
This repository contains the following directories:
data/
contains the dataset files.models/
contains the Spark ML models.notebooks/
has the Jupyter notebooks of the project and used for learning purposes (interactive PDA).output/
represents the output directory for storing the results of the project. It can containcsv
files, text files. images and any other materials you returned as an ouput of the pipeline.scripts/
is a place for storing.sh
scripts and.py
scripts of the pipeline.sql/
is a folder for keeping all.sql
and.hql
files.
requirements.txt
lists the Python packages needed for running your Python scripts. Feel free to add more packages when necessary.
main.sh
is the main script that will run all scripts of the pipeline stages which will execute the full pipeline and store the results in output/
folder. During checking your project repo, the grader will run only the main script and check the results in output/
folder.