NYC-Open-Data-Profiling-and-Cleaning

Created for CSGY 6513 Big Data Final Project Part 2.

Part 1 of Project(Assignment 3) which contains Jupyter Notebooks for Profiling and Cleaning of NYC Citywide Payroll data can be found here: https://github.com/alekzanderx1/NYC-Citywide-Payroll-Data-Profiling-and-Cleaning

Steps to Reproduce:

Login to Peel and create a new working directory and clone/copy all the project files there. Then navigate to code folder.
Run the script run_part1.sh by calling ./run_part1.sh in the shell. This will submit spark job for each of the dataset.

NOTE: If facing permission denied error on running above command, run the following - chmod 777 run_part1.sh
Once all the spark jobs are finished, the Original and Cleaned sample of each Dataset will be available in the current working directory. Named as [DatasetID]Original.csv and [DatasetID]Output.csv
To calculate precision and recall of the above sample data, follow the "Steps to calculate accuracy" given below separately.

Improvements were made on the original script based on above results, to run the improved script on all datasets following step 5:
To calculate precision and recall of improved script run command ./run_part2subset.sh This will behave same as run_part1.sh and output both original and cleaned csv of dataset samples. Follow Step 4 to calculate precision and recall using the output.

NOTE: If facing permission denied error on running above command, run the following - chmod 777 run_part2subset.sh
Run the script run_part2.sh by calling ./run_part2.sh in the shell. This will output entire cleaned datasets to HDFS with filenames as [DatasetID]Cleaned.out

NOTE: If facing permission denied error on running above command, run the following - chmod 777 run_part2.sh

Steps to calculate accuracy:

Follow steps 1 to 3 in the "Steps to Reproduce"
Copy output csv files to local machine and open in spreadsheet software of your choice. You can use WinSCP or SCP command with following syntax - scp <netID>@<peelurl>:/file/to/send /where/to/put
Manually inspect and compare the Original and Output file of each dataset to calculate the effectiveness of cleaning approach.

Use following method to calculate precision and recall:
```
Precision = True Positive Count/True Positive Count + False Positive Count

Recall = True Positive Count/True Positive Count + False Negative Count
```

Reference Data creation

The folder Reference Data contains Refrence data for Nonprofit names in NYC as well as PySpark script to create the same. Steps to create:

Goto ReferenceData directory in Peel
Run spark-submit createReferenceData.py
Download result by running hfs -getmerge NonprofitNameReferenceDataset.csv NonprofitNameReferenceDataset.csv

Team:

Syed Ahmad - [email protected]

Suyash Soniminde - [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
ReferenceData		ReferenceData
code		code
Project_Report.pdf		Project_Report.pdf
README.md		README.md
Scaling for NYC Datasets.pdf		Scaling for NYC Datasets.pdf

Provide feedback