San Francisco Datasets (ETL-Project)

Objective:

The purpose of this project is to collaborate as a team to:

Extract - 4 different datasets (in at least 2 formats)
Transform - each dataset, based on their current state
Load - these non-related datasets as individual datasets as a collection into a NoSQL database

Resources:

This repo contains the collaborative work for Team 1. In this repo, you will find the following:

Code Folder (link here)

Containing each team member's Jupyter Notebook, that contains their respective code, used for the extraction and transformation of each team member's dataset:

Team Member	Filename	Description
William	`william_notebook.ipynb`	Extrating a CSV file and saving as a Pandas DataFrame, containing stats for 150 San Francisco restaurants, before undetaking transformation (inc. `str.replace` and code to convert longitute and latitude to zipcodes) and finally converting into a JSON file.
Caitlin	`SFzips.ipynb`	Caitlin's file, that imports 2x CSV files containing (i) Park (ii) Neighbourhood information, performing transformation (inc. dropping duplicates and deleting obsolete rows), before saving 2x new CSV and finally converting the contents of both CSV files to a JSON file).
Heesung	`Heesung_code_with_comments.ipynb`	Heesung's file, that imports 2x CSV files (i) Boba Shops in the Bay Area (ii) Pokemon Go Spawns in the Bay Area. Tranformation of the datasets includes (i) removal of irrelevant information, utilizing Geopy to return zipcodes, from latitute and longitude and merging the two datasets using the zipcode as the join.
Kathryn	`Google_data.ipynb`	Kathryn used Google's API to query of every hotel located within 50,000 meters from the latitude and longitude corresponding to each zip code in the original dataset

Input Folder (link here)

Contains each team members raw dataset(s), once extraction was completed, but before transformation and loading into the destination database:

TEAM MEMBER	FILENAME	DESCRIPTION
William	`restaurant_raw.csv`	contains data (inc. longitude and latitude) of 150 Bay Area restaurants.
Caitlin	`Neighborhoods.csv`	contains data for San Francisco parks (inc. a score for each park and it's longitude and latitude)
	`SFZ.csv`	contains a list of zipcodes for each neighbourhood in San Francisco.
Heesung	`boba.csv`	contains a data for Boba merchants in the Bay Area, which includes each location's longitude and latitude.
	`pokemon-spawns.csv`	contains data for Pokemon Go spawns in the Bay Area - includes the longitude and latitude of each spawn location, as well as a unique identifier for each Pokemon
Kathryn	N/A	As Kathryn queried and API, she has no input file per se.

Outputs Folder (link here)

Containing each team member's output files, which includes JSON files (that were subsquently used to form the collections, that made the final MongoDB database) and also any output files, once transformation was completed and before conversion to JSON files:

TEAM MEMBER	FILENAME	DESCRIPTION
William	`will_data.JSON`	the byproduct of William's data extraction and transformation is a JSON file, used for loading in as the 'Restaurants' collection of the final MongoDB database.
Caitlin	`SF_Park_Scores.csv`	the output files of both of Caitlin's datasets, once transformation had been completed on her original input files, which were `SF_Park_Scores.csv` & `SFZ.csv`.
	`sf_parks.json`	the output files of both of Caitlin's datasets, once transformation had been completed on her original input files, which were `SF_Park_Scores.csv` & `SFZ.csv`.
Heesung	`pika_boba.csv`	contains a CSV of both datasets, once transformation had completed and before conversion into a JSON file.
	`boba_pika.json`	contains the boba dataset, once zip codes had bee obtainined, using the longitude and latitude of each boba shop, with help from the `geocoders` library.
	`boba_zip.csv`	contains the merged dataset data (clean_boba_pika_df) in JSON format, ready for loading to the MongoDB database, as a collection.
	`pika_zip.csv`	contains the Pokeman dataset, including zip codes for each spawn location.
	`pika_zip.csv`	contains the Pokeman dataset, including zip codes for each spawn location.
Kathryn	`TBC`	CONFIRM WITH KATHRYN

The Process:

The objective of this project was to work as a team, as collaboratively as possible, to walkthrough the the ETL process with at least 2 datasets, from different sources (e.g Kaggle & API, in atleast 2 differeing formats (e.g JSON and CSV).

The obvious approach would to have had individual team members tackle a different stage of the ETL process, however it was collectively agreed that this approach wouldn't necessarily have the entire team empowered with the knowledge of every step of the ETL process.

Consensus was therefore reached, that each team member would go through the ETL process individually, but in parallel. Each team member would therefore:

Extract at least 1x dataset
Perform transformation, dependant on the type and state of the data, following extraction. The prerequisite being, that each team member would present their transformed data in JSON format.
Individually load their JSON-formatted transformed dataset(s) into a personal NOSQL (MongoDB) data, as a sanity check, before...
Submitting their JSON file (that has successfully parsed by MongoDB) to the team member's respective resources file.

Once each team member had successfully extracted, transformed and 'test' loaded their datasets into MongoDB, Kathryn then took the helm of the load element of the project, by uploading each team member's dataset JSON file as individual collections in a master MongoDB database.

EXTRACTION

In total the team extracted 6 datasets, from 3 sources, in 2 formats:

6 datasets: stored in the Input Folder
3 sources: we used Kaggle and the Google API
2 formats: William, Heesung and Cailin's data all derived in CSV format, whilst Kathryn's API call to Google, returned a JSON file.

The team describes individually, their decision behind the extract of each respective data:

Heesung Extraction Notes: In this project, I would like to find out the correlation between the boba shop rating and the number of spawn of Pokemon Go Pikachu in SF Bay Area. The dataset in the two files included the following information:
- Boba shop in bay area:
  - id: Unique id of the boba shop
  - name : Name of the boba shop
  - rating : Yelp rating of the boba shop, on a scale of 1-5
  - address : Address of the boba shop
  - ity : City the boba shop is in
  - lat : Latitude of the boba shop
  - long : Longitude of the boba shop
- SF Bay Area Pokemon Go Spawns:
  - id: Unique id of pokemon
  - number:unique number of pokemon character
  - lat: Latitude of the despawn location
  - long: Longitude of the despawn location
  - encounter_ms: Time encountered with Pokemon
  - diappear_ms:The time when Pokemon was disappered
William Extraction Notes: as a budding restaurant goer (and as someone who is relatively new the the Bay Area) I wanted to find data that represented restaurant data in the Bay Area, that might make for interesting analysis. I found a dataset on Kaggle that looked like it had been neglected. It is information for 150 restaurants, including address, cuisine type, price point and importantly the longitude and latitude of the restaurant. My extraction only involved downloading my dataset from Kaggle, before beginning transformation.
Caitlin Extraction Notes: I relied on two seperate CSV files, to make up the extraction stage of this project; the SFZ.csv contained information on San Francisco neighbourhoods (with corresponding zip codes), whilst the SF_park_Scores.csv contained information on parks within the San Francisco area, including a score for each respective Park. Both these source data files derived from Kaggle. Extract included (i) downloading each respective CSV file (ii) and writing the CSV files to Pandas.
Kathryn Extraction Notes: for my data, I chose to locate all of the hotels that are located in each of the zip codes in the bay area. I used the Google Places api to locate hotels within the approximately 30 miles from the latitude and longitude coordinates corresponding to each zip code. This generated about 20 hotels per zip code, of which there were 50 in the dataset.

TRANSFORMATION

Whilst all team member's extracted there data from reputable sources, all data required a degree of data cleanup and analysis. The following transformations took place:

Heesung Transformation Notes: Once I had loaded my files, as CVS into DataFrames, I performed a series of tasks that included:

(i) Removing irrelevant to the project
(ii) Using Geopy to find zip codes, using longitude and latitude
(iii) Merging both datasets on the zip codes column
(iv) Removing rows containing missing information
(v) Saving the cleansed DataSets, firstly as CSV and secondly as a JSON file

TRANSFORMATION	CODE	DESCRIPTION
DataType conversion	`.astype(int)`	Checking data volume for each Pokemon
Filtering	`.loc[pokemon_df["name"]=="Pikachu",:]`	Filtering on a specific character
Index reset	`.reset_index(drop=True)`	Resetting the index on a DataFrame
Geopy query	`geopy.Nominatim`	Used to query the zip codes of corresponding longitute and latitude
Counting	`.value_counts()`	Counting instances of certain Pokemon characters
Purging	`['Unnamed: 0']`	Deleting unnecessary columns
JSON conversion	`to_json`	Converting DataFrame(s) / CSV to JSON

William's Transformation Notes: the Hotel information dataset that I extracted from Kaggle looked to be a badly converted JSON to CSV file and contained alot of white space, obsolete characters and values from one column spilling over into a complete seperate columns cells. My objective from my transformation was to simply bring order and consistency to my data. I struggled initially to write the code to repetitively remove white space from different columns, however following alot of trial-and-error, I was able to write a simple for loop that iteracted through my DataFrame and (by referencing a list containing correct column headers) and replace the white space, with correct vaules. The other challenging element of my transformation, was using Arcgis to look up zipcodes for each restaurant, using that restaurants respectuve latitude and longitude.

TRANSFORMATION	CODE	DESCRIPTION
CSV to DataFrame	`pd.DataFrame`	Converting my read CSV file into a DataFrame in preparation for Transformation
String replacement	`str.replace`	Replacing white space
Dropping columns / cells	`df.drop`	Removing obsolete columns and or cells.
Merging DataFrames	`df.merge`	Merging dataframes together, prior to JSON conversion
JSON conversion	`to_json`	Converting DataFrame(s) / CSV to JSON

Caitlin's Transformation Notes: The data sets both contained some unnecessary and hard-to-read information. The transformation process consisted of cleaning it for clear, functional analysis. My transformation steps are as follows - deleted unnecessary columns, dropped duplicate rows, dropped rows with no zip/lat/long data, formated columns for clarity.

TRANSFORMATION	CODE	DESCRIPTION
Drop duplicates	`.drop_duplicates()`	Dropping duplicate values
Drop NaaN	`dropna()`	Dropping NaaN values
Renaming Columns	`[‘new_column_name]`	Renaming columns for consistency
DataType conversion	`.astype(int)`	Converting DataType, prior to JSON conversion
JSON conversion	`to_json`	Converting DataFrame(s) / CSV to JSON

LOADING

Loading all datasets: Kathryn volunteered to take all our JSON files are build the team database. Here are the steps she took:

Kathryn Team Loading Steps: I looped through a pandas dataframe of zip codes in San Francisco, making an api call to google places for all hotels within 50,000 feet of each zip code’s latitude and longitude equivalent:

Each of these records was added to MongoDB database called locations_mdb under a collection called “hotels”. Now all hotels can be viewed for each latitude and longitude pair.
Heesung’s data I imported to the database under a collection called ‘boba’
The remaining two dataframes I converted to JSON format from a csv by:
- Converting each to a dictionary in “index” format
- Converting each dictionary to a JSON string
- Loading each json string into JSON format
- Inserting the data into its the MongoDB database as its own collection (resulting in “parks” and “restaurants” collections)

Bonus

Taking it to the next level: Heesung decided to take things one step further to the brief, by producting a map that identified geographically, where the Boba Shops (green), Pokemon Spawn (yellow), Restaurants (red) and park locations (blue) were in the Bay area.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
Cait_resources		Cait_resources
Code		Code
Heesung_resources		Heesung_resources
Images		Images
Inputs		Inputs
Outputs		Outputs
Resources		Resources
kathryn_resources		kathryn_resources
will_resources		will_resources
working		working
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
ProjectWrite-up.ipynb		ProjectWrite-up.ipynb
README.md		README.md
data.txt		data.txt
parks.json		parks.json
sf_parks.json		sf_parks.json
~$oject Plan.docx		~$oject Plan.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

San Francisco Datasets (ETL-Project)

Objective:

The purpose of this project is to collaborate as a team to:

Resources:

This repo contains the collaborative work for Team 1. In this repo, you will find the following:

Code Folder (link here)

Input Folder (link here)

Outputs Folder (link here)

The Process:

EXTRACTION

TRANSFORMATION

LOADING

Bonus

About

Releases

Packages

Contributors 4

Languages

SaltireSequence/ETL-Project

Folders and files

Latest commit

History

Repository files navigation

San Francisco Datasets (ETL-Project)

Objective:

The purpose of this project is to collaborate as a team to:

Resources:

This repo contains the collaborative work for Team 1. In this repo, you will find the following:

Code Folder (link here)

Input Folder (link here)

Outputs Folder (link here)

The Process:

EXTRACTION

TRANSFORMATION

LOADING

Bonus

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages