- Extract - 4 different datasets (in at least 2 formats)
- Transform - each dataset, based on their current state
- Load - these non-related datasets as individual datasets as a collection into a NoSQL database
Code Folder (link here)
Containing each team member's Jupyter Notebook, that contains their respective code, used for the extraction and transformation of each team member's dataset:
Team Member | Filename | Description |
---|---|---|
William | william_notebook.ipynb |
Extrating a CSV file and saving as a Pandas DataFrame, containing stats for 150 San Francisco restaurants, before undetaking transformation (inc. str.replace and code to convert longitute and latitude to zipcodes) and finally converting into a JSON file. |
Caitlin | SFzips.ipynb |
Caitlin's file, that imports 2x CSV files containing (i) Park (ii) Neighbourhood information, performing transformation (inc. dropping duplicates and deleting obsolete rows), before saving 2x new CSV and finally converting the contents of both CSV files to a JSON file). |
Heesung | Heesung_code_with_comments.ipynb |
Heesung's file, that imports 2x CSV files (i) Boba Shops in the Bay Area (ii) Pokemon Go Spawns in the Bay Area. Tranformation of the datasets includes (i) removal of irrelevant information, utilizing Geopy to return zipcodes, from latitute and longitude and merging the two datasets using the zipcode as the join. |
Kathryn | Google_data.ipynb |
Kathryn used Google's API to query of every hotel located within 50,000 meters from the latitude and longitude corresponding to each zip code in the original dataset |
Input Folder (link here)
Contains each team members raw dataset(s), once extraction was completed, but before transformation and loading into the destination database:
TEAM MEMBER | FILENAME | DESCRIPTION |
---|---|---|
William | restaurant_raw.csv |
contains data (inc. longitude and latitude) of 150 Bay Area restaurants. |
Caitlin | Neighborhoods.csv |
contains data for San Francisco parks (inc. a score for each park and it's longitude and latitude) |
SFZ.csv |
contains a list of zipcodes for each neighbourhood in San Francisco. | |
Heesung | boba.csv |
contains a data for Boba merchants in the Bay Area, which includes each location's longitude and latitude. |
pokemon-spawns.csv |
contains data for Pokemon Go spawns in the Bay Area - includes the longitude and latitude of each spawn location, as well as a unique identifier for each Pokemon |
|
Kathryn | N/A | As Kathryn queried and API, she has no input file per se. |
Outputs Folder (link here)
Containing each team member's output files, which includes JSON files (that were subsquently used to form the collections, that made the final MongoDB database) and also any output files, once transformation was completed and before conversion to JSON files:
TEAM MEMBER | FILENAME | DESCRIPTION |
---|---|---|
William | will_data.JSON |
the byproduct of William's data extraction and transformation is a JSON file, used for loading in as the 'Restaurants' collection of the final MongoDB database. |
Caitlin | SF_Park_Scores.csv |
the output files of both of Caitlin's datasets, once transformation had been completed on her original input files, which were SF_Park_Scores.csv & SFZ.csv . |
sf_parks.json |
the output files of both of Caitlin's datasets, once transformation had been completed on her original input files, which were SF_Park_Scores.csv & SFZ.csv . |
|
Heesung | pika_boba.csv |
contains a CSV of both datasets, once transformation had completed and before conversion into a JSON file. |
boba_pika.json |
contains the boba dataset, once zip codes had bee obtainined, using the longitude and latitude of each boba shop, with help from the geocoders library. |
|
boba_zip.csv |
contains the merged dataset data (clean_boba_pika_df) in JSON format, ready for loading to the MongoDB database, as a collection. |
|
pika_zip.csv |
contains the Pokeman dataset, including zip codes for each spawn location. | |
pika_zip.csv |
contains the Pokeman dataset, including zip codes for each spawn location. | |
Kathryn | TBC |
CONFIRM WITH KATHRYN |
The objective of this project was to work as a team, as collaboratively as possible, to walkthrough the the ETL process with at least 2 datasets, from different sources (e.g Kaggle & API, in atleast 2 differeing formats (e.g JSON and CSV).
The obvious approach would to have had individual team members tackle a different stage of the ETL process, however it was collectively agreed that this approach wouldn't necessarily have the entire team empowered with the knowledge of every step of the ETL process.
Consensus was therefore reached, that each team member would go through the ETL process individually, but in parallel. Each team member would therefore:
- Extract at least 1x dataset
- Perform transformation, dependant on the type and state of the data, following extraction. The prerequisite being, that each team member would present their transformed data in JSON format.
- Individually load their JSON-formatted transformed dataset(s) into a personal NOSQL (MongoDB) data, as a sanity check, before...
- Submitting their JSON file (that has successfully parsed by MongoDB) to the team member's respective resources file.
Once each team member had successfully extracted, transformed and 'test' loaded their datasets into MongoDB, Kathryn then took the helm of the load element of the project, by uploading each team member's dataset JSON file as individual collections in a master MongoDB database.
In total the team extracted 6 datasets, from 3 sources, in 2 formats:
- 6 datasets: stored in the Input Folder
- 3 sources: we used Kaggle and the Google API
- 2 formats: William, Heesung and Cailin's data all derived in CSV format, whilst Kathryn's API call to Google, returned a JSON file.
The team describes individually, their decision behind the extract of each respective data:
-
Heesung Extraction Notes: In this project, I would like to find out the correlation between the boba shop rating and the number of spawn of Pokemon Go Pikachu in SF Bay Area. The dataset in the two files included the following information:
-
Boba shop in bay area:
- id: Unique id of the boba shop
- name : Name of the boba shop
- rating : Yelp rating of the boba shop, on a scale of 1-5
- address : Address of the boba shop
- ity : City the boba shop is in
- lat : Latitude of the boba shop
- long : Longitude of the boba shop
-
SF Bay Area Pokemon Go Spawns:
- id: Unique id of pokemon
- number:unique number of pokemon character
- lat: Latitude of the despawn location
- long: Longitude of the despawn location
- encounter_ms: Time encountered with Pokemon
- diappear_ms:The time when Pokemon was disappered
-
-
William Extraction Notes: as a budding restaurant goer (and as someone who is relatively new the the Bay Area) I wanted to find data that represented restaurant data in the Bay Area, that might make for interesting analysis. I found a dataset on Kaggle that looked like it had been neglected. It is information for 150 restaurants, including address, cuisine type, price point and importantly the longitude and latitude of the restaurant. My extraction only involved downloading my dataset from Kaggle, before beginning transformation.
-
Caitlin Extraction Notes: I relied on two seperate CSV files, to make up the extraction stage of this project; the
SFZ.csv
contained information on San Francisco neighbourhoods (with corresponding zip codes), whilst theSF_park_Scores.csv
contained information on parks within the San Francisco area, including a score for each respective Park. Both these source data files derived from Kaggle. Extract included (i) downloading each respective CSV file (ii) and writing the CSV files to Pandas. -
Kathryn Extraction Notes: for my data, I chose to locate all of the hotels that are located in each of the zip codes in the bay area. I used the Google Places api to locate hotels within the approximately 30 miles from the latitude and longitude coordinates corresponding to each zip code. This generated about 20 hotels per zip code, of which there were 50 in the dataset.
Whilst all team member's extracted there data from reputable sources, all data required a degree of data cleanup and analysis. The following transformations took place:
-
Heesung Transformation Notes: Once I had loaded my files, as CVS into DataFrames, I performed a series of tasks that included:
(i) Removing irrelevant to the project
(ii) Using Geopy to find zip codes, using longitude and latitude
(iii) Merging both datasets on the zip codes column
(iv) Removing rows containing missing information
(v) Saving the cleansed DataSets, firstly as CSV and secondly as a JSON fileTRANSFORMATION CODE DESCRIPTION DataType conversion .astype(int)
Checking data volume for each Pokemon Filtering .loc[pokemon_df["name"]=="Pikachu",:]
Filtering on a specific character Index reset .reset_index(drop=True)
Resetting the index on a DataFrame Geopy query geopy.Nominatim
Used to query the zip codes of corresponding longitute
and latitudeCounting .value_counts()
Counting instances of certain Pokemon characters Purging ['Unnamed: 0']
Deleting unnecessary columns JSON conversion to_json
Converting DataFrame(s) / CSV to JSON -
William's Transformation Notes: the Hotel information dataset that I extracted from Kaggle looked to be a badly converted JSON to CSV file and contained alot of white space, obsolete characters and values from one column spilling over into a complete seperate columns cells. My objective from my transformation was to simply bring order and consistency to my data. I struggled initially to write the code to repetitively remove white space from different columns, however following alot of trial-and-error, I was able to write a simple for loop that iteracted through my DataFrame and (by referencing a list containing correct column headers) and replace the white space, with correct vaules. The other challenging element of my transformation, was using Arcgis to look up zipcodes for each restaurant, using that restaurants respectuve latitude and longitude.
TRANSFORMATION CODE DESCRIPTION CSV to DataFrame pd.DataFrame
Converting my read CSV file into a DataFrame in preparation for Transformation String replacement str.replace
Replacing white space Dropping columns / cells df.drop
Removing obsolete columns and or cells. Merging DataFrames df.merge
Merging dataframes together, prior to JSON conversion JSON conversion to_json
Converting DataFrame(s) / CSV to JSON -
Caitlin's Transformation Notes: The data sets both contained some unnecessary and hard-to-read information. The transformation process consisted of cleaning it for clear, functional analysis. My transformation steps are as follows - deleted unnecessary columns, dropped duplicate rows, dropped rows with no zip/lat/long data, formated columns for clarity.
TRANSFORMATION CODE DESCRIPTION Drop duplicates .drop_duplicates()
Dropping duplicate values Drop NaaN dropna()
Dropping NaaN values Renaming Columns [‘new_column_name]
Renaming columns for consistency DataType conversion .astype(int)
Converting DataType, prior to JSON conversion JSON conversion to_json
Converting DataFrame(s) / CSV to JSON
Loading all datasets: Kathryn volunteered to take all our JSON files are build the team database. Here are the steps she took:
Kathryn Team Loading Steps: I looped through a pandas dataframe of zip codes in San Francisco, making an api call to google places for all hotels within 50,000 feet of each zip code’s latitude and longitude equivalent:
- Each of these records was added to MongoDB database called locations_mdb under a collection called “hotels”. Now all hotels can be viewed for each latitude and longitude pair.
- Heesung’s data I imported to the database under a collection called ‘boba’
- The remaining two dataframes I converted to JSON format from a csv by:
- Converting each to a dictionary in “index” format
- Converting each dictionary to a JSON string
- Loading each json string into JSON format
- Inserting the data into its the MongoDB database as its own collection (resulting in “parks” and “restaurants” collections)
- Converting each to a dictionary in “index” format
Taking it to the next level: Heesung decided to take things one step further to the brief, by producting a map that identified geographically, where the Boba Shops (green), Pokemon Spawn (yellow), Restaurants (red) and park locations (blue) were in the Bay area.