Louisville Metro Expenditure Data Pipeline

Project Goal

This project creates a data pipeline to load Louisville Metro (KY) expenditure data (via data.louisvilleky.gov) for comparison across fiscal years. The comparisons include expenditures by department and type per year. Reports for each fiscal year can have different column names, so this project helps consolidate data and map related columns as well as resolve slight name differences in the agencies.

Project Architecture

Necessary cloud infrastructure is created via Terraform. The data pipeline is orchestrated by a Prefect docker container, which creates tasks to download data and load it to a Google Cloud Storage (GCS) bucket.

The next stage will extract data from the GCS bucket, perform cleanup and transformations, and load it to Google Big Query. The final stage will be transformations using dbt to create a fact table for efficient data aggregation to the dashboard.

Although the entire data set is less than 1 GB, I've partitioned the data by fiscal year to fulfill the project requirements.

Install/Initialization

Pre-requisites

Docker Installed
Terraform installed
Google Cloud account
- New project created
- Google Cloud service account for this project with Storage and BigQuery roles. JSON Access key downloaded.
Linux or WSL and ability to run make.

Setup / Commands

Setup: Cloud Credentials / env variables / Docker install / WSL 1.a Export environment variable for GCP_PROJECT_ID
```
> export GCP_PROJECT_ID=<Project ID>
```
1.b Download a service account JSON key for a service account with Storage and BigQuery roles. Place JSON file into the credentials directory with the name gcp_service_account_key.json
make init in the project directory. This will use terraform to create the necessary infrastructure, and build the docker containers
make start. This will start prefect containers to run the ETL task flows.
make transform. This will start the dbt container and run transformations.
make stop to stop containers
make clean to stop containers, remove volumes, and use terraform to take down resources created by this project.

Dashboard

Google Data Studio dashboard can be viewed here: https://lookerstudio.google.com/reporting/dcd5e9e7-bf72-4a5b-addd-f1e0349159b9

Future Improvmements

Include a script or optional parameter to download directly from the original data source instead of my repo with a snapshot. It should download their latest previously generated CSV, rather than requesting new report, to avoid excessive report generation on their end.

Thanks

Thank you to everyone involved in DataTalksClub Data Engineering Zoomcamp 2023 for helping me improve my knowledge of data engineering concepts and tools involved in this project. https://github.com/DataTalksClub/data-engineering-zoomcamp

Also thanks to the Louisville Open Data portal for compiling and giving public access to this data.

Data license: PPDL (https://louisville-metro-opendata-lojic.hub.arcgis.com/pages/terms-of-use-and-license)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
credentials		credentials
dbt		dbt
prefect		prefect
terraform		terraform
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Louisville Metro Expenditure Data Pipeline

Project Goal

Project Architecture

Install/Initialization

Pre-requisites

Setup / Commands

Dashboard

Future Improvmements

Thanks

About

Releases

Packages

Languages

Xanthus1/zoomcamp-louisville-data

Folders and files

Latest commit

History

Repository files navigation

Louisville Metro Expenditure Data Pipeline

Project Goal

Project Architecture

Install/Initialization

Pre-requisites

Setup / Commands

Dashboard

Future Improvmements

Thanks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages