Setup and load Fisheye data into Azure
In this case we are using Azure services. We'll use the Serverless functions to manage the load into ADLS gen 2 (as a blob) then load that blob into Postgres. Some links for reference:
- Azure Setup Instructions here to help get us started.
- ADLS into Postgres
Our initial use case is data from the EPA ECHO dataset. We'll start with these.
Most of these are .zip
files containing multile child files. We'll have to identify how to use Azure resources to unpack and load them, since the examples basically loady from local .csv
files.
We are using Python virtual environments. If you are unfamiliar with venv
please reach out to the other contributors. Please use the requirements.txt in the current environment.
Make sure you have a file called .env
that resembles the file .env.sample. Cop the sample file and replace with your values.
For local dev, please install postgres and set up a database called "fisheye-dev". We'll create a schema called "epa" until we have a full architecture review. If you are using Windows Subsystem for Linux (WSL2) you can follow these instructions for Postgresql. You will need to install the database locally until we have establihed a public SQL server.
Remember to login to postgres as postgres
and create a user for the application. These credentials should be unique and will only apply to your local database. Here is an example
> sudo -u postgres createuser buddha
> sudo -u postgres createdb fisheye_dev
Login to the db
> sudo -u postgres psql
> psql (12.11 (Ubuntu 12.11-0ubuntu0.20.04.1))
Type "help" for help.
> alter user buddha with encrypted password 'epa4you';
> grant all privileges on database fisheye_dev to buddha ;
At this point, you're still the postgres
user, so logout and re-enter the db
psql buddha -d fisheye_dev
There are no tables. But let's create the epa schema.
> create schema epa
All right! Now you should have some structure and you can either a) join the ETL team or b) go back to Python.
You may have to run the script several times to load the database. It runs out of memory, but if you run it again it will just work on the new tables.