Skip to content

Commit

Permalink
address comments from PR review
Browse files Browse the repository at this point in the history
  • Loading branch information
lauriemerrell committed Dec 14, 2023
1 parent 2851cc6 commit 8b95e36
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 7 deletions.
6 changes: 3 additions & 3 deletions airflow/dags/create_external_tables/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@ post_hook: | # this is optional; can provide an example query to check that e
SELECT *
FROM `{{ get_project_id() }}`.<your dataset as defined below under destination_project_dataset_table>.<your table name as defined below under destination_project_dataset_table>
LIMIT 1;
source_objects: # this tells the external table which path to look in for the objects that will be queryable through this external table
- "<the top level folder name within your bucket that should be used for this external table like my_data>/*.jsonl.gz"
source_objects: # this tells the external table which path & file format to look in for the objects that will be queryable through this external table
- "<the top level folder name within your bucket that should be used for this external table like my_data>/*.<your file extension, most likely '.jsonl.gz'>"
destination_project_dataset_table: "<desired dataset name like external_my_data_source>.<desired table name, may be like topic_name__specific_data_name>" # this defines the external table name (dataset and table name) through which the data will be accessible in BigQuery
source_format: NEWLINE_DELIMITED_JSON # file format of raw data; generally should not change
source_format: NEWLINE_DELIMITED_JSON # file format of raw data; generally should not change -- allowable options are specified here: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#ExternalDataConfiguration.FIELDS.source_format
use_bq_client: true # this option only exists for backwards compatibility; should always be true for new tables
hive_options: # this section provides information about how hive-partitioning is used
mode: CUSTOM # options are CUSTOM and AUTO. if CUSTOM, you need to define the hive partitions and their datatypes in the source_uri_prefix below; if you use AUTO, you only need to provide the top-level directory in the source_uri_prefix
Expand Down
8 changes: 4 additions & 4 deletions docs/architecture/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ The [Should it be a dbt model?](tool_choice) docs section also has some guidance

### Bring data into Google Cloud Storage

We store our raw, un-transformed data in Google Cloud Storage to ensure that we can always recover the raw data if needed.
We store our raw, un-transformed data in Google Cloud Storage, usually in perpetuity, to ensure that we can always recover the raw data if needed.

We store data in [hive-partitioned buckets](https://cloud.google.com/bigquery/docs/hive-partitioned-queries#supported_data_layouts) so that data is clearly labeled and partitioned for better performance. We use UTC dates and timestamps in hive paths (for example, for the timestamp of the data extract) for consistency.

Expand All @@ -140,16 +140,16 @@ The [Airflow README in the data-infra repo](https://github.com/cal-itp/data-infr

We often bring data into our environment in two steps, created as two separate Airflow [DAGs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html):

- **Sync the fully-raw data in its original format:** See for example the changes in the `airflow/dags/sync_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). We do this to preserve the raw data in its original form. This data might be saved in a `calitp-<your-data-source>-raw` bucket.
- **Sync the fully-raw data in its original format:** See for example the changes in the `airflow/dags/sync_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files) (note: this example is typical in terms of its overall structure and use of Cal-ITP storage classes and methods, but the specifics of how to access and request the upstream data source will vary). We do this to preserve the raw data in its original form. This data might be saved in a `calitp-<your-data-source>-raw` bucket.
- **Convert the saved raw data into a BigQuery-readable gzipped JSONL file:** See for example the changes in the `airflow/dags/parse_elavon` directory in [data-infra PR #2376](https://github.com/cal-itp/data-infra/pull/2376/files). This prepares the data is to be read into BigQuery. **Conversion here should be limited to the bare minimum needed to make the data BigQuery-compatible, for example converting column names that would be invalid in BigQuery and changing the file type to gzipped JSONL.** This data might be saved in a `calitp-<your-data-source>-parsed` bucket.

```{note}
When you merge a pull request creating a new Airflow DAG, that DAG will be paused by default. To start the DAG, someone will need to log into the Airflow UI and unpause the DAG.
When you merge a pull request creating a new Airflow DAG, that DAG will be paused by default. To start the DAG, someone will need to log into [the Airflow UI (requires Composer access in Cal-ITP Google Cloud Platform instance)](https://o1d2fa0877cf3fb10p-tp.appspot.com/home) and unpause the DAG.
```

### Create external tables

We use [external tables](https://cloud.google.com/bigquery/docs/external-data-sources#external_tables) to allow BigQuery to query data stored in Google Cloud Storage. External tables do not move data into BigQuery, they simply define the data schema which BigQuery can then use to access the data still stored in Google Cloud Storage.
We use [external tables](https://cloud.google.com/bigquery/docs/external-data-sources#external_tables) to allow BigQuery to query data stored in Google Cloud Storage. External tables do not move data into BigQuery; they simply define the data schema which BigQuery can then use to access the data still stored in Google Cloud Storage.

External tables are created by the [`create_external_tables` Airflow DAG](https://github.com/cal-itp/data-infra/tree/main/airflow/dags/create_external_tables) using the [ExternalTable custom operator](https://github.com/cal-itp/data-infra/blob/main/airflow/plugins/operators/external_table.py). Testing guidance and example YAML for how to create your external table is provided in the [Airflow DAG documentation](https://github.com/cal-itp/data-infra/tree/main/airflow/dags/create_external_tables#create_external_tables).

Expand Down

0 comments on commit 8b95e36

Please sign in to comment.