Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NTD Ingestion – Remaining Tables in Dataset #3403

Closed
charlie-costanzo opened this issue Jul 23, 2024 · 8 comments
Closed

NTD Ingestion – Remaining Tables in Dataset #3403

charlie-costanzo opened this issue Jul 23, 2024 · 8 comments
Assignees
Labels
data-pipeline-ingestion-and-modeling Ingesting, parsing and modeling data. Evan Siroky is product owner.

Comments

@charlie-costanzo
Copy link
Member

charlie-costanzo commented Jul 23, 2024

User story / feature request

Part of #3401

As a data engineer, I would like to ingest the data in the form of gcs blob storage and external tables for the remaining tables in the NTD dataset, extending the work found in scrape_ntd.py and annual_database_service.yml (found below).

Building upon the work completed in #3345

Existing NTD patterns:

General Cal-ITP Pipeline Patterns

Acceptance Criteria

I can successfully access the data for the following tables from the https://www.transit.dot.gov/ntd/data-product website in the warehouse as external tables:

  • Assets
  • Expenses
  • Fares/Funding
  • Resources
  • Security and Safety
  • Service (previous years)

Notes


@erikamov
Copy link
Contributor

External tables on branch 3403-ntd-api-external-tables

erikamov pushed a commit that referenced this issue Sep 18, 2024
erikamov pushed a commit that referenced this issue Sep 18, 2024
erikamov pushed a commit that referenced this issue Sep 18, 2024
@ohrite ohrite added this to the National Transit Database milestone Sep 18, 2024
erikamov pushed a commit that referenced this issue Sep 19, 2024
erikamov pushed a commit that referenced this issue Sep 19, 2024
erikamov pushed a commit that referenced this issue Sep 19, 2024
erikamov pushed a commit that referenced this issue Sep 19, 2024
@erikamov
Copy link
Contributor

  • External tables for 2022 API data were tested and working. I am still reviewing field types for each column, but could be deployed as is now (using default STRING) and as we work on modeling we can keep changing types as needed.

  • historical__major_safety_events and historical__nonmajor_safety_and_security_events external tables for Historical API data are failing because of the column names :@computed_region_8fe2_rd7y and :@computed_region_m2nu_4dib.
    @charlie-costanzo @mjumbewu have you seen this problem before?

    The full error:
    google.api_core.exceptions.BadRequest: 400 Error while reading table: cal-itp-data-infra-staging.external_ntd__annual_reporting.historical__major_safety_events, error message: JSON parsing error in row starting at position 0: JSON object specified for non-record field: latlon File: gs://calitp-ntd-api-products/major_safety_events/historical/dt=2024-09-18/execution_ts=2024-09-18T21:46:57.651684+00:00/historical__major_safety_events.jsonl.gz

  • The external tables for XLSX Complete Monthly Ridership with Adjustments and Estimates Data are also failing, but I think it is just my local configuration to access the XLSX files/buckets.

@evansiroky evansiroky added the data-pipeline-ingestion-and-modeling Ingesting, parsing and modeling data. Evan Siroky is product owner. label Sep 19, 2024
@charlie-costanzo
Copy link
Member Author

Hey @erikamov, thanks for the update!

  • External tables for 2022 API data were tested and working. I am still reviewing field types for each column, but could be deployed as is now (using default STRING) and as we work on modeling we can keep changing types as needed.

This is great, I say we get these tables merged as soon as we can and keep iterating as necessary

  • historical__major_safety_events and historical__nonmajor_safety_and_security_events external tables for Historical API data are failing because of the column names :@computed_region_8fe2_rd7y and :@computed_region_m2nu_4dib.

Hm, I'm assuming these column names aren't 'BigQuery safe'. If it will get things working, I say we just exclude these columns when listing the column names for the external table, since I don't think they're useful for analysis.

If that doesn't work, then I say we hold off on creating external tables out of these files at the moment since these tables weren't actually requested, I just added them for some unrelated testing and decided to keep them in place. We can always reintroduce them with proper handling, if we learn they are desirable.

  • The external tables for XLSX Complete Monthly Ridership with Adjustments and Estimates Data are also failing, but I think it is just my local configuration to access the XLSX files/buckets.

Let me know if I can hep in any way with this one!

@erikamov
Copy link
Contributor

erikamov commented Sep 19, 2024

I created the a PR with the 2022 external tables and it is ready to review: #3465
There are tones of columns so I it would take longer to keep reviewing each end point and checking the type of each column. So I think we can start having these tables with mostly columns as STRING and we can change the data types as needed.

I am still working on the XLSX tables.

@erikamov
Copy link
Contributor

2022 External Tables were successfully created on cal-itp-data-infra.external_ntd__annual_reporting.

New PR created for NTD external tables for Complete Monthly Ridership with Adjustments and Estimates Data.

The remaining tables will not be created for now. If there is a need for those I kept the work on a branch. Those table will need some extra work to be created due to position of the column names and invalid characters on the name (see previous comments).

erikamov pushed a commit that referenced this issue Sep 23, 2024
erikamov pushed a commit that referenced this issue Sep 23, 2024
@erikamov
Copy link
Contributor

Last PR merged. Next time the create_external_tables runs it will create those tables.

@erikamov
Copy link
Contributor

Tables created and new fields will be added on the next create_external_tables DAG runs.

@erikamov
Copy link
Contributor

DAG updated all tables successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-pipeline-ingestion-and-modeling Ingesting, parsing and modeling data. Evan Siroky is product owner.
Projects
None yet
Development

No branches or pull requests

4 participants