Skip to content

Commit

Permalink
Add filtered datasets for "serious" incidents
Browse files Browse the repository at this point in the history
h/t and thanks to @medievalmadeline for the core development of this
new feature 🎉
  • Loading branch information
jsvine committed Mar 29, 2024
1 parent 8e317f1 commit 8d19a15
Show file tree
Hide file tree
Showing 7 changed files with 31,951 additions and 0 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/scrape.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,9 @@ jobs:
- name: Get discovery dates
run: python scripts/01-get-discovery-dates.py --num-months $FETCH_NUM_MONTHS

- name: Create filtered datasets
run: python scripts/02-filter.py

- name: Write RSS
run: python scripts/03-generate-rss.py

Expand Down
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ fetch-data:
discover-dates:
venv/bin/python scripts/01-get-discovery-dates.py

filter-data:
venv/bin/python scripts/02-filter.py

publish-feed:
venv/bin/python scripts/03-generate-rss.py

Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ This repository, developed by the [Data Liberation Project](https://www.data-lib
- Status: 🔵 In progress
- Generate one file that contains a subset of fields (to keep size within GitHub's limits) for *all* reports
- Status: 🟠 Not yet started
- Generate filtered data focusing just on the most *serious* reports
- Status: 🟢 Completed, now available [here](data/processed/filtered/)
- Provide RSS feeds with the latest available incidents, nationally and by state
- Status: 🟢 Completed, now available [here](data/processed/feeds/)
- Provide RSS feeds listing incident [updates](https://www.ecfr.gov/current/title-49/subtitle-B/chapter-I/subchapter-C/part-171/subpart-B/section-171.16#p-171.16\(c\))
Expand All @@ -31,6 +33,7 @@ You can clone or [download](https://sites.northwestern.edu/researchcomputing/res

The files are split into months to stay within GitHub's file size limits. You can combine them with your preferred toolset. For example, using [`xsv`](https://github.com/BurntSushi/xsv#installation), you could run `xsv cat rows data/fetched/*.csv > combined.csv`.

A set of *all years* incidents *filtered* to just the most “serious” are available in the [`data/processed/filtered/`](data/processed/filtered/) directory.

### Resources

Expand All @@ -56,6 +59,7 @@ Many thanks to the volunteers who have contributed to this repository:
- [@gcappaert](https://github.com/gcappaert)
- [@m-nolan](https://github.com/m-nolan)
- [@rjintu](https://github.com/rjintu)
- [@medievalmadeline](https://github.com/medievalmadeline)

## Licensing

Expand Down
23 changes: 23 additions & 0 deletions data/processed/filtered/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Filtered Subsets

This directory contains filtered subsets of the full incident dataset.

## `serious-incidents.csv`

This dataset contains all rows for which *any* of the following fields' values is `Yes`:

- `Serious Incident Ind`
- `Hmis Serious Bulk Release`
- `Hmis Serious Evacuations`
- `Hmis Serious Fatality`
- `Hmis Serious Flight Plan`
- `Hmis Serious Injury`
- `Hmis Serious Major Artery`
- `Hmis Serious Marine Pollutant`
- `Hmis Serious Radioactive`

The Data Liberation Project thanks volunteer Madeline Everett for developing this filter, as well the filter described below.

## `serious-incidents-expensive.csv`

This dataset begins with the same filter as above, but adds an additional constraint: The total cost of the incident (`Total Amount Of Damages`) is $10,000 or more.
10,000 changes: 10,000 additions & 0 deletions data/processed/filtered/serious-incidents-expensive.csv

Large diffs are not rendered by default.

21,872 changes: 21,872 additions & 0 deletions data/processed/filtered/serious-incidents.csv

Large diffs are not rendered by default.

46 changes: 46 additions & 0 deletions scripts/02-filter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import pathlib

import pandas as pd


def filter_rows(df, cost_min=0):
return df.loc[
(df["Total Amount Of Damages"] >= cost_min)
& (
(df["Serious Incident Ind"] == "Yes")
| (df["Hmis Serious Bulk Release"] == "Yes")
| (df["Hmis Serious Evacuations"] == "Yes")
| (df["Hmis Serious Fatality"] == "Yes")
| (df["Hmis Serious Flight Plan"] == "Yes")
| (df["Hmis Serious Injury"] == "Yes")
| (df["Hmis Serious Major Artery"] == "Yes")
| (df["Hmis Serious Marine Pollutant"] == "Yes")
| (df["Hmis Serious Radioactive"] == "Yes")
)
]


def read_csv(path):
return pd.read_csv(path, dtype=str).astype({"Total Amount Of Damages": int})


def main():
# Collect all of the CSVs in the fetched folder
paths = sorted(pathlib.Path("data/fetched").glob("*.csv"))

# Concatenate all of the CSV files
all_rows = pd.concat(map(read_csv, paths), ignore_index=True)

# Filter to "serious" incidents
filtered_rows = filter_rows(all_rows)
filtered_rows.to_csv("data/processed/filtered/serious-incidents.csv", index=False)

# Filter the serious incidents to just those with $10k+ in total costs
filtered_rows_expensive = filter_rows(filtered_rows, cost_min=10000)
filtered_rows_expensive.to_csv(
"data/processed/filtered/serious-incidents-expensive.csv", index=False
)


if __name__ == "__main__":
main()

0 comments on commit 8d19a15

Please sign in to comment.