Add filtered datasets for "serious" incidents

h/t and thanks to @medievalmadeline for the core development of this new feature 🎉
data-liberation-project · Mar 29, 2024 · 8d19a15 · 8d19a15
1 parent 8e317f1
commit 8d19a15
Show file tree

Hide file tree

Showing 7 changed files with 31,951 additions and 0 deletions.
diff --git a/.github/workflows/scrape.yaml b/.github/workflows/scrape.yaml
@@ -56,6 +56,9 @@ jobs:
       - name: Get discovery dates
         run: python scripts/01-get-discovery-dates.py --num-months $FETCH_NUM_MONTHS
 
+      - name: Create filtered datasets
+        run: python scripts/02-filter.py
+
       - name: Write RSS
         run: python scripts/03-generate-rss.py
 

diff --git a/Makefile b/Makefile
@@ -30,6 +30,9 @@ fetch-data:
 discover-dates:
 	venv/bin/python scripts/01-get-discovery-dates.py
 
+filter-data:
+	venv/bin/python scripts/02-filter.py
+
 publish-feed:
 	venv/bin/python scripts/03-generate-rss.py
 

diff --git a/README.md b/README.md
@@ -16,6 +16,8 @@ This repository, developed by the [Data Liberation Project](https://www.data-lib
     - Status: 🔵 In progress
 - Generate one file that contains a subset of fields (to keep size within GitHub's limits) for *all* reports
     - Status: 🟠 Not yet started
+- Generate filtered data focusing just on the most *serious* reports
+    - Status: 🟢 Completed, now available [here](data/processed/filtered/)
 - Provide RSS feeds with the latest available incidents, nationally and by state
     - Status: 🟢 Completed, now available [here](data/processed/feeds/)
 - Provide RSS feeds listing incident [updates](https://www.ecfr.gov/current/title-49/subtitle-B/chapter-I/subchapter-C/part-171/subpart-B/section-171.16#p-171.16\(c\))
@@ -31,6 +33,7 @@ You can clone or [download](https://sites.northwestern.edu/researchcomputing/res
 
 The files are split into months to stay within GitHub's file size limits. You can combine them with your preferred toolset. For example, using [`xsv`](https://github.com/BurntSushi/xsv#installation), you could run `xsv cat rows data/fetched/*.csv > combined.csv`.
 
+A set of *all years* incidents *filtered* to just the most “serious” are available in the [`data/processed/filtered/`](data/processed/filtered/) directory.
 
 ### Resources
 
@@ -56,6 +59,7 @@ Many thanks to the volunteers who have contributed to this repository:
 - [@gcappaert](https://github.com/gcappaert)
 - [@m-nolan](https://github.com/m-nolan)
 - [@rjintu](https://github.com/rjintu)
+- [@medievalmadeline](https://github.com/medievalmadeline)
 
 ## Licensing
 

diff --git a/data/processed/filtered/README.md b/data/processed/filtered/README.md
@@ -0,0 +1,23 @@
+# Filtered Subsets
+
+This directory contains filtered subsets of the full incident dataset.
+
+## `serious-incidents.csv`
+
+This dataset contains all rows for which *any* of the following fields' values is `Yes`:
+
+- `Serious Incident Ind`
+- `Hmis Serious Bulk Release`
+- `Hmis Serious Evacuations`
+- `Hmis Serious Fatality`
+- `Hmis Serious Flight Plan`
+- `Hmis Serious Injury`
+- `Hmis Serious Major Artery`
+- `Hmis Serious Marine Pollutant`
+- `Hmis Serious Radioactive`
+
+The Data Liberation Project thanks volunteer Madeline Everett for developing this filter, as well the filter described below.
+
+## `serious-incidents-expensive.csv`
+
+This dataset begins with the same filter as above, but adds an additional constraint: The total cost of the incident (`Total Amount Of Damages`) is $10,000 or more.
diff --git a/data/processed/filtered/serious-incidents-expensive.csv b/data/processed/filtered/serious-incidents-expensive.csv
diff --git a/data/processed/filtered/serious-incidents.csv b/data/processed/filtered/serious-incidents.csv
diff --git a/scripts/02-filter.py b/scripts/02-filter.py
@@ -0,0 +1,46 @@
+import pathlib
+
+import pandas as pd
+
+
+def filter_rows(df, cost_min=0):
+    return df.loc[
+        (df["Total Amount Of Damages"] >= cost_min)
+        & (
+            (df["Serious Incident Ind"] == "Yes")
+            | (df["Hmis Serious Bulk Release"] == "Yes")
+            | (df["Hmis Serious Evacuations"] == "Yes")
+            | (df["Hmis Serious Fatality"] == "Yes")
+            | (df["Hmis Serious Flight Plan"] == "Yes")
+            | (df["Hmis Serious Injury"] == "Yes")
+            | (df["Hmis Serious Major Artery"] == "Yes")
+            | (df["Hmis Serious Marine Pollutant"] == "Yes")
+            | (df["Hmis Serious Radioactive"] == "Yes")
+        )
+    ]
+
+
+def read_csv(path):
+    return pd.read_csv(path, dtype=str).astype({"Total Amount Of Damages": int})
+
+
+def main():
+    # Collect all of the CSVs in the fetched folder
+    paths = sorted(pathlib.Path("data/fetched").glob("*.csv"))
+
+    # Concatenate all of the CSV files
+    all_rows = pd.concat(map(read_csv, paths), ignore_index=True)
+
+    # Filter to "serious" incidents
+    filtered_rows = filter_rows(all_rows)
+    filtered_rows.to_csv("data/processed/filtered/serious-incidents.csv", index=False)
+
+    # Filter the serious incidents to just those with $10k+ in total costs
+    filtered_rows_expensive = filter_rows(filtered_rows, cost_min=10000)
+    filtered_rows_expensive.to_csv(
+        "data/processed/filtered/serious-incidents-expensive.csv", index=False
+    )
+
+
+if __name__ == "__main__":
+    main()