Fix pdp gen failure from different institution ids #47

ZakMiller · 2024-12-30T21:05:41Z

There's a cohort schema check that ensures there's a single institution for all the data, but the pdp generation script creates a new institution id per record (so data created using the script will fail the validation). This creates the institution id once and uses it for both cohort and course data.

changes

Generates institution id once in the parent script and then passes it down.
Adds a test that uses the new field to pass the same value down to multiple cohorts/courses (this would fail without this change) for both cohort (which has the check) and course (which doesn't)
Ranames test_raw_cohort_record to test_raw_course_record in test_raw_course.py (unrelated, looks like a copy paste error)

context

Noticed this while running through the templates using the existing test data in the repo. Some more similar changes to follow.

questions

Should the same schema check exist for the course schema?
Where should I create the asana ticket (that's mentioned in CONTRIBUTING.md)?

bdewilde

Hey, good catch. Since uniqueness of institution_id only matters when generating multiple records, I think we should make it an optional parameter, so the single-record case still works outside the context of the generate data script. Does that make sense to you? Maybe I'm trying to preserve a use case that doesn't exist in practice, but making it a required arg for both types of records feels a bit heavy.

src/student_success_tool/generation/pdp/raw_cohort.py

tests/generation/pdp/test_raw_course.py

bdewilde · 2025-01-03T00:37:42Z

src/student_success_tool/generation/pdp/raw_course.py

@@ -8,7 +8,7 @@

 class Provider(BaseProvider):
    def raw_course_record(
-        self, cohort_record: t.Optional[dict] = None, normalize_col_names: bool = False
+        self, cohort_record: t.Optional[dict] = None, normalize_col_names: bool = False, institution_id: int = 12345


I don't feel super strongly about this, but following existing patterns, it seems like we'd want to pull institution_id from the cohort_record if provided or otherwise generate a random one, rather than passing cohort record && a specified institution id. In this case, I think you'd just leave all the below code as it was, and just drop this extra arg?

Oh ya that seems way better. I was missing some context that you added with your comment.

bdewilde · 2025-01-04T17:05:42Z

scripts/generate_synthetic_pdp_datasets.py

        for _ in range(args.num_students)
    ]
    course_records = [
        FAKER.raw_course_record(
-            cohort_record, normalize_col_names=args.normalize_col_names
-        )
+            cohort_record, normalize_col_names=args.normalize_col_names)


Just curious, are you using an autoformatter on this code? This parens should be dangling according to ruff 🤷‍♂️

I did not, I ran ruff (with this latest commit).

bdewilde · 2025-01-04T17:07:06Z

tests/generation/pdp/test_raw_cohort.py

+        print(df_obs)
+
+
+def test_multiple_raw_cohort_records():


There's a cohort schema check that ensures there's a single institution for all the data, but the pdp generation script creates a new institution id per record. This creates the institution id once and uses it for both cohort and course data. Also fixed what's probably a copy paste error, test_raw_cohort_record -> test_raw_course_record in test_raw_course.py

Co-authored-by: Burton DeWilde <[email protected]>

- Removed course changes (it was already looking at cohort) - Default to generating an id if it's not provided

ZakMiller requested review from bdewilde, kaylawilding and vishpillai123 as code owners December 30, 2024 21:05

ZakMiller force-pushed the bugfix/consistent-pdp-institution-id branch from 113fab5 to 15dc7a0 Compare December 30, 2024 21:06

bdewilde reviewed Jan 3, 2025

View reviewed changes

ZakMiller force-pushed the bugfix/consistent-pdp-institution-id branch from 03f0e05 to 076cf91 Compare January 3, 2025 19:57

ZakMiller mentioned this pull request Jan 3, 2025

Allow True and False through _cast_to_bool_via_int unchanged #49

Merged

ZakMiller force-pushed the bugfix/consistent-pdp-institution-id branch from 076cf91 to b40bc75 Compare January 3, 2025 21:53

bdewilde approved these changes Jan 4, 2025

View reviewed changes

ZakMiller and others added 4 commits January 4, 2025 14:26

Apply suggestions from code review

2f007f5

Co-authored-by: Burton DeWilde <[email protected]>

Responding to feedback

ed482f5

- Removed course changes (it was already looking at cohort) - Default to generating an id if it's not provided

Ran ruff

e1fb9a4

ZakMiller force-pushed the bugfix/consistent-pdp-institution-id branch from 0f6d944 to e1fb9a4 Compare January 4, 2025 19:26

ZakMiller merged commit 58d2e2d into develop Jan 4, 2025
5 checks passed

ZakMiller deleted the bugfix/consistent-pdp-institution-id branch January 4, 2025 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pdp gen failure from different institution ids #47

Fix pdp gen failure from different institution ids #47

ZakMiller commented Dec 30, 2024

bdewilde left a comment

bdewilde Jan 3, 2025

ZakMiller Jan 3, 2025

bdewilde Jan 4, 2025

ZakMiller Jan 4, 2025 •

edited

Loading

bdewilde Jan 4, 2025

Fix pdp gen failure from different institution ids #47

Fix pdp gen failure from different institution ids #47

Conversation

ZakMiller commented Dec 30, 2024

changes

context

questions

bdewilde left a comment

Choose a reason for hiding this comment

bdewilde Jan 3, 2025

Choose a reason for hiding this comment

ZakMiller Jan 3, 2025

Choose a reason for hiding this comment

bdewilde Jan 4, 2025

Choose a reason for hiding this comment

ZakMiller Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

bdewilde Jan 4, 2025

Choose a reason for hiding this comment

ZakMiller Jan 4, 2025 •

edited

Loading