Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JP-3742: Remove allowed_memory and associated environment variable #8975

Merged
merged 4 commits into from
Nov 25, 2024

Conversation

emolter
Copy link
Collaborator

@emolter emolter commented Nov 22, 2024

Resolves JP-3742

Closes #8775

This PR removes the allowed_memory parameter from the resample and outlier detection steps, the DMODEL_ALLOWED_MEMORY environment variable that would set the value of that parameter, and the custom OutputTooLargeError that was raised when that memory limit was hit.

There are multiple reasons why this parameter was not working as intended and should be removed, exposed in part by a conversation with Jesse Doggett about how this might be used in operations:

  • Checking the output_wcs.array_shape is insufficient to know the actual memory usage of the step in almost all cases. For example, testing has shown that when resample is called via outlier detection, the median computation is often more memory-hungry. The pixmap calculation may also take lots of memory, as computing it involves lots of array copying and its final size is 2x the input size of float64s. And the context array has the same height and width as the output but with a depth that is the number of input images divided by 32 (but see also JP-3707). This likely explains why the OutputTooLargeError is never encountered in ops.
  • It is not clear that the inclusion of swap memory is a good thing in many cases
  • It was realized that this OutputTooLargeError is not currently one of the metrics ops uses to check if a step ran out of memory. The only things checked are: exit status 137 from strun (killed by sigkill); “ValueError: assignment destination is read-only”; and “MemoryErr”. This OutputTooLargeError has apparently never been encountered.
  • In operations, the available memory read by psutil is not very helpful. Each machine has 10 job slots, but the available memory is (very likely but not yet tested) reading the total for the whole machine. Operations sets the available_memory equal to 0.7 (using the DMODEL_ALLOWED_MEMORY) environment variable, which is probably not a good idea for a machine running 10 jobs.

Tasks

  • request a review from someone specific, to avoid making the maintainers review every PR
  • add a build milestone, i.e. Build 11.3 (use the latest build if not sure)
  • Does this PR change user-facing code / API? (if not, label with no-changelog-entry-needed)
    • write news fragment(s) in changes/: echo "changed something" > changes/<PR#>.<changetype>.rst (see below for change types)
    • update or add relevant tests
    • update relevant docstrings and / or docs/ page
    • start a regression test and include a link to the running job (click here for instructions)
      • Do truth files need to be updated ("okified")?
        • after the reviewer has approved these changes, run okify_regtests to update the truth files
  • if a JIRA ticket exists, make sure it is resolved properly
news fragment change types...
  • changes/<PR#>.general.rst: infrastructure or miscellaneous change
  • changes/<PR#>.docs.rst
  • changes/<PR#>.stpipe.rst
  • changes/<PR#>.datamodels.rst
  • changes/<PR#>.scripts.rst
  • changes/<PR#>.fits_generator.rst
  • changes/<PR#>.set_telescope_pointing.rst
  • changes/<PR#>.pipeline.rst

stage 1

  • changes/<PR#>.group_scale.rst
  • changes/<PR#>.dq_init.rst
  • changes/<PR#>.emicorr.rst
  • changes/<PR#>.saturation.rst
  • changes/<PR#>.ipc.rst
  • changes/<PR#>.firstframe.rst
  • changes/<PR#>.lastframe.rst
  • changes/<PR#>.reset.rst
  • changes/<PR#>.superbias.rst
  • changes/<PR#>.refpix.rst
  • changes/<PR#>.linearity.rst
  • changes/<PR#>.rscd.rst
  • changes/<PR#>.persistence.rst
  • changes/<PR#>.dark_current.rst
  • changes/<PR#>.charge_migration.rst
  • changes/<PR#>.jump.rst
  • changes/<PR#>.clean_flicker_noise.rst
  • changes/<PR#>.ramp_fitting.rst
  • changes/<PR#>.gain_scale.rst

stage 2

  • changes/<PR#>.assign_wcs.rst
  • changes/<PR#>.badpix_selfcal.rst
  • changes/<PR#>.msaflagopen.rst
  • changes/<PR#>.nsclean.rst
  • changes/<PR#>.imprint.rst
  • changes/<PR#>.background.rst
  • changes/<PR#>.extract_2d.rst
  • changes/<PR#>.master_background.rst
  • changes/<PR#>.wavecorr.rst
  • changes/<PR#>.srctype.rst
  • changes/<PR#>.straylight.rst
  • changes/<PR#>.wfss_contam.rst
  • changes/<PR#>.flatfield.rst
  • changes/<PR#>.fringe.rst
  • changes/<PR#>.pathloss.rst
  • changes/<PR#>.barshadow.rst
  • changes/<PR#>.photom.rst
  • changes/<PR#>.pixel_replace.rst
  • changes/<PR#>.resample_spec.rst
  • changes/<PR#>.residual_fringe.rst
  • changes/<PR#>.cube_build.rst
  • changes/<PR#>.extract_1d.rst
  • changes/<PR#>.resample.rst

stage 3

  • changes/<PR#>.assign_mtwcs.rst
  • changes/<PR#>.mrs_imatch.rst
  • changes/<PR#>.tweakreg.rst
  • changes/<PR#>.skymatch.rst
  • changes/<PR#>.exp_to_source.rst
  • changes/<PR#>.outlier_detection.rst
  • changes/<PR#>.tso_photometry.rst
  • changes/<PR#>.stack_refs.rst
  • changes/<PR#>.align_refs.rst
  • changes/<PR#>.klip.rst
  • changes/<PR#>.spectral_leak.rst
  • changes/<PR#>.source_catalog.rst
  • changes/<PR#>.combine_1d.rst
  • changes/<PR#>.ami.rst

other

  • changes/<PR#>.wfs_combine.rst
  • changes/<PR#>.white_light.rst
  • changes/<PR#>.cube_skymatch.rst
  • changes/<PR#>.engdb_tools.rst
  • changes/<PR#>.guider_cds.rst

@emolter
Copy link
Collaborator Author

emolter commented Nov 22, 2024

starting regression tests here: https://github.com/spacetelescope/RegressionTests/actions/runs/11976873413

edit: regtest failure matches the one on main

Copy link

codecov bot commented Nov 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 64.53%. Comparing base (2e01d5f) to head (c4ccd7d).
Report is 674 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8975      +/-   ##
==========================================
+ Coverage   64.50%   64.53%   +0.02%     
==========================================
  Files         375      375              
  Lines       38750    38729      -21     
==========================================
- Hits        24996    24994       -2     
+ Misses      13754    13735      -19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@emolter emolter marked this pull request as ready for review November 22, 2024 18:11
@emolter emolter requested a review from a team as a code owner November 22, 2024 18:11
@braingram
Copy link
Collaborator

Overall LGTM.

I think this means we can drop psutil as a dependency. Is there any other usage of that package?

@emolter
Copy link
Collaborator Author

emolter commented Nov 22, 2024

Overall LGTM.

I think this means we can drop psutil as a dependency. Is there any other usage of that package?

good catch, looks like no

Copy link
Collaborator

@braingram braingram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I reran your regtest run with the psutil dependency removal and all tests passed.

Copy link
Contributor

@tapastro tapastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tapastro tapastro merged commit 6fc485b into spacetelescope:main Nov 25, 2024
31 checks passed
@emolter emolter deleted the JP-3742 branch November 25, 2024 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reconsider resample allowed_memory and associated environment variable
3 participants