Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to reduce memory footprint in Aggregator #1391

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

snehal-das
Copy link
Contributor

Changes made:

  • Explicit deletion of local variables reduces the run-time footprint as the garbage collector
  • Moved the measurement of memory to after tensor_db::clean_up() to get a clearer picture of memory usage.
  • Added garbage collection at the end of round post db clean_up to free up unused memory.

@snehal-das snehal-das force-pushed the agg-memory-optimization branch from 1ece418 to aab7971 Compare February 19, 2025 13:12
@snehal-das snehal-das marked this pull request as ready for review February 19, 2025 13:12
Copy link
Collaborator

@theakshaypant theakshaypant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also suggest running formatting script as the pipeline is failing.

sh scripts/format.sh

@rahulga1
Copy link
Collaborator

@snehal-das can you run it with memory_callback with large number of rounds to see the impact of this change?

Copy link
Member

@MasterSkepticista MasterSkepticista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snehal-das Does this PR "fix" memory leak once and for all, or it only optimizes memory usage?

@snehal-das snehal-das force-pushed the agg-memory-optimization branch from a75c1a1 to 147d017 Compare February 25, 2025 05:21
@snehal-das
Copy link
Contributor Author

snehal-das commented Feb 25, 2025

can you run it with memory_callback with large number of rounds to see the impact of this change?

@rahulga1 I've run these changes for 100-200 rounds with torch/mnist and torch/histology tests. I've also tested it with 32 collaborators to evaluate the impact of the change.

Does this PR "fix" memory leak once and for all, or it only optimizes memory usage?

@MasterSkepticista The current state of code does not fix memory issues. It only reduces the footprint of the aggregator, but there is still a growth in overall memory usage with time, albeit at a slower rate. There are a couple more approaches being considered for this:

  1. Reducing memory fragmentation: Tracking fordblks show that memory fragmentation is increasing proportional to the number of rounds.
  2. Delegating memory intensive tasks to child processes - Return the memory allocated at the time of execution back to the allocator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants