Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly cleanup processes and queues for MPRS and Fix pause for prefetch (#1075) #1096

Closed
wants to merge 1 commit into from

Conversation

ejguan
Copy link
Contributor

@ejguan ejguan commented Mar 17, 2023

Summary:
Fixes issue about MPRS.finalize when dataloader2.shutdown() is called

Changes

  • DataLoader2 should always clean up datapipe_iter at shutdown
  • Guard MPRS to finalize once
  • Fix the problem of ConnectionError when DataLoader early exits
    • This is caused by queue is joined when main/worker/dispatching process exits. No more request/response can be passed across processes.
      • Consumer process shouldn't join the req_queue at exit to make sure producer process can still access the remaining request. And, consumer will close req_queue after clean up to prevent any further request sent to queue.
      • Produce process shouldn't join the res_queue at exit to make sure consumer process can still access response. And, producer will close res_queue after clean up to prevent any further response sent to queue.
        • Main (Consumer) <-> Worker (Producer)
        • Worker (Consumer) -> Dispatching (Producer)
  • Fix pause API for DataLoader2
    • Invoke pause lazily until the limit+1 iteration is reached to align with python's iterator behavior. (Breaking internal tests)
    • Make prefetch.pause blocking unless there might be potential racing issue. Main thread is paused but prefetch worker is still trying to fetch data from iter.
  • Add tests to validate

Pull Request resolved: #1075

Differential Revision: D44168655

Pulled By: ejguan

…efetch (pytorch#1075)

Summary:
Fixes issue about `MPRS.finalize` when `dataloader2.shutdown()` is called

### Changes

- DataLoader2 should always clean up `datapipe_iter` at shutdown
- Guard `MPRS` to finalize once
- Fix the problem of `ConnectionError` when DataLoader early exits
  - This is caused by `queue` is joined when main/worker/dispatching process exits. No more request/response can be passed across processes.
    - Consumer process shouldn't join the `req_queue` at exit to make sure producer process can still access the remaining request. And, consumer will close `req_queue` after clean up to prevent any further request sent to queue.
    - Produce process shouldn't join the `res_queue` at exit to make sure consumer process can still access response. And, producer will close `res_queue` after clean up to prevent any further response sent to queue.
       - Main (Consumer) <-> Worker (Producer)
       - Worker (Consumer) -> Dispatching (Producer)
- Fix `pause` API for DataLoader2
    - Invoke `pause` lazily until the `limit+1` iteration is reached to align with python's iterator behavior.
    - Make `prefetch.pause` blocking unless there might be potential racing issue. Main thread is paused but prefetch worker is still trying to fetch data from `iter`.
- Add tests to validate

Pull Request resolved: pytorch#1075

Differential Revision: D44168655

Pulled By: ejguan

fbshipit-source-id: 4fa8c34438414cad7f46527355eb0d51866e6b15
@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Mar 17, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D44168655

@ejguan ejguan requested a review from NivekT March 17, 2023 17:35
@ejguan
Copy link
Contributor Author

ejguan commented Mar 17, 2023

This is the reland of #1075. To fix the internal test failures, I have to remove the optional pause change to DataLoader2.

@facebook-github-bot
Copy link
Contributor

@ejguan merged this pull request in 8a9b963.

@ejguan ejguan added the topic: improvements topic category label Mar 20, 2023
@ejguan ejguan added this to the 0.6.1 milestone Mar 20, 2023
NivekT pushed a commit that referenced this pull request Apr 19, 2023
…efetch (#1096)

Summary:
Pull Request resolved: #1096

Fixes issue about `MPRS.finalize` when `dataloader2.shutdown()` is called

### Changes

- DataLoader2 should always clean up `datapipe_iter` at shutdown
- Guard `MPRS` to finalize once
- Fix the problem of `ConnectionError` when DataLoader early exits
  - This is caused by `queue` is joined when main/worker/dispatching process exits. No more request/response can be passed across processes.
    - Consumer process shouldn't join the `req_queue` at exit to make sure producer process can still access the remaining request. And, consumer will close `req_queue` after clean up to prevent any further request sent to queue.
    - Produce process shouldn't join the `res_queue` at exit to make sure consumer process can still access response. And, producer will close `res_queue` after clean up to prevent any further response sent to queue.
       - Main (Consumer) <-> Worker (Producer)
       - Worker (Consumer) -> Dispatching (Producer)
- Fix `pause` API for DataLoader2
    - Invoke `pause` lazily until the `limit+1` iteration is reached to align with python's iterator behavior.
    - Make `prefetch.pause` blocking unless there might be potential racing issue. Main thread is paused but prefetch worker is still trying to fetch data from `iter`.
- Add tests to validate

Pull Request resolved: #1075

Reviewed By: NivekT

Differential Revision: D44168655

Pulled By: ejguan

fbshipit-source-id: fdfee5c27b512b5c0d5308e53a81b1cb2db70a43
ejguan added a commit that referenced this pull request Apr 20, 2023
…efetch (#1096)

Summary:
Pull Request resolved: #1096

Fixes issue about `MPRS.finalize` when `dataloader2.shutdown()` is called

### Changes

- DataLoader2 should always clean up `datapipe_iter` at shutdown
- Guard `MPRS` to finalize once
- Fix the problem of `ConnectionError` when DataLoader early exits
  - This is caused by `queue` is joined when main/worker/dispatching process exits. No more request/response can be passed across processes.
    - Consumer process shouldn't join the `req_queue` at exit to make sure producer process can still access the remaining request. And, consumer will close `req_queue` after clean up to prevent any further request sent to queue.
    - Produce process shouldn't join the `res_queue` at exit to make sure consumer process can still access response. And, producer will close `res_queue` after clean up to prevent any further response sent to queue.
       - Main (Consumer) <-> Worker (Producer)
       - Worker (Consumer) -> Dispatching (Producer)
- Fix `pause` API for DataLoader2
    - Invoke `pause` lazily until the `limit+1` iteration is reached to align with python's iterator behavior.
    - Make `prefetch.pause` blocking unless there might be potential racing issue. Main thread is paused but prefetch worker is still trying to fetch data from `iter`.
- Add tests to validate

Pull Request resolved: #1075

Reviewed By: NivekT

Differential Revision: D44168655

Pulled By: ejguan

fbshipit-source-id: fdfee5c27b512b5c0d5308e53a81b1cb2db70a43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged topic: improvements topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants