Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate CUDA SHM region registration size #7178

Merged
merged 13 commits into from
May 7, 2024
Merged

Validate CUDA SHM region registration size #7178

merged 13 commits into from
May 7, 2024

Conversation

krishung5
Copy link
Contributor

@krishung5 krishung5 commented May 2, 2024

  • Validate CUDA SHM region size during registration
  • Add CUDA SHM registration tests
  • Refactor tests for CUDA SHM and System SHM

@krishung5 krishung5 requested review from rmccorm4 and jbkyang-nvi May 2, 2024 03:08
Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor and general approach LGTM! Only minor comments, thanks for tackling this so quickly @krishung5 !

@krishung5 krishung5 requested review from rmccorm4 and nnshah1 May 3, 2024 18:17
rmccorm4
rmccorm4 previously approved these changes May 3, 2024
Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to review level of detail passed back to client / user of api

@nnshah1 nnshah1 requested review from jbkyang-nvi and removed request for jbkyang-nvi May 3, 2024 21:34
Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to normalize to a single error message for "failed to register shared memory .. invalid args"

@krishung5 krishung5 requested a review from nnshah1 May 6, 2024 19:35
Copy link
Contributor

@jbkyang-nvi jbkyang-nvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor suggestions. Thanks for refactoring the tests

qa/L0_cuda_shared_memory/cuda_shared_memory_test.py Outdated Show resolved Hide resolved
src/shared_memory_manager.cc Show resolved Hide resolved
src/shared_memory_manager.cc Show resolved Hide resolved
@krishung5
Copy link
Contributor Author

Made the error message less informative for OpenCudaIPCRegion and fixed the L0_grpc test.

@krishung5 krishung5 requested a review from rmccorm4 May 7, 2024 01:41
@@ -255,11 +255,57 @@ OpenCudaIPCRegion(
cudaError_t err = cudaIpcOpenMemHandle(
data_ptr, *cuda_shm_handle, cudaIpcMemLazyEnablePeerAccess);
if (err != cudaSuccess) {
// Should not pass the detailed error message back to the client.
Copy link
Contributor

@nnshah1 nnshah1 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Should not pass the detailed error message back to the client.
// Log detailed error message and send generic error to client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

nnshah1
nnshah1 previously approved these changes May 7, 2024
Copy link
Contributor

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error logging looks good - one slight suggested tweak to comment

@krishung5 krishung5 requested a review from nnshah1 May 7, 2024 16:18
@krishung5 krishung5 merged commit c5227c0 into main May 7, 2024
3 checks passed
@krishung5 krishung5 deleted the krish-cuda-shm branch May 7, 2024 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants