Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel attributes #360

Merged
merged 34 commits into from
Feb 8, 2025
Merged

Kernel attributes #360

merged 34 commits into from
Feb 8, 2025

Conversation

ksimpson-work
Copy link
Contributor

@ksimpson-work ksimpson-work commented Jan 6, 2025

Add getters and setters for the kernel attributes.

close #205

Copy link
Contributor

copy-pr-bot bot commented Jan 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ksimpson-work ksimpson-work self-assigned this Jan 6, 2025
@ksimpson-work ksimpson-work added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jan 6, 2025
@ksimpson-work
Copy link
Contributor Author

/ok to test

@ksimpson-work
Copy link
Contributor Author

/ok to test

@leofang leofang added this to the cuda.core beta 3 milestone Jan 7, 2025
@leofang leofang added feature New feature or request and removed enhancement Any code-related improvements labels Jan 8, 2025
@ksimpson-work
Copy link
Contributor Author

/ok to test

@ksimpson-work
Copy link
Contributor Author

/ok to test

@ksimpson-work
Copy link
Contributor Author

I have a design question for any reviewers to weigh in on. There is another change in the works to add device properties to the Device class, and the way I've implemented that, is to have device_instance.properties -> DeviceProperties, where DeviceProperties lazy queries the properties and exposes them. In short you would get a property like such:

device = Device()
device.properties.property_a

The reason I put all of the properties in the subclass, is because there are a lot of them, and adding them straight to device would cause device to be very bloated.

The question is whether you think I should do the same thing here. Prior to making the deivce property change, I thought this was the best way to implement it, but I am now leaning towards sticking the attributes in a subclass so they would be accessed like:

kernel.attributes.attribute_a = True
variable = kernel.attributes.attribute_b

One considerable difference is that all the device properties are read only, while some of the kernel attributes are read/write.

@ksimpson-work ksimpson-work marked this pull request as ready for review January 21, 2025 21:26
@ksimpson-work ksimpson-work requested a review from leofang January 21, 2025 21:26
@leofang
Copy link
Member

leofang commented Jan 21, 2025

The question is whether you think I should do the same thing here. Prior to making the deivce property change, I thought this was the best way to implement it, but I am now leaning towards sticking the attributes in a subclass so they would be accessed

I really think this is the way to go! We definitely do not want to bloat the kernel/device instance when hitting tab.

@ksimpson-work
Copy link
Contributor Author

ok cool, I agree. Change made

@ksimpson-work
Copy link
Contributor Author

/ok to test

@ksimpson-work
Copy link
Contributor Author

updated the review to remove the setters on read/write properties in line with the discussion about deadlock between properties and launch config. + a couple formatting improvements to the docs

@ksimpson-work
Copy link
Contributor Author

/ok to test

…luggy-1.5.0

benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/ksimpson/code/cuda-python/cuda_core
configfile: pyproject.toml
plugins: benchmark-4.0.0
collected 17 items

tests/test_module.py xAverage time per call to max_threads_per_block: 0.0000001646 seconds
.Average time per call to shared_size_bytes: 0.0000001421 seconds
.Average time per call to const_size_bytes: 0.0000001451 seconds
.Average time per call to local_size_bytes: 0.0000001464 seconds
.Average time per call to num_regs: 0.0000001585 seconds
.Average time per call to ptx_version: 0.0000002534 seconds
.Average time per call to binary_version: 0.0000001346 seconds
.Average time per call to cache_mode_ca: 0.0000001768 seconds
.Average time per call to cluster_size_must_be_set: 0.0000002234 seconds
.Average time per call to max_dynamic_shared_size_bytes: 0.0000001594 seconds
.Average time per call to preferred_shared_memory_carveout: 0.0000001541 seconds
.Average time per call to required_cluster_width: 0.0000001443 seconds
.Average time per call to required_cluster_height: 0.0000001399 seconds
.Average time per call to required_cluster_depth: 0.0000001660 seconds
.Average time per call to non_portable_cluster_size_allowed: 0.0000001502 seconds
.Average time per call to cluster_scheduling_policy_preference: 0.0000001410 seconds
.

====================================== 16 passed, 1 xfailed in 2.66s ======================================
(cuda_126) ksimpson@NV-3KWHSV3:~/code/cuda-python/cuda_core$ python -m pytest tests/test_module.py -s
=========================================== test session starts ===========================================
platform linux -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/ksimpson/code/cuda-python/cuda_core
configfile: pyproject.toml
plugins: benchmark-4.0.0
collected 17 items

tests/test_module.py xAverage time per call to max_threads_per_block: 0.0000006603 seconds
.Average time per call to shared_size_bytes: 0.0000006781 seconds
.Average time per call to const_size_bytes: 0.0000005997 seconds
.Average time per call to local_size_bytes: 0.0000006500 seconds
.Average time per call to num_regs: 0.0000006209 seconds
.Average time per call to ptx_version: 0.0000006196 seconds
.Average time per call to binary_version: 0.0000006121 seconds
.Average time per call to cache_mode_ca: 0.0000006328 seconds
.Average time per call to cluster_size_must_be_set: 0.0000006298 seconds
.Average time per call to max_dynamic_shared_size_bytes: 0.0000006944 seconds
.Average time per call to preferred_shared_memory_carveout: 0.0000007717 seconds
.Average time per call to required_cluster_width: 0.0000006319 seconds
.Average time per call to required_cluster_height: 0.0000006384 seconds
.Average time per call to required_cluster_depth: 0.0000006286 seconds
.Average time per call to non_portable_cluster_size_allowed: 0.0000006788 seconds
.Average time per call to cluster_scheduling_policy_preference: 0.0000008922 seconds
@ksimpson-work
Copy link
Contributor Author

@leofang I propose we leave the DeviceProperties #409 review on the backburner until this one is merged. Then I will port all the relevant changes to that one (caching, test skipping etc)

leofang

This comment was marked as resolved.

@ksimpson-work
Copy link
Contributor Author

/ok to test

@leofang leofang merged commit 7387715 into NVIDIA:main Feb 8, 2025
69 checks passed
Copy link

github-actions bot commented Feb 8, 2025

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Kernel attribute getter/setter
2 participants