Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derive the device_task_t from a parsec_object_t #694

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bosilca
Copy link
Contributor

@bosilca bosilca commented Nov 6, 2024

Introduce the parsec_gpu_flow_info_s info structure to combine the flow information needed by the GPU code.

Allow the standard device tasks (aka. parsec_gpu_dsl_task_t) to contain the flow_info array inside the task, while allowing other DSL to have their own type of device task (derived from parsec_gpu_task_t)

Enhance the mechanism to release the device tasks via the release_device_task function pointer. The device code will call this function to let the DSL decide how to device task release should be handled. Some DSL (PTG and DTD as of now) will call OBJ_RELEASE on it (and the free is automatic), while others (TTG as an example) will have its own handling.

Introduce the parsec_gpu_flow_info_s info structure to combine the flow
information needed by the GPU code.

Allow the standard device tasks (aka. parsec_gpu_dsl_task_t) to contain
the flow_info array inside the task, while allowing other DSL to have
their own type of device task (derived from parsec_gpu_task_t)

Enhance the mechanism to release the device tasks via the release_device_task
function pointer. The device code will call this function to let the DSL
decide how to device task release should be handled. Some DSL (PTG and
DTD as of now) will call OBJ_RELEASE on it (and the free is automatic),
while others (TTG as an example) will have its own handling.

Signed-off-by: George Bosilca <[email protected]>
@bosilca bosilca requested a review from a team as a code owner November 6, 2024 22:22
Co-authored-by: Aurelien Bouteiller <[email protected]>
abouteiller
abouteiller previously approved these changes Dec 12, 2024
@abouteiller

This comment was marked as resolved.

@abouteiller abouteiller added the API Change Change to the public API, backward incompatible (version major bump) label Dec 12, 2024
@abouteiller abouteiller self-requested a review December 13, 2024 12:47
@abouteiller abouteiller dismissed their stale review December 13, 2024 12:48

stress:gpu fail

abouteiller added a commit to abouteiller/dplasma that referenced this pull request Jan 2, 2025
@abouteiller

This comment was marked as resolved.

@abouteiller
Copy link
Contributor

New problem: all tests with 2gpus appear to leak/over-retain gpu memory

237/441 Test #237: dplasma_cgemm_2gpu_cuda_shm .....................................***Failed    1.78 sec
W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
#+++++ cores detected       : 40
#+++++ nodes x cores + gpu  : 1 x 40 + 2 (40+2)
#+++++ thread mode          : THREAD_SERIALIZED
#+++++ P x Q                : 1 x 1 (1/1)
#+++++ M x N x K|NRHS       : 1940 x 1940 x 1940
#+++++ LDA , LDB , LDC      : 1940 , 1940 , 1940
#+++++ MB x NB              : 320 x 320
[   0] TIME(s)      0.43082 : PaRSEC initialized
***************************************************
 ----- TESTING CGEMM (N, N) --------
Generate matrices ... Done
Compute ... ... [****] TIME(s)      0.03400 : cgemm     PxQxg=   1 1   2 NB=  320 N=    1940 :    1718.191219 gflops - ENQ&PROG&DEST      0.04932 :    1184.270940 gflops - ENQ      0.01127 - DEST      0.00406
+---------------------------------------------------------------------------------------------------------------------------------------------+
|         |                    |                       Data In                              |                    Data Out                     |
|Rank   0 |  # KERNEL |    %   |  Required  |    Transfered H2D(%)  |    Transfered D2D(%)  |  Required  |      Transfered(%)  |   Evictions  |
|---------|-----------|--------|------------|-----------------------|-----------------------|------------|---------------------|--------------|
|  Dev  0 |       198 |  36.60 |     0.00 B |       0.00 B(   nan)  |       0.00 B(   nan)  |     0.00 B |     0.00 B(   nan)  |           0  | cpu-cores
|  Dev  1 |       196 |  36.23 |   459.38MB |      61.72MB( 13.44)  |      20.31MB(  4.42)  |   153.12MB |    21.88MB( 14.29)  |           0  | cuda(0)
|  Dev  2 |       147 |  27.17 |   344.53MB |      60.94MB( 17.69)  |      10.16MB(  2.95)  |   114.84MB |    16.41MB( 14.29)  |           0  | cuda(1)
|---------|-----------|--------|------------|-----------------------|-----------------------|------------|---------------------|--------------|
|All Devs |       541 | 100.00 |   803.91MB |     153.12MB( 19.05)  |      30.47MB(  3.79)  |   267.97MB |    38.28MB( 14.29)  |           0  |
+---------------------------------------------------------------------------------------------------------------------------------------------+
<DartMeasurement name="performance" type="numeric/double"
                 encoding="none" compression="none">
1718.19
</DartMeasurement>
W@00000 GPU[1:cuda(0)] memory leak detected: 12582912 bytes still allocated on GPU
D@00000 flush_lru: free: 16 units (8388608 bytes) from 0x7f618e000000 to 0x7f618e7fffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f618e800000 to 0x7f618e8fffff @zone_debug:224
D@00000 flush_lru: free: 24 units (12582912 bytes) from 0x7f618e900000 to 0x7f618f4fffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f618f500000 to 0x7f618f5fffff @zone_debug:224
D@00000 flush_lru: free: 36 units (18874368 bytes) from 0x7f618f600000 to 0x7f61907fffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190800000 to 0x7f61908fffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190900000 to 0x7f61909fffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190a00000 to 0x7f6190afffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190b00000 to 0x7f6190bfffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190c00000 to 0x7f6190cfffff @zone_debug:224
D@00000 flush_lru: free: 2 units (1048576 bytes) from 0x7f6190d00000 to 0x7f6190dfffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190e00000 to 0x7f6190efffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190f00000 to 0x7f6190ffffff @zone_debug:224
D@00000 flush_lru: free: 6 units (3145728 bytes) from 0x7f6191000000 to 0x7f61912fffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6191300000 to 0x7f61913fffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6191400000 to 0x7f61914fffff @zone_debug:224
D@00000 flush_lru: free: 18 units (9437184 bytes) from 0x7f6191500000 to 0x7f6191dfffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6191e00000 to 0x7f6191efffff @zone_debug:224
D@00000 flush_lru: free: 3970 units (2081423360 bytes) from 0x7f6191f00000 to 0x7f620dffffff @zone_debug:217
testing_cgemm: /home/bouteill/parsec/dplasma/parsec/parsec/mca/device/device_gpu.c:831: int parsec_device_flush_lru(parsec_device_module_t *): Assertion `!in_use' failed.
[leconte:1749985] *** Process received signal ***
[leconte:1749985] Signal: Aborted (6)
[leconte:1749985] Signal code:  (-6)
[leconte:1749985] [ 0] /lib64/libc.so.6(+0x3e730)[0x7f62530f0730]
[leconte:1749985] [ 1] /lib64/libc.so.6(+0x8ba6c)[0x7f625313da6c]
[leconte:1749985] [ 2] /lib64/libc.so.6(raise+0x16)[0x7f62530f0686]
[leconte:1749985] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7f62530da833]
[leconte:1749985] [ 4] /lib64/libc.so.6(+0x2875b)[0x7f62530da75b]
[leconte:1749985] [ 5] /lib64/libc.so.6(+0x373c6)[0x7f62530e93c6]
[leconte:1749985] [ 6] /home/bouteill/parsec/dplasma/build.cuda/parsec/parsec/libparsec.so.4(parsec_device_flush_lru+0x12c)[0x7f62a6ff7fbc]
[leconte:1749985] [ 7] /home/bouteill/parsec/dplasma/build.cuda/parsec/parsec/libparsec.so.4(parsec_devices_release_memory+0x57)[0x7f62a6ff5047]
[leconte:1749985] [ 8] /home/bouteill/parsec/dplasma/build.cuda/tests/./testing_cgemm[0x40550c]
[leconte:1749985] [ 9] /lib64/libc.so.6(+0x295d0)[0x7f62530db5d0]
[leconte:1749985] [10] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f62530db680]
[leconte:1749985] [11] /home/bouteill/parsec/dplasma/build.cuda/tests/./testing_cgemm[0x4023f5]
[leconte:1749985] *** End of error message ***
srun: error: leconte: task 0: Aborted

devreal added a commit to devreal/ttg that referenced this pull request Jan 9, 2025
With ICLDisco/parsec#694 PaRSEC will support
dyanmically allocated flows based on the application-managed gpu task
structure. This allows us to ditch the extra task class structure
and lets us cut down loops over MAX_PARAM_COUNT.


Signed-off-by: Joseph Schuchart <[email protected]>
devreal added a commit to devreal/ttg that referenced this pull request Jan 9, 2025
With ICLDisco/parsec#694 PaRSEC will have
a proper initializer for gpu task structures but until then
we need to properly initialize this field to something non-zero.

Signed-off-by: Joseph Schuchart <[email protected]>
gpu_task->prof_tp_id = 0;
#endif
gpu_task->ec = NULL;
gpu_task->last_data_check_epoch = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong because it will deadlock on the first GPU task submitted.

Suggested change
gpu_task->last_data_check_epoch = 0;
gpu_task->last_data_check_epoch = (uint64_t)-1;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_int

@devreal
Copy link
Contributor

devreal commented Jan 10, 2025

This is now used in TESSEorg/ttg#307 and seems to work (except for #694 (comment)). I'd like to see this go in asap.

@bosilca
Copy link
Contributor Author

bosilca commented Jan 10, 2025

I saw a notification with a lldb stack trace but I can't find it here. Weird! Anyway, it seemed to indicate that the complete_stage was pointing to an incorrect location.

@abouteiller
Copy link
Contributor

I saw a notification with a lldb stack trace but I can't find it here. Weird! Anyway, it seemed to indicate that the complete_stage was pointing to an incorrect location.

#694 (comment)

" gpu_task->ec = (parsec_task_t*)this_task;\n"
" gpu_task->submit = &%s_kernel_submit_%s_%s;\n"
" gpu_task->task_type = 0;\n"
" gpu_task->task_type = PARSEC_GPU_TASK_TYPE_KERNEL;\n"
" gpu_task->last_data_check_epoch = -1; /* force at least one validation for the task */\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

parsec_gpu_task_t *gpu_task = (parsec_gpu_task_t *) calloc(1, sizeof(parsec_gpu_task_t));
PARSEC_OBJ_CONSTRUCT(gpu_task, parsec_list_item_t);
gpu_task->release_device_task = free; /* by default free the device task */
parsec_gpu_task_t *gpu_task = (parsec_gpu_task_t*)PARSEC_OBJ_NEW(parsec_gpu_dsl_task_t);
gpu_task->ec = (parsec_task_t *) this_task;
gpu_task->submit = dtd_tc->gpu_func_ptr;
gpu_task->task_type = 0;
gpu_task->last_data_check_epoch = -1; /* force at least one validation for the task */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Change Change to the public API, backward incompatible (version major bump)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants