Derive the device_task_t from a parsec_object_t #694

bosilca · 2024-11-06T22:22:41Z

Introduce the parsec_gpu_flow_info_s info structure to combine the flow information needed by the GPU code.

Allow the standard device tasks (aka. parsec_gpu_dsl_task_t) to contain the flow_info array inside the task, while allowing other DSL to have their own type of device task (derived from parsec_gpu_task_t)

Enhance the mechanism to release the device tasks via the release_device_task function pointer. The device code will call this function to let the DSL decide how to device task release should be handled. Some DSL (PTG and DTD as of now) will call OBJ_RELEASE on it (and the free is automatic), while others (TTG as an example) will have its own handling.

Introduce the parsec_gpu_flow_info_s info structure to combine the flow information needed by the GPU code. Allow the standard device tasks (aka. parsec_gpu_dsl_task_t) to contain the flow_info array inside the task, while allowing other DSL to have their own type of device task (derived from parsec_gpu_task_t) Enhance the mechanism to release the device tasks via the release_device_task function pointer. The device code will call this function to let the DSL decide how to device task release should be handled. Some DSL (PTG and DTD as of now) will call OBJ_RELEASE on it (and the free is automatic), while others (TTG as an example) will have its own handling. Signed-off-by: George Bosilca <[email protected]>

parsec/data_internal.h

parsec/mca/device/device_gpu.c

parsec/interfaces/ptg/ptg-compiler/jdf2c.c

parsec/mca/device/device_gpu.c

Co-authored-by: Aurelien Bouteiller <[email protected]>

stress:gpu fail

abouteiller · 2025-01-02T20:32:43Z

New problem: all tests with 2gpus appear to leak/over-retain gpu memory

237/441 Test #237: dplasma_cgemm_2gpu_cuda_shm .....................................***Failed    1.78 sec
W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
#+++++ cores detected       : 40
#+++++ nodes x cores + gpu  : 1 x 40 + 2 (40+2)
#+++++ thread mode          : THREAD_SERIALIZED
#+++++ P x Q                : 1 x 1 (1/1)
#+++++ M x N x K|NRHS       : 1940 x 1940 x 1940
#+++++ LDA , LDB , LDC      : 1940 , 1940 , 1940
#+++++ MB x NB              : 320 x 320
[   0] TIME(s)      0.43082 : PaRSEC initialized
***************************************************
 ----- TESTING CGEMM (N, N) --------
Generate matrices ... Done
Compute ... ... [****] TIME(s)      0.03400 : cgemm     PxQxg=   1 1   2 NB=  320 N=    1940 :    1718.191219 gflops - ENQ&PROG&DEST      0.04932 :    1184.270940 gflops - ENQ      0.01127 - DEST      0.00406
+---------------------------------------------------------------------------------------------------------------------------------------------+
|         |                    |                       Data In                              |                    Data Out                     |
|Rank   0 |  # KERNEL |    %   |  Required  |    Transfered H2D(%)  |    Transfered D2D(%)  |  Required  |      Transfered(%)  |   Evictions  |
|---------|-----------|--------|------------|-----------------------|-----------------------|------------|---------------------|--------------|
|  Dev  0 |       198 |  36.60 |     0.00 B |       0.00 B(   nan)  |       0.00 B(   nan)  |     0.00 B |     0.00 B(   nan)  |           0  | cpu-cores
|  Dev  1 |       196 |  36.23 |   459.38MB |      61.72MB( 13.44)  |      20.31MB(  4.42)  |   153.12MB |    21.88MB( 14.29)  |           0  | cuda(0)
|  Dev  2 |       147 |  27.17 |   344.53MB |      60.94MB( 17.69)  |      10.16MB(  2.95)  |   114.84MB |    16.41MB( 14.29)  |           0  | cuda(1)
|---------|-----------|--------|------------|-----------------------|-----------------------|------------|---------------------|--------------|
|All Devs |       541 | 100.00 |   803.91MB |     153.12MB( 19.05)  |      30.47MB(  3.79)  |   267.97MB |    38.28MB( 14.29)  |           0  |
+---------------------------------------------------------------------------------------------------------------------------------------------+
<DartMeasurement name="performance" type="numeric/double"
                 encoding="none" compression="none">
1718.19
</DartMeasurement>
W@00000 GPU[1:cuda(0)] memory leak detected: 12582912 bytes still allocated on GPU
D@00000 flush_lru: free: 16 units (8388608 bytes) from 0x7f618e000000 to 0x7f618e7fffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f618e800000 to 0x7f618e8fffff @zone_debug:224
D@00000 flush_lru: free: 24 units (12582912 bytes) from 0x7f618e900000 to 0x7f618f4fffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f618f500000 to 0x7f618f5fffff @zone_debug:224
D@00000 flush_lru: free: 36 units (18874368 bytes) from 0x7f618f600000 to 0x7f61907fffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190800000 to 0x7f61908fffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190900000 to 0x7f61909fffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190a00000 to 0x7f6190afffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190b00000 to 0x7f6190bfffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190c00000 to 0x7f6190cfffff @zone_debug:224
D@00000 flush_lru: free: 2 units (1048576 bytes) from 0x7f6190d00000 to 0x7f6190dfffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190e00000 to 0x7f6190efffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6190f00000 to 0x7f6190ffffff @zone_debug:224
D@00000 flush_lru: free: 6 units (3145728 bytes) from 0x7f6191000000 to 0x7f61912fffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6191300000 to 0x7f61913fffff @zone_debug:224
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6191400000 to 0x7f61914fffff @zone_debug:224
D@00000 flush_lru: free: 18 units (9437184 bytes) from 0x7f6191500000 to 0x7f6191dfffff @zone_debug:217
D@00000 flush_lru: used: 2 units (1048576 bytes) from 0x7f6191e00000 to 0x7f6191efffff @zone_debug:224
D@00000 flush_lru: free: 3970 units (2081423360 bytes) from 0x7f6191f00000 to 0x7f620dffffff @zone_debug:217
testing_cgemm: /home/bouteill/parsec/dplasma/parsec/parsec/mca/device/device_gpu.c:831: int parsec_device_flush_lru(parsec_device_module_t *): Assertion `!in_use' failed.
[leconte:1749985] *** Process received signal ***
[leconte:1749985] Signal: Aborted (6)
[leconte:1749985] Signal code:  (-6)
[leconte:1749985] [ 0] /lib64/libc.so.6(+0x3e730)[0x7f62530f0730]
[leconte:1749985] [ 1] /lib64/libc.so.6(+0x8ba6c)[0x7f625313da6c]
[leconte:1749985] [ 2] /lib64/libc.so.6(raise+0x16)[0x7f62530f0686]
[leconte:1749985] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7f62530da833]
[leconte:1749985] [ 4] /lib64/libc.so.6(+0x2875b)[0x7f62530da75b]
[leconte:1749985] [ 5] /lib64/libc.so.6(+0x373c6)[0x7f62530e93c6]
[leconte:1749985] [ 6] /home/bouteill/parsec/dplasma/build.cuda/parsec/parsec/libparsec.so.4(parsec_device_flush_lru+0x12c)[0x7f62a6ff7fbc]
[leconte:1749985] [ 7] /home/bouteill/parsec/dplasma/build.cuda/parsec/parsec/libparsec.so.4(parsec_devices_release_memory+0x57)[0x7f62a6ff5047]
[leconte:1749985] [ 8] /home/bouteill/parsec/dplasma/build.cuda/tests/./testing_cgemm[0x40550c]
[leconte:1749985] [ 9] /lib64/libc.so.6(+0x295d0)[0x7f62530db5d0]
[leconte:1749985] [10] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f62530db680]
[leconte:1749985] [11] /home/bouteill/parsec/dplasma/build.cuda/tests/./testing_cgemm[0x4023f5]
[leconte:1749985] *** End of error message ***
srun: error: leconte: task 0: Aborted

With ICLDisco/parsec#694 PaRSEC will support dyanmically allocated flows based on the application-managed gpu task structure. This allows us to ditch the extra task class structure and lets us cut down loops over MAX_PARAM_COUNT. Signed-off-by: Joseph Schuchart <[email protected]>

With ICLDisco/parsec#694 PaRSEC will have a proper initializer for gpu task structures but until then we need to properly initialize this field to something non-zero. Signed-off-by: Joseph Schuchart <[email protected]>

devreal · 2025-01-10T17:24:01Z

parsec/mca/device/device_gpu.c

+    gpu_task->prof_tp_id = 0;
+#endif
+    gpu_task->ec = NULL;
+    gpu_task->last_data_check_epoch = 0;


This is wrong because it will deadlock on the first GPU task submitted.

Suggested change

gpu_task->last_data_check_epoch = 0;

gpu_task->last_data_check_epoch = (uint64_t)-1;

devreal · 2025-01-10T17:33:12Z

This is now used in TESSEorg/ttg#307 and seems to work (except for #694 (comment)). I'd like to see this go in asap.

bosilca · 2025-01-10T20:15:02Z

I saw a notification with a lldb stack trace but I can't find it here. Weird! Anyway, it seemed to indicate that the complete_stage was pointing to an incorrect location.

abouteiller · 2025-02-04T18:47:26Z

I saw a notification with a lldb stack trace but I can't find it here. Weird! Anyway, it seemed to indicate that the complete_stage was pointing to an incorrect location.

#694 (comment)

abouteiller · 2025-02-13T15:47:46Z

parsec/interfaces/ptg/ptg-compiler/jdf2c.c

            "  gpu_task->ec = (parsec_task_t*)this_task;\n"
            "  gpu_task->submit = &%s_kernel_submit_%s_%s;\n"
-            "  gpu_task->task_type = 0;\n"
+            "  gpu_task->task_type = PARSEC_GPU_TASK_TYPE_KERNEL;\n"
            "  gpu_task->last_data_check_epoch = -1;  /* force at least one validation for the task */\n",


abouteiller · 2025-02-13T15:48:02Z

parsec/interfaces/dtd/insert_function.c

-    parsec_gpu_task_t *gpu_task = (parsec_gpu_task_t *) calloc(1, sizeof(parsec_gpu_task_t));
-    PARSEC_OBJ_CONSTRUCT(gpu_task, parsec_list_item_t);
-    gpu_task->release_device_task = free;  /* by default free the device task */
+    parsec_gpu_task_t *gpu_task = (parsec_gpu_task_t*)PARSEC_OBJ_NEW(parsec_gpu_dsl_task_t);
    gpu_task->ec = (parsec_task_t *) this_task;
    gpu_task->submit = dtd_tc->gpu_func_ptr;
    gpu_task->task_type = 0;
    gpu_task->last_data_check_epoch = -1;       /* force at least one validation for the task */


bosilca requested a review from a team as a code owner November 6, 2024 22:22

abouteiller reviewed Dec 11, 2024

View reviewed changes

Apply suggestions from code review

9d744ae

Co-authored-by: Aurelien Bouteiller <[email protected]>

abouteiller previously approved these changes Dec 12, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

abouteiller added the API Change Change to the public API, backward incompatible (version major bump) label Dec 12, 2024

abouteiller self-requested a review December 13, 2024 12:47

devreal mentioned this pull request Dec 23, 2024

Decouple device flow count from parameters #722

Open

abouteiller added a commit to abouteiller/dplasma that referenced this pull request Jan 2, 2025

flow_info/flow_span: Update for ICLDisco/parsec#694

4883ecb

abouteiller mentioned this pull request Jan 2, 2025

flow_info/flow_span: Update for icldisco/parsec#694 ICLDisco/dplasma#137

Draft

1 task

This comment was marked as resolved.

Sign in to view

abouteiller added 2 commits January 2, 2025 14:22

gpu_task_t must be zeroed during the constructor

c5abdbf

Merge branch 'master' into topic/derive_task_from_parsec_object

c97e2fc

devreal mentioned this pull request Jan 9, 2025

Release function for parsec_object_t #724

Open

devreal mentioned this pull request Jan 9, 2025

PaRSEC now allows DSLs to free the gpu task TESSEorg/ttg#307

Open

devreal requested changes Jan 10, 2025

View reviewed changes

abouteiller mentioned this pull request Feb 4, 2025

Rework the parsec_object_t system. #726

Closed

abouteiller reviewed Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive the device_task_t from a parsec_object_t #694

Derive the device_task_t from a parsec_object_t #694

bosilca commented Nov 6, 2024

This comment was marked as resolved.

This comment was marked as resolved.

abouteiller commented Jan 2, 2025

devreal Jan 10, 2025

abouteiller Feb 13, 2025

devreal commented Jan 10, 2025

bosilca commented Jan 10, 2025

abouteiller commented Feb 4, 2025

abouteiller Feb 13, 2025

abouteiller Feb 13, 2025

	gpu_task->last_data_check_epoch = 0;
	gpu_task->last_data_check_epoch = (uint64_t)-1;

Derive the device_task_t from a parsec_object_t #694

Are you sure you want to change the base?

Derive the device_task_t from a parsec_object_t #694

Conversation

bosilca commented Nov 6, 2024

This comment was marked as resolved.

This comment was marked as resolved.

abouteiller commented Jan 2, 2025

devreal Jan 10, 2025

Choose a reason for hiding this comment

abouteiller Feb 13, 2025

Choose a reason for hiding this comment

devreal commented Jan 10, 2025

bosilca commented Jan 10, 2025

abouteiller commented Feb 4, 2025

abouteiller Feb 13, 2025

Choose a reason for hiding this comment

abouteiller Feb 13, 2025

Choose a reason for hiding this comment