Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ucc integration #591

Merged
merged 38 commits into from
Jul 14, 2022
Merged

Ucc integration #591

merged 38 commits into from
Jul 14, 2022

Conversation

kaiyingshan
Copy link
Collaborator

No description provided.

@nirandaperera
Copy link
Collaborator

@kaiyingshan I'm getting the following seg fault. I can't figure out where and why it is coming from. I'm using latest UCC master branch.

(cylon_dev) niranda@aurora-r10:~/git/cylon/build$ ./bin/ucc_allgather_example 
[1656170916.449172] [aurora-r10:49843:0]          ucc_cl.c:57   UCC  ERROR no TLs are selected for CL_BASIC
[1656170916.449190] [aurora-r10:49843:0]         ucc_lib.c:127  UCC  ERROR lib_init failed for component: basic
[aurora-r10:49843:0:49843] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x164)
==== backtrace (tid:  49843) ====
 0  /home/niranda/miniconda3/envs/cylon_dev/lib/libucs.so.0(ucs_handle_error+0x2fd) [0x7f3eff20778d]
 1  /home/niranda/miniconda3/envs/cylon_dev/lib/libucs.so.0(+0x2b994) [0x7f3eff207994]
 2  /home/niranda/miniconda3/envs/cylon_dev/lib/libucs.so.0(+0x2bb5a) [0x7f3eff207b5a]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f3f00071090]
 4  /home/niranda/git/ucc/install/lib/libucc.so.1(ucc_collective_init+0x1e0) [0x7f3eff09bf30]
 5  /home/niranda/git/cylon/install/lib/libcylon.so.0.5.0(_ZNK5cylon3ucc21UccTableAllgatherImpl20AllgatherBufferSizesEPKiiPi+0x8e) [0x7f3f005f2b9e]
 6  /home/niranda/git/cylon/install/lib/libcylon.so.0.5.0(_ZN5cylon3net18TableAllgatherImpl7ExecuteERKSt10shared_ptrINS_15TableSerializerEERKS2_INS_9AllocatorEEiPSt6vectorIiSaIiEEPSB_IS2_INS_6BufferEESaISG_EEPSB_ISD_SaISD_EE+0xfc) [0x7f3f008d44dc]
 7  /home/niranda/git/cylon/install/lib/libcylon.so.0.5.0(_ZN5cylon3net18TableAllgatherImpl7ExecuteERKSt10shared_ptrINS_5TableEEPSt6vectorIS4_SaIS4_EE+0x1d1) [0x7f3f008d4ea1]
 8  /home/niranda/git/cylon/install/lib/libcylon.so.0.5.0(_ZNK5cylon3net15UCXCommunicator9AllGatherERKSt10shared_ptrINS_5TableEEPSt6vectorIS4_SaIS4_EE+0x5c) [0x7f3f005f192c]
 9  ./bin/ucc_allgather_example(+0x5f39) [0x55fbb51b8f39]
10  ./bin/ucc_allgather_example(+0x5c69) [0x55fbb51b8c69]
11  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f3f00052083]
12  ./bin/ucc_allgather_example(+0x5e1e) [0x55fbb51b8e1e]
=================================
Segmentation fault (core dumped)

@kaiyingshan
Copy link
Collaborator Author

I don't know if this is the same issue that I experienced, which is due to the ucc team size. When I run the example this way, it gives a warning "ucc_team.c:114 UCC WARN minimal size of UCC team is 2, provided 1", and it is caused by ucc_collective_init; when I run with mpirun -n it won't result in segfault.

std::cout<<std::endl;

/* Cleanup UCC */
UCC_CHECK(ucc_team_destroy(team));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

team destroy is nonblocking operation, it might return UCC_INPROGRESS, should be something like this

    ucc_status_t status;
    while (UCC_INPROGRESS == (status = ucc_team_destroy(team.team))) {
        if (UCC_OK != status) {
            std::cerr << "ucc_team_destroy failed\n";
            break;
        }
    }


RETURN_CYLON_STATUS_IF_UCC_FAILED(ucc_context_config_read(lib, nullptr, &ctx_config));
RETURN_CYLON_STATUS_IF_UCC_FAILED(ucc_context_create(lib, &ctx_params, ctx_config, &uccContext));
while (UCC_OK != ucc_context_progress(uccContext)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucc_context_create is blocking, no need to call ucc_context_progress

@Sergei-Lebedev
Copy link

I don't know if this is the same issue that I experienced, which is due to the ucc team size. When I run the example this way, it gives a warning "ucc_team.c:114 UCC WARN minimal size of UCC team is 2, provided 1", and it is caused by ucc_collective_init; when I run with mpirun -n it won't result in segfault.

recently we added support for team size 1 (openucx/ucc#511)

@nirandaperera
Copy link
Collaborator

nirandaperera commented Jul 12, 2022

I don't know if this is the same issue that I experienced, which is due to the ucc team size. When I run the example this way, it gives a warning "ucc_team.c:114 UCC WARN minimal size of UCC team is 2, provided 1", and it is caused by ucc_collective_init; when I run with mpirun -n it won't result in segfault.

recently we added support for team size 1 (openucx/ucc#511)

@Sergei-Lebedev I'm still getting the following error with team size 1.
I opened an issue in ucx regarding this openucx/ucc#567

(cylon_dev) niranda@aurora-r10:~/git/cylon$ ./build/bin/ucc_example 
[1657643132.702496] [aurora-r10:188948:0]   cl_basic_team.c:131  CL_BASIC ERROR no tl teams were created
[1657643132.702508] [aurora-r10:188948:0]        ucc_team.c:294  UCC  ERROR No CL teams were created
failed to create ucc team
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
(cylon_dev) niranda@aurora-r10:~/git/cylon$ mpirun -n 1 ./build/bin/ucc_example 
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1657643157.590587] [aurora-r10:189016:0]   cl_basic_team.c:131  CL_BASIC ERROR no tl teams were created
[1657643157.590601] [aurora-r10:189016:0]        ucc_team.c:294  UCC  ERROR No CL teams were created
failed to create ucc team

@nirandaperera nirandaperera marked this pull request as ready for review July 13, 2022 15:23
@nirandaperera
Copy link
Collaborator

@kaiyingshan I reviewed your code and made some changes myself in this commit 9a7e9aa

Could you please check that?

@nirandaperera nirandaperera merged commit 4dd359f into main Jul 14, 2022
@nirandaperera
Copy link
Collaborator

@kaiyingshan thank you for doing this

@nirandaperera nirandaperera mentioned this pull request Jul 15, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants