-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/tcp dead stall - generic ofi_write_socket buffer size issue #5897
Comments
bernd@t420-work-1 tmp>cat socket-test.c // https://unix.stackexchange.com/questions/424380/what-values-may-linux-use-for-the-default-unix-socket-buffer-size #define NEEDED_SEND_BUF_SIZE (SOCK_OVERHEAD * 8192) char buf[1024]; enum { int main(int argc, char **argv)
// int option = SO_SNDBUFFORCE;
} |
Default wmem_max cat /proc/sys/net/core/wmem_max ./socket-test |
I'm not following the problem here. Is the ofi_signal code showing an issue? The signal code shouldn't ever have more than 1 byte written to the socket at a time. |
tcpx_cq_report_success() writes 1 byte for every completed cq? So if you don't read from the signal socket sufficiently fast it will block at some point. |
Are you referring to this call path: tcpx_cq_report_success() -> ofi_cq_signal() -> util_cq_signal() -> cq->wait->signal()? If so, do you know what that callback is set to? |
The call trace points to fd_signal_set().
The write socket call is only invoked if the write cnt equals the read cnt. Until someone reads from the socket (calls fd_signal_reset()) we shouldn't write to it again. The design is to avoid writing to the socket (and taking the kernel transition) unless necessary. |
Is this limit a process limit, rather than a per socket limit? How many CQs is the application using? The fd_signal abstraction sets the read fd to nonblocking, but not the write fd. The assumption was that there was no need, since we shouldn't ever write more than 1 byte per socket. We can set the write fd to nonblocking, but a failure here means that we've lost the signal, which could likely cause issues elsewhere. |
Sorry it is getting very very late (well, already early here), I will reply in more detail after getting some sleep. There might be very simple solution - pipe instead of the socket. I need to check why it is writing to the socket the (signal->wcnt == signal->rcnt) condition - in about 8 to 9 hours. |
I don't think pipe is available on windows. But worry about it tomorrow. |
Thanks a lot Sean for your help and your pointer to fd_signal_set()! I had seen this function in my traces, but didn't pay attention to it. In the end it turns out is just a missing lock protecting signal->wcnt and signal->rcnt from multi threaded access. |
Thanks for merging the patch, I'm closing the issue. |
Lets start with the root cause, by default a socket fd has limited number of 1B writes, because of internal linux overhead (I have no idea about BSD and other supported platforms). A small test program writing 1B to a socket can do that only 278 times without setting SO_SNDBUF and default procfs (centos7) settings. Setting SO_SNDBUF increases that to 555, but is still well below default CQ size of 1024 (we have actually increased that to 4096). That can be changed using SO_SNDBUFFORCE or setting /proc/sys/net/core/wmem_max, but that either requires root or CAP_NET_ADMIN permissions and not possible for us, or requires to change generic system settings and is not everywhere wanted. Other providers might be affected as well, for now I see a dead stall using prov/tcp
Our polling thread and which is the socket fd reader/consumer blocks on fastlock_acquire(&ep->lock);
Thread 73 (Thread 0x7ff3fdbc5700 (LWP 2609459)):
#0 pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:33
#1 0x00007ff6968c8a46 in tcpx_try_func (util_ep=util_ep@entry=0x61a00000fc80) at prov/tcp/src/tcpx_progress.c:657
#2 0x00007ff6968c7334 in tcpx_cq_progress (cq=0x61200000ff40) at prov/tcp/src/tcpx_cq.c:58
#3 0x00007ff69686921e in ofi_cq_readfrom (cq_fid=0x61200000ff40, buf=, count=0, src_addr=src_addr@entry=0x0) at prov/util/src/util_cq.c:245
#4 0x00007ff6968694a4 in ofi_cq_read (cq_fid=, buf=, count=) at prov/util/src/util_cq.c:312
#5 0x00007ff69686c6ed in fi_cq_read (count=0, buf=0x0, cq=0x61200000ff40) at ./include/rdma/fi_eq.h:385
#6 util_poll_run (poll_fid=0x607000013880, context=0x7ff3fdbc4828, count=1) at prov/util/src/util_poll.c:95
#7 0x00007ff69686cb97 in fi_poll (count=1, context=0x7ff3fdbc4828, pollset=) at ./include/rdma/fi_eq.h:326
#8 util_wait_fd_try (wait=0x60d000006f90) at prov/util/src/util_wait.c:253
#9 0x00007ff69686cc12 in ofi_trywait (fabric=, fids=0x7ff3fdbc4910, count=2) at prov/util/src/util_wait.c:70
#10 0x0000000000a79f5f in fi_trywait (fabric=0x608000002620, fids=0x7ff3fdbc4910, count=2) at /home/bschubert/local/rhel7/libfabric/include/rdma/fi_eq.h:314
The socket fd writer has acquired that lock and blocks because the socket fd buffer is full.
Thread 11 (Thread 0x7ff67a80b700 (LWP 2609250)):
#0 0x00007ff69538d1fd in write () at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000480ed1 in __interceptor_write (fd=, ptr=0x7ff67a809b9f, count=) at /scratch/gcc-9.2.0-build/../gcc-9.2.0/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1087
#2 0x00007ff69686cbde in ofi_write_socket (count=1, buf=0x7ff67a809b9f, fd=) at ./include/unix/osd.h:129
#3 fd_signal_set (signal=0x60d000006fe8) at ./include/ofi_signal.h:91
#4 util_wait_fd_signal (util_wait=0x60d000006f90) at prov/util/src/util_wait.c:231
#5 0x00007ff6968697bf in util_cq_signal (cq=0x61200000ff40) at ./include/ofi_util.h:508
#6 ofi_cq_signal (cq_fid=cq_fid@entry=0x61200000ff40) at prov/util/src/util_cq.c:418
#7 0x00007ff6968c77c3 in tcpx_cq_report_success (cq=0x61200000ff40, xfer_entry=xfer_entry@entry=0x7ff3faae2960) at prov/tcp/src/tcpx_cq.c:166
#8 0x00007ff6968c83d1 in process_tx_entry (tx_entry=tx_entry@entry=0x7ff3faae2960) at prov/tcp/src/tcpx_progress.c:121
#9 0x00007ff6968c8b82 in tcpx_tx_queue_insert (tcpx_ep=tcpx_ep@entry=0x61a00000fc80, tx_entry=tx_entry@entry=0x7ff3faae2960) at prov/tcp/src/tcpx_progress.c:692
#10 0x00007ff6968c5680 in tcpx_send (ep=0x61a00000fc80, buf=0x7ff5fffde478, len=, desc=, dest_addr=, context=0x7ff5fffde000) at prov/tcp/src/tcpx_msg.c:263
#11 0x0000000000a3fdad in fi_send (ep=0x61a00000fc80, buf=0x7ff5fffde478, len=806, desc=0x40, dest_addr=18446744073709551615, context=0x7ff5fffde000) at /home/bschubert/local/rhel7/libfabric/include/rdma/fi_endpoint.h:291
One way to solve that for us might be to set our own wait object (FI_WAIT_SET), which then uses another wakeup way, like eventfd. But as this is a general problem, it should be solved in libfabric as well. Right now I don't have a good idea how, though. Using eventfd in general for fd-signals will not work, as this is linux specific. For prov/tcp it probably would help to do the ofi_cq_signal() outside of the ep lock, but at a minumum that requires some significant changes and I would like to get the opinion of upstream about this issue first.
Thanks,
Bernd
The text was updated successfully, but these errors were encountered: