Initialize xp_ifindex of the transport. #137

malahal · 2018-06-09T06:15:15Z

It defaults to zero and all responses go to a single queue leading to
just one thread doing the Replies. Added a monotonically increasing
value so we can use multiple threads for sending RPC Replies.

was4 · 2018-06-10T12:57:43Z

On 6/9/18 2:15 AM, Malahal wrote: It defaults to zero and all responses go to a single queue leading to just one thread doing the Replies. Added a monotonically increasing value so we can use multiple threads for sending RPC Replies.

This is a very bad idea. It creates contention in the kernel. Also degrades CPU cache coherency. My original change to this code measurably improved throughput. Not sure how we'd measure this degradation other than anecdotally. Data is sent over interfaces, not threads. The whole point is to run one async hot thread per interface, serializing the output. This releases the sending thread to do other work in parallel, reducing the number of needed threads. What we really need is support for multiple interfaces. But that is properly in the user, not this library.

was4 · 2018-06-10T13:32:12Z

On 6/10/18 8:57 AM, William Allen Simpson wrote: What we really need is support for multiple interfaces. But that is properly in the user, not this library.

Correction. At the time this was written in 2015, much of the per connection handling was in Ganesha. In Ganesha V2.5 and V2.6 (ntirpc 1.5 and 1.6), that was moved into ntirpc. So the interface index could now be determined in ntirpc. It would likely take multiple system calls per connection, but it could be done. Yet probably wouldn't be measurably faster. The only reason we have a thread here at all is Posix API doesn't handle both async and iov zero-copy. Instead, we need a thread to prepare the next write upon completion of the previous write. All this code does is build the iov, and wait.... We need only 1 thread to handle this waiting.

malahal · 2018-06-11T17:31:33Z

More investigation revealed that there is a client that is slow or downright doesn't acknowledge packets from our server. In this case writev/write hangs rather than return an error. A bad client can cause Ganesha hang! I think a non-blocking socket would help here, any downside with a non-blocking socket? Seems like there was an effort to make it non-blocking but the code is under comments for now.

dang · 2018-06-12T13:51:18Z

So, I don't think this change will help that situation. The ifq doesn't determine which thread is used, just which queue is used. The thread that's used is the one that called svc_sendreply(). A workaround would be to make svc_vc_reply() call the async write (svc_ioq_write_submit()) rather than the sync version (svc_ioq_write_now()). That gets rid of hot thread streaming, but it also means no write can block progress. So it likely would result in lower throughput.

Also, note that, in 2.5, this write should not block everything in Ganesha, just the one worker thread. So it would take as many dead clients as you have worker threads to halt entirely; and, in addition, the tcp session should eventually (30 minutes, I think) time out, closing the socket and freeing the thread.

was4 · 2018-06-12T14:04:34Z

No, non-blocking wouldn't help. The kernel buffers are filled. Non-blocking would just loop forever waiting for them to empty. The solution committed by Swen Schillig in G2.5-dev-5 was to enable keepalive by default. Perhaps this was wrong. Are you seeing TCP keepalive (0 length TCP segments) in your trace?

was4 · 2018-06-13T13:53:33Z

As I was writing a reply at the same time:

On 6/12/18 9:51 AM, Daniel Gryniewicz wrote: Also, note that, in 2.5, this write should not block everything in Ganesha, just the one worker thread.

His description is that all output traffic stops. So it's not a thread problem. That means the kernel has run out of output buffers. Still wouldn't help for a single client that asks for a lot of data, fills all available kernel buffers, then never TCP Acks the data. That will hang Ganesha (and all system) output until the inactivity close that you mention below.

So it would take as many dead clients as you have worker threads to halt entirely; and, in addition, the tcp session should eventually (30 minutes, I think) time out, closing the socket and freeing the thread.

This is somewhat interesting. Where is that code? Closing the socket won't free the kernel output buffers until linger is over. As I mentioned, the usual choice here is server-side TCP options. SO_KEEPALIVE should have cleaned out the buffers. But now there's a better option for Linux: TCP_USER_TIMEOUT (since Linux 2.6.37) Apparently, the kernel tcp_retries2 timer is 30 minutes. That's much too long for us. (I'm a bit behind the times, as IIRC my last Linux TCP kernel contribution was in 2.6.32. So I just found this one.) Malahal, could you redo this patch to set this where we already set TCP_NODELAY?

malahal · 2018-06-14T16:07:00Z

Bill, based on documentation, writev/sendto/send/sendmsg will wait for the socket send buffer. I don't think the entire kernel memory needs to be exhausted. BTW, I did a very small experiment to show that a single slow client could affect perf with other clients:

Run these two lines on two different terminals on a Linux box:
#1. cat /dev/zero | strace -T -e sendto -f nc -l 5000
#2. nc localhost 5000 | pv -L 100

"pv" makes the second "nc" very slow to read the socket buffer after the pipe buffer full. The first screen will show sendto system call times. Initially, it will print a bunch of quick sendto calls, then it waits in sendto call. My system has 8GB and it still has 4-5GB free when this happened.

malahal · 2018-06-14T18:54:24Z

Bill, instead of using TCP_USER_TIMEOUT, I used SO_SNDTIMEO. I think SO_SNDTIMEO is what we are concerned with. TCP_USER_TIMEOUT also should work and it closes the TCP connection which is good. SO_SNDTIMEO needs Ganesha to close the connection and I am not sure SVC_DESTROY() is the right candidate. I have not tested the patch yet though. Here is the patch:

malahal@bb81c2b

Most of these options are inherited from the listening socket, so setting these on the listening sockets might be good enough. What do you think? Shall I go with SEND TIMEOUT or USER TIMEOUT? Anything that pushes us to use one over the other?

was4 · 2018-06-15T11:56:08Z

On 6/14/18 12:07 PM, Malahal wrote: My system has 8GB and it still has 4-5GB free when this happened.

Overall memory size has nothing to do with the kernel I-O buffers. A quick search yields: https://www.cyberciti.biz/faq/linux-tcp-tuning/ TCP memory is calculated automatically based on system memory; you can find the actual values by typing the following commands: $ cat /proc/sys/net/ipv4/tcp_mem The default and maximum amount for the receive socket memory: $ cat /proc/sys/net/core/rmem_default $ cat /proc/sys/net/core/rmem_max The default and maximum amount for the send socket memory: $ cat /proc/sys/net/core/wmem_default $ cat /proc/sys/net/core/wmem_max The maximum amount of option memory buffers: $ cat /proc/sys/net/core/optmem_max

was4 · 2018-06-15T12:12:18Z

On 6/14/18 2:55 PM, Malahal wrote: malahal@bb81c2b

I do not have the cycles to review or test patches. I'm just trying to help out with technical advice so that this project can be successful. We know the original patch will not work. I suggest that you replace it with your new patch, so that the commentary thus far will be coupled with the solution. Also I urge you to try TCP_USER_TIMEOUT. It is somewhat standardized by RFC5482, and has been available in Linux since 2.6.37 released January 4, 2011.

malahal · 2018-06-15T15:24:51Z

I was trying to show with nc cmmand that per socket max buffers are used and writev()/send() times do depend on how fast the other side can read.

SO_SNDTIMEO should work for this issue as well as with a bad client that ACKs everything and still advertises zero window size. TCP_USER_TIMEOUT will not work with the latter case and is only available in some versions (RHEL6.x lacks it!).

With either SO_SNDTIMEO or TCP_USER_TIMEOUT, we probably need a config parameter for extreme cases. Having a very large timeout like 5 seconds should be OK without any config parameter option, correct?

was4 · 2018-06-15T20:10:40Z

On 6/15/18 11:24 AM, Malahal wrote: I was trying to show with nc cmmand that **per socket max** buffers are used

That's not how it used to work. I'll ask Bruce.

and writev()/send() times do depend on how fast the other side can read.

Of course they do! If the client cannot keep up, the buffers will fill. That's why we have buffers.

SO_SNDTIMEO should work for this issue as well as with a bad client that ACKs everything and still advertises zero window size.

Let's just fix the known issue with NFS client vmware VIO server talking to a RHEL 7 with Ganesha. Otherwise, just speculation.

TCP_USER_TIMEOUT will not work with the latter case

Are you sure? I'd have to look at the kernel support, and re-read the RFC.

and is only available in some versions (RHEL6.x lacks it!).

G2.5 isn't targeted at an 8-year-old OS version. There are probably many things that won't work.

With either SO_SNDTIMEO or TCP_USER_TIMEOUT, we probably need a config parameter for extreme cases. Having a very large timeout like 5 seconds should be OK without any config parameter option, correct?

Agreed on parameter, but that would limit you to using G2.7. Sure. 5 seconds would be long for a datacenter, but OK for a VPN from US to SA. With a decent default, it could be backported to your G2.5.

malahal · 2018-06-18T08:27:46Z

kaleb recently posted a patch for non-systemd environments, that would be RHEL/CENTOS 6.x. Ganesha being user level should be more portable. Talking on IRC with DanG indicated that we should use SO_SNDTIMEO as TCP_USER_TIMEOUT is not available on REHL6.x systems. So keeping that in mind, I adjusted this patch here: malahal@100cc6b

Non blocking sockets would be better but it requires changes at many places to handle non-blocking sockets. Added send timeout of 5 seconds where writev should timeout at the most after 5 seconds. We destroy the socket upon any failure of writev.

malahal force-pushed the next branch from fb3f21b to 100cc6b Compare June 18, 2018 08:21

malahal force-pushed the next branch from 100cc6b to 817a3f0 Compare June 18, 2018 18:13

dang approved these changes Aug 7, 2018

View reviewed changes

dang merged commit c08a99f into nfs-ganesha:next Aug 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize xp_ifindex of the transport. #137

Initialize xp_ifindex of the transport. #137

malahal commented Jun 9, 2018

was4 commented Jun 10, 2018 via email

was4 commented Jun 10, 2018 via email

malahal commented Jun 11, 2018 •

edited

Loading

dang commented Jun 12, 2018

was4 commented Jun 12, 2018 via email

was4 commented Jun 13, 2018 via email

malahal commented Jun 14, 2018

malahal commented Jun 14, 2018 •

edited

Loading

was4 commented Jun 15, 2018 via email

was4 commented Jun 15, 2018 via email

malahal commented Jun 15, 2018

was4 commented Jun 15, 2018 via email

malahal commented Jun 18, 2018

Initialize xp_ifindex of the transport. #137

Initialize xp_ifindex of the transport. #137

Conversation

malahal commented Jun 9, 2018

was4 commented Jun 10, 2018 via email

was4 commented Jun 10, 2018 via email

malahal commented Jun 11, 2018 • edited Loading

dang commented Jun 12, 2018

was4 commented Jun 12, 2018 via email

was4 commented Jun 13, 2018 via email

malahal commented Jun 14, 2018

malahal commented Jun 14, 2018 • edited Loading

was4 commented Jun 15, 2018 via email

was4 commented Jun 15, 2018 via email

malahal commented Jun 15, 2018

was4 commented Jun 15, 2018 via email

malahal commented Jun 18, 2018

malahal commented Jun 11, 2018 •

edited

Loading

malahal commented Jun 14, 2018 •

edited

Loading