-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize xp_ifindex of the transport. #137
Conversation
On 6/9/18 2:15 AM, Malahal wrote:
It defaults to zero and all responses go to a single queue leading to
just one thread doing the Replies. Added a monotonically increasing
value so we can use multiple threads for sending RPC Replies.
This is a very bad idea. It creates contention in the kernel.
Also degrades CPU cache coherency.
My original change to this code measurably improved throughput.
Not sure how we'd measure this degradation other than anecdotally.
Data is sent over interfaces, not threads.
The whole point is to run one async hot thread per interface,
serializing the output. This releases the sending thread to do
other work in parallel, reducing the number of needed threads.
What we really need is support for multiple interfaces. But that is
properly in the user, not this library.
|
On 6/10/18 8:57 AM, William Allen Simpson wrote:
What we really need is support for multiple interfaces. But that is
properly in the user, not this library.
Correction. At the time this was written in 2015, much of the
per connection handling was in Ganesha. In Ganesha V2.5 and V2.6
(ntirpc 1.5 and 1.6), that was moved into ntirpc. So the interface
index could now be determined in ntirpc. It would likely take
multiple system calls per connection, but it could be done.
Yet probably wouldn't be measurably faster.
The only reason we have a thread here at all is Posix API doesn't
handle both async and iov zero-copy. Instead, we need a thread to
prepare the next write upon completion of the previous write. All
this code does is build the iov, and wait....
We need only 1 thread to handle this waiting.
|
More investigation revealed that there is a client that is slow or downright doesn't acknowledge packets from our server. In this case writev/write hangs rather than return an error. A bad client can cause Ganesha hang! I think a non-blocking socket would help here, any downside with a non-blocking socket? Seems like there was an effort to make it non-blocking but the code is under comments for now. |
So, I don't think this change will help that situation. The ifq doesn't determine which thread is used, just which queue is used. The thread that's used is the one that called svc_sendreply(). A workaround would be to make svc_vc_reply() call the async write (svc_ioq_write_submit()) rather than the sync version (svc_ioq_write_now()). That gets rid of hot thread streaming, but it also means no write can block progress. So it likely would result in lower throughput. Also, note that, in 2.5, this write should not block everything in Ganesha, just the one worker thread. So it would take as many dead clients as you have worker threads to halt entirely; and, in addition, the tcp session should eventually (30 minutes, I think) time out, closing the socket and freeing the thread. |
No, non-blocking wouldn't help. The kernel buffers are filled.
Non-blocking would just loop forever waiting for them to empty.
The solution committed by Swen Schillig in G2.5-dev-5 was to enable
keepalive by default. Perhaps this was wrong.
Are you seeing TCP keepalive (0 length TCP segments) in your trace?
|
As I was writing a reply at the same time:
On 6/12/18 9:51 AM, Daniel Gryniewicz wrote:
Also, note that, in 2.5, this write should not block everything in Ganesha, just the one worker thread.
His description is that all output traffic stops. So it's not a
thread problem. That means the kernel has run out of output buffers.
Still wouldn't help for a single client that asks for a lot of data,
fills all available kernel buffers, then never TCP Acks the data.
That will hang Ganesha (and all system) output until the inactivity
close that you mention below.
So it would take as many dead clients as you have worker threads to halt entirely; and, in addition, the tcp session should eventually (30 minutes, I think) time out, closing the socket and freeing the thread.
This is somewhat interesting. Where is that code?
Closing the socket won't free the kernel output buffers until
linger is over.
As I mentioned, the usual choice here is server-side TCP options.
SO_KEEPALIVE should have cleaned out the buffers.
But now there's a better option for Linux:
TCP_USER_TIMEOUT (since Linux 2.6.37)
Apparently, the kernel tcp_retries2 timer is 30 minutes. That's
much too long for us.
(I'm a bit behind the times, as IIRC my last Linux TCP kernel
contribution was in 2.6.32. So I just found this one.)
Malahal, could you redo this patch to set this where we already
set TCP_NODELAY?
|
Bill, based on documentation, writev/sendto/send/sendmsg will wait for the socket send buffer. I don't think the entire kernel memory needs to be exhausted. BTW, I did a very small experiment to show that a single slow client could affect perf with other clients: Run these two lines on two different terminals on a Linux box: "pv" makes the second "nc" very slow to read the socket buffer after the pipe buffer full. The first screen will show sendto system call times. Initially, it will print a bunch of quick sendto calls, then it waits in sendto call. My system has 8GB and it still has 4-5GB free when this happened. |
Bill, instead of using TCP_USER_TIMEOUT, I used SO_SNDTIMEO. I think SO_SNDTIMEO is what we are concerned with. TCP_USER_TIMEOUT also should work and it closes the TCP connection which is good. SO_SNDTIMEO needs Ganesha to close the connection and I am not sure SVC_DESTROY() is the right candidate. I have not tested the patch yet though. Here is the patch: Most of these options are inherited from the listening socket, so setting these on the listening sockets might be good enough. What do you think? Shall I go with SEND TIMEOUT or USER TIMEOUT? Anything that pushes us to use one over the other? |
On 6/14/18 12:07 PM, Malahal wrote:
My system has 8GB and it still has 4-5GB free when this happened.
Overall memory size has nothing to do with the kernel I-O buffers.
A quick search yields:
https://www.cyberciti.biz/faq/linux-tcp-tuning/
TCP memory is calculated automatically based on system memory; you can find the actual values by typing the following commands:
$ cat /proc/sys/net/ipv4/tcp_mem
The default and maximum amount for the receive socket memory:
$ cat /proc/sys/net/core/rmem_default
$ cat /proc/sys/net/core/rmem_max
The default and maximum amount for the send socket memory:
$ cat /proc/sys/net/core/wmem_default
$ cat /proc/sys/net/core/wmem_max
The maximum amount of option memory buffers:
$ cat /proc/sys/net/core/optmem_max
|
On 6/14/18 2:55 PM, Malahal wrote:
malahal@bb81c2b
I do not have the cycles to review or test patches. I'm just
trying to help out with technical advice so that this project
can be successful.
We know the original patch will not work. I suggest that you
replace it with your new patch, so that the commentary thus far
will be coupled with the solution.
Also I urge you to try TCP_USER_TIMEOUT. It is somewhat
standardized by RFC5482, and has been available in Linux
since 2.6.37 released January 4, 2011.
|
I was trying to show with nc cmmand that per socket max buffers are used and writev()/send() times do depend on how fast the other side can read. SO_SNDTIMEO should work for this issue as well as with a bad client that ACKs everything and still advertises zero window size. TCP_USER_TIMEOUT will not work with the latter case and is only available in some versions (RHEL6.x lacks it!). With either SO_SNDTIMEO or TCP_USER_TIMEOUT, we probably need a config parameter for extreme cases. Having a very large timeout like 5 seconds should be OK without any config parameter option, correct? |
On 6/15/18 11:24 AM, Malahal wrote:
I was trying to show with nc cmmand that **per socket max** buffers are used
That's not how it used to work. I'll ask Bruce.
and writev()/send() times do depend on how fast the other side can read.
Of course they do! If the client cannot keep up, the buffers will
fill. That's why we have buffers.
SO_SNDTIMEO should work for this issue as well as with a bad client that ACKs everything and still advertises zero window size.
Let's just fix the known issue with NFS client vmware VIO server
talking to a RHEL 7 with Ganesha. Otherwise, just speculation.
TCP_USER_TIMEOUT will not work with the latter case
Are you sure? I'd have to look at the kernel support, and re-read
the RFC.
and is only available in some versions (RHEL6.x lacks it!).
G2.5 isn't targeted at an 8-year-old OS version. There are probably
many things that won't work.
With either SO_SNDTIMEO or TCP_USER_TIMEOUT, we probably need a config parameter for extreme cases. Having a very large timeout like 5 seconds should be OK without any config parameter option, correct?
Agreed on parameter, but that would limit you to using G2.7.
Sure. 5 seconds would be long for a datacenter, but OK for a VPN
from US to SA.
With a decent default, it could be backported to your G2.5.
|
kaleb recently posted a patch for non-systemd environments, that would be RHEL/CENTOS 6.x. Ganesha being user level should be more portable. Talking on IRC with DanG indicated that we should use SO_SNDTIMEO as TCP_USER_TIMEOUT is not available on REHL6.x systems. So keeping that in mind, I adjusted this patch here: malahal@100cc6b |
Non blocking sockets would be better but it requires changes at many places to handle non-blocking sockets. Added send timeout of 5 seconds where writev should timeout at the most after 5 seconds. We destroy the socket upon any failure of writev.
It defaults to zero and all responses go to a single queue leading to
just one thread doing the Replies. Added a monotonically increasing
value so we can use multiple threads for sending RPC Replies.