-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easily reproducible message timeouts #74
Comments
I think this might be caused by internal message framing overhead and I have an idea how to fix it. But I have to ask; is this a real use case, sending lots of very small messages? |
I was doing it as part of testing, but it could theoretically happen. On Friday, February 7, 2014, Magnus Edenhill [email protected]
|
Cant reproduce this:
Some questions:
Thanks |
You may have to run it multiple times. But also, I am using brokers that On Saturday, February 8, 2014, Magnus Edenhill [email protected]
|
I managed to reproduce this by adding latency to the loopback like this: The above commit fixes the problem, can you verify on your end? |
Didnt know you could do such a thing. Will try over the weekend. Thanks On Saturday, February 8, 2014, Magnus Edenhill [email protected]
|
Magnus - the fix didn't seem to work right. I simply got the timeouts much more quickly (like a second) than I had in the past. And the number remaining seemed to follow some weird pattern... |
Hi Magnus - I believe you mentioned offline that this could be caused by librdkafka using the wrong parameter for the message timeout? Just wondering if you figured it out? |
Those timeouts were fixed prior to what you tested last in this issue report (940dfdc), so I dont think you befited from those fixes in this scenario. |
Not sure I follow - you're saying its still broken? (Because that's what I'm seeing) |
Still broken, yeah, I'll try to repro |
I think this might be caused by rdkafka_performance changing the default value of |
This value was set to 5000ms for some unknown historical reason.
Or update to latest master and the default override will be gone. |
@edenhill could this be cause of what I'm seeing in logs? should I explicitly add
|
The Metadata timeouts are an indication of some problem with the broker(s) or network, they should typically not time out. Check your broker logs. |
I tried setting that in Producer client configuration and see this error:
But in your docs I see it, and configured same as other settings? Configured... |
I'll try request.timeout.ms instead since not "local" setting from python client. Still trying to figure out broker issue. I'm curious if perhaps the buffer size being 1 million could be impacting the Open file limits on server? Is a file created for each buffered message and perhaps that explains why after they all get running we hit that ceiling and crash brokers? Sorry to hijack thread but his errors seem all too similar to what I'm experiencing. |
Messages are written to log segments, when a log segment reaches a certain configured sized it is rolled over and a new log segment is created. |
Thanks for clarification. Crazy thing happened. On a whim I updated my Ansible playbooks and reduced 4-node cluster to 3-node for Zookeeper/Kafka and Cassandra I removed 4th node too. Restarted everything and ETL running fine; no issues whatsoever. Is it possible issues caused by having an even number of servers vs. and odd number? I did read that Zookeeper should be Odd number but very surprised if that would actually cause all these issues. Now everything is working 100% moving from 4 node to 3 node. |
Hi - not sure if this is an rdkafka issue or an actual kafka issue. (Not sure which one I want..)
I'm finding that when I send a lot of small messages, I frequently get message timeouts. I see this in both using the api as well as using the sample programs.
Example:
./rdkafka_performance -P -t TENSECONDLOGS -b masked.com:5757 -s 2 -a 2 -z gzip -p 1 -c 1000000 -X queue.buffering.max.messages=1000000
Message delivered: 471000 remain^M
Message delivered: 470000 remain^M
Message delivered: 469000 remain^M
Message delivered: 468000 remain^M
Message delivered: 467000 remain^M
Message delivered: 466000 remain^M
Message delivered: 465000 remain^M
Message delivered: 464000 remain^M
Message delivered: 463000 remain^M
Message delivered: 462000 remain^M
Message delivered: 461000 remain^M
Message delivered: 460000 remain^M
(sits and hangs here for a long time...and then:)
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743417.429|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743417.429|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743417.429|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743417.429|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
if I strace while its hanging, I see:
futex(0x1374160, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x137418c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 7175, {1391707106, 253292000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x1374160, FUTEX_WAKE_PRIVATE, 1) = 0
I sometimes see this on the cluster (but definitely not always):
[2014-02-06 22:02:54,585] INFO Client session timed out, have not heard from server in 4815ms for sessionid 0x343c4f6383400a7, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:54,689] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient)
[2014-02-06 22:02:55,201] INFO Opening socket connection to server zookeeperlog1.masked.com/10.52.189.106:5700 (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:55,202] INFO Socket connection established to zookeeperlog1.masked.com/10.52.189.106:5700, initiating session (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:55,207] INFO Session establishment complete on server zookeeperlog1.masked.com/10.52.189.106:5700, sessionid = 0x343c4f6383400a7, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:55,208] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
do you think I should simply increase the zookeeper timeout? (note it's running on the same host as the cluster node). Have you seen this before?
The text was updated successfully, but these errors were encountered: