Easily reproducible message timeouts #74

winbatch · 2014-02-07T03:34:08Z

Hi - not sure if this is an rdkafka issue or an actual kafka issue. (Not sure which one I want..)

I'm finding that when I send a lot of small messages, I frequently get message timeouts. I see this in both using the api as well as using the sample programs.

Example:

./rdkafka_performance -P -t TENSECONDLOGS -b masked.com:5757 -s 2 -a 2 -z gzip -p 1 -c 1000000 -X queue.buffering.max.messages=1000000

Message delivered: 471000 remain^M
Message delivered: 470000 remain^M
Message delivered: 469000 remain^M
Message delivered: 468000 remain^M
Message delivered: 467000 remain^M
Message delivered: 466000 remain^M
Message delivered: 465000 remain^M
Message delivered: 464000 remain^M
Message delivered: 463000 remain^M
Message delivered: 462000 remain^M
Message delivered: 461000 remain^M
Message delivered: 460000 remain^M

if I strace while its hanging, I see:

futex(0x1374160, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x137418c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 7175, {1391707106, 253292000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x1374160, FUTEX_WAKE_PRIVATE, 1) = 0

I sometimes see this on the cluster (but definitely not always):

[2014-02-06 22:02:54,585] INFO Client session timed out, have not heard from server in 4815ms for sessionid 0x343c4f6383400a7, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:54,689] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient)
[2014-02-06 22:02:55,201] INFO Opening socket connection to server zookeeperlog1.masked.com/10.52.189.106:5700 (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:55,202] INFO Socket connection established to zookeeperlog1.masked.com/10.52.189.106:5700, initiating session (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:55,207] INFO Session establishment complete on server zookeeperlog1.masked.com/10.52.189.106:5700, sessionid = 0x343c4f6383400a7, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:55,208] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)

do you think I should simply increase the zookeeper timeout? (note it's running on the same host as the cluster node). Have you seen this before?

edenhill · 2014-02-08T03:42:12Z

I think this might be caused by internal message framing overhead and I have an idea how to fix it.

But I have to ask; is this a real use case, sending lots of very small messages?

winbatch · 2014-02-08T04:22:09Z

I was doing it as part of testing, but it could theoretically happen.
Remember that I am unable to batch, so if I have small content available,
I'll send it immediately

On Friday, February 7, 2014, Magnus Edenhill [email protected]
wrote:

I think this might be caused by internal message framing overhead and I
have an idea how to fix it.

But I have to ask; is this a real use case, sending lots of very small
messages?

Reply to this email directly or view it on GitHubhttps://github.com//issues/74#issuecomment-34531309
.

edenhill · 2014-02-08T07:05:42Z

Cant reproduce this:

$ ./rdkafka_performance -P -t multi -p 0 -b localhost -s 2 -a 2 -z gzip -c 1000000 -X queue.buffering.max.messages=1000000

...

% 1000000 messages and 2000000 bytes produced in 4809ms: 207911 msgs/s and 0.42 Mb/s, 0 messages failed, gzip compression

Some questions:

how many brokers are in the replication set for this topic?
what happens if you try -a -1, 0 and 1? (request.required.acks) NOTE: For 0 you will need to update to latest master due to issue Producer segfaults with request.required.acks=0 #75
is it possible that it's the replication between brokers thats the problem? I think you might have insinuated this in another issue at some point.
can you you reproduce this with -d all ? There's going to be alot of output so you can mail it to me directly.

Thanks

winbatch · 2014-02-08T12:08:45Z

You may have to run it multiple times. But also, I am using brokers that
are not localhost. 3 brokers 3 replicas. unable to reproduce with all the
debugging presumably because it slows down the app enough. Will try with
the other ack values

On Saturday, February 8, 2014, Magnus Edenhill [email protected]
wrote:

Cant reproduce this:

$ ./rdkafka_performance -P -t multi -p 0 -b localhost -s 2 -a 2 -z gzip -c 1000000 -X queue.buffering.max.messages=1000000

...

% 1000000 messages and 2000000 bytes produced in 4809ms: 207911 msgs/s and 0.42 Mb/s, 0 messages failed, gzip compression

Some questions:

how many brokers are in the replication set for this topic?

what happens if you try -a -1, 0 and 1? (request.required.acks)
NOTE: For 0 you will need to update to latest master due to issue Producer segfaults with request.required.acks=0 #75Producer segfaults with request.required.acks=0 #75

is it possible that it's the replication between brokers thats the
problem? I think you might have insinuated this in another issue at some
point.

can you you reproduce this with -d all ? There's going to be alot of
output so you can mail it to me directly.

Thanks

Reply to this email directly or view it on GitHubhttps://github.com//issues/74#issuecomment-34537376
.

…d (for issue #74)

edenhill · 2014-02-08T14:27:59Z

I managed to reproduce this by adding latency to the loopback like this: sudo tc qdisc add dev lo root netem delay 350ms

The above commit fixes the problem, can you verify on your end?

winbatch · 2014-02-08T14:29:01Z

Didnt know you could do such a thing. Will try over the weekend. Thanks
very much.

On Saturday, February 8, 2014, Magnus Edenhill [email protected]
wrote:

I managed to reproduce this by adding latency to the loopback like this: sudo
tc qdisc add dev lo root netem delay 350ms

The above commit fixes the problem, can you verify on your end?

Reply to this email directly or view it on GitHubhttps://github.com//issues/74#issuecomment-34545163
.

winbatch · 2014-02-10T05:09:54Z

Magnus - the fix didn't seem to work right. I simply got the timeouts much more quickly (like a second) than I had in the past. And the number remaining seemed to follow some weird pattern...
Message delivered: 556000 remain
Message delivered: 555000 remain
Message delivered: 554000 remain
Message delivered: 553000 remain
Message delivered: 552000 remain
Message delivered: 551000 remain
Message delivered: 550000 remain
Message delivered: 549000 remain
Message delivery failed: Local: Message timed out (548827 remain)
Message delivery failed: Local: Message timed out (548826 remain)
Message delivery failed: Local: Message timed out (548825 remain)
Message delivery failed: Local: Message timed out (548824 remain)
Message delivery failed: Local: Message timed out (548823 remain)
Message delivery failed: Local: Message timed out (548822 remain)
Message delivery failed: Local: Message timed out (548821 remain)
Message delivery failed: Local: Message timed out (548820 remain)
Message delivery failed: Local: Message timed out (548819 remain)
Message delivery failed: Local: Message timed out (548818 remain)
Message delivery failed: Local: Message timed out (548817 remain)
Message delivery failed: Local: Message timed out (548816 remain)
Message delivery failed: Local: Message timed out (548815 remain)
Message delivery failed: Local: Message timed out (548814 remain)
Message delivery failed: Local: Message timed out (548813 remain)
Message delivery failed: Local: Message timed out (548812 remain)
Message delivery failed: Local: Message timed out (548811 remain)
Message delivery failed: Local: Message timed out (548810 remain)
Message delivery failed: Local: Message timed out (548809 remain)
Message delivery failed: Local: Message timed out (548808 remain)
Message delivery failed: Local: Message timed out (548807 remain)
Message delivery failed: Local: Message timed out (548806 remain)
Message delivery failed: Local: Message timed out (548805 remain)
Message delivery failed: Local: Message timed out (548804 remain)
Message delivery failed: Local: Message timed out (548803 remain)
Message delivery failed: Local: Message timed out (548802 remain)
Message delivery failed: Local: Message timed out (548801 remain)
Message delivery failed: Local: Message timed out (548800 remain)
Message delivery failed: Local: Message timed out (548799 remain)
Message delivery failed: Local: Message timed out (548798 remain)
Message delivery failed: Local: Message timed out (548797 remain)
Message delivery failed: Local: Message timed out (548796 remain)
Message delivery failed: Local: Message timed out (548795 remain)
Message delivery failed: Local: Message timed out (548794 remain)
Message delivery failed: Local: Message timed out (548793 remain)
Message delivery failed: Local: Message timed out (548792 remain)
Message delivery failed: Local: Message timed out (548791 remain)
Message delivery failed: Local: Message timed out (548790 remain)
Message delivery failed: Local: Message timed out (548789 remain)
Message delivery failed: Local: Message timed out (548788 remain)
Message delivery failed: Local: Message timed out (548787 remain)
Message delivery failed: Local: Message timed out (548786 remain)
Message delivery failed: Local: Message timed out (548785 remain)
Message delivery failed: Local: Message timed out (548784 remain)
Message delivery failed: Local: Message timed out (548783 remain)
Message delivery failed: Local: Message timed out (548782 remain)
Message delivery failed: Local: Message timed out (548781 remain)
Message delivery failed: Local: Message timed out (548780 remain)
Message delivery failed: Local: Message timed out (548779 remain)
Message delivery failed: Local: Message timed out (548000 remain)
Message delivery failed: Local: Message timed out (547828 remain)
Message delivery failed: Local: Message timed out (547000 remain)
Message delivery failed: Local: Message timed out (546828 remain)
Message delivery failed: Local: Message timed out (546000 remain)
Message delivery failed: Local: Message timed out (545828 remain)
Message delivery failed: Local: Message timed out (545000 remain)
Message delivery failed: Local: Message timed out (544828 remain)

winbatch · 2014-03-18T02:23:19Z

Hi Magnus - I believe you mentioned offline that this could be caused by librdkafka using the wrong parameter for the message timeout? Just wondering if you figured it out?

edenhill · 2014-03-18T09:29:41Z

Those timeouts were fixed prior to what you tested last in this issue report (940dfdc), so I dont think you befited from those fixes in this scenario.

winbatch · 2014-03-18T11:11:41Z

Not sure I follow - you're saying its still broken? (Because that's what I'm seeing)

edenhill · 2014-03-18T11:33:07Z

Still broken, yeah, I'll try to repro

edenhill · 2014-03-18T16:35:19Z

I think this might be caused by rdkafka_performance changing the default value of message.timeout.ms (300s) to 5s.
This is a legacy and I dont know why this value was chosen.
Try setting it to its default value, .. -X topic.message.timeout.ms=300000, and the problem should go away.

This value was set to 5000ms for some unknown historical reason.

edenhill · 2014-03-19T02:02:38Z

Or update to latest master and the default override will be gone.

mikesparr · 2016-09-16T15:19:21Z

@edenhill could this be cause of what I'm seeing in logs? should I explicitly add topic.message.timeout.ms to my confluent-kakfa-python producer class?

2016-09-16 08:56:01,425 - INFO - extractor_rets.py - Sleeping for 5 minutes (300 seconds)
2016-09-16 09:01:08,140 - INFO - extractor_rets.py - Processed 0 rows
2016-09-16 09:01:08,141 - INFO - extractor_rets.py - Sleeping for 5 minutes (300 seconds)
%4|1474038111.837|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038111.837|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038111.837|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038174.946|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038174.946|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038174.946|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038238.055|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038238.055|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038238.055|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038252.485|FAIL|app2#producer-1| queue4:9092/4: Receive failed: Connection reset by peer
%3|1474038252.485|ERROR|app2#producer-1| queue4:9092/4: Receive failed: Connection reset by peer
%4|1474038300.358|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038314.380|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038314.380|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038316.394|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
2016-09-16 09:06:13,052 - INFO - extractor_rets.py - Processed 0 rows
2016-09-16 09:06:13,053 - INFO - extractor_rets.py - Sleeping for 5 minutes (300 seconds)
%3|1474038375.488|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038375.488|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038377.501|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038436.593|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038436.593|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038438.607|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038497.699|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038497.699|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038499.713|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out

edenhill · 2016-09-16T15:21:41Z

The Metadata timeouts are an indication of some problem with the broker(s) or network, they should typically not time out. Check your broker logs.

mikesparr · 2016-09-16T16:21:14Z

I tried setting that in Producer client configuration and see this error:

2016-09-16 10:14:00,476 - ERROR - stream_producer.pyc - Error [KafkaError{code=_INVALID_ARG,val=-186,str="No such configuration property: "message.timeout.ms""}] publishing data at topic [rets.rayac.Office]:[None]-[rayac:1347]

But in your docs I see it, and configured same as other settings?

Configured...

mikesparr · 2016-09-16T16:26:03Z

I'll try request.timeout.ms instead since not "local" setting from python client. Still trying to figure out broker issue. I'm curious if perhaps the buffer size being 1 million could be impacting the Open file limits on server? Is a file created for each buffered message and perhaps that explains why after they all get running we hit that ceiling and crash brokers? Sorry to hijack thread but his errors seem all too similar to what I'm experiencing.

edenhill · 2016-09-16T18:06:35Z

message.timeout.ms is a topic-level config property so you will need to set it in the "default.topic.config" sub-dict.

Messages are written to log segments, when a log segment reaches a certain configured sized it is rolled over and a new log segment is created.
So yes, the message size * message count does affect the number of open log segments.
Please see broker properties log.segment.bytes and log.roll.* here:
http://kafka.apache.org/documentation.html#brokerconfigs

mikesparr · 2016-09-16T19:57:02Z

Thanks for clarification. Crazy thing happened. On a whim I updated my Ansible playbooks and reduced 4-node cluster to 3-node for Zookeeper/Kafka and Cassandra I removed 4th node too. Restarted everything and ETL running fine; no issues whatsoever.

Is it possible issues caused by having an even number of servers vs. and odd number? I did read that Zookeeper should be Odd number but very surprised if that would actually cause all these issues. Now everything is working 100% moving from 4 node to 3 node.

winbatch mentioned this issue Feb 7, 2014

Enhancement request: on a local message timeout, is it possible to log the broker hosts being waited on? #73

Closed

edenhill added duplicate and removed duplicate labels Feb 8, 2014

edenhill added a commit that referenced this issue Feb 8, 2014

message.send.max.retries did not have effect, no messages were retrie…

940dfdc

…d (for issue #74)

edenhill added a commit that referenced this issue Mar 18, 2014

rdkafka_performance: use message.timeout.ms default value (issue #74)

874ac55

This value was set to 5000ms for some unknown historical reason.

edenhill closed this as completed Mar 19, 2014

MockingJayWong mentioned this issue Jul 9, 2019

rd_kafka_topic_new after using admin api #2399

Closed

7 tasks

Furuta-Masakazu-quick mentioned this issue Sep 29, 2020

Producer detects "All broker connections are down" when rolling restarts #3090

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easily reproducible message timeouts #74

Easily reproducible message timeouts #74

winbatch commented Feb 7, 2014

edenhill commented Feb 8, 2014

winbatch commented Feb 8, 2014

edenhill commented Feb 8, 2014

winbatch commented Feb 8, 2014

edenhill commented Feb 8, 2014

winbatch commented Feb 8, 2014

winbatch commented Feb 10, 2014

winbatch commented Mar 18, 2014

edenhill commented Mar 18, 2014

winbatch commented Mar 18, 2014

edenhill commented Mar 18, 2014

edenhill commented Mar 18, 2014

edenhill commented Mar 19, 2014

mikesparr commented Sep 16, 2016

edenhill commented Sep 16, 2016

mikesparr commented Sep 16, 2016

mikesparr commented Sep 16, 2016

edenhill commented Sep 16, 2016

mikesparr commented Sep 16, 2016

Easily reproducible message timeouts #74

Easily reproducible message timeouts #74

Comments

winbatch commented Feb 7, 2014

edenhill commented Feb 8, 2014

winbatch commented Feb 8, 2014

edenhill commented Feb 8, 2014

winbatch commented Feb 8, 2014

edenhill commented Feb 8, 2014

winbatch commented Feb 8, 2014

winbatch commented Feb 10, 2014

winbatch commented Mar 18, 2014

edenhill commented Mar 18, 2014

winbatch commented Mar 18, 2014

edenhill commented Mar 18, 2014

edenhill commented Mar 18, 2014

edenhill commented Mar 19, 2014

mikesparr commented Sep 16, 2016

edenhill commented Sep 16, 2016

mikesparr commented Sep 16, 2016

mikesparr commented Sep 16, 2016

edenhill commented Sep 16, 2016

mikesparr commented Sep 16, 2016