Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easily reproducible message timeouts #74

Closed
winbatch opened this issue Feb 7, 2014 · 19 comments
Closed

Easily reproducible message timeouts #74

winbatch opened this issue Feb 7, 2014 · 19 comments
Labels

Comments

@winbatch
Copy link

winbatch commented Feb 7, 2014

Hi - not sure if this is an rdkafka issue or an actual kafka issue. (Not sure which one I want..)

I'm finding that when I send a lot of small messages, I frequently get message timeouts. I see this in both using the api as well as using the sample programs.

Example:

./rdkafka_performance -P -t TENSECONDLOGS -b masked.com:5757 -s 2 -a 2 -z gzip -p 1 -c 1000000 -X queue.buffering.max.messages=1000000

Message delivered: 471000 remain^M
Message delivered: 470000 remain^M
Message delivered: 469000 remain^M
Message delivered: 468000 remain^M
Message delivered: 467000 remain^M
Message delivered: 466000 remain^M
Message delivered: 465000 remain^M
Message delivered: 464000 remain^M
Message delivered: 463000 remain^M
Message delivered: 462000 remain^M
Message delivered: 461000 remain^M
Message delivered: 460000 remain^M

(sits and hangs here for a long time...and then:)
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743416.628|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743417.429|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743417.429|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743417.429|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M
%4|1391743417.429|METADATA|rdkafka#producer-0| d146537-021.masked.com:5757/1052189105: Metadata request failed: Local: Message timed out^M

if I strace while its hanging, I see:

futex(0x1374160, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x137418c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 7175, {1391707106, 253292000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x1374160, FUTEX_WAKE_PRIVATE, 1) = 0

I sometimes see this on the cluster (but definitely not always):

[2014-02-06 22:02:54,585] INFO Client session timed out, have not heard from server in 4815ms for sessionid 0x343c4f6383400a7, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:54,689] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient)
[2014-02-06 22:02:55,201] INFO Opening socket connection to server zookeeperlog1.masked.com/10.52.189.106:5700 (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:55,202] INFO Socket connection established to zookeeperlog1.masked.com/10.52.189.106:5700, initiating session (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:55,207] INFO Session establishment complete on server zookeeperlog1.masked.com/10.52.189.106:5700, sessionid = 0x343c4f6383400a7, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2014-02-06 22:02:55,208] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)

do you think I should simply increase the zookeeper timeout? (note it's running on the same host as the cluster node). Have you seen this before?

@edenhill
Copy link
Contributor

edenhill commented Feb 8, 2014

I think this might be caused by internal message framing overhead and I have an idea how to fix it.

But I have to ask; is this a real use case, sending lots of very small messages?

@winbatch
Copy link
Author

winbatch commented Feb 8, 2014

I was doing it as part of testing, but it could theoretically happen.
Remember that I am unable to batch, so if I have small content available,
I'll send it immediately

On Friday, February 7, 2014, Magnus Edenhill [email protected]
wrote:

I think this might be caused by internal message framing overhead and I
have an idea how to fix it.

But I have to ask; is this a real use case, sending lots of very small
messages?

Reply to this email directly or view it on GitHubhttps://github.com//issues/74#issuecomment-34531309
.

@edenhill
Copy link
Contributor

edenhill commented Feb 8, 2014

Cant reproduce this:

$ ./rdkafka_performance -P -t multi -p 0 -b localhost -s 2 -a 2 -z gzip -c 1000000 -X queue.buffering.max.messages=1000000

...

% 1000000 messages and 2000000 bytes produced in 4809ms: 207911 msgs/s and 0.42 Mb/s, 0 messages failed, gzip compression

Some questions:

  • how many brokers are in the replication set for this topic?
  • what happens if you try -a -1, 0 and 1? (request.required.acks) NOTE: For 0 you will need to update to latest master due to issue Producer segfaults with request.required.acks=0 #75
  • is it possible that it's the replication between brokers thats the problem? I think you might have insinuated this in another issue at some point.
  • can you you reproduce this with -d all ? There's going to be alot of output so you can mail it to me directly.

Thanks

@winbatch
Copy link
Author

winbatch commented Feb 8, 2014

You may have to run it multiple times. But also, I am using brokers that
are not localhost. 3 brokers 3 replicas. unable to reproduce with all the
debugging presumably because it slows down the app enough. Will try with
the other ack values

On Saturday, February 8, 2014, Magnus Edenhill [email protected]
wrote:

Cant reproduce this:

$ ./rdkafka_performance -P -t multi -p 0 -b localhost -s 2 -a 2 -z gzip -c 1000000 -X queue.buffering.max.messages=1000000

...

% 1000000 messages and 2000000 bytes produced in 4809ms: 207911 msgs/s and 0.42 Mb/s, 0 messages failed, gzip compression

Some questions:

  • how many brokers are in the replication set for this topic?
  • what happens if you try -a -1, 0 and 1? (request.required.acks)
    NOTE: For 0 you will need to update to latest master due to issue Producer segfaults with request.required.acks=0 #75Producer segfaults with request.required.acks=0 #75
  • is it possible that it's the replication between brokers thats the
    problem? I think you might have insinuated this in another issue at some
    point.
  • can you you reproduce this with -d all ? There's going to be alot of
    output so you can mail it to me directly.

Thanks

Reply to this email directly or view it on GitHubhttps://github.com//issues/74#issuecomment-34537376
.

@edenhill
Copy link
Contributor

edenhill commented Feb 8, 2014

I managed to reproduce this by adding latency to the loopback like this: sudo tc qdisc add dev lo root netem delay 350ms

The above commit fixes the problem, can you verify on your end?

@winbatch
Copy link
Author

winbatch commented Feb 8, 2014

Didnt know you could do such a thing. Will try over the weekend. Thanks
very much.

On Saturday, February 8, 2014, Magnus Edenhill [email protected]
wrote:

I managed to reproduce this by adding latency to the loopback like this: sudo
tc qdisc add dev lo root netem delay 350ms

The above commit fixes the problem, can you verify on your end?

Reply to this email directly or view it on GitHubhttps://github.com//issues/74#issuecomment-34545163
.

@winbatch
Copy link
Author

Magnus - the fix didn't seem to work right. I simply got the timeouts much more quickly (like a second) than I had in the past. And the number remaining seemed to follow some weird pattern...
Message delivered: 556000 remain
Message delivered: 555000 remain
Message delivered: 554000 remain
Message delivered: 553000 remain
Message delivered: 552000 remain
Message delivered: 551000 remain
Message delivered: 550000 remain
Message delivered: 549000 remain
Message delivery failed: Local: Message timed out (548827 remain)
Message delivery failed: Local: Message timed out (548826 remain)
Message delivery failed: Local: Message timed out (548825 remain)
Message delivery failed: Local: Message timed out (548824 remain)
Message delivery failed: Local: Message timed out (548823 remain)
Message delivery failed: Local: Message timed out (548822 remain)
Message delivery failed: Local: Message timed out (548821 remain)
Message delivery failed: Local: Message timed out (548820 remain)
Message delivery failed: Local: Message timed out (548819 remain)
Message delivery failed: Local: Message timed out (548818 remain)
Message delivery failed: Local: Message timed out (548817 remain)
Message delivery failed: Local: Message timed out (548816 remain)
Message delivery failed: Local: Message timed out (548815 remain)
Message delivery failed: Local: Message timed out (548814 remain)
Message delivery failed: Local: Message timed out (548813 remain)
Message delivery failed: Local: Message timed out (548812 remain)
Message delivery failed: Local: Message timed out (548811 remain)
Message delivery failed: Local: Message timed out (548810 remain)
Message delivery failed: Local: Message timed out (548809 remain)
Message delivery failed: Local: Message timed out (548808 remain)
Message delivery failed: Local: Message timed out (548807 remain)
Message delivery failed: Local: Message timed out (548806 remain)
Message delivery failed: Local: Message timed out (548805 remain)
Message delivery failed: Local: Message timed out (548804 remain)
Message delivery failed: Local: Message timed out (548803 remain)
Message delivery failed: Local: Message timed out (548802 remain)
Message delivery failed: Local: Message timed out (548801 remain)
Message delivery failed: Local: Message timed out (548800 remain)
Message delivery failed: Local: Message timed out (548799 remain)
Message delivery failed: Local: Message timed out (548798 remain)
Message delivery failed: Local: Message timed out (548797 remain)
Message delivery failed: Local: Message timed out (548796 remain)
Message delivery failed: Local: Message timed out (548795 remain)
Message delivery failed: Local: Message timed out (548794 remain)
Message delivery failed: Local: Message timed out (548793 remain)
Message delivery failed: Local: Message timed out (548792 remain)
Message delivery failed: Local: Message timed out (548791 remain)
Message delivery failed: Local: Message timed out (548790 remain)
Message delivery failed: Local: Message timed out (548789 remain)
Message delivery failed: Local: Message timed out (548788 remain)
Message delivery failed: Local: Message timed out (548787 remain)
Message delivery failed: Local: Message timed out (548786 remain)
Message delivery failed: Local: Message timed out (548785 remain)
Message delivery failed: Local: Message timed out (548784 remain)
Message delivery failed: Local: Message timed out (548783 remain)
Message delivery failed: Local: Message timed out (548782 remain)
Message delivery failed: Local: Message timed out (548781 remain)
Message delivery failed: Local: Message timed out (548780 remain)
Message delivery failed: Local: Message timed out (548779 remain)
Message delivery failed: Local: Message timed out (548000 remain)
Message delivery failed: Local: Message timed out (547828 remain)
Message delivery failed: Local: Message timed out (547000 remain)
Message delivery failed: Local: Message timed out (546828 remain)
Message delivery failed: Local: Message timed out (546000 remain)
Message delivery failed: Local: Message timed out (545828 remain)
Message delivery failed: Local: Message timed out (545000 remain)
Message delivery failed: Local: Message timed out (544828 remain)

@winbatch
Copy link
Author

Hi Magnus - I believe you mentioned offline that this could be caused by librdkafka using the wrong parameter for the message timeout? Just wondering if you figured it out?

@edenhill
Copy link
Contributor

Those timeouts were fixed prior to what you tested last in this issue report (940dfdc), so I dont think you befited from those fixes in this scenario.

@winbatch
Copy link
Author

Not sure I follow - you're saying its still broken? (Because that's what I'm seeing)

@edenhill
Copy link
Contributor

Still broken, yeah, I'll try to repro

@edenhill
Copy link
Contributor

I think this might be caused by rdkafka_performance changing the default value of message.timeout.ms (300s) to 5s.
This is a legacy and I dont know why this value was chosen.
Try setting it to its default value, .. -X topic.message.timeout.ms=300000, and the problem should go away.

edenhill added a commit that referenced this issue Mar 18, 2014
This value was set to 5000ms for some unknown historical reason.
@edenhill
Copy link
Contributor

Or update to latest master and the default override will be gone.

@mikesparr
Copy link

@edenhill could this be cause of what I'm seeing in logs? should I explicitly add topic.message.timeout.ms to my confluent-kakfa-python producer class?

2016-09-16 08:56:01,425 - INFO - extractor_rets.py - Sleeping for 5 minutes (300 seconds)
2016-09-16 09:01:08,140 - INFO - extractor_rets.py - Processed 0 rows
2016-09-16 09:01:08,141 - INFO - extractor_rets.py - Sleeping for 5 minutes (300 seconds)
%4|1474038111.837|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038111.837|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038111.837|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038174.946|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038174.946|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038174.946|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038238.055|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038238.055|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038238.055|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038252.485|FAIL|app2#producer-1| queue4:9092/4: Receive failed: Connection reset by peer
%3|1474038252.485|ERROR|app2#producer-1| queue4:9092/4: Receive failed: Connection reset by peer
%4|1474038300.358|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038314.380|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038314.380|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038316.394|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
2016-09-16 09:06:13,052 - INFO - extractor_rets.py - Processed 0 rows
2016-09-16 09:06:13,053 - INFO - extractor_rets.py - Sleeping for 5 minutes (300 seconds)
%3|1474038375.488|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038375.488|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038377.501|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038436.593|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038436.593|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038438.607|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out
%3|1474038497.699|FAIL|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%3|1474038497.699|ERROR|app2#producer-1| queue4:9092/4: 3 request(s) timed out: disconnect
%4|1474038499.713|METADATA|app2#producer-1| queue4:9092/4: Metadata request failed: Local: Timed out

@edenhill
Copy link
Contributor

The Metadata timeouts are an indication of some problem with the broker(s) or network, they should typically not time out. Check your broker logs.

@mikesparr
Copy link

I tried setting that in Producer client configuration and see this error:

2016-09-16 10:14:00,476 - ERROR - stream_producer.pyc - Error [KafkaError{code=_INVALID_ARG,val=-186,str="No such configuration property: "message.timeout.ms""}] publishing data at topic [rets.rayac.Office]:[None]-[rayac:1347]

But in your docs I see it, and configured same as other settings?

screen shot 2016-09-16 at 10 18 21 am

Configured...

screen shot 2016-09-16 at 10 17 38 am

@mikesparr
Copy link

I'll try request.timeout.ms instead since not "local" setting from python client. Still trying to figure out broker issue. I'm curious if perhaps the buffer size being 1 million could be impacting the Open file limits on server? Is a file created for each buffered message and perhaps that explains why after they all get running we hit that ceiling and crash brokers? Sorry to hijack thread but his errors seem all too similar to what I'm experiencing.

@edenhill
Copy link
Contributor

message.timeout.ms is a topic-level config property so you will need to set it in the "default.topic.config" sub-dict.

Messages are written to log segments, when a log segment reaches a certain configured sized it is rolled over and a new log segment is created.
So yes, the message size * message count does affect the number of open log segments.
Please see broker properties log.segment.bytes and log.roll.* here:
http://kafka.apache.org/documentation.html#brokerconfigs

@mikesparr
Copy link

Thanks for clarification. Crazy thing happened. On a whim I updated my Ansible playbooks and reduced 4-node cluster to 3-node for Zookeeper/Kafka and Cassandra I removed 4th node too. Restarted everything and ETL running fine; no issues whatsoever.

Is it possible issues caused by having an even number of servers vs. and odd number? I did read that Zookeeper should be Odd number but very surprised if that would actually cause all these issues. Now everything is working 100% moving from 4 node to 3 node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants