-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When starting up, recovery of shards takes up to 50 minutes #6372
Comments
hey, can you give us some more info about what version of ES you are running? |
it seems like you are setting a lot of things on the cluster. can you provide the settings you are using on the cluster aside of the defaults. |
Here it is: Our river + master nodes have identical configuration, except that data node and master have datadir/master flags switched. We only use one master in order to not have the split brain issue.
|
Please let me know if you need any further information. thanks |
While restarting with 1000 indexes, 20k shards and 500 nodes, the master node took 15 minutes to get to the initial allocation. We are still running elasticsearch 1.0.2.
It seems that the requests are being done sequentially, so maybe its possible to speed this up by running some requests in parallel / caching more information. |
@miccon in those 15m, does the master node report the correct number of nodes? I wonder if it is the time it takes for the 500 nodes to join the master. We improved the latter considerably in 1.4 (yet to be released): #7493 (see batch joining bullet point) |
@bleskes I agree that the batch joining of the nodes will indeed help when the cluster starts, so it should help with issue #5232. In order to get the nodes to join quickly in 1.0 we have to set discovery.zen.publish_timeout to 0 as described in the other issue. Here, these 15m are after the nodes have joined, so yes the master reports the correct number of nodes. It might be related to how the master queries the nodes for which shards are available / when it calculates the allocation. |
Before recovering indices from disk, the master asks all the nodes about what they have on disk. To do so the nodes need some information that's part of the cluster state and if the don't have it they respond with "I don't know yet". The problem is that you have set That said, do you know what happens in the 11 minutes? how big are the shards? Last, with the cluster of your size, I would really recommend you upgrade as soon as you can . We have had so many optimizations that will help you (.batched joins , memory signature ... and many more) |
I have noticed cluster initialization 'hung' on the same cluster update task, like @miccon :
Notes:
|
@shikhar when the master assigns replicas, it first asks all the nodes what files they have for that shard on disk. The idea is to assign it to the nodes that has most files already available. The stack trace you're seeing is master waiting on a node to answer this question. I can see optimizations we can do here, but but this requests should be relatively quick.
|
For now I can only answer
As opposed to the normal cluster init time of a couple of minutes, it seems to be taking over 10-15 mins by which time alerts fire so we re-bounce it As for
I will be sure to check these out next time. Thanks @bleskes! |
Hi, After 1.4.0 is release, we will then also run some tests with production indexes, to see if the long shard initialisation phase has also improved. |
@bluelu great news. Indeed 1.4 massively improved the time it takes to form large clusters by batching join requests. W.r.t shard initialization time, let's see how it goes. We still need to reach out to all the nodes and ask them for information about what shards they have on disk before primaries can be allocated and cluster becomes yellow. |
Just an idea there (Don't know how it's done at the moment, but I guess it's like that at the moment also in 1.4.0): |
master cpu is pretty low
it does change I mainly wanted to update that in our case the problem might possibly be due to using JDK8u5. I was able to capture some more diagnostics when bouncing the nodes following one such event of super-slow cluster init. We have some automated thread-dumping if a node takes too long to go down nicely and kill-9 needs to be used. The thread dump on this (non-master) node revealed a bunch of threads executing code relevant to the RPC's issued by the master:
The really weird thing is that these threads are reported to be RUNNABLE although they are supposedly in Anyway we have seen this weirdness a couple of times in some non-ES usage as well. We plan to upgrade our JDK8 version and hopefully this occasional issue will stop happening altogether. UPDATE May 14, 15 https://issues.apache.org/jira/browse/LUCENE-6482 - unrelated to JDK version |
@bleskes Cluster join has increased much, that's great. Still the allocation will take a lot of time (it's now more than 30 minutes since cluster start and it hasn't much progressed, see below status it's still hanging at "initializing_shards" : 8390). elasticsearch[I51N16][clusterService#updateTask][T#1]" #52 daemon Cluster health: { |
@Thibaut Thx for the update. I'll read more carefully later but can you run http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html on the master a couple of times? I wonder what it does. On Sat, Dec 6, 2014 at 11:21 AM, Thibaut [email protected] wrote:
|
@bleskes In the meantime, the allocation advanced a little bit: { What we normally do (or did before) was that we only added a few nodes first (main indexes, 200-300 nodes), waited for it to come up, and then added the other nodes 1 by 1 (since we could afford to have some indexes red in the beginning) afterwards, while having the allocating/balancing disabled. |
I see. The disk threshold allocator decider, in charge of making sure a node is not overloaded with shards is calculating the size of relocating shards walking the shards list again and again. We'd have to make it more efficient. A simple work around is to temporally set
and enable it while start up is done. Can you try? |
Instead of iterating all shards of all indices to get all relocating shards for a given node we can just use the RoutingNode#shardsWithState method and fetch all INITIALIZING / RELOCATING shards and check if they are relocating. This operation is much faster and uses pre-build data-structures. Relates to #6372
@bleskes What do you need in order to be able to debug this and reproduce the performance issue? Cluster state, settings and configuration? I can send you this in private? |
@Thibaut I can't look at things in details right now. It is surprising that disabling include relocation didn't kick in. Being able to reroute quickly is important for the operations of the master. As a temporary work around (as I see from the tickets that it causes other problems as well) try disabling the disk threshold allocator all together. Ps - Simon alread fixed that slowness we saw in your hot threads: #8803 On Sun, Dec 7, 2014 at 11:03 AM, Thibaut [email protected] wrote:
|
@bluelu it seems that the include_relocations settings were backported to 1.4, but the dynamic update code didn't make it (although it is registered for dynamic updates, see 4e5264c ). Which misses 4185566#diff-1b8dca987fbcfb8d8e452d7e29c4d058R92 ). I'm sorry for sending you down a wrong path. @dakrone I think it makes sense to add the dynamic update logic for 1.4.2, right? |
@bleskes yes, most likely caused by a bad backport, I'll fix in the 1.4 branch. |
Fix for the setting issue in #8813 |
Did this also affect the persistent value, as this value is only read and set after recovery of the cluster state has been started? We will apply both fixes to our branch and let you know how it worked on next restart. (It can take 1-2 weeks till this) Thanks! |
The affects the setting set using the cluster settings update API. If the setting is set in |
We added more logging to the allocateUnassigned function in LocalGatewayAllocator. (1.4.2) The first iteration of primary allocation will take 5-10 minutes for 300 nodes started (all SSD nodes) If primary allocation is done, secondary allocation starts. This also takes a lot of time if there are many unassigned shards (up to 10 minutes as well), going down to < 10 seconds if there only a few left. During that time all cluster commands (e.g. create index, etc.. ) timeout and won't be executed as they timeout after 30 seconds by default. If we keep track of the shards which we have tried to assign (both replicate and primary) and make sure that in each call we assign at least 1 shard, then we wouldn't have to run over all shards in each iteration, e.g. just check at most 100 shards in each iteration or more until we assign at least 1 shard? When all shards have been handled, we would retry all shards again. I don't see a reason that allocateunassigned must iterate over all unasigned primaries and shards at each iteration? Do you see an issue in doing this (e.g. data loss that on the first call not all primaries will be directly allocated)? We are little hesitant about implementing this because of that :-) |
Yeah, this call is not algorithmically right (it goes per shard and not per node). We have plans to change this - no time line yet :)..
This is still slow - do you have hot threads from that period?
for what it's worth, not having all primaries immediately assigned is not a problem (as far as data loss goes, obviously you want be able to index to those shards). It is already possible due to the node concurrent recoveries settings. I would like to first understand what slow for you now - potentially we can find some simpler work around. |
Hey @bleskes, I'm working with @bluelu and @miccon After some more analysis, there are two reasons why initial shard allocation is slow for us. First, the Assume nodes of type A and B (could be SSD and HDD nodes for example). The nodes are marked as A or B through the attribute The benchmark appended below (adapted from the class In our actual setup (nodes of type A,B,C,D) things are even worse (>50 minutes total and >40 seconds for new shards).
Output:
Note that with some crude hacks (having BalancedShardsAllocator account for the different node types in earlier pathes of the code) we reduced full allocation to a minute and allocation of new shards to a second. |
Thx @ywelsch-t . This is a great reproduction. We'll look at it as soon as possible. |
@ywelsch-t quick question in the meantime, can you elaborate on what you did:
|
Output:
|
Instead of iterating all shards of all indices to get all relocating shards for a given node we can just use the RoutingNode#shardsWithState method and fetch all INITIALIZING / RELOCATING shards and check if they are relocating. This operation is much faster and uses pre-build data-structures. Relates to elastic#6372
@ywelsch do you know if there are still improvements to make here? |
The remaining issue was about the BalancedShardsAllocator. I ran some tests comparing the performance of 1.5 (the version to which applied the remaining issue) to 1.7 and 2.x. On the test code I introduced above, 1.7 and 2.x both ran noticeably faster (around 2½ minutes instead of 15 minutes for 1.5). With improvements in #15678, this was reduced to 1½ minutes. The rebalance step for hot-warm setup with many nodes was also greatly improved with #15678 (taking 1 second instead of 10). I am closing this issue as I think all points are addressed. |
We have about 1000 indexes, 20k shards and over 400 nodes in our cluster.
When we restart the cluster, it takes about 50 minutes to reach yellow state. All nodes (including the master node) seem to be idling.
It doesn't seem to be traffic related (the clsuter state is only sent more often later when yellow state is being reached).
The master node seems to only request shard recovery every 11 minutes (but not for all shards), causing the long wait.
[2014-05-27 13:41:53,300][DEBUG][index.gateway ] [server5N5] [2013.06.24.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#1]}
[2014-05-27 13:41:53,325][DEBUG][index.gateway ] [server5N5] [2014.03.13.0000_000][1] starting recovery from local ... {elasticsearch[server5N5][generic][T#4]}
[2014-05-27 13:41:53,359][DEBUG][index.gateway ] [server5N5] [2012.07.06.0000_000][5] starting recovery from local ... {elasticsearch[server5N5][generic][T#5]}
[2014-05-27 13:41:53,367][DEBUG][index.gateway ] [server5N5] [2013.03.30.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#6]}
[2014-05-27 13:41:53,374][DEBUG][index.gateway ] [server5N5] [2012.05.27.0000_000][0] starting recovery from local ... {elasticsearch[server5N5][generic][T#7]}
[2014-05-27 13:41:53,462][DEBUG][index.gateway ] [server5N5] [2012.09.05.0000_000][0] starting recovery from local ... {elasticsearch[server5N5][generic][T#8]}
[2014-05-27 13:41:53,469][DEBUG][index.gateway ] [server5N5] [2013.06.17.0000_000][2] starting recovery from local ... {elasticsearch[server5N5][generic][T#9]}
[2014-05-27 13:41:53,552][DEBUG][index.gateway ] [server5N5] [2012.05.05.0000_000][9] starting recovery from local ... {elasticsearch[server5N5][generic][T#10]}
[2014-05-27 13:41:53,635][DEBUG][index.gateway ] [server5N5] [2012.05.22.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#10]}
[2014-05-27 13:41:53,642][DEBUG][index.gateway ] [server5N5] [2014.01.26.0000_000][0] starting recovery from local ... {elasticsearch[server5N5][generic][T#11]}
[2014-05-27 13:41:53,648][DEBUG][index.gateway ] [server5N5] [2012.11.09.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#12]}
[2014-05-27 13:41:53,678][DEBUG][index.gateway ] [server5N5] [2013.12.26.0000_000][1] starting recovery from local ... {elasticsearch[server5N5][generic][T#13]}
[2014-05-27 13:41:53,685][DEBUG][index.gateway ] [server5N5] [2014.01.11.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#14]}
[2014-05-27 13:41:53,693][DEBUG][index.gateway ] [server5N5] [2013.11.04.0000_000][1] starting recovery from local ... {elasticsearch[server5N5][generic][T#15]}
[2014-05-27 13:41:53,715][DEBUG][index.gateway ] [server5N5] [2014.02.18.0000_000][3] starting recovery from local ... {elasticsearch[server5N5][generic][T#16]}
[2014-05-27 13:41:53,739][DEBUG][index.gateway ] [server5N5] [2012.06.23.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#17]}
[2014-05-27 13:41:53,747][DEBUG][index.gateway ] [server5N5] [2013.07.07.0000_000][8] starting recovery from local ... {elasticsearch[server5N5][generic][T#18]}
[2014-05-27 13:41:53,754][DEBUG][index.gateway ] [server5N5] [2013.05.12.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#19]}
[2014-05-27 13:41:53,804][DEBUG][index.gateway ] [server5N5] [2013.02.15.0000_000][7] starting recovery from local ... {elasticsearch[server5N5][generic][T#20]}
[2014-05-27 13:41:53,995][DEBUG][index.gateway ] [server5N5] [2013.09.14.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#21]}
[2014-05-27 13:41:54,001][DEBUG][index.gateway ] [server5N5] [2012.11.04.0000_000][5] starting recovery from local ... {elasticsearch[server5N5][generic][T#22]}
[2014-05-27 13:41:54,007][DEBUG][index.gateway ] [server5N5] [2013.11.17.0000_000][3] starting recovery from local ... {elasticsearch[server5N5][generic][T#23]}
[2014-05-27 13:41:54,014][DEBUG][index.gateway ] [server5N5] [2014.03.15.0000_000][9] starting recovery from local ... {elasticsearch[server5N5][generic][T#24]}
[2014-05-27 13:41:54,088][DEBUG][index.gateway ] [server5N5] [2014.02.17.0000_000][7] starting recovery from local ... {elasticsearch[server5N5][generic][T#25]}
[2014-05-27 13:41:54,095][DEBUG][index.gateway ] [server5N5] [2013.08.05.0000_000][9] starting recovery from local ... {elasticsearch[server5N5][generic][T#26]}
[2014-05-27 13:41:54,500][DEBUG][index.gateway ] [server5N5] [2013.07.29.0000_000][0] starting recovery from local ... {elasticsearch[server5N5][generic][T#27]}
[2014-05-27 13:41:54,507][DEBUG][index.gateway ] [server5N5] [2014.04.02.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#28]}
[2014-05-27 13:41:54,514][DEBUG][index.gateway ] [server5N5] [2013.11.23.0000_000][7] starting recovery from local ... {elasticsearch[server5N5][generic][T#29]}
[2014-05-27 13:41:54,520][DEBUG][index.gateway ] [server5N5] [2013.01.04.0000_000][5] starting recovery from local ... {elasticsearch[server5N5][generic][T#30]}
[2014-05-27 13:41:54,526][DEBUG][index.gateway ] [server5N5] [2013.05.15.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#31]}
[2014-05-27 13:53:33,567][DEBUG][index.gateway ] [server5N5] [2012.12.13.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#307]}
[2014-05-27 13:53:33,645][DEBUG][index.gateway ] [server5N5] [2014.01.19.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#296]}
[2014-05-27 13:53:33,732][DEBUG][index.gateway ] [server5N5] [2014.02.06.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#295]}
[2014-05-27 13:53:33,760][DEBUG][index.gateway ] [server5N5] [2012.05.12.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#294]}
[2014-05-27 13:53:33,816][DEBUG][index.gateway ] [server5N5] [2013.01.09.0000_000][0] starting recovery from local ... {elasticsearch[server5N5][generic][T#313]}
[2014-05-27 13:53:34,005][DEBUG][index.gateway ] [server5N5] [2013.06.05.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#288]}
[2014-05-27 13:53:34,099][DEBUG][index.gateway ] [server5N5] [2012.11.25.0000_000][2] starting recovery from local ... {elasticsearch[server5N5][generic][T#257]}
[2014-05-27 13:53:34,257][DEBUG][index.gateway ] [server5N5] [2012.09.23.0000_000][3] starting recovery from local ... {elasticsearch[server5N5][generic][T#303]}
[2014-05-27 13:53:34,821][DEBUG][index.gateway ] [server5N5] [2012.06.21.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#304]}
The text was updated successfully, but these errors were encountered: