When starting up, recovery of shards takes up to 50 minutes #6372

bluelu · 2014-06-02T09:11:51Z

We have about 1000 indexes, 20k shards and over 400 nodes in our cluster.

When we restart the cluster, it takes about 50 minutes to reach yellow state. All nodes (including the master node) seem to be idling.

It doesn't seem to be traffic related (the clsuter state is only sent more often later when yellow state is being reached).

The master node seems to only request shard recovery every 11 minutes (but not for all shards), causing the long wait.

[2014-05-27 13:41:53,300][DEBUG][index.gateway [2014-05-27 13:41:53,325][DEBUG][index.gateway [2014-05-27 13:41:53,359][DEBUG][index.gateway [2014-05-27 13:41:53,367][DEBUG][index.gateway [2014-05-27 13:41:53,374][DEBUG][index.gateway [2014-05-27 13:41:53,462][DEBUG][index.gateway [2014-05-27 13:41:53,469][DEBUG][index.gateway [2014-05-27 13:41:53,552][DEBUG][index.gateway [2014-05-27 13:41:53,635][DEBUG][index.gateway [2014-05-27 13:41:53,642][DEBUG][index.gateway [2014-05-27 13:41:53,648][DEBUG][index.gateway [2014-05-27 13:41:53,678][DEBUG][index.gateway [2014-05-27 13:41:53,685][DEBUG][index.gateway [2014-05-27 13:41:53,693][DEBUG][index.gateway [2014-05-27 13:41:53,715][DEBUG][index.gateway [2014-05-27 13:41:53,739][DEBUG][index.gateway [2014-05-27 13:41:53,747][DEBUG][index.gateway [2014-05-27 13:41:53,754][DEBUG][index.gateway [2014-05-27 13:41:53,804][DEBUG][index.gateway [2014-05-27 13:41:53,995][DEBUG][index.gateway [2014-05-27 13:41:54,001][DEBUG][index.gateway [2014-05-27 13:41:54,007][DEBUG][index.gateway [2014-05-27 13:41:54,014][DEBUG][index.gateway [2014-05-27 13:41:54,088][DEBUG][index.gateway [2014-05-27 13:41:54,095][DEBUG][index.gateway [2014-05-27 13:41:54,500][DEBUG][index.gateway [2014-05-27 13:41:54,507][DEBUG][index.gateway [2014-05-27 13:41:54,514][DEBUG][index.gateway [2014-05-27 13:41:54,520][DEBUG][index.gateway [2014-05-27 13:41:54,526][DEBUG][index.gateway [2014-05-27 13:53:33,567][DEBUG][index.gateway [2014-05-27 13:53:33,645][DEBUG][index.gateway [2014-05-27 13:53:33,732][DEBUG][index.gateway [2014-05-27 13:53:33,760][DEBUG][index.gateway [2014-05-27 13:53:33,816][DEBUG][index.gateway [2014-05-27 13:53:34,005][DEBUG][index.gateway [2014-05-27 13:53:34,099][DEBUG][index.gateway [2014-05-27 13:53:34,257][DEBUG][index.gateway [2014-05-27 13:53:34,821][DEBUG][index.gateway ] [server5N5] [2013.06.24.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#1]}
] [server5N5] [2014.03.13.0000_000][1] starting recovery from local ... {elasticsearch[server5N5][generic][T#4]}
] [server5N5] [2012.07.06.0000_000][5] starting recovery from local ... {elasticsearch[server5N5][generic][T#5]}
] [server5N5] [2013.03.30.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#6]}
] [server5N5] [2012.05.27.0000_000][0] starting recovery from local ... {elasticsearch[server5N5][generic][T#7]}
] [server5N5] [2012.09.05.0000_000][0] starting recovery from local ... {elasticsearch[server5N5][generic][T#8]}
] [server5N5] [2013.06.17.0000_000][2] starting recovery from local ... {elasticsearch[server5N5][generic][T#9]}
] [server5N5] [2012.05.05.0000_000][9] starting recovery from local ... {elasticsearch[server5N5][generic][T#10]}
] [server5N5] [2012.05.22.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#10]}
] [server5N5] [2014.01.26.0000_000][0] starting recovery from local ... {elasticsearch[server5N5][generic][T#11]}
] [server5N5] [2012.11.09.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#12]}
] [server5N5] [2013.12.26.0000_000][1] starting recovery from local ... {elasticsearch[server5N5][generic][T#13]}
] [server5N5] [2014.01.11.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#14]}
] [server5N5] [2013.11.04.0000_000][1] starting recovery from local ... {elasticsearch[server5N5][generic][T#15]}
] [server5N5] [2014.02.18.0000_000][3] starting recovery from local ... {elasticsearch[server5N5][generic][T#16]}
] [server5N5] [2012.06.23.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#17]}
] [server5N5] [2013.07.07.0000_000][8] starting recovery from local ... {elasticsearch[server5N5][generic][T#18]}
] [server5N5] [2013.05.12.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#19]}
] [server5N5] [2013.02.15.0000_000][7] starting recovery from local ... {elasticsearch[server5N5][generic][T#20]}
] [server5N5] [2013.09.14.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#21]}
] [server5N5] [2012.11.04.0000_000][5] starting recovery from local ... {elasticsearch[server5N5][generic][T#22]}
] [server5N5] [2013.11.17.0000_000][3] starting recovery from local ... {elasticsearch[server5N5][generic][T#23]}
] [server5N5] [2014.03.15.0000_000][9] starting recovery from local ... {elasticsearch[server5N5][generic][T#24]}
] [server5N5] [2014.02.17.0000_000][7] starting recovery from local ... {elasticsearch[server5N5][generic][T#25]}
] [server5N5] [2013.08.05.0000_000][9] starting recovery from local ... {elasticsearch[server5N5][generic][T#26]}
] [server5N5] [2013.07.29.0000_000][0] starting recovery from local ... {elasticsearch[server5N5][generic][T#27]}
] [server5N5] [2014.04.02.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#28]}
] [server5N5] [2013.11.23.0000_000][7] starting recovery from local ... {elasticsearch[server5N5][generic][T#29]}
] [server5N5] [2013.01.04.0000_000][5] starting recovery from local ... {elasticsearch[server5N5][generic][T#30]}
] [server5N5] [2013.05.15.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#31]}
] [server5N5] [2012.12.13.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#307]}
] [server5N5] [2014.01.19.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#296]}
] [server5N5] [2014.02.06.0000_000][4] starting recovery from local ... {elasticsearch[server5N5][generic][T#295]}
] [server5N5] [2012.05.12.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#294]}
] [server5N5] [2013.01.09.0000_000][0] starting recovery from local ... {elasticsearch[server5N5][generic][T#313]}
] [server5N5] [2013.06.05.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#288]}
] [server5N5] [2012.11.25.0000_000][2] starting recovery from local ... {elasticsearch[server5N5][generic][T#257]}
] [server5N5] [2012.09.23.0000_000][3] starting recovery from local ... {elasticsearch[server5N5][generic][T#303]}
] [server5N5] [2012.06.21.0000_000][6] starting recovery from local ... {elasticsearch[server5N5][generic][T#304]}

s1monw · 2014-06-02T09:18:57Z

hey,

can you give us some more info about what version of ES you are running?

bluelu · 2014-06-02T09:26:02Z

Hi, we use elasticsearch-1.0.2

Please keep in mind that his might also be related to
#6295
#5232

We set discovery.zen.publish_timeout:0 in order to start the cluster. Otherwise nodes wouldn't even join at first before the master node starts the recovery.

s1monw · 2014-06-02T09:27:52Z

it seems like you are setting a lot of things on the cluster. can you provide the settings you are using on the cluster aside of the defaults.

bluelu · 2014-06-02T09:44:07Z

Here it is:

Our river + master nodes have identical configuration, except that data node and master have datadir/master flags switched.

We only use one master in order to not have the split brain issue.

# Configuration Template for ES
# used by es data nodes (river is disabled)


# Set publish timeout to 0, otherwise cluster startup takes long time
# See https://github.com/elasticsearch/elasticsearch/issues/5232
discovery.zen.publish_timeout: 0


#lock memory
bootstrap.mlockall: true

# Cluster Settings
cluster:
  name: xxx

node:
  name: xxx
  t_cluster: xxx
  t_scluster: xxx
  master: false
  data: true
  max_local_storage_nodes: 1
  service: xxx

  # disable river allocation to data node
  river: _none_

#disabled cluster awareness -> allocation settings are now being set on index level.
#cluster.routing.allocation.awareness.attributes: rack_id

cluster.routing.allocation.node_initial_primaries_recoveries: 30
cluster.routing.allocation.node_concurrent_recoveries: 8
indices.recovery.max_bytes_per_sec: 100mb

path:
  logs: /tmp/searchdata_log/
  work: /elasticsearch_work/
  data: /ed1/,/ed2/

index:
  number_of_replicas: 1
  number_of_shards: 1
  query:
    bool:
      max_clause_count: 10240

# When changing the analysis settings, don't forget to also update
# com.xxx.modules.searchclient.analysis.SpecialNGramAnalyzer and
# com.xxx.modules.searchclient.analysis.xxxDefaultAnalyzer
  analysis:
    analyzer:
      xxx:
        type: custom
        tokenizer: uax_url_email
        filter: [standard, lowercase, elision_split, asciifolding]
      xxx-exact:
        type: custom
        tokenizer: whitespace
        filter: [lowercase, exact-match]
      xxx-exact-cs:
        type: custom
        tokenizer: whitespace
        filter: [exact-match]
      patternnewline:
        type: pattern
        pattern: "[\n\r]+"
        lowercase: false

# increase cache filter size to 30% (default is 20%)
indices.cache.filter.size: 30%

# Gateway Settings
gateway:
  recover_after_data_nodes: 425
  recover_after_time: 2m
  expected_data_nodes: 430
#gateway.local.compress: false
#gateway.local.pretty: true

# Use fixed ports, prevents es from starting multiple instances
# Should also already be prevented by max_local_storage_nodes set to 1
http.port: 9200
transport.tcp.port: 9300

discovery:
  zen.ping_timeout: 5s
  zen.minimum_master_nodes: 1
  zen.ping:
    multicast:
      enabled: false
    unicast:
      hosts: masternode[9300]
  zen.fd.ping_interval: 3s
  zen.fd.ping_timeout: 60s

script.native.relevance.type: com.xxx.modules.search.es.script.RelevanceScore
script.native.orelevance.type: com.xxx.modules.search.es.script.OldRelevanceScore
# script.native.sortvalue.type: com.xxx.modules.search.es.script.SortValue
script.native.dynamicdata.type: com.xxx.modules.search.es.script.DynamicDataScore
script.native.updateproject.type: com.xxx.modules.search.es.script.UpdateScript
script.native.projectcopies.type: com.xxx.modules.search.es.script.ProjectCopiesScript
script.native.propagatescript.type: com.xxx.modules.search.es.script.PropagateScript

# disable deleting of indices in a single API call
action.disable_delete_all_indices: true
# disable automatic index creation during indexation
action.auto_create_index: false

# since _all is disabled for our documents, we have to define another default_field
# edit: commented out, as the default_field has to be specified anyway during searching (content+title)
#       so we can leave the _all field enabled for auxiliary indices
# index.query.default_field: content

# disable automatic index creation for indices
action.auto_create_index: -dsearch_*,-nsearch_*,-tw_*,-twindex_*,-twproj_*,-twprojindex_*,-tw-*,+*

################################## Slow Log ##################################
# Shard level query and fetch threshold logging.

index.search.slowlog.level: TRACE
index.search.slowlog.threshold.query.warn: 30s
index.search.slowlog.threshold.query.info: 15s
index.search.slowlog.threshold.query.debug: 5s
index.search.slowlog.threshold.query.trace: 1s

index.search.slowlog.threshold.fetch.warn: 15s
index.search.slowlog.threshold.fetch.info: 10s
index.search.slowlog.threshold.fetch.debug: 5s
index.search.slowlog.threshold.fetch.trace: 1s

################################## GC Logging ################################

monitor.jvm.gc.ParNew.warn: 1000ms
monitor.jvm.gc.ParNew.info: 700ms
monitor.jvm.gc.ParNew.debug: 400ms

monitor.jvm.gc.ConcurrentMarkSweep.warn: 10s
monitor.jvm.gc.ConcurrentMarkSweep.info: 5s
monitor.jvm.gc.ConcurrentMarkSweep.debug: 2s

bluelu · 2014-06-05T09:15:39Z

Please let me know if you need any further information. thanks

ghost · 2014-09-15T08:23:34Z

While restarting with 1000 indexes, 20k shards and 500 nodes, the master node took 15 minutes to get to the initial allocation. We are still running elasticsearch 1.0.2.
Commit 6af80d5 does some optimizations about the allocation, but it seems that the optimization is only after getting the shard stores, although our nodes are mostly waiting for the responses from the cluster. The load on the master node is low (network and cpu wise) during the whole time:

"elasticsearch[I51N16][clusterService#updateTask][T#1]" daemon prio=10 tid=0x00007fc420005000 nid=0x2234 waiting on condition [0x00007fc4d5bc9000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000006ca9c15e8> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:274)
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:113)
        at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:45)
        at org.elasticsearch.gateway.local.LocalGatewayAllocator.buildShardStores(LocalGatewayAllocator.java:441)
        at org.elasticsearch.gateway.local.LocalGatewayAllocator.allocateUnassigned(LocalGatewayAllocator.java:279)
        at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:74)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:216)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:159)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:145)
        at org.elasticsearch.cluster.routing.RoutingService$1.execute(RoutingService.java:144)
        at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:308)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:134)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)

It seems that the requests are being done sequentially, so maybe its possible to speed this up by running some requests in parallel / caching more information.

bleskes · 2014-09-29T11:54:33Z

@miccon in those 15m, does the master node report the correct number of nodes? I wonder if it is the time it takes for the 500 nodes to join the master. We improved the latter considerably in 1.4 (yet to be released): #7493 (see batch joining bullet point)

ghost · 2014-09-30T12:28:26Z

@bleskes I agree that the batch joining of the nodes will indeed help when the cluster starts, so it should help with issue #5232. In order to get the nodes to join quickly in 1.0 we have to set discovery.zen.publish_timeout to 0 as described in the other issue.

Here, these 15m are after the nodes have joined, so yes the master reports the correct number of nodes. It might be related to how the master queries the nodes for which shards are available / when it calculates the allocation.

bluelu · 2014-09-30T12:34:31Z

@bleskes I added more logs here at the end regarding to the slow recoery issue (miccon and I work together):
#6295

bleskes · 2014-09-30T19:06:10Z

The master node seems to only request shard recovery every 11 minutes (but not for all shards), causing the long wait.

Before recovering indices from disk, the master asks all the nodes about what they have on disk. To do so the nodes need some information that's part of the cluster state and if the don't have it they respond with "I don't know yet". The problem is that you have set discovery.zen.publish_timeout: 0, which means the master doesn't wait on the nodes and continues processing join events. It does not speed the processing on the node side. I suspect that's way recovering shards from disk takes long.

That said, do you know what happens in the 11 minutes? how big are the shards?

Last, with the cluster of your size, I would really recommend you upgrade as soon as you can . We have had so many optimizations that will help you (.batched joins , memory signature ... and many more)

shikhar · 2014-10-09T07:33:18Z

I have noticed cluster initialization 'hung' on the same cluster update task, like @miccon :

"elasticsearch[blabla-es0][clusterService#updateTask][T#1]" #79 daemon prio=5 os_prio=0 tid=0x00007fd16988d000 nid=0x6e01 waiting on condition [0x00007fd0bc279000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x0000000614a22508> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:274)
    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:113)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:45)
    at org.elasticsearch.gateway.local.LocalGatewayAllocator.buildShardStores(LocalGatewayAllocator.java:443)
    at org.elasticsearch.gateway.local.LocalGatewayAllocator.allocateUnassigned(LocalGatewayAllocator.java:281)
    at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:74)
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:217)
    at org.elasticsearch.cluster.routing.allocation.AllocationService.applyStartedShards(AllocationService.java:86)
    at org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:278)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:328)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Notes:

This is on a cluster using eskka discovery which might be reason to discount this report. However AFAICT I don't think it is doing anything wrong. eskka does publishing asynchronously, same as the effect of discovery.zen.publish_timeout: 0 with Zen -- so that seems to be a common feature here
The tdump is from the master node, and it was aware of all cluster nodes at this point and had published a few cluster state versions prior to this happening
ES version 1.3.4, Nodes: 20 Indices: 13 Shards: 1122 Data: 748.24 GB
This has happened a few times during bounce and gets resolved by bouncing the cluster again. Seems like some sort of race condition. Normally going from start to green only takes a couple of minutes.

bleskes · 2014-10-09T08:23:18Z

@shikhar when the master assigns replicas, it first asks all the nodes what files they have for that shard on disk. The idea is to assign it to the nodes that has most files already available. The stack trace you're seeing is master waiting on a node to answer this question. I can see optimizations we can do here, but but this requests should be relatively quick.

How long does the master wait?
do you see high cpu on it?
If you stack trace it a couple of times, does the lock pointer change? In your example it's - 0x0000000614a22508 . This will indicate the master is issuing multiple requests as opposed to hang on one.

shikhar · 2014-10-09T09:22:32Z

For now I can only answer

How long does the master wait?

As opposed to the normal cluster init time of a couple of minutes, it seems to be taking over 10-15 mins by which time alerts fire so we re-bounce it

As for

do you see high cpu on it?

If you stack trace it a couple of times, does the lock pointer change? In your example it's - 0x0000000614a22508 . This will indicate the master is issuing multiple requests as opposed to hang on one.

I will be sure to check these out next time. Thanks @bleskes!

bluelu · 2014-10-24T09:10:19Z

Hi,
Just a short update on our tests:
We created a 200 node empty cluster on 1.3.4, and the joining process took more than 45 minutes.
It worked fine (only 1-3 minutes) in 1.4.0 beta, so we will wait for the 1.4.0 version until we upgrade.

After 1.4.0 is release, we will then also run some tests with production indexes, to see if the long shard initialisation phase has also improved.

bleskes · 2014-10-24T09:32:10Z

@bluelu great news. Indeed 1.4 massively improved the time it takes to form large clusters by batching join requests.

W.r.t shard initialization time, let's see how it goes. We still need to reach out to all the nodes and ask them for information about what shards they have on disk before primaries can be allocated and cluster becomes yellow.

bluelu · 2014-10-24T10:08:54Z

Just an idea there (Don't know how it's done at the moment, but I guess it's like that at the moment also in 1.4.0):
I suppose that asking all nodes for their shards, also means that the shards need to do an integrity check (e.g. the new checksum checks, etc...) on their shards before they report back to the master node, which in our case could take some time on some servers (e.g. non ssd disks) with multiple shards, as it involves reading a lot of TBs of data per node.
If this process could be splitted in two, first list available shards (no integrety check, just check what is on the filesystem, should be very fast) and return, and start the current allocation phase. But during that phase, you could already allocate the primaries as soon as all nodes have finished the initialisation phases which contain a specific shard, as you know in advance already that no other nodes contain that shard.

shikhar · 2014-12-03T12:56:30Z

@bleskes

do you see high cpu on it?

master cpu is pretty low

If you stack trace it a couple of times, does the lock pointer change? In your example it's - 0x0000000614a22508 . This will indicate the master is issuing multiple requests as opposed to hang on one.

it does change

I mainly wanted to update that in our case the problem might possibly be due to using JDK8u5. I was able to capture some more diagnostics when bouncing the nodes following one such event of super-slow cluster init. We have some automated thread-dumping if a node takes too long to go down nicely and kill-9 needs to be used. The thread dump on this (non-master) node revealed a bunch of threads executing code relevant to the RPC's issued by the master:

"elasticsearch[blabla][generic][T#43]" #160 daemon prio=5 os_prio=0 tid=0x00007f79180c5800 nid=0x3d1f in Object.wait() [0x00007f79d9289000]
   java.lang.Thread.State: RUNNABLE
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:359)
    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:457)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:912)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:758)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:453)
    at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:98)
    at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:126)
    at org.elasticsearch.index.store.Store.access$300(Store.java:76)
    at org.elasticsearch.index.store.Store$MetadataSnapshot.buildMetadata(Store.java:465)
    at org.elasticsearch.index.store.Store$MetadataSnapshot.<init>(Store.java:456)
    at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:281)
    at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:186)
    at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:140)
    at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:61)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:277)
    at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:268)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

The really weird thing is that these threads are reported to be RUNNABLE although they are supposedly in Object.wait(), and I can't even spot any relevant synchronization in SegmentInfos.read().

Anyway we have seen this weirdness a couple of times in some non-ES usage as well. We plan to upgrade our JDK8 version and hopefully this occasional issue will stop happening altogether.

UPDATE May 14, 15 https://issues.apache.org/jira/browse/LUCENE-6482 - unrelated to JDK version

bluelu · 2014-12-06T10:21:03Z

@bleskes
We upgraded now to 1.4.1:

Cluster join has increased much, that's great.

Still the allocation will take a lot of time (it's now more than 30 minutes since cluster start and it hasn't much progressed, see below status it's still hanging at "initializing_shards" : 8390).
The master node runs one Thread at 100% cpu. It's not stuck as the stack progresses.

elasticsearch[I51N16][clusterService#updateTask][T#1]" #52 daemon
prio=5 os_prio=0 tid=0x00007f1ad25d9000 nid=0x127f runnable
[0x00007f1a22171000]
java.lang.Thread.State: RUNNABLE
at org.elasticsearch.common.collect.UnmodifiableListIterator.(UnmodifiableListIterator.java:34)
at org.elasticsearch.common.collect.AbstractIndexedListIterator.(AbstractIndexedListIterator.java:68)
at org.elasticsearch.common.collect.Iterators$11.(Iterators.java:1058)
at org.elasticsearch.common.collect.Iterators.forArray(Iterators.java:1058)
at org.elasticsearch.common.collect.RegularImmutableList.listIterator(RegularImmutableList.java:106)
at org.elasticsearch.common.collect.ImmutableList.listIterator(ImmutableList.java:344)
at org.elasticsearch.common.collect.ImmutableList.iterator(ImmutableList.java:340)
at org.elasticsearch.cluster.routing.IndexShardRoutingTable.iterator(IndexShardRoutingTable.java:173)
at org.elasticsearch.cluster.routing.IndexShardRoutingTable.iterator(IndexShardRoutingTable.java:46)
at org.elasticsearch.cluster.routing.IndexShardRoutingTable.shardsWithState(IndexShardRoutingTable.java:552)
at org.elasticsearch.cluster.routing.IndexRoutingTable.shardsWithState(IndexRoutingTable.java:268)
at org.elasticsearch.cluster.routing.RoutingTable.shardsWithState(RoutingTable.java:114)
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.sizeOfRelocatingShards(DiskThresholdDecider.java:225)
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.canAllocate(DiskThresholdDecider.java:288)
at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:74)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.tryRelocateShard(BalancedShardsAllocator.java:799)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.balance(BalancedShardsAllocator.java:426)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.rebalance(BalancedShardsAllocator.java:124)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocateUnassigned(BalancedShardsAllocator.java:118)
at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:75)
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:217)
at org.elasticsearch.cluster.routing.allocation.AllocationService.applyStartedShards(AllocationService.java:86)
at org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:281)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:329)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Cluster health:

{
"cluster_name" : "cluster",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 651,
"number_of_data_nodes" : 650,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 8390,
"unassigned_shards" : 16108
}

bleskes · 2014-12-06T10:28:28Z

@Thibaut Thx for the update. I'll read more carefully later but can you run http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html on the master a couple of times? I wonder what it does.

On Sat, Dec 6, 2014 at 11:21 AM, Thibaut [email protected] wrote:

@bleskes
We upgraded now to 1.4.1:
Cluster join has increased much, that's great.
Still the allocation will take a lot of time (it's now more than 30 minutes since cluster start and it hasn't much progressed, see below status it's still hanging at "initializing_shards" : 8390).
The master node runs one Thread at 100% cpu. It's not stuck as the stack progresses.
elasticsearch[I51N16][clusterService#updateTask][T#1]" #52 daemon
prio=5 os_prio=0 tid=0x00007f1ad25d9000 nid=0x127f runnable
[0x00007f1a22171000]
java.lang.Thread.State: RUNNABLE
at org.elasticsearch.common.collect.UnmodifiableListIterator.(UnmodifiableListIterator.java:34)
at org.elasticsearch.common.collect.AbstractIndexedListIterator.(AbstractIndexedListIterator.java:68)
at org.elasticsearch.common.collect.Iterators$11.(Iterators.java:1058)
at org.elasticsearch.common.collect.Iterators.forArray(Iterators.java:1058)
at org.elasticsearch.common.collect.RegularImmutableList.listIterator(RegularImmutableList.java:106)
at org.elasticsearch.common.collect.ImmutableList.listIterator(ImmutableList.java:344)
at org.elasticsearch.common.collect.ImmutableList.iterator(ImmutableList.java:340)
at org.elasticsearch.cluster.routing.IndexShardRoutingTable.iterator(IndexShardRoutingTable.java:173)
at org.elasticsearch.cluster.routing.IndexShardRoutingTable.iterator(IndexShardRoutingTable.java:46)
at org.elasticsearch.cluster.routing.IndexShardRoutingTable.shardsWithState(IndexShardRoutingTable.java:552)
at org.elasticsearch.cluster.routing.IndexRoutingTable.shardsWithState(IndexRoutingTable.java:268)
at org.elasticsearch.cluster.routing.RoutingTable.shardsWithState(RoutingTable.java:114)
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.sizeOfRelocatingShards(DiskThresholdDecider.java:225)
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.canAllocate(DiskThresholdDecider.java:288)
at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:74)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.tryRelocateShard(BalancedShardsAllocator.java:799)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.balance(BalancedShardsAllocator.java:426)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.rebalance(BalancedShardsAllocator.java:124)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocateUnassigned(BalancedShardsAllocator.java:118)
at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:75)
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:217)
at org.elasticsearch.cluster.routing.allocation.AllocationService.applyStartedShards(AllocationService.java:86)
at org.elasticsearch.cluster.action.shard.ShardStateAction$4.execute(ShardStateAction.java:281)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:329)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Cluster health:
{
"cluster_name" : "cluster",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 651,
"number_of_data_nodes" : 650,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 8390,
"unassigned_shards" : 16108

}

Reply to this email directly or view it on GitHub:
#6372 (comment)

bluelu · 2014-12-06T10:42:35Z

@bleskes
I uploaded 4 hothreads outputs of the master node to pastebin:
http://pastebin.com/H0mQcAFE

In the meantime, the allocation advanced a little bit:

{
"cluster_name" : "cluster",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 651,
"number_of_data_nodes" : 650,
"active_primary_shards" : 3468,
"active_shards" : 3468,
"relocating_shards" : 1,
"initializing_shards" : 7402,
"unassigned_shards" : 13628
}

What we normally do (or did before) was that we only added a few nodes first (main indexes, 200-300 nodes), waited for it to come up, and then added the other nodes 1 by 1 (since we could afford to have some indexes red in the beginning) afterwards, while having the allocating/balancing disabled.

bleskes · 2014-12-06T11:17:23Z

I see. The disk threshold allocator decider, in charge of making sure a node is not overloaded with shards is calculating the size of relocating shards walking the shards list again and again. We'd have to make it more efficient. A simple work around is to temporally set

curl -XPUT localhost:9200/_cluster/settings -d '{
    "transient" : {
        "cluster.routing.allocation.disk.include_relocations" : false
    }
}'

and enable it while start up is done. Can you try?

Instead of iterating all shards of all indices to get all relocating shards for a given node we can just use the RoutingNode#shardsWithState method and fetch all INITIALIZING / RELOCATING shards and check if they are relocating. This operation is much faster and uses pre-build data-structures. Relates to #6372

bluelu · 2014-12-07T10:03:26Z

@bleskes
It seems that both the reallocate and rebalance functions are just too slow if you have too many nodes and shards. We also use the Shard allocation filtering.

What do you need in order to be able to debug this and reproduce the performance issue? Cluster state, settings and configuration? I can send you this in private?

bleskes · 2014-12-07T12:00:03Z

@Thibaut I can't look at things in details right now. It is surprising that disabling include relocation didn't kick in. Being able to reroute quickly is important for the operations of the master. As a temporary work around (as I see from the tickets that it causes other problems as well) try disabling the disk threshold allocator all together.

Ps - Simon alread fixed that slowness we saw in your hot threads: #8803

On Sun, Dec 7, 2014 at 11:03 AM, Thibaut [email protected] wrote:

@bleskes
It seems that both the reallocate and rebalance functions are just too slow if you have too many nodes and shards. We also use the Shard allocation filtering.

What do you need in order to be able to debug this and reproduce the performance issue? Cluster state, settings and configuration? I can send you this in private?

Reply to this email directly or view it on GitHub:
#6372 (comment)

bleskes · 2014-12-08T08:20:42Z

@bluelu it seems that the include_relocations settings were backported to 1.4, but the dynamic update code didn't make it (although it is registered for dynamic updates, see 4e5264c ). Which misses 4185566#diff-1b8dca987fbcfb8d8e452d7e29c4d058R92 ). I'm sorry for sending you down a wrong path. @dakrone I think it makes sense to add the dynamic update logic for 1.4.2, right?

dakrone · 2014-12-08T09:20:31Z

@bleskes yes, most likely caused by a bad backport, I'll fix in the 1.4 branch.

dakrone · 2014-12-08T09:36:31Z

Fix for the setting issue in #8813

bluelu · 2014-12-08T10:05:37Z

Did this also affect the persistent value, as this value is only read and set after recovery of the cluster state has been started?

We will apply both fixes to our branch and let you know how it worked on next restart. (It can take 1-2 weeks till this) Thanks!

dakrone · 2014-12-08T10:10:46Z

Did this also affect the persistent value, as this value is only read and set after recovery of the cluster state has been started?

The affects the setting set using the cluster settings update API. If the setting is set in elasticsearch.yml it will still take effect.

bluelu · 2014-12-24T11:57:37Z

We added more logging to the allocateUnassigned function in LocalGatewayAllocator. (1.4.2)
While performance is fine when the cluster has allocated all shards, it's still rather slow at startup time and as long as there are many unassigned shards still to be assigned. We still have disabled cluster.routing.allocation.disk.include_relocation.

The first iteration of primary allocation will take 5-10 minutes for 300 nodes started (all SSD nodes)
The next iterations will be much faster (still a few seconds) until nothing is left.
It's caused by the following call in buildShardStates: TransportNodesListGatewayStartedShards.NodesLocalGatewayStartedShards response = listGatewayStartedShards.list(shard.shardId(), nodesIdsArray, listTimeout).actionGet();
which is executed sequentially for each node. As I see it this could easily be multithreaded.

If primary allocation is done, secondary allocation starts. This also takes a lot of time if there are many unassigned shards (up to 10 minutes as well), going down to < 10 seconds if there only a few left. During that time all cluster commands (e.g. create index, etc.. ) timeout and won't be executed as they timeout after 30 seconds by default.

If we keep track of the shards which we have tried to assign (both replicate and primary) and make sure that in each call we assign at least 1 shard, then we wouldn't have to run over all shards in each iteration, e.g. just check at most 100 shards in each iteration or more until we assign at least 1 shard? When all shards have been handled, we would retry all shards again.
Or otherwise just periodically call the unassigned function (no dog piling though) every 30 seconds.

I don't see a reason that allocateunassigned must iterate over all unasigned primaries and shards at each iteration? Do you see an issue in doing this (e.g. data loss that on the first call not all primaries will be directly allocated)? We are little hesitant about implementing this because of that :-)

bleskes · 2015-01-05T20:57:42Z

It's caused by the following call in buildShardStates: TransportNodesListGatewayStartedShards.NodesLocalGatewayStartedShards response = listGatewayStartedShards.list(shard.shardId(), nodesIdsArray, listTimeout).actionGet();
which is executed sequentially for each node. As I see it this could easily be multithreaded.

Yeah, this call is not algorithmically right (it goes per shard and not per node). We have plans to change this - no time line yet :)..

going down to < 10 seconds if there only a few left.

This is still slow - do you have hot threads from that period?

Do you see an issue in doing this (e.g. data loss that on the first call not all primaries will be directly allocated)?

for what it's worth, not having all primaries immediately assigned is not a problem (as far as data loss goes, obviously you want be able to index to those shards). It is already possible due to the node concurrent recoveries settings.

I would like to first understand what slow for you now - potentially we can find some simpler work around.

ywelsch-t · 2015-03-06T16:12:24Z

Hey @bleskes, I'm working with @bluelu and @miccon

After some more analysis, there are two reasons why initial shard allocation is slow for us. First, the LocalGatewayAllocator is slow. It will hopefully be adressed as part of #9502. Second, the BalancedShardsAllocator is slow. In the following, I will detail why this is the case for our setup.

Assume nodes of type A and B (could be SSD and HDD nodes for example). The nodes are marked as A or B through the attribute node_type. We now want to ensure that some of the indices are only allocated on A nodes and others only on B nodes. This is realized by setting index.routing.allocation.require.node_type either to A or B.

The benchmark appended below (adapted from the class ClusterAllocationRerouteBenchmark) applies the BalancedShardsAllocator to the cluster setup that was described here. It shows the following outputs: Full allocation takes 17 minutes and allocating a new index takes 30 seconds.

In our actual setup (nodes of type A,B,C,D) things are even worse (>50 minutes total and >40 seconds for new shards).

package org.elasticsearch.benchmark.cluster;

import com.google.common.collect.ImmutableMap;
import org.elasticsearch.cluster.ClusterName;
import org.elasticsearch.cluster.ClusterState;
import org.elasticsearch.cluster.metadata.IndexMetaData;
import org.elasticsearch.cluster.metadata.MetaData;
import org.elasticsearch.cluster.node.DiscoveryNodes;
import org.elasticsearch.cluster.routing.RoutingTable;
import org.elasticsearch.cluster.routing.allocation.AllocationService;
import org.elasticsearch.cluster.routing.allocation.RoutingAllocation;
import org.elasticsearch.common.logging.ESLogger;
import org.elasticsearch.common.logging.Loggers;
import org.elasticsearch.common.settings.ImmutableSettings;
import org.elasticsearch.common.unit.TimeValue;
import org.elasticsearch.test.ElasticsearchAllocationTestCase;

import java.util.Random;

import static org.elasticsearch.cluster.routing.ShardRoutingState.INITIALIZING;

public class SlowClusterAllocationRerouteIssue {

    private static final ESLogger logger = Loggers.getLogger(SlowClusterAllocationRerouteIssue.class);

    private static final String NODE_TYPE = "node_type";

    public static void main(String[] args) {
        final int numberOfRuns = 1;        

        final int numAIndices = 1000;
        final int numAShards = 10;
        final int numAReplicas = 1;

        final int numBIndices = 3000;
        final int numBShards = 1;
        final int numBReplicas = 1;

        final int numberOfANodes = 300;
        final int numberOfBNodes = 300;

        AllocationService strategy = ElasticsearchAllocationTestCase.createAllocationService(ImmutableSettings.builder()
                .put("cluster.routing.allocation.node_initial_primaries_recoveries", "30")
                .put("cluster.routing.allocation.node_concurrent_recoveries", "1")
                .build(), new Random(1));

        MetaData.Builder mb = MetaData.builder();
        for (int i = 1; i <= numAIndices; i++) {
            mb.put(IndexMetaData.builder("a_index" + i).settings(
                    ImmutableSettings.settingsBuilder().put("index.routing.allocation.require." + NODE_TYPE, "A"))
                    .numberOfShards(numAShards).numberOfReplicas(numAReplicas));
        }
        for (int i = 1; i <= numBIndices; i++) {
            mb.put(IndexMetaData.builder("b_index" + i).settings(
                    ImmutableSettings.settingsBuilder().put("index.routing.allocation.require." + NODE_TYPE, "B"))
                    .numberOfShards(numBShards).numberOfReplicas(numBReplicas));
        }
        MetaData metaData = mb.build();

        RoutingTable.Builder rb = RoutingTable.builder();
        for (int i = 1; i <= numAIndices; i++) {
            rb.addAsNew(metaData.index("a_index" + i));
        }
        for (int i = 1; i <= numBIndices; i++) {
            rb.addAsNew(metaData.index("b_index" + i));
        }
        RoutingTable routingTable = rb.build();

        DiscoveryNodes.Builder nb = DiscoveryNodes.builder();
        for (int i = 1; i <= numberOfANodes; i++) {
            nb.put(ElasticsearchAllocationTestCase.newNode("a_node" + i, ImmutableMap.of(NODE_TYPE, "A")));
        }
        for (int i = 1; i <= numberOfBNodes; i++) {
            nb.put(ElasticsearchAllocationTestCase.newNode("b_node" + i, ImmutableMap.of(NODE_TYPE, "B")));
        }

        ClusterState initialClusterState = ClusterState.builder(ClusterName.DEFAULT).metaData(metaData).routingTable(routingTable).nodes(nb).build();

        long start = System.currentTimeMillis();
        for (int i = 0; i < numberOfRuns; i++) {
            logger.info("[{}] starting... ", i);
            long runStart = System.currentTimeMillis();
            ClusterState clusterState = initialClusterState;
            while (clusterState.readOnlyRoutingNodes().hasUnassignedShards()) {
                logger.info("[{}] remaining unassigned {}", i, clusterState.readOnlyRoutingNodes().unassigned().size());
                RoutingAllocation.Result result = strategy.applyStartedShards(clusterState, clusterState.readOnlyRoutingNodes().shardsWithState(INITIALIZING));
                clusterState = ClusterState.builder(clusterState).routingResult(result).build();
                result = strategy.reroute(clusterState);
                clusterState = ClusterState.builder(clusterState).routingResult(result).build();
            }
            logger.info("[{}] took {}", i, TimeValue.timeValueMillis(System.currentTimeMillis() - runStart));
        }
        long took = System.currentTimeMillis() - start;
        logger.info("total took {}, AVG {}", TimeValue.timeValueMillis(took), TimeValue.timeValueMillis(took / numberOfRuns));
    }
}

Output:

[2015-03-06 16:07:28,080][INFO ][org.elasticsearch.benchmark.cluster] [0] starting... 
[2015-03-06 16:07:28,119][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 26000
[2015-03-06 16:07:42,633][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 14000
[2015-03-06 16:08:12,875][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 12700
[2015-03-06 16:08:42,247][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 12100
[2015-03-06 16:09:11,950][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 11500
[2015-03-06 16:09:42,086][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 10900
[2015-03-06 16:10:12,184][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 10300
[2015-03-06 16:10:42,527][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 9700
[2015-03-06 16:11:13,257][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 9100
[2015-03-06 16:11:44,172][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 8500
[2015-03-06 16:12:15,184][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 7900
[2015-03-06 16:12:46,170][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 7300
[2015-03-06 16:13:16,607][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 7000
[2015-03-06 16:13:45,755][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 6700
[2015-03-06 16:14:14,863][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 6400
[2015-03-06 16:14:43,883][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 6100
[2015-03-06 16:15:12,949][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 5800
[2015-03-06 16:15:41,760][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 5500
[2015-03-06 16:16:10,331][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 5200
[2015-03-06 16:16:39,046][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 4900
[2015-03-06 16:17:07,332][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 4600
[2015-03-06 16:17:35,698][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 4300
[2015-03-06 16:18:03,819][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 4000
[2015-03-06 16:18:31,773][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 3700
[2015-03-06 16:18:59,798][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 3400
[2015-03-06 16:19:27,735][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 3100
[2015-03-06 16:19:55,748][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 2800
[2015-03-06 16:20:23,786][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 2500
[2015-03-06 16:20:51,744][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 2200
[2015-03-06 16:21:19,735][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 1900
[2015-03-06 16:21:47,910][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 1600
[2015-03-06 16:22:16,249][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 1300
[2015-03-06 16:22:44,494][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 1000
[2015-03-06 16:23:12,843][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 700
[2015-03-06 16:23:40,748][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 400
[2015-03-06 16:24:08,520][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 200
[2015-03-06 16:24:26,921][INFO ][org.elasticsearch.benchmark.cluster] [0] took 16.9m
[2015-03-06 16:24:26,922][INFO ][org.elasticsearch.benchmark.cluster] total took 16.9m, AVG 16.9m

Note that with some crude hacks (having BalancedShardsAllocator account for the different node types in earlier pathes of the code) we reduced full allocation to a minute and allocation of new shards to a second.

bleskes · 2015-03-08T20:06:35Z

Thx @ywelsch-t . This is a great reproduction. We'll look at it as soon as possible.

bleskes · 2015-03-08T20:20:10Z

@ywelsch-t quick question in the meantime, can you elaborate on what you did:

Note that with some crude hacks (having BalancedShardsAllocator account for the different node types in earlier pathes of the code) we reduced full allocation to a minute and allocation of new shards to a second.

ywelsch-t · 2015-03-09T10:13:03Z

See ywelsch-t@3b9e5b2

With a profiler, I found that the two biggest pain points were BalancedShardsAllocator.Balancer.balance() and BalancedShardsAllocator.Balancer.allocateUnassigned()
Both methods now only consider the nodes which have the right node-type for the relevant shard/index.
The hack assumes that node_type references a single sub-cluster.

Output:

[2015-03-09 10:06:25,202][INFO ][org.elasticsearch.benchmark.cluster] [0] starting... 
[2015-03-09 10:06:25,237][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 26000
[2015-03-09 10:06:32,212][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 14000
[2015-03-09 10:06:35,434][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 12700
[2015-03-09 10:06:37,528][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 12100
[2015-03-09 10:06:39,269][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 11500
[2015-03-09 10:06:40,870][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 10900
[2015-03-09 10:06:42,787][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 10300
[2015-03-09 10:06:44,564][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 9700
[2015-03-09 10:06:46,231][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 9100
[2015-03-09 10:06:48,106][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 8500
[2015-03-09 10:06:49,904][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 7900
[2015-03-09 10:06:51,635][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 7300
[2015-03-09 10:06:53,041][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 7000
[2015-03-09 10:06:54,742][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 6700
[2015-03-09 10:06:56,395][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 6400
[2015-03-09 10:06:57,950][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 6100
[2015-03-09 10:06:59,304][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 5800
[2015-03-09 10:07:00,799][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 5500
[2015-03-09 10:07:02,430][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 5200
[2015-03-09 10:07:03,876][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 4900
[2015-03-09 10:07:05,354][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 4600
[2015-03-09 10:07:07,119][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 4300
[2015-03-09 10:07:08,757][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 4000
[2015-03-09 10:07:10,183][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 3700
[2015-03-09 10:07:11,694][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 3400
[2015-03-09 10:07:13,301][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 3100
[2015-03-09 10:07:14,869][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 2800
[2015-03-09 10:07:16,371][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 2500
[2015-03-09 10:07:17,859][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 2200
[2015-03-09 10:07:19,440][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 1900
[2015-03-09 10:07:21,087][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 1600
[2015-03-09 10:07:22,607][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 1300
[2015-03-09 10:07:24,130][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 1000
[2015-03-09 10:07:25,709][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 700
[2015-03-09 10:07:27,378][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 400
[2015-03-09 10:07:28,882][INFO ][org.elasticsearch.benchmark.cluster] [0] remaining unassigned 200
[2015-03-09 10:07:29,756][INFO ][org.elasticsearch.benchmark.cluster] [0] took 1m
[2015-03-09 10:07:29,756][INFO ][org.elasticsearch.benchmark.cluster] total took 1m, AVG 1m

Instead of iterating all shards of all indices to get all relocating shards for a given node we can just use the RoutingNode#shardsWithState method and fetch all INITIALIZING / RELOCATING shards and check if they are relocating. This operation is much faster and uses pre-build data-structures. Relates to elastic#6372

clintongormley · 2015-10-14T16:41:53Z

@ywelsch do you know if there are still improvements to make here?

pickypg · 2015-10-22T17:25:57Z

@ywelsch-t ^

ywelsch · 2015-12-28T19:08:10Z

The remaining issue was about the BalancedShardsAllocator. I ran some tests comparing the performance of 1.5 (the version to which applied the remaining issue) to 1.7 and 2.x. On the test code I introduced above, 1.7 and 2.x both ran noticeably faster (around 2½ minutes instead of 15 minutes for 1.5). With improvements in #15678, this was reduced to 1½ minutes. The rebalance step for hot-warm setup with many nodes was also greatly improved with #15678 (taking 1 second instead of 10).

I am closing this issue as I think all points are addressed.

bluelu mentioned this issue Jul 4, 2014

Slow cluster startup with zen discovery. #5232

Closed

bluelu mentioned this issue Sep 23, 2014

Only send diff of cluster state instead of full cluster state #6295

Closed

clintongormley added the discuss label Sep 24, 2014

clintongormley added feedback_needed stalled and removed discuss feedback_needed labels Oct 24, 2014

bleskes added feedback_needed and removed stalled labels Oct 24, 2014

bluelu mentioned this issue Dec 7, 2014

Nodes can't join anymore after they were killed (1.4.1) #8804

Closed

clintongormley added discuss and removed feedback_needed labels Dec 24, 2014

clintongormley assigned bleskes May 5, 2015

clintongormley added the :Allocation label May 5, 2015

clintongormley mentioned this issue Nov 21, 2015

Add unit testing framework for updating settings #8817

Closed

ywelsch mentioned this issue Dec 28, 2015

Speed improvements for BalancedShardsAllocator #15678

Merged

ywelsch removed the discuss label Dec 28, 2015

ywelsch closed this as completed Dec 28, 2015

lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018

When starting up, recovery of shards takes up to 50 minutes #6372

When starting up, recovery of shards takes up to 50 minutes #6372

Comments

bluelu commented Jun 2, 2014

s1monw commented Jun 2, 2014

bluelu commented Jun 2, 2014

s1monw commented Jun 2, 2014

bluelu commented Jun 2, 2014

bluelu commented Jun 5, 2014

ghost commented Sep 15, 2014

bleskes commented Sep 29, 2014

ghost commented Sep 30, 2014

bluelu commented Sep 30, 2014

bleskes commented Sep 30, 2014

shikhar commented Oct 9, 2014

bleskes commented Oct 9, 2014

shikhar commented Oct 9, 2014

bluelu commented Oct 24, 2014

bleskes commented Oct 24, 2014

bluelu commented Oct 24, 2014

shikhar commented Dec 3, 2014

bluelu commented Dec 6, 2014

bleskes commented Dec 6, 2014

}

bluelu commented Dec 6, 2014

bleskes commented Dec 6, 2014

bluelu commented Dec 7, 2014

bleskes commented Dec 7, 2014

What do you need in order to be able to debug this and reproduce the performance issue? Cluster state, settings and configuration? I can send you this in private?

bleskes commented Dec 8, 2014

dakrone commented Dec 8, 2014

dakrone commented Dec 8, 2014

bluelu commented Dec 8, 2014

dakrone commented Dec 8, 2014

bluelu commented Dec 24, 2014

bleskes commented Jan 5, 2015

ywelsch-t commented Mar 6, 2015

bleskes commented Mar 8, 2015

bleskes commented Mar 8, 2015

ywelsch-t commented Mar 9, 2015

clintongormley commented Oct 14, 2015

pickypg commented Oct 22, 2015

ywelsch commented Dec 28, 2015