Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Curious case of draining a data node in the presence of X-Pack Security #32340

Closed
jordansissel opened this issue Jul 24, 2018 · 6 comments
Closed
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Security/Security Security issues without another label

Comments

@jordansissel
Copy link
Contributor

jordansissel commented Jul 24, 2018

Describe the feature:

When x-pack security is enabled and at least one data node is excluded with cluster.routing.allocation.exclude._ip, the cluster state will never show green because a single .security shard remains unallocated.

Ideally, cluster state green would be usable in a draining scenario even when X-Pack Security is active.

Elasticsearch version (bin/elasticsearch --version): docker.elastic.co/elasticsearch/elasticsearch-platinum:6.2.3

Plugins installed: x-pack and whatever else comes on the docker image.

JVM version (java -version): docker.elastic.co/elasticsearch/elasticsearch-platinum:6.2.3

OS version (uname -a if on a Unix-like system): docker.elastic.co/elasticsearch/elasticsearch-platinum:6.2.3

Description of the problem including expected versus actual behavior:

A few weeks ago, I migrated my Elasticisearch data nodes to bigger machines. To do this, I roughly followed these steps:

  1. Bring new nodes online and join the cluster
  2. Drain one old machine at a time with shard allocation filtering
  3. Proceed with additional drains, one at a time, only when the cluster finishes moving shards. I intended to observe this with cluster state "green".

My plan was to drain a single node and wait for the cluster to be green. However, it seems the way X-Pack Security configures the .security index prevents this. If there are 6 data nodes, the number of replicas for .security will be 5. When draining a node with shard allocation filtering, this means the cluster will complete draining and still be yellow!

I understand solving this may be a challenge given the complexities (replica count based on data nodes, shard allocation filtering to exclude whole nodes, etc), but I wanted to raise it for discussion.

My current workaround is to use /_cat/shards, looking for UNASSIGNED, and concluding "draining complete" only when exactly 1 UNASSIGNED shard exists and that shard must belong to the .security index.

Example:

index                              shard prirep state          docs   store ip          node
.security-6                        0     r      UNASSIGNED                              
.security-6                        0     r      UNASSIGNED                              
.security-6                        0     r      UNASSIGNED                              
.security-6                        0     r      UNASSIGNED                              

For the above, I expect one UNASSIGNED per drained data node. In the case above, I have 4 drained data nodes, so 4 .security replica shards are unassigned.

Steps to reproduce:

  1. Enable X-Pack Security
  2. Use shard allocation filtering to exclude a single node
  3. Observe the cluster state will never go green because a single .security shard will remain unassigned.
@dnhatn dnhatn added the :Security/Security Security issues without another label label Jul 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@dnhatn dnhatn added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Jul 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jasontedor
Copy link
Member

@jaymode I think this will be sufficiently addressed by the plan to stop using auto-expand replicas, is that right?

@jaymode
Copy link
Member

jaymode commented Jul 25, 2018

That is correct

@tvernum
Copy link
Contributor

tvernum commented Jul 25, 2018

You guys just managed to beat me to this.
Related issue: #29933

@ywelsch
Copy link
Contributor

ywelsch commented Jul 25, 2018

Closed in favor of #29933

@ywelsch ywelsch closed this as completed Jul 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Security/Security Security issues without another label
Projects
None yet
Development

No branches or pull requests

7 participants