Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use auto_expand_replicas #3580

Merged
merged 1 commit into from
Mar 24, 2017

Conversation

lukas-vlcek
Copy link
Contributor

Do not use auto_expand_replicas feature. It is associated with unnecessary data recovery from other nodes.

See for details: elastic/elasticsearch#1873

@sdodson sdodson requested a review from ewolinetz March 7, 2017 16:12
@lukas-vlcek
Copy link
Contributor Author

The following is simple reproduction script:

# Start two ES nodes and for clarity make sure each node stores data to different location

# ES 5.2.2
./bin/elasticsearch -E path.data=./node1/data -E path.logs=./node1/logs
./bin/elasticsearch -E path.data=./node2/data -E path.logs=./node2/logs

# ES 2.4.4
# ./bin/elasticsearch -Dpath.data=./node1/data -Dpath.logs=./node1/logs
# ./bin/elasticsearch -Dpath.data=./node2/data -Dpath.logs=./node2/logs

After nodes are up and cluster is formed:

ES1=http://localhost:9200
ES2=http://localhost:9201

# curl $ES1/_cluster/health?pretty
# curl $ES2/_cluster/health?pretty

# Prepare index template using the "auto_expand_replicas"
curl -X POST $ES1/_template/auto_index -d '{
  "template": "auto*",
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "auto_expand_replicas": "0-3"
  }
}'

# Another index template using 1/1 shard/replica schema
curl -X POST $ES1/_template/fixed_index -d '{
  "template": "fixed*",
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}'

# Index two documents. Each goes to different index.
# - "auto" index uses 'auto_index_expand'
# - "fixed" index uses 1/1 schema
curl -X POST $ES1/auto/type -d '{ "name": "foo" }'
curl -X POST $ES1/fixed/type -d '{ "name": "foo" }'

# ... and let's make sure the data is flushed to disk
# so we can clearly track Lucene segment files on the FS
curl -X POST "$ES1/_flush?wait_if_ongoing=true&force=true"

At this point we can check _cat/indices:

curl $ES1/_cat/indices?v

health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   auto  6OLcLCRhT8GM34aExedFkg   1   1          1            0      6.6kb          3.3kb
green  open   fixed miBt1Om1SsegDb67XNpmIg   1   1          1            0      6.6kb          3.3kb

And we can also check data store for each node:

$ cd ./node1/data/nodes/0/indices
$ du -h
4.0K	./6OLcLCRhT8GM34aExedFkg/0/_state
 16K	./6OLcLCRhT8GM34aExedFkg/0/index
8.0K	./6OLcLCRhT8GM34aExedFkg/0/translog
 28K	./6OLcLCRhT8GM34aExedFkg/0
4.0K	./6OLcLCRhT8GM34aExedFkg/_state
 32K	./6OLcLCRhT8GM34aExedFkg
4.0K	./miBt1Om1SsegDb67XNpmIg/0/_state
 16K	./miBt1Om1SsegDb67XNpmIg/0/index
8.0K	./miBt1Om1SsegDb67XNpmIg/0/translog
 28K	./miBt1Om1SsegDb67XNpmIg/0
4.0K	./miBt1Om1SsegDb67XNpmIg/_state
 32K	./miBt1Om1SsegDb67XNpmIg
 64K	.
$ cd ./node2/data/nodes/0/indices
$ du -h
4.0K	./6OLcLCRhT8GM34aExedFkg/0/_state
 16K	./6OLcLCRhT8GM34aExedFkg/0/index
8.0K	./6OLcLCRhT8GM34aExedFkg/0/translog
 28K	./6OLcLCRhT8GM34aExedFkg/0
4.0K	./6OLcLCRhT8GM34aExedFkg/_state
 32K	./6OLcLCRhT8GM34aExedFkg
4.0K	./miBt1Om1SsegDb67XNpmIg/0/_state
 16K	./miBt1Om1SsegDb67XNpmIg/0/index
8.0K	./miBt1Om1SsegDb67XNpmIg/0/translog
 28K	./miBt1Om1SsegDb67XNpmIg/0
4.0K	./miBt1Om1SsegDb67XNpmIg/_state
 32K	./miBt1Om1SsegDb67XNpmIg
 64K	.

Both index data on both indices.

Now, let's disable allocation:

curl -X PUT $ES1/_cluster/settings -d '{
  "transient": {"cluster.routing.allocation.enable": "none"}
}'
# {"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"none"}}}}}

Shutdown the node2.

# The following will appear in log of node1:
# [CJxI8Hy] updating number_of_replicas to [0] for indices [auto]
# [CJxI8Hy] [auto/6OLcLCRhT8GM34aExedFkg] auto expanded replicas to [0]

Start the node2 again:

./bin/elasticsearch -E path.data=./node2/data -E path.logs=./node2/logs

# its log will stop at:
# [mQvaM96] started

Now the data for index auto (6OLcLCRhT8GM34aExedFkg) is gone in the node2 store:

$ du -h
4.0K	./6OLcLCRhT8GM34aExedFkg/_state
4.0K	./6OLcLCRhT8GM34aExedFkg
4.0K	./miBt1Om1SsegDb67XNpmIg/0/_state
 16K	./miBt1Om1SsegDb67XNpmIg/0/index
8.0K	./miBt1Om1SsegDb67XNpmIg/0/translog
 28K	./miBt1Om1SsegDb67XNpmIg/0
4.0K	./miBt1Om1SsegDb67XNpmIg/_state
 32K	./miBt1Om1SsegDb67XNpmIg
 36K	.

Now, let's enable allocation:

curl -X PUT $ES1/_cluster/settings -d '{
  "transient": {"cluster.routing.allocation.enable": "all"}
}'
# {"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"enable":"all"}}}}}

And the data for index auto is recovered from other node:

$ du -h
4.0K	./6OLcLCRhT8GM34aExedFkg/0/_state
 16K	./6OLcLCRhT8GM34aExedFkg/0/index
8.0K	./6OLcLCRhT8GM34aExedFkg/0/translog
 28K	./6OLcLCRhT8GM34aExedFkg/0
4.0K	./6OLcLCRhT8GM34aExedFkg/_state
 32K	./6OLcLCRhT8GM34aExedFkg
4.0K	./miBt1Om1SsegDb67XNpmIg/0/_state
 16K	./miBt1Om1SsegDb67XNpmIg/0/index
8.0K	./miBt1Om1SsegDb67XNpmIg/0/translog
 28K	./miBt1Om1SsegDb67XNpmIg/0
4.0K	./miBt1Om1SsegDb67XNpmIg/_state
 32K	./miBt1Om1SsegDb67XNpmIg
 64K	.

For more detailed investigation it is possible to turn on logging like this:

  indices.recovery: TRACE
  index.shard: TRACE

This brings related low level info into ES logs.
Notice: the logging config changed a bit for ES 5.x, the above tip works for ES 2.x.

@ewolinetz
Copy link
Contributor

@lukas-vlcek the intent with using the auto_expand_replicas was to remove the amount of administrative work that users needed to do with their EFK stack as deployed by OCP. This way a customer's cluster would be green even if they only have a single node (and then the node wouldn't need to maintain information about replica shards that it may never be able to place).

Are we going to start recommending to customers that they should be updating the number of replicas based on their cluster size and also providing commands/documentation on how they would accomplish this? That is my concern with us no longer using auto_expand_replicas

@lukas-vlcek
Copy link
Contributor Author

@ewolinetz I understand and welcome effort to minimize user maintenance work and I am not against it. But this ticket is not about maintenance ease, this ticket tries to describe issues associated with using auto_expand_replicas. IMO it introduces serious issues (especially in our logging use case) and I do not see any justification for keeping it in OOB experience any longer. If you want to keep it in then please make sure users are well informed about associated risks and downsides so they can make qualified decision.

From what I understand when auto_expand_replicas is used then local copy of node data is always dropped when node joins cluster, no matter if shard allocation is turned on or off, no matter if the node shutdown/startup is controlled by user or result of network load, long CG run or whatever. In case of logging (especially given our existing setup, i.e. 1 primary shard and 2 or 3 replica shards) this means that every node joining 2 or 3 node cluster will always pull all the data from other nodes. Are associated expenses worth it?

//cc @portante

@lukas-vlcek
Copy link
Contributor Author

@ewolinetz As for the second part of your question:

Are we going to start recommending to customers that they should be updating the number of replicas based on their cluster size and also providing commands/documentation on how they would accomplish this?

I think unless we can reliably automate this for users then providing tooling sounds like good option to me. What is wrong with cluster in yellow state compared to single node cluster in green state? With single node cluster aren't you in permanent risk of loosing service availability anyway?

@richm
Copy link
Contributor

richm commented Mar 9, 2017

I think we just need to document the cases where having a yellow cluster is OK. It is really a problem for large deployments if nodes have to be unnecessarily initialized with Elasticsearch data.

@ewolinetz
Copy link
Contributor

@lukas-vlcek can you also open a PR for this against release-1.5?

@@ -7,8 +7,7 @@ script:

index:
number_of_shards: 1
number_of_replicas: 0
auto_expand_replicas: 0-2
number_of_replicas: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this 0 by default, and give the customers a setting to change it to the value they want.

I would think both of these numbers (shard and replica counts) should be templated to allow a variable to change their values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I make the argument that this is something we can specify as part of the install?
Or if a customer states they want an ES cluster size of 2 or more we would then change this value?

Or or - if not specified we would change this value based on the intended cluster size (0 - 2) but then give users the ability to overwrite it if they so desire.

Copy link
Contributor

@ewolinetz ewolinetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the number_of_replicas configurable instead?

@jcantrill
Copy link
Contributor

@lukas-vlcek It looks like this PR is to address https://bugzilla.redhat.com/show_bug.cgi?id=1430910. Could you also update the commit message to 'bug 1430910. MESSAGE HERE' which will add commit info to PR

@jcantrill
Copy link
Contributor

@ewolinetz If we are to make this configurable, we should consider how to incorporate these changes as part of a map. It is going to become exceedingly unweildy to create a var like 'openshift_logging_ADDYOURCONFIGOPTIONHERE' everytime we add these tweeks. It probably should be something like:

openshift_logging_es_settings:
   number_of_shards: 1
   number_of_replicas: 0

@ewolinetz
Copy link
Contributor

@jcantrill i'd just be hesitant to increase the complexity simply to reduce verbosity

@jcantrill
Copy link
Contributor

@ewolinetz I dont see this is being that much more additionally complex. Ansible already supports the notion of variable files which are yaml as opposed to inventory files. This seems like a more advanced configuration anyway where we might expect users to have or be able to utilize variable files.

@ewolinetz
Copy link
Contributor

@lukas-vlcek can you update the template for the es config to use two variables, one for primary shards one for replicas? Something along the lines of openshift_logging_es_primary_shards openshift_logging_es_replicas (and then ops instances).

The defaults should probably be 1 primary and 0 or 1 replicas (either should be fine, customers are currently used to having yellow state if they only have one node).

@lukas-vlcek
Copy link
Contributor Author

@ewolinetz I will do it on Monday (on PTO today 😎 ).

@lukas-vlcek lukas-vlcek force-pushed the no_auto_expand_replicas branch from 80381c1 to c4a615e Compare March 20, 2017 11:54
@lukas-vlcek
Copy link
Contributor Author

@ewolinetz let me know if this is what you meant (/me not expert on Ansible and its variables):

number_of_shards: {{openshift_logging_es_primary_shards | default ('1')}}
number_of_replicas: {{openshift_logging_es_replicas | default ('1')}}

@lukas-vlcek
Copy link
Contributor Author

@ewolinetz May be I should add used variables also to both roles/openshift_logging/defaults/main.yml and roles/openshift_logging/vars/main.yaml?

@portante
Copy link
Contributor

@lukas-vlcek, I think the default for shard replication should be "0".

@lukas-vlcek lukas-vlcek force-pushed the no_auto_expand_replicas branch from c4a615e to e4f8ffb Compare March 20, 2017 13:55
@lukas-vlcek
Copy link
Contributor Author

@portante updated

@ewolinetz
Copy link
Contributor

@lukas-vlcek if you could just add to defaults/main.yaml that would be good. Entries in variables/main.yaml are generally not meant to be changed.

We likely should also have ops instances of these as well, since customers may want to configure them differently. That would mean that we would then pass in the values to the templates in tasks/install_elasticsearch.yaml instead. If you wouldn't mind doing so.

Can you also update the README for the role just to explain what these vars do?

Otherwise LGTM

@lukas-vlcek lukas-vlcek force-pushed the no_auto_expand_replicas branch from e4f8ffb to b203e12 Compare March 21, 2017 13:13
@lukas-vlcek
Copy link
Contributor Author

@ewolinetz I think I need some help figuring out how to modify roles/openshift_logging/tasks/install_elasticsearch.yaml to make this complete and shine.

@lukas-vlcek lukas-vlcek force-pushed the no_auto_expand_replicas branch 2 times, most recently from 072947d to b73bd39 Compare March 21, 2017 13:43
@lukas-vlcek
Copy link
Contributor Author

@ewolinetz I did another update, PTAL

@@ -72,6 +72,8 @@ When both `openshift_logging_install_logging` and `openshift_logging_upgrade_log
- `openshift_logging_es_recover_after_time`: The amount of time ES will wait before it tries to recover. Defaults to '5m'.
- `openshift_logging_es_storage_group`: The storage group used for ES. Defaults to '65534'.
- `openshift_logging_es_nodeselector`: A map of labels (e.g. {"node":"infra","region":"west"} to select the nodes where the pod will land.
- `openshift_logging_es_number_of_shards`: The number of shards for every new index created in ES. Defaults to '1'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The number of primary shards"

@@ -88,6 +90,8 @@ same as above for their non-ops counterparts, but apply to the OPS cluster insta
- `openshift_logging_es_ops_pvc_prefix`: logging-es-ops
- `openshift_logging_es_ops_recover_after_time`: 5m
- `openshift_logging_es_ops_storage_group`: 65534
- `openshift_logging_es_ops_number_of_shards`: The number of shards for every new index created in ES. Defaults to '1'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The number of primary shards"

@lukas-vlcek lukas-vlcek force-pushed the no_auto_expand_replicas branch from 82526cb to 44a5dcc Compare March 21, 2017 15:37
@lukas-vlcek
Copy link
Contributor Author

@ewolinetz updated

@ewolinetz
Copy link
Contributor

aos-ci-test

@@ -134,6 +136,8 @@
openshift_logging_es_recover_after_time: "{{openshift_logging_es_ops_recover_after_time}}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be openshift_logging_es_recover_after_time? Or should that be openshift_logging_es_ops_recover_after_time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@portante It can make sense, I will look at this tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@portante I think it is correct.

There is {{openshift_logging_es_recover_after_time}} value set to ${RECOVER_AFTER_TIME} env variable in roles/openshift_logging/templates/es.j2 which is then used in elasticsearch.yml.j2 template.

//cc @ewolinetz (Erik see below please)

However, I do not understand why this value is set only for DeploymentConfig for Ops in tasks/install_elasticsearch.yaml script. Please compare vars sections for Ops vs vars section for Non-Ops.

Why the following four vars are set only for Ops DeploymentConfig?

  • es_node_quorum
  • es_recover_after_nodes
  • es_recover_expected_nodes
  • openshift_logging_es_recover_after_time

Apart from this if there is any issue with openshift_logging_es_recover_after_time usage and configuration then I suggest to address it in separated ticket/PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my IRC comment, the difference is when running the 'ops' tasks, we update the variables to be the 'ops' variables. Technically, we should have written these tasks to run twice with a single set of tasks: once with non-ops and the other with ops. The four variables in question are found in the vars/main.yaml

@openshift-bot
Copy link

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 22, 2017
@lukas-vlcek lukas-vlcek force-pushed the no_auto_expand_replicas branch from 44a5dcc to a59661b Compare March 22, 2017 09:53
@lukas-vlcek
Copy link
Contributor Author

rebased

@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 22, 2017
@ewolinetz
Copy link
Contributor

aos-ci-test

@openshift-bot
Copy link

@ewolinetz
Copy link
Contributor

[merge]

@openshift-bot
Copy link

Evaluated for openshift ansible merge up to 9460def

@openshift-bot
Copy link

openshift-bot commented Mar 23, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/88/) (Base Commit: fdd0889)

@openshift-bot openshift-bot merged commit 6f48ddc into openshift:master Mar 24, 2017
number_of_replicas: 0
auto_expand_replicas: 0-2
number_of_shards: {{ es_number_of_shards | default ('1') }}
number_of_replicas: {{ es_number_of_replicas | default ('0') }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly sure why, but because of the way we use this template, the values have to be quoted. Please change this to be:

   number_of_shards: "{{ es_number_of_shards | default ('1') }}"
   number_of_replicas: "{{ es_number_of_replicas | default ('0') }}"

You'll have to submit a new PR since this one was merged.

Copy link
Contributor Author

@lukas-vlcek lukas-vlcek Mar 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @richm! I opened new PR #3763 and #3764

@lukas-vlcek lukas-vlcek deleted the no_auto_expand_replicas branch March 24, 2017 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants