Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect error on _bulk api on elasticsearch plugin #681

Closed
HarukaMa opened this issue Feb 20, 2018 · 9 comments
Closed

Detect error on _bulk api on elasticsearch plugin #681

HarukaMa opened this issue Feb 20, 2018 · 9 comments

Comments

@HarukaMa
Copy link
Contributor

I am trying elasticsearch plugin and encountered this issue:

HTTP log:

POST /_bulk HTTP/1.1
Host: localhost:9200
User-Agent: libcrp/0.1
Accept: */*
Content-Type: application/json
Content-Length: 46770
Expect: 100-continue

{ "index" : { "_index" : "graphene-2018-02", "_type" : "data", "op_type" : "create", "_id" : "2.9.145258579" } }
...

HTTP/1.1 200 OK
access-control-allow-credentials: true
content-type: application/json; charset=UTF-8
content-length: 10534

{"took":0,"errors":true,"items":[{"create":{"_index":"graphene-2018-02","_type":"data","_id":"2.9.145258579","status":403,"error":{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"}}},{"create"
...

The error [FORBIDDEN/12/index read-only / allow delete (api)] seems related to low free space when indexing and will prevent further insert operations, but the plugin is currently ignoring the error, causing all subsequent operations missing in this index.

I think it would be better if the plugin could detect such errors instead of peacefully ignoring it to prevent data loss, as recovering could take a full replay which is lengthy using elasticsearch. The wiki should updated with estimated disk space requirement to somewhat prevent this happening too. My partial data up to 2018-01 takes about 60 GB disk space(excluding translog) using best-compression codec setting, with reduced translog size and age settings and 2 shards per index.

@abitmore
Copy link
Member

I heard that 160 GB of disk space is not enough for up-to-date full history.

@HarukaMa
Copy link
Contributor Author

The data of indices shouldn't take more than 100 GB, I guess it's mainly because of the translog: By default every index have 5 shards, and each of them could take up to 512 MB, which means there could be 2.5 GB temporary storage overhead for every index during replay. In my opinion, 5 shards is a bit overkill as the largest index is still below 20 GB currently. Also, enabling best_compression could save about 10% space at the expense of slower operations.

@oxarbitrage
Copy link
Member

we have considered to do something with http error logs(https://github.com/bitshares/bitshares-core/blob/master/libraries/plugins/elasticsearch/elasticsearch_plugin.cpp#L285-L295).

can include code 413 there but do what ? send a log msg to the node ? try to kill the node ?

let me know if you have a good idea, will be happy to add.

in regards to the shards, i am not expert in ES but the settings for the index are added at creation, when the first insert is sent, this creates the index with the default settings. in order to control the index settings we need to send a query before with the custom options we want.

if you have a set of good settings i can try add that.

in regards to the wiki, you are right, i added a note here: https://github.com/bitshares/bitshares-core/wiki/ElasticSearch-Plugin#checking-if-it-is-working with 160 gigs even if it is less, better to have people prepared for big hd.

@HarukaMa
Copy link
Contributor Author

HarukaMa commented Feb 21, 2018

I'm using template to pre-define the settings:

$ curl -XPUT 'http://localhost:9200/_template/graphene' -d '{
  "index_patterns" : ["graphene-*"],
  "settings": { "number_of_shards": 2,
    "index": {
      "translog": {
        "retention": {
          "size": "512mb", "age": "300s"
        }
      }
    }
  }
}' -H 'Content-Type: application/json

This template would apply those settings to all newly created index prefixed with graphene-. It's one time so there will be no need to specify them for every new index. In this settings I have also reduced translog age to 15min to minimize the storage usage, but I think that's optional.

Also, some errors are returned with 200 status, so checking status code for errors is not enough I think.

Is killing the node normal if one of the plugins have encountered errors during normal operations? I'm not quite sure about it now... I think we should have a way to "fix" partial indexes, like replaying from specific points instead of replaying from start to save (a lot of) time.

I'm having an additional question: Can we somehow make get_account_history api call use ES to get data? Currently it's only returning 1 op which matches the behavior of the plugin, but will affect the functionality of light wallet and various applications relying on this call, as they need to interact to ES instead to get the data.

@oxarbitrage
Copy link
Member

as i think each server can define its own pre-settings i think is better to add the command to the wiki instead of making the call from the plugin itself, added:
https://github.com/bitshares/bitshares-core/wiki/ElasticSearch-Plugin#pre-define-settings

Kill the node is not something any other plugin do as far as i know, a msg in the witness console can be at least something better than just do nothing. in the case of disk full, a msg for error 413 will do it.

Errors inside 200 are generally for a malformed query, i saw a lot when building the plugin, never saw any after released, the details for this can be obtained from the log index.
the other kind of error i saw inside 200 is "document already exist". as https://github.com/bitshares/bitshares-core/wiki/ElasticSearch-Plugin#note-on-duplicates documents with the same id will not be added. can be caused by a block arriving twice or something like that. this are ok to ignore.

in regards to the get_account_history i definitely think that the call should have an if elasticsearch plugin active: do it with elastic; else use normal call code.
i need aproval from @abitmore and @pmconrad in order to do this.

@oxarbitrage
Copy link
Member

about the get_account_history changes to use elasticsearch when available is a no-go for the bitshares core development team. we already discussed it before but i forgot about it.
the reason is that we do not want more api calls inside the bitshares-core if possible, if we add elasticsearch call version for get_account_history we need to do the same for get_account_history_operations. once we have them, new calls will be requested like get_account_history_by_date, get_account_history_by_block, etc.

this is against what we initially tried to do with the plugin which is remove the api call load from the nodes. to make all the queries imaginable with elasticsearch the api node can 1) expose full elasticsearch access to application(not recommended for security but if app is in the same machine as node this is an option) or 2) expose a wrapper like https://github.com/oxarbitrage/bitshares-es-wrapper
or 3) develop its own wrapper to expose the data the app will need.

we need to educate elasticsearch api nodes to use one of this options depending on their needs but one thing is sure in the short term, bitshares-core will not make use of elasticsearch to pull data out.

@HarukaMa
Copy link
Contributor Author

Then we need a way to let clients know if the server is using ES plugin, and that should belong to #626 . For something like reference wallet or similar applications I think we may still need some "standardized" way if possible...

@abitmore abitmore added this to the Future Non-Consensus-Changing Release milestone Feb 25, 2018
@oxarbitrage
Copy link
Member

The error handling had been improved in the last version. There is a dedicated function looking for the error code here https://github.com/bitshares/bitshares-core/blob/develop/libraries/utilities/elasticsearch.cpp#L96 and returning true of false.

When sending data to ES fails plugin_exception will be raised: https://github.com/bitshares/bitshares-core/blob/develop/libraries/plugins/elasticsearch/elasticsearch_plugin.cpp#L405
This will make the plugin to stop processing blocks and keep trying until is solved(ES can be down, so it will resume when restarted, if no space - it will continue when space is freed, etc).
So basically will not keep going until the problem is fixed and data can be sent.

For this reason i think this issue can be closed but feel free to reopen if you think this is not enough.

@oxarbitrage
Copy link
Member

reference to the pull where error handling was added: #1201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants