-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large aggregation query -> OOM -> Index corruption -> data loss #12041
Comments
This message is worrisome - it seems you have 7GB of memory, which took 18.7 minutes to GC? What kind of hardware are you running on? did you disable swap?
Once you've reached an OOM you can better restart the node as many unexpected things can happen. It's especially bad of the oom happened in the network stack (like yours did). Can you specify some more details about the corruption reported? |
Each data node (like the one above) in our cluster has 16 GB RAM and 8 GB heap. The aggregations are very heavy, so long GCs are as such not the main problem here (we need to find ways to limit the queries on our part). My main question here is: Should an OOM occur on one node, would shard corruption - and then data loss - then be one of the possible (expected) outcomes? The net effect of the crash has been a data loss of 20% of the original data for a specific set of indices. By looking at shard information we see that some shards have less data (take less space on disk) than they should. We also see missing documents in the affected indices - very evenly spread out across our primary key (an event number id - our data are time series). We suspect that due to OOM some shards have been corrupted and then discarded. We have 1 replica for each shard (that is, 1 primary and 1 replica) in the cluster but this has not prevented data loss. (Yes, we could have had more replicas but that is not the main issue here.) It seems that long GC pauses have made data nodes unable to connect to the master (we see traces of that in the logs on the data nodes) so an additional factor here might be that the master has not had a chance to become aware of the shard corruption. Mlockall is set to true by us (in elasticsearch.yml) but reading out the actual value from /_nodes/process shows the actual value is false. We are alredy aware of that - we believe the reason is that we have to use virtual machines on VMWare. We have not disabled swap explicity, so we run on OS defaults. We are using RHEL (7.0, I think). |
How much less? Shards may very in size based on their merge schedules.
Do you delete documents ? if so, you should take them into account as the stats report the total amount of documents in the index. It has a separate entry for deleted docs (which will be purge away by the background merge process). Are there any other signs of corruption? A log entry would be great.
A non responsive will be removed from the cluster and it's shard will be assigned somewhere else. Once the node rejoins (after the GC finishes) it will be reassigned shards, which will be resynced to the primaries. |
The indices are time-partitioned (one index for each month of data). Each index has 6 shards. The distribution of data for the May index - after our problems - was as follows: Shard 0, 2, 3, 4, 5: 4.1-4.2 Gb For the February index, the distribution was: Shard 1: 115 bytes / 79 bytes (primary & replica) For other indices (not affected by data loss) we see very even shard sizes, within 5% of each other. So we suspect merge schedules are not a factor here.
No. This is mostly an append-only system. At times we update existing documents (which we realise may be implemented internally in Lucene as delete + add), but also here we do not think the differences are due to deletes (i.e. updates), as the indices that not were affected by data loss look very similar to each other and they are all quite different to the ones affected.
This is for us the the main sign of corruption:
I will look for other log messages and update this issue. We see problems on other data nodes as well (OOMs, missing cluster synchronisation) during the failure period. Here is an example of events from a different node. First occurence of an OOM:
The node is then unable to let the master know of the shard failure:
The node is then unable to create shards:
We experienced problems with reassignments so we had to allocate shards manually to get ES out of its "red state". It seems we hit the case described here: At least, this comment fits very well with our hypothesis (and previous observations):
Still, my main question is: Should an OOM occur on one node, would shard corruption - and then data loss - then be one of the possible (expected) outcomes? It would be very valuable for us to have your opinion on whether data loss (on a single node) could be expected should an OOM occur. |
Searching ES logs for information about the February index (where some shards were emptied) gave the following result:
This is the first occurrence of log entry from a data node that has handled those shards (within the period we had failures). So it seems at this point in time, the directory containing this shard (shard 1) has been unexpectedly emptied. A bit before this, we see that the master node cannot contact this data node. These two log entries (one above, one below) are the only ones that show up in our log analysis tool and that give hints on what happened to the shards.
|
OOM should not cause data corruption and certainly not shards to be removed. Is the following something that you can reproduce/happens often? I would love to have some trace logs enabled and debug further...
|
We are working on reproducing the issue now, will post further details here. |
@kristoffer-dyrkorn this may be related to the issue fixed in #12487 so I would advise upgrading. |
We have not been able to recreate the issue and cannot spend further resources on trying. So I am afraid we cannot reach a final conclusion on what went wrong. We will however upgrade now. |
thanks @kristoffer-dyrkorn - closing for now. |
This issue happende again today (still on 1.5.2), with the same sequence of errors as described earlier. It does indeed look very similar to #12487. We lost four shards (2 x primary and replica) in this crash, and both primary and replica were located on the nodes that went OOM. After nodes had been restarted, the files and directories for shard 4 and 5 were no longer on disk on the affected data nodes. Both nodes had long GC pauses, so I guess there were good chances for a race condition to happen. Luckily, it only affected one of the indices, which are backed up nightly, so we were able to restore all the data. |
We have experienced data loss in ES (1.5.2) after running very heavy aggregation queries. We get OOMs (which is not ideal, but tolerable, given our context) but also index corruption and data loss (which is bad).
Any ideas as to what has happened here - and how this could be prevented? Please see the log excerpt:
The text was updated successfully, but these errors were encountered: