-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General stability issues influxdb v1.0beta3 (possible compaction problems, still)? #7074
Comments
Oliver's gist contains:
|
@sysadminsemantics could you also provide (ideally on a node that's in the middle of causing trouble):
You can then drag and drop all the files as an issue comment. If you're concerned about exposing stats, shards or diagnostics then you can email those to my first name as it appears on my profile page |
@sysadminsemantics I just realised that the profiles in the gist are when the system is performing well? In that case we could also do with:
when it's performing badly, too 😄 |
All nodes just starting having issues out of the blue; no major event that triggered it. |
@sysadminsemantics Something looks off with your disks. You mentioned these disks were SSDs, but the iostat data looks like a spinning disk profile. Each disk is sustaining about 110 IOPS and is fully utilized. There are also many goroutines blocked on file IO related tasks. Are you able to test without the RAID setup? |
@jwilder thanks for the feedback. I was able to test SSDs in several scenarios, and I think the conclusion and morale of the story is something as stupidly simple as queue'ing. We've completely overseen that Jessie by default enables CFQ for every single device; not ideal for IO intensive scenarios. Newer fio tests reveal: 40K IOPS per disk without even having 100% disk util; and this is us just testing random writes. Under normal circumstances, the InfluxDB nodes seem to be doing around this: That's only about 100 IOPS, but also only 13% io util. Furthermore; it would seem that after changing the queue'ing from CFQ to noop, the compaction lockup we experienced from the previous days no longer occurs: I'll still be monitoring the situation closely, but I'm hoping that just the simple queue'ing change might have solved this. |
@sysadminsemantics Interesting discovery. I'm going to close this out as it appears to be an system config issue. If you run into other issues, please log a new issue. Thanks for all the detailed data as well. Really helps to understand what is going more easily. |
I'm afraid it happened again over the weekend, so we cheered slightly too early. The strange thing is; when I run an iostat while influxDB is in "bad" state; the IOPS reported by iostat are indeed very low (around the 100 mark, with 60 MB / sec. throughput per drive): If in the meanwhile; while influxDB is in this bad state, i run an fio on my own and then run the iostat, the throughput increases.. Then, after an influxDB restart, the writes that InfluxDB does do are more performant than they were before, it seems like it is barely doing anything at all: When the nodes are in this bad state; automatically the full compaction time increases to crazy numbers, probably because it cannot complete? I'm really interested in the full compaction theme here. Is there a way to manually run this, to see if we can reproduce the error? If not, what details can you provide on what the thresholds and / or schedules are for it to run? |
I'm going to reopen this issue since previous data may still be useful in diagnosing this. /cc @jwilder |
Just an update that before going home last night, I removed both mirrored drives for the RAID1+0 array, to do some performance testing on that isolated InfluxDB nodes the next morning (mainly looking into chunk size / stripe alignment). This morning when I came in, all nodes, with the exception of the node where I took out the mirrors, had the same issue as reported before, so software RAID is becoming more suspect. At this point we're thinking of sharding the cluster and re-do'ing the whole set up altogether (maybe with ZFS), and maybe also with influxdb_relay on them; or on the other hand to actually give the commercial clustered version a spin. In the meanwhile, are there any other people here that have handled production InfluxDB loads with software RAID and / or ZFS? |
More updates. This seems to happen systematically. I've started paying attention to what is actually happening in the logs as well. I'm actually noticing that there isn't a filesize difference after it is done compacting: Case in point: From the influxDB logs:
looking at the /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm file at the time of compaction, it is the following size: the logfiles appear to mention it writing it into a "tmp" file, which is the exact same size: and when it has finally done compacting, and i'm assuming it overwrites the original tsm1-file with the tmp one, it is still the same size: I can virtually reproduce this for any tsm-file on the filesystem. Which makes me question:
|
So, to exclude the very poor filesystem performance on my nodes, I changed the filesystem on all of my nodes to ZFS. I've created a mirrored pool, so effectively I am running the same thing as RAID1+0. So the IOPS are loads better than what they were before: Please note that this IO spike, however, still only occurs when the node is doing a full compaction. Which brings me to my next point, once a node is in full compaction mode, it never seems to finish. Whereas there are nodes doing a normal compaction during regular intervals, once the full compaction happens, it seems to keep a shitload of compaction tmp files in the datadirs, and does not get rid of them unless I restart InfluxDB. Now, this has become less of an issue now, as due to the fact that the disks during this io spike aren't at 100% ioutil anymore, I actually have headroom for metrics still to be coming in and not have the "spikey" behavior I see when disks are 100% utilised. All of this, however, cannot be deemed default behavior. I'm nearly considering writing an event handler in my monitoring system, checking on the system / cpu / disk load, amount of compaction tmp files, to initiate a restart of InfluxDB. What information can you provide me in the meanwhile about how compaction works? Is there any other less hacky way of ensuring that this doesn't happen? |
@sysadminsemantics The If you are restarting the server in the middle of a compaction, then the compactions will re-run the next time, but it will need to start over. A full compaction runs after a shard is cold (e.g. no writes) for 24hrs by default. These don't run frequently because they can be very CPU intensive and full compactions can take a long time due to a variety of factors. Other types of compactions run while the shard is accumulating new data. In the issue description, you were expecting that there would be no IO or load spikes after writes have stopped. This is not the way the system works since shards are compacted and fully compacted after 24hrs. You can try lowering your |
I'll try to answer some of your earlier questions:
This is a different file than was being compacted
Compactions convert series data from a write optimized, less compact format to read optimized compressed format. They also expunge deleted data and help to keep larger shards able to be loaded quickly.
No. Once a shard is fully compacted and no new writes to the shard occur. It will not be re-compacted.
I don't understand the question.
No, they happen automatically to keep the shard both queryable and writeable. You can lower some of the thresholds used to trigger snapshot and full compactions, but I would not recommend changing the defaults without a good reason. |
The full compaction planner could return a plan that only included one generation. If this happened, a full compaction would run on that generation producing just one generation again. The planner would then repeat the plan. This could happen if there were two generations that were both over the max TSM file size and the second one happened to be in level 3 or lower. When this situation occurs, one cpu is pegged running a full compaction continuously and the disks become very busy basically rewriting the same files over and over again. This can eventually cause disk and CPU saturation if it occurs with more than one shard. Fixes #7074
Should be fixed via #7258 |
Bug report
System info:
4 of these (non clustered, just mirroring all metrics data x 4 via telegraf, might shard or truly cluster these at some point):
ProLiant XL170r Gen9
Intel(R) Xeon(R) CPU E5-2660 (40 cores hyperthreaded)
256GB of RAM
5 x 960GB SSD's (HP MK0960GECQK, apparently these are refurbished Samsungs), out of which:
Steps to reproduce:
-> 54 shards
-> numMeasurements: 90
-> numSeries: 259848
Expected behavior:
Actual behavior:
I've attached a few screenshots of my influxDB instance metrics; as well as the reports per node (both during and after the occurrence). Also; separate bug, did we remove UDP metrics from _internal?
The gist for the bugreport is here:
https://gist.github.com/sysadminsemantics/18a9692acfeb417deba5b2ab7f67e825
You'll probably need more info, but if you have any pointers I could work with in the meanwhile, that would be greatly appreciated. Forgot to mention (but should be apparent in the graphs), that the bad state occurs around 12:30.
The text was updated successfully, but these errors were encountered: