General stability issues influxdb v1.0beta3 (possible compaction problems, still)? #7074

sysadminsemantics · 2016-07-27T09:38:59Z

Bug report

System info:

4 of these (non clustered, just mirroring all metrics data x 4 via telegraf, might shard or truly cluster these at some point):
ProLiant XL170r Gen9
Intel(R) Xeon(R) CPU E5-2660 (40 cores hyperthreaded)
256GB of RAM
5 x 960GB SSD's (HP MK0960GECQK, apparently these are refurbished Samsungs), out of which:

4 of these are in RAID1+0, since the machines don't have a RAID controller, we're using software raid... -> xfs
1 of these drives, following design guidelines, used as a dedicated location for the WAL -> xfs
InfluxDB v1.0.0beta3
Debian Jessie 8.2

Steps to reproduce:

InfluxDB will be happily parsing all our metrics for a while. Current stats:
-> 54 shards
-> numMeasurements: 90
-> numSeries: 259848
a big event will occur, something like a power outage in a datacentre, stopping submission of metrics collection for a while
once metrics collection is resumed, the box in question will explode. io utilisation will rise to about 100%, the machine consuming all of its 40 threads
a restart of influxDB usually mitigates the issue (for a while). Current cluster has been running in a stable fashion for nearly 22 hours.
it "feels" like there is some queue'ing going on. WAL looks normal; the only thing that looks out of the ordinary is that the full compaction time suddenly spikes to be about 13 minutes on average. this could however also be a symptom of 100 io util
i've not been able so far to collect statistics about socket utilisation; amount of TIME_WAIT or CLOSE_WAIT sockets; which might be an interesting metric, but could also be highly symptomatic
i've collected a few reports of the cluster when it is actually behaving, but not when it is misbehaving. I will update this issue with that once this issue reoccurs (current reports are attached)
you'll see in a few reports that the max speed of the SSD's seems to be around 60 MB / sec. (and ioutil then being @ 100%) obviously that is very low, which is a separate hardware (and maybe software raid) issue we are investigating separately

Expected behavior:

no inexplicable loadspike / io lockup after metric submission has been stopped for a while

Actual behavior:

loadspike / io lockup

I've attached a few screenshots of my influxDB instance metrics; as well as the reports per node (both during and after the occurrence). Also; separate bug, did we remove UDP metrics from _internal?

The gist for the bugreport is here:
https://gist.github.com/sysadminsemantics/18a9692acfeb417deba5b2ab7f67e825

You'll probably need more info, but if you have any pointers I could work with in the meanwhile, that would be greatly appreciated. Forgot to mention (but should be apparent in the graphs), that the bad state occurs around 12:30.

e-dard · 2016-07-27T10:05:56Z

Oliver's gist contains:

vars.txt
iostat.txt
goroutine profile
blocking profile

e-dard · 2016-07-27T10:17:15Z

@sysadminsemantics could you also provide (ideally on a node that's in the middle of causing trouble):

curl -o heap.txt "http://localhost:8086/debug/pprof/heap?debug=1"
influx -execute "show shards" > shards.txt
influx -execute "show stats" > stats.txt
influx -execute "show diagnostics" > diagnostics.txt

You can then drag and drop all the files as an issue comment. If you're concerned about exposing stats, shards or diagnostics then you can email those to my first name as it appears on my profile page @influxdb.com.

e-dard · 2016-07-27T10:34:22Z

@sysadminsemantics I just realised that the profiles in the gist are when the system is performing well? In that case we could also do with:

curl -o block.txt "http://localhost:8086/debug/pprof/block?debug=1" 
curl -o goroutine.txt "http://localhost:8086/debug/pprof/goroutine?debug=1" 
curl -o vars.txt "http://localhost:8086/debug/vars" 
iostat -xd 1 30 > iostat.txt

when it's performing badly, too 😄

sysadminsemantics · 2016-07-27T12:29:32Z

All nodes just starting having issues out of the blue; no major event that triggered it.
block_sn1558_1.0.0-beta3.txt
diagnostics_sn1558_1.0.0-beta3.txt
goroutine_sn1558_1.0.0-beta3.txt
heap_sn1558_1.0.0-beta3.txt
iostat_sn1558_1.0.0-beta3.txt
shards_sn1558_1.0.0-beta3.txt
stats_sn1558_1.0.0-beta3.txt
vars_sn1558_1.0.0-beta3.txt

See graphs as well:

jwilder · 2016-07-27T15:04:09Z

@sysadminsemantics Something looks off with your disks. You mentioned these disks were SSDs, but the iostat data looks like a spinning disk profile. Each disk is sustaining about 110 IOPS and is fully utilized. There are also many goroutines blocked on file IO related tasks.

Are you able to test without the RAID setup?

sysadminsemantics · 2016-07-28T12:53:45Z

@jwilder thanks for the feedback.

I was able to test SSDs in several scenarios, and I think the conclusion and morale of the story is something as stupidly simple as queue'ing. We've completely overseen that Jessie by default enables CFQ for every single device; not ideal for IO intensive scenarios.

Newer fio tests reveal:

40K IOPS per disk without even having 100% disk util; and this is us just testing random writes.

Under normal circumstances, the InfluxDB nodes seem to be doing around this:

That's only about 100 IOPS, but also only 13% io util.

Furthermore; it would seem that after changing the queue'ing from CFQ to noop, the compaction lockup we experienced from the previous days no longer occurs:

I'll still be monitoring the situation closely, but I'm hoping that just the simple queue'ing change might have solved this.

jwilder · 2016-07-28T18:18:46Z

@sysadminsemantics Interesting discovery. I'm going to close this out as it appears to be an system config issue. If you run into other issues, please log a new issue.

Thanks for all the detailed data as well. Really helps to understand what is going more easily.

sysadminsemantics · 2016-08-01T08:56:32Z

I'm afraid it happened again over the weekend, so we cheered slightly too early. The strange thing is; when I run an iostat while influxDB is in "bad" state; the IOPS reported by iostat are indeed very low (around the 100 mark, with 60 MB / sec. throughput per drive):

If in the meanwhile; while influxDB is in this bad state, i run an fio on my own and then run the iostat, the throughput increases..

Then, after an influxDB restart, the writes that InfluxDB does do are more performant than they were before, it seems like it is barely doing anything at all:

When the nodes are in this bad state; automatically the full compaction time increases to crazy numbers, probably because it cannot complete?

I'm really interested in the full compaction theme here. Is there a way to manually run this, to see if we can reproduce the error? If not, what details can you provide on what the thresholds and / or schedules are for it to run?

e-dard · 2016-08-01T10:01:08Z

I'm going to reopen this issue since previous data may still be useful in diagnosing this. /cc @jwilder

sysadminsemantics · 2016-08-02T14:48:35Z

Just an update that before going home last night, I removed both mirrored drives for the RAID1+0 array, to do some performance testing on that isolated InfluxDB nodes the next morning (mainly looking into chunk size / stripe alignment).

This morning when I came in, all nodes, with the exception of the node where I took out the mirrors, had the same issue as reported before, so software RAID is becoming more suspect.

At this point we're thinking of sharding the cluster and re-do'ing the whole set up altogether (maybe with ZFS), and maybe also with influxdb_relay on them; or on the other hand to actually give the commercial clustered version a spin.

In the meanwhile, are there any other people here that have handled production InfluxDB loads with software RAID and / or ZFS?

sysadminsemantics · 2016-08-03T08:34:44Z

More updates. This seems to happen systematically. I've started paying attention to what is actually happening in the logs as well. I'm actually noticing that there isn't a filesize difference after it is done compacting:

Case in point:

From the influxDB logs:

[tsm1] 2016/08/03 07:58:49 beginning full compaction of group 0, 2 TSM files
[tsm1] 2016/08/03 07:58:49 compacting full group (0) /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm (#0)
[tsm1] 2016/08/03 07:58:49 compacting full group (0) /mnt/influxdb/data/influxdb/default/73/000000478-000035230.tsm (#1)
[tsm1] 2016/08/03 07:59:10 compacted level 1 group (0) into /mnt/influxdb/data/influxdb/default/167/000000494-000000002.tsm.tmp (#0)
[tsm1] 2016/08/03 07:59:10 compacted level 1 group 0 of 2 files into 1 files in 51.076741817s

looking at the /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm file at the time of compaction, it is the following size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:55 000000478-000035229.tsm

the logfiles appear to mention it writing it into a "tmp" file, which is the exact same size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:59 000000478-000035231.tsm.tmp

and when it has finally done compacting, and i'm assuming it overwrites the original tsm1-file with the tmp one, it is still the same size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 08:28 000000478-000035245.tsm

I can virtually reproduce this for any tsm-file on the filesystem. Which makes me question:

what is compaction intended to do?
will influxDB always attempt to compact a file, even though it cannot be compacted any further?
what is the mechanism for compaction?
is there a way to schedule compaction?

sysadminsemantics · 2016-08-17T09:04:39Z

So, to exclude the very poor filesystem performance on my nodes, I changed the filesystem on all of my nodes to ZFS. I've created a mirrored pool, so effectively I am running the same thing as RAID1+0.

So the IOPS are loads better than what they were before:

Please note that this IO spike, however, still only occurs when the node is doing a full compaction. Which brings me to my next point, once a node is in full compaction mode, it never seems to finish.

Whereas there are nodes doing a normal compaction during regular intervals, once the full compaction happens, it seems to keep a shitload of compaction tmp files in the datadirs, and does not get rid of them unless I restart InfluxDB.

Now, this has become less of an issue now, as due to the fact that the disks during this io spike aren't at 100% ioutil anymore, I actually have headroom for metrics still to be coming in and not have the "spikey" behavior I see when disks are 100% utilised.

All of this, however, cannot be deemed default behavior. I'm nearly considering writing an event handler in my monitoring system, checking on the system / cpu / disk load, amount of compaction tmp files, to initiate a restart of InfluxDB.

What information can you provide me in the meanwhile about how compaction works? Is there any other less hacky way of ensuring that this doesn't happen?

jwilder · 2016-08-17T18:00:48Z

@sysadminsemantics The tmp files are created by the compaction process and installed when the compaction complete. Once a TSM file is written, it is never modified. The tmp files are newly compacted TSM files that are not ready to be used just yet. We have various compaction stages that move data from a write optimized format into a read optimized format with higher levels of compression. A full compaction produces an optimal format for querying.

If you are restarting the server in the middle of a compaction, then the compactions will re-run the next time, but it will need to start over. A full compaction runs after a shard is cold (e.g. no writes) for 24hrs by default. These don't run frequently because they can be very CPU intensive and full compactions can take a long time due to a variety of factors. Other types of compactions run while the shard is accumulating new data.

In the issue description, you were expecting that there would be no IO or load spikes after writes have stopped. This is not the way the system works since shards are compacted and fully compacted after 24hrs. You can try lowering your [data].compact-full-write-cold-duration to 1hr to kick the process off more aggressively. If a write comes in to a shard that is fully compacted, it would be re-compacted after the cold duration threshold is exceeded. This can happen if you are writing data into the past or overwriting existing values. Doing either of those things makes compactions much more CPU intensive and slower.

jwilder · 2016-08-17T18:13:53Z

I'll try to answer some of your earlier questions:

looking at the /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm file at the time of compaction, it is the following size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:55 000000478-000035229.tsm

the logfiles appear to mention it writing it into a "tmp" file, which is the exact same size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:59 000000478-000035231.tsm.tmp

and when it has finally done compacting, and i'm assuming it overwrites the original tsm1-file with the tmp one, it is still the same size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 08:28 000000478-000035245.tsm

This is a different file than was being compacted 000000478-000035231.tsm.tmp would have produced 000000478-000035231.tsm when completed. The sizes are all 2GB because that is the maximum size of a TSM file before we split it into a second file. The contents of each of these tmp files would be the combination of files used for the compaction. Blocks within the file would be more dense, series data will be sorted and have better locality, index would be better organized and deleted data would be removed.

I can virtually reproduce this for any tsm-file on the filesystem. Which makes me question:

what is compaction intended to do?

Compactions convert series data from a write optimized, less compact format to read optimized compressed format. They also expunge deleted data and help to keep larger shards able to be loaded quickly.

will influxDB always attempt to compact a file, even though it cannot be compacted any further?

No. Once a shard is fully compacted and no new writes to the shard occur. It will not be re-compacted.

what is the mechanism for compaction?

I don't understand the question.

is there a way to schedule compaction?

No, they happen automatically to keep the shard both queryable and writeable. You can lower some of the thresholds used to trigger snapshot and full compactions, but I would not recommend changing the defaults without a good reason.

The full compaction planner could return a plan that only included one generation. If this happened, a full compaction would run on that generation producing just one generation again. The planner would then repeat the plan. This could happen if there were two generations that were both over the max TSM file size and the second one happened to be in level 3 or lower. When this situation occurs, one cpu is pegged running a full compaction continuously and the disks become very busy basically rewriting the same files over and over again. This can eventually cause disk and CPU saturation if it occurs with more than one shard. Fixes #7074

jwilder · 2016-09-06T18:17:16Z

Should be fixed via #7258

e-dard added the area/performance label Jul 27, 2016

jwilder closed this as completed Jul 28, 2016

e-dard reopened this Aug 1, 2016

jwilder mentioned this issue Sep 3, 2016

Fix continuous full compactions #7258

Merged

3 tasks

jwilder self-assigned this Sep 3, 2016

jwilder added the 2 - Working label Sep 3, 2016

jwilder added this to the 1.0.0 milestone Sep 6, 2016

jwilder closed this as completed Sep 6, 2016

jwilder removed the 2 - Working label Sep 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General stability issues influxdb v1.0beta3 (possible compaction problems, still)? #7074

General stability issues influxdb v1.0beta3 (possible compaction problems, still)? #7074

sysadminsemantics commented Jul 27, 2016 •

edited

Loading

e-dard commented Jul 27, 2016

e-dard commented Jul 27, 2016

e-dard commented Jul 27, 2016

sysadminsemantics commented Jul 27, 2016

jwilder commented Jul 27, 2016

sysadminsemantics commented Jul 28, 2016 •

edited

Loading

jwilder commented Jul 28, 2016

sysadminsemantics commented Aug 1, 2016

e-dard commented Aug 1, 2016

sysadminsemantics commented Aug 2, 2016

sysadminsemantics commented Aug 3, 2016

sysadminsemantics commented Aug 17, 2016 •

edited

Loading

jwilder commented Aug 17, 2016

jwilder commented Aug 17, 2016

jwilder commented Sep 6, 2016

General stability issues influxdb v1.0beta3 (possible compaction problems, still)? #7074

General stability issues influxdb v1.0beta3 (possible compaction problems, still)? #7074

Comments

sysadminsemantics commented Jul 27, 2016 • edited Loading

Bug report

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

e-dard commented Jul 27, 2016

e-dard commented Jul 27, 2016

e-dard commented Jul 27, 2016

sysadminsemantics commented Jul 27, 2016

jwilder commented Jul 27, 2016

sysadminsemantics commented Jul 28, 2016 • edited Loading

jwilder commented Jul 28, 2016

sysadminsemantics commented Aug 1, 2016

e-dard commented Aug 1, 2016

sysadminsemantics commented Aug 2, 2016

sysadminsemantics commented Aug 3, 2016

sysadminsemantics commented Aug 17, 2016 • edited Loading

jwilder commented Aug 17, 2016

jwilder commented Aug 17, 2016

jwilder commented Sep 6, 2016

sysadminsemantics commented Jul 27, 2016 •

edited

Loading

sysadminsemantics commented Jul 28, 2016 •

edited

Loading

sysadminsemantics commented Aug 17, 2016 •

edited

Loading