Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General stability issues influxdb v1.0beta3 (possible compaction problems, still)? #7074

Closed
sysadminsemantics opened this issue Jul 27, 2016 · 15 comments
Assignees
Milestone

Comments

@sysadminsemantics
Copy link

sysadminsemantics commented Jul 27, 2016

Bug report

System info:

4 of these (non clustered, just mirroring all metrics data x 4 via telegraf, might shard or truly cluster these at some point):
ProLiant XL170r Gen9
Intel(R) Xeon(R) CPU E5-2660 (40 cores hyperthreaded)
256GB of RAM
5 x 960GB SSD's (HP MK0960GECQK, apparently these are refurbished Samsungs), out of which:

  • 4 of these are in RAID1+0, since the machines don't have a RAID controller, we're using software raid... -> xfs
  • 1 of these drives, following design guidelines, used as a dedicated location for the WAL -> xfs
  • InfluxDB v1.0.0beta3
  • Debian Jessie 8.2

Steps to reproduce:

  • InfluxDB will be happily parsing all our metrics for a while. Current stats:
    -> 54 shards
    -> numMeasurements: 90
    -> numSeries: 259848
  • a big event will occur, something like a power outage in a datacentre, stopping submission of metrics collection for a while
  • once metrics collection is resumed, the box in question will explode. io utilisation will rise to about 100%, the machine consuming all of its 40 threads
  • a restart of influxDB usually mitigates the issue (for a while). Current cluster has been running in a stable fashion for nearly 22 hours.
  • it "feels" like there is some queue'ing going on. WAL looks normal; the only thing that looks out of the ordinary is that the full compaction time suddenly spikes to be about 13 minutes on average. this could however also be a symptom of 100 io util
  • i've not been able so far to collect statistics about socket utilisation; amount of TIME_WAIT or CLOSE_WAIT sockets; which might be an interesting metric, but could also be highly symptomatic
  • i've collected a few reports of the cluster when it is actually behaving, but not when it is misbehaving. I will update this issue with that once this issue reoccurs (current reports are attached)
  • you'll see in a few reports that the max speed of the SSD's seems to be around 60 MB / sec. (and ioutil then being @ 100%) obviously that is very low, which is a separate hardware (and maybe software raid) issue we are investigating separately

Expected behavior:

  • no inexplicable loadspike / io lockup after metric submission has been stopped for a while

Actual behavior:

  • loadspike / io lockup

I've attached a few screenshots of my influxDB instance metrics; as well as the reports per node (both during and after the occurrence). Also; separate bug, did we remove UDP metrics from _internal?

screen shot 2016-07-27 at 11 32 28
screen shot 2016-07-27 at 11 33 25

The gist for the bugreport is here:
https://gist.github.com/sysadminsemantics/18a9692acfeb417deba5b2ab7f67e825

You'll probably need more info, but if you have any pointers I could work with in the meanwhile, that would be greatly appreciated. Forgot to mention (but should be apparent in the graphs), that the bad state occurs around 12:30.

@e-dard
Copy link
Contributor

e-dard commented Jul 27, 2016

Oliver's gist contains:

  • vars.txt
  • iostat.txt
  • goroutine profile
  • blocking profile

@e-dard
Copy link
Contributor

e-dard commented Jul 27, 2016

@sysadminsemantics could you also provide (ideally on a node that's in the middle of causing trouble):

curl -o heap.txt "http://localhost:8086/debug/pprof/heap?debug=1"
influx -execute "show shards" > shards.txt
influx -execute "show stats" > stats.txt
influx -execute "show diagnostics" > diagnostics.txt

You can then drag and drop all the files as an issue comment. If you're concerned about exposing stats, shards or diagnostics then you can email those to my first name as it appears on my profile page @influxdb.com.

@e-dard
Copy link
Contributor

e-dard commented Jul 27, 2016

@sysadminsemantics I just realised that the profiles in the gist are when the system is performing well? In that case we could also do with:

curl -o block.txt "http://localhost:8086/debug/pprof/block?debug=1" 
curl -o goroutine.txt "http://localhost:8086/debug/pprof/goroutine?debug=1" 
curl -o vars.txt "http://localhost:8086/debug/vars" 
iostat -xd 1 30 > iostat.txt

when it's performing badly, too 😄

@sysadminsemantics
Copy link
Author

@jwilder
Copy link
Contributor

jwilder commented Jul 27, 2016

@sysadminsemantics Something looks off with your disks. You mentioned these disks were SSDs, but the iostat data looks like a spinning disk profile. Each disk is sustaining about 110 IOPS and is fully utilized. There are also many goroutines blocked on file IO related tasks.

Are you able to test without the RAID setup?

@sysadminsemantics
Copy link
Author

sysadminsemantics commented Jul 28, 2016

@jwilder thanks for the feedback.

I was able to test SSDs in several scenarios, and I think the conclusion and morale of the story is something as stupidly simple as queue'ing. We've completely overseen that Jessie by default enables CFQ for every single device; not ideal for IO intensive scenarios.

Newer fio tests reveal:

screen shot 2016-07-28 at 14 55 11

40K IOPS per disk without even having 100% disk util; and this is us just testing random writes.

Under normal circumstances, the InfluxDB nodes seem to be doing around this:

screen shot 2016-07-28 at 15 23 16

That's only about 100 IOPS, but also only 13% io util.

Furthermore; it would seem that after changing the queue'ing from CFQ to noop, the compaction lockup we experienced from the previous days no longer occurs:
screen shot 2016-07-28 at 14 51 08

I'll still be monitoring the situation closely, but I'm hoping that just the simple queue'ing change might have solved this.

@jwilder
Copy link
Contributor

jwilder commented Jul 28, 2016

@sysadminsemantics Interesting discovery. I'm going to close this out as it appears to be an system config issue. If you run into other issues, please log a new issue.

Thanks for all the detailed data as well. Really helps to understand what is going more easily.

@jwilder jwilder closed this as completed Jul 28, 2016
@sysadminsemantics
Copy link
Author

I'm afraid it happened again over the weekend, so we cheered slightly too early. The strange thing is; when I run an iostat while influxDB is in "bad" state; the IOPS reported by iostat are indeed very low (around the 100 mark, with 60 MB / sec. throughput per drive):

screen shot 2016-08-01 at 10 35 47

If in the meanwhile; while influxDB is in this bad state, i run an fio on my own and then run the iostat, the throughput increases..

screen shot 2016-08-01 at 10 39 31

Then, after an influxDB restart, the writes that InfluxDB does do are more performant than they were before, it seems like it is barely doing anything at all:

screen shot 2016-08-01 at 10 50 39

When the nodes are in this bad state; automatically the full compaction time increases to crazy numbers, probably because it cannot complete?

screen shot 2016-08-01 at 10 54 36

I'm really interested in the full compaction theme here. Is there a way to manually run this, to see if we can reproduce the error? If not, what details can you provide on what the thresholds and / or schedules are for it to run?

@e-dard
Copy link
Contributor

e-dard commented Aug 1, 2016

I'm going to reopen this issue since previous data may still be useful in diagnosing this. /cc @jwilder

@e-dard e-dard reopened this Aug 1, 2016
@sysadminsemantics
Copy link
Author

Just an update that before going home last night, I removed both mirrored drives for the RAID1+0 array, to do some performance testing on that isolated InfluxDB nodes the next morning (mainly looking into chunk size / stripe alignment).

This morning when I came in, all nodes, with the exception of the node where I took out the mirrors, had the same issue as reported before, so software RAID is becoming more suspect.

At this point we're thinking of sharding the cluster and re-do'ing the whole set up altogether (maybe with ZFS), and maybe also with influxdb_relay on them; or on the other hand to actually give the commercial clustered version a spin.

In the meanwhile, are there any other people here that have handled production InfluxDB loads with software RAID and / or ZFS?

@sysadminsemantics
Copy link
Author

More updates. This seems to happen systematically. I've started paying attention to what is actually happening in the logs as well. I'm actually noticing that there isn't a filesize difference after it is done compacting:

Case in point:

From the influxDB logs:

[tsm1] 2016/08/03 07:58:49 beginning full compaction of group 0, 2 TSM files
[tsm1] 2016/08/03 07:58:49 compacting full group (0) /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm (#0)
[tsm1] 2016/08/03 07:58:49 compacting full group (0) /mnt/influxdb/data/influxdb/default/73/000000478-000035230.tsm (#1)
[tsm1] 2016/08/03 07:59:10 compacted level 1 group (0) into /mnt/influxdb/data/influxdb/default/167/000000494-000000002.tsm.tmp (#0)
[tsm1] 2016/08/03 07:59:10 compacted level 1 group 0 of 2 files into 1 files in 51.076741817s

looking at the /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm file at the time of compaction, it is the following size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:55 000000478-000035229.tsm

the logfiles appear to mention it writing it into a "tmp" file, which is the exact same size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:59 000000478-000035231.tsm.tmp

and when it has finally done compacting, and i'm assuming it overwrites the original tsm1-file with the tmp one, it is still the same size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 08:28 000000478-000035245.tsm

I can virtually reproduce this for any tsm-file on the filesystem. Which makes me question:

  • what is compaction intended to do?
  • will influxDB always attempt to compact a file, even though it cannot be compacted any further?
  • what is the mechanism for compaction?
  • is there a way to schedule compaction?

@sysadminsemantics
Copy link
Author

sysadminsemantics commented Aug 17, 2016

So, to exclude the very poor filesystem performance on my nodes, I changed the filesystem on all of my nodes to ZFS. I've created a mirrored pool, so effectively I am running the same thing as RAID1+0.

So the IOPS are loads better than what they were before:

screen shot 2016-08-17 at 11 05 03

Please note that this IO spike, however, still only occurs when the node is doing a full compaction. Which brings me to my next point, once a node is in full compaction mode, it never seems to finish.

Whereas there are nodes doing a normal compaction during regular intervals, once the full compaction happens, it seems to keep a shitload of compaction tmp files in the datadirs, and does not get rid of them unless I restart InfluxDB.

Now, this has become less of an issue now, as due to the fact that the disks during this io spike aren't at 100% ioutil anymore, I actually have headroom for metrics still to be coming in and not have the "spikey" behavior I see when disks are 100% utilised.

All of this, however, cannot be deemed default behavior. I'm nearly considering writing an event handler in my monitoring system, checking on the system / cpu / disk load, amount of compaction tmp files, to initiate a restart of InfluxDB.

What information can you provide me in the meanwhile about how compaction works? Is there any other less hacky way of ensuring that this doesn't happen?

@jwilder
Copy link
Contributor

jwilder commented Aug 17, 2016

@sysadminsemantics The tmp files are created by the compaction process and installed when the compaction complete. Once a TSM file is written, it is never modified. The tmp files are newly compacted TSM files that are not ready to be used just yet. We have various compaction stages that move data from a write optimized format into a read optimized format with higher levels of compression. A full compaction produces an optimal format for querying.

If you are restarting the server in the middle of a compaction, then the compactions will re-run the next time, but it will need to start over. A full compaction runs after a shard is cold (e.g. no writes) for 24hrs by default. These don't run frequently because they can be very CPU intensive and full compactions can take a long time due to a variety of factors. Other types of compactions run while the shard is accumulating new data.

In the issue description, you were expecting that there would be no IO or load spikes after writes have stopped. This is not the way the system works since shards are compacted and fully compacted after 24hrs. You can try lowering your [data].compact-full-write-cold-duration to 1hr to kick the process off more aggressively. If a write comes in to a shard that is fully compacted, it would be re-compacted after the cold duration threshold is exceeded. This can happen if you are writing data into the past or overwriting existing values. Doing either of those things makes compactions much more CPU intensive and slower.

@jwilder
Copy link
Contributor

jwilder commented Aug 17, 2016

I'll try to answer some of your earlier questions:

looking at the /mnt/influxdb/data/influxdb/default/73/000000478-000035229.tsm file at the time of compaction, it is the following size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:55 000000478-000035229.tsm

the logfiles appear to mention it writing it into a "tmp" file, which is the exact same size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 07:59 000000478-000035231.tsm.tmp

and when it has finally done compacting, and i'm assuming it overwrites the original tsm1-file with the tmp one, it is still the same size:
-rw-r--r-- 1 influxdb influxdb 2147802559 Aug 3 08:28 000000478-000035245.tsm

This is a different file than was being compacted 000000478-000035231.tsm.tmp would have produced 000000478-000035231.tsm when completed. The sizes are all 2GB because that is the maximum size of a TSM file before we split it into a second file. The contents of each of these tmp files would be the combination of files used for the compaction. Blocks within the file would be more dense, series data will be sorted and have better locality, index would be better organized and deleted data would be removed.

I can virtually reproduce this for any tsm-file on the filesystem. Which makes me question:

what is compaction intended to do?

Compactions convert series data from a write optimized, less compact format to read optimized compressed format. They also expunge deleted data and help to keep larger shards able to be loaded quickly.

will influxDB always attempt to compact a file, even though it cannot be compacted any further?

No. Once a shard is fully compacted and no new writes to the shard occur. It will not be re-compacted.

what is the mechanism for compaction?

I don't understand the question.

is there a way to schedule compaction?

No, they happen automatically to keep the shard both queryable and writeable. You can lower some of the thresholds used to trigger snapshot and full compactions, but I would not recommend changing the defaults without a good reason.

jwilder added a commit that referenced this issue Sep 3, 2016
The full compaction planner could return a plan that only included
one generation.  If this happened, a full compaction would run on that
generation producing just one generation again.  The planner would then
repeat the plan.

This could happen if there were two generations that were both over
the max TSM file size and the second one happened to be in level 3 or
lower.

When this situation occurs, one cpu is pegged running a full compaction
continuously and the disks become very busy basically rewriting the
same files over and over again.  This can eventually cause disk and CPU
saturation if it occurs with more than one shard.

Fixes #7074
@jwilder jwilder self-assigned this Sep 3, 2016
@jwilder jwilder added this to the 1.0.0 milestone Sep 6, 2016
@jwilder
Copy link
Contributor

jwilder commented Sep 6, 2016

Should be fixed via #7258

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants