-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit number of concurrent compactions #8276
Comments
Can you attach some profile when this is occurring? Also, what version are you running?
|
Hi, |
Here is a copy of (a part of) the log which shows that very many compactions are started at the same time: This can easyly be reproduced by:
Influx will then start to do a full compaction of all 100 shards at once. Each compaction may take a few minutes so they do not have the time to complete before others are started, and the system will start trashing (100 % utilization of swap media) or you will get a OOM. |
It looks to me that #7142 will solve this in a more generic way? |
What version are you running? |
InfluxDB starting, version 1.3.0~n201704010800, branch master, commit 15e594f |
In our use case we are doing a backfill to InfluxDB with data from 2008 to 2017. Each file we import might have data for a day or a month and we are importing it in acending order (more or less). In some cases we have to import some of the data twice. Shard duration i 1 week, but we do have multiple databases. I guess it is this backfill combined with a restart of influx before compaction is done that causes the large number of compactions to start at the same time. |
@hpbieker For that range of time that you are backfilling, you may want to bump up the shard duration to greater than 1 week to reduce the number of shards you have. 1-3 months might be better. If that data is never going to be removed, even as high as 1 year would be good. It sounds like you may have sparse data as well. The FAQ has other suggestions about config and schema design. I've run into the many compactions issue you are seeing due to the server being restarted frequently and compactions never completing. We do need to handle this case better when there are many shards. |
@jwilder , thank you for your suggestion. I agree that increasing the shard size to at least 4 weeks will be beneficial for read performance when the time window is > 1 week. However, it will result in large shards that will have to be recompacted from time to time. |
We ran into the same problem yesterday when adding one to each shards in a database using a script. It triggered the compacting of all shards at the same time. It looks like it fails with OOM if I have more than ~10 full compactions at the same time. What is memory consuming in the full compactions? It looks like the memory usage increases as the compaction is progressing, but shouldn't influx free the memory as it is done with parts of the files? Or does it somehow wait to free some of the memory until a compaction is completed? I guess I will have the same problem if I do a drop measurement XX because that will trigger a full compaction of all the shards at the same time. |
In order to limit the number of compactions, I think the code below in engine.go should be modified to a) start a limited number of go rutines or b) add some code to throttle the number of go rutines running. The number should be configurable. Any suggestion of the best way to do this?// Apply concurrently compacts all the groups in a compaction strategy.
} |
Fixed via #8348 |
@hpbieker Great! Do you have a graph of goroutines by chance? It's available in _internal. They should drop significantly as well once the shards go cold and are recompacted. |
Hi again @jwilder I think the first graph below confirms that -- we upgraded at 9:00 today :-) The second graph is the same graph, but only with data after the upgrade. |
I do not know if it is related to this commit or an earlier, but it looks like the current version uses a bit more memory than the previous version. The memory consumption now ends up at around 38.4 GB, but it used to be 32.3 GB. As you see from the graphs, I get a significant jump in various measurements at around 09:00 when we did the upgrade. |
@hpbieker Would you be able to grab a heap profile? |
#8370 is merged and in current nightlies. |
@jwilder , we did a restart 16th May at 12:00. I am not sure how to read this, but I see that the heep in use has been reduced, but heap alloc has increased. Also the number of object has increased. Should I provide some more stats? May 16th 2017, 12:05:11.000 INFO - InfluxDB starting, version 1.3.0-n201705150800, branch master, commit 3b70086 - influx-log |
@hpbieker Are you running into issues with the current build or just noting the difference here? If you can grab a heap snapshot that might be useful.
|
Proposal:
I would like InfluxDB to have a configuration setting to limit the number of concurrent compactions.
Current behavior:
InfluxDB starts as many compactions as it can concurrently. It is not limited to the number of MAXGOPROCS as far as I understand.
Desired behavior:
The user can limit the number of compactions running at the same time.
Use case:
On my system I had 170 full compactions running concurrently after a restart of InfluxDB as all of them are basically started at the same time. This caused the machine to run out of memory and starts swapping. Thuse it because useless. It might also cause the machine to run out of disk space.
I have 16 cores and 64 GB RAM. 1500 shards distributed on ~10 databases. 1.7 TB data.
By limiting the number of concurrent compactions, these situations can be prevented.
The text was updated successfully, but these errors were encountered: