-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Influxdb crashes after 2 hours #7640
Comments
At the time of crash, there was 114k goroutines in the system:
Roughly half of them were stuck at the same place:
|
@rubycut Can you attach the full panic? The two links provided cut off the actual cause of the panic. Can you also run the following and attach to the issue:
|
|
|
diagnostics:
|
@rubycut Can you attach |
@jwilder ,
|
@jwilder , I attached full panic, there is nothing important before panic, influxdb was writing, business as usual:
|
pprof is not working, I am using stock influxdb 1.1.0 downloaded from your side, do I need to recompile or use anything else to turn pprof on? |
@rubycut Can you attach the full trace? That tells me what the panic was, but not where it occurred or what else was going on at the time. For your shards, I'd recommend using a larger shard group duration. For a four month retention period, a shard group duration of 1 mon would significantly reduce the number of shards and goroutines running on your system. The number of databases compounds the problem as well. The shards don't look very big so smaller shard durations is likely just creating overhead for little benefit. It looks like you may be using a custom kernel as well which might be having problems with the number of goroutines/shards/etc.. or other dependent system resources associated with them. |
You may need to set:
|
@jwilder , when influxdb crashes, it writes state in the log, this is what I already sent you, I don't know any other way how to obtain full trace. |
@rubycut The two dropbox links only have the end of the trace. The beginning part is cut off which doesn't indicate what the panic was or where it occurred. This is all I see:
|
oh man, dropbox sucks, here is the gist of trace: https://gist.github.com/rubycut/d35daf01887b5d23ddd4571c500c8e64 |
As we changed shardGroupDuration from one day to two weeks, number of goroutines is going down, we are at 111k goroutines, they go down by 1k each day. Today, influxdb was running for 4.5 hours without going down. |
But we are observing strange memory behavior compared with influxdb version 0.13. Blue line is Influxdb 1.1.0, while green is influxdb 0.13, both servers are running same load. As you can see influxdb 1.1.0 has constantly rising memory footprint, and in many cases I suspect that memory could be cause of crash, although server doesn't seem to die from OOM killer. Sudden drop in blue line every couple of hours when influxdb crashes. |
This is after running for 50 minutes: goroutine.txt: https://gist.github.com/995cd72876f8638447fab50af25995d5 |
@jwilder , I added all the info you required. I am standing by if you need anything else. |
This is prior to crash: https://gist.github.com/rubycut/a30c2351664424189e312e168154629e |
Here is the pprof heap map as it is getting full: https://gist.github.com/b036527bc1163a7f52b0de6fa7fd2a6b |
@jwilder , it seems that defaultWaitingWALWrites = 10 could be bottleneck, look at: https://github.com/influxdata/influxdb/blob/v1.1.0/tsdb/engine/tsm1/wal.go#L71 is it safe for me to increase this value and recompile? |
@rubycut Increasing it could cause higher memory usage. You may want to try lowering it since you have a lot of databases and shards. |
I can't get it to work, if I reduce it to 5, or increase it, it always grows as visible in this heatmap. |
I brought memory under control with Still, it crashes every two hours with:
|
@rubycut I'm not sure what to make of that. That call trace is what happens at startup when loading shards. I'm not sure how you would see that just before it crashed. BTW, how did you generate that image? What tool is that? |
Big pool can lead to huge memory usage in certain loads. See influxdata#7640 for detailed discussion.
@jwilder , you are right, this could already be startup since instance is automatically restarted with supervisor. Tool is relatively new, https://stackimpact.com/ |
@jwilder I've been digging into the memory usage. At least in this 500 database config the size of the requested buffers varies quite a bit resulting in a large amount of reallocating which prevents the pool from doing a good job. I've been experimenting with making 4 different pools each with a different buffer size. All buffers in that pool are allocated with the same capacity so there isn't a need for reallocating. After analyzing the requested buffers here are the pool sizes that work well for this test config.
I see a handful of 11MB and 13MB requests so I figured XL can just use whatever they requested. Under the load test for 40 minutes I'm only seeing 151MB being retained by the pools in total. Most of the activity is in the small and large pools at about 50-100 requests per second each. The XL pool only has 1 item and has only gotten accessed 189 times in 40 minutes. I've thought about not pooling the XL requests but it is only using up 13MB of RAM but it is preventing heap fragmentation by not having to reallocate that same block. The medium pool got used a lot at startup time and then leveled off. I think that size must get used while reading in the cachloader. Here are the pool stats after running for about 40 minutes under load
|
@allenpetersen Interesting. Would you be able to try reverting the pool back to |
I was looking at that code and I agree that sync.Pool might not be the best fit because the pools can get cleared often. The thrashing I was seeing with bufPool is gone with the bucketed sizes and Here is the WIP commit, there is a lot of logging code I've been using to monitor the pools https://github.com/allenpetersen/influxdb/commit/cf76c3ee4f14055a4bb1a08dace9d3879f6c2721 I'm still looking into what seems like a heap fragmentation issue. Linux is reporting 30GB virtual and 11GB resident and there is 16GB of HeapReleased RAM.
|
@jwilder I quickly swapped out https://github.com/allenpetersen/influxdb/commit/dd71b23513e42edd3ff25cfb918faac1491768a0 I'll get exact numbers here in a bit but for example with the older code it took almost 2 minutes to finish the cachloader stage of startup. With Linux is only reporting 10GB virtual and 4.4GB resident, each about 1/3rd of the older code.
|
@jwilder After 70 minutes with Here are the results for a 70 minute test with
I'll do a similar run with 1.1.0 now. |
Here is a comparison of 1.1.0 to master+sync.Pool buckets
To be honest I don't understand the 1.9GB of HeapReleased, my guess is something like a fragmented heap so there are not full pages the kernel can take back. |
@jwilder I ran a few more tests and using fixed size sync.Pools worked best. Unfortunately there is a lot of variability between the test runs. I want to make a repeatable workload with |
@rubycut @allenpetersen there are a couple of problems with the current pool in pkg/pool`. @rubycut's fix in #7683 will help with memory management for systems under load with occasional large buffer requirements but the cost is increased allocations due to more contention over the now smaller pool. @allenpetersen we originally used I've been working on a new |
@e-dard What I noticed while trying to simulate the workload for this issue was a very high number of Get() requests failing the capacity check https://github.com/influxdata/influxdb/blob/master/pkg/pool/bytes.go#L28
There were 200-400 calls to Bytes.Get() per second. Half of those requests were for small buffers of various sizes of less than 300k. The other half were for a single size of 1048576 bytes. In a very short period of time buffers in the pool channel were essentially randomly sized resulting in a lot of reallocations when the capacity check failed. Looking back at the sync.Pool implementation from September there was similar logic. My guess is the problem was more about discarding these pooled buffers when the capacity check fails rather than a GC emptying the pool. Maintaining a few pools of specific sized buffers removed this thrashing and the allocations went way down. Actually the total memory in all the pools went down quite a bit as well. I've been busy this week and haven't been able to simplify my test procedure. I've looked at stress and stress/v2 but this workload is rather specific and I think it will be easier to just write a simple program to generate the load. Let me know when your new branch is ready, I'm happy to test it. |
@e-dard Here is the latest version of my experiment pools of specific sizes. https://github.com/allenpetersen/influxdb/commit/7a86d160ac2bba3e97c2ddcdf14c7e108eddac6d [edit] updated to latest commit |
@e-dard, @allenpetersen , reducing pool size with my patch solved memory issues for us, but if you want that we test something, I am available. |
@jwilder, after I changed shard group duration, number of shards is decreasing little by little every day in last two weeks. Now, sudden crashes of influxdb are not happening so often. It's quite possible that when thousands of these 24 hours shards expire, that influxdb will be not be crashing anymore. |
Upgrade to influx 1.2.0-rc1 fixed the memory leak issue we were having. We've also dropped ~250 subscriptions which brought the number of goroutines down to a reasonable level. However, Influx still crashes after 90-180 minutes of uptime and it seems it's not related to usage (write queries vs mixed queries). |
@rubycut Has stability improved since lowering the number of shards? |
@jwilder , we still have crashes, we are running 1.2.0-rc1, my colleague @weshmashian is running tests and trying various stuff, number of shards decreased significantly, since we are running 14 day shard groups since mid Dec, but we still have crashes, they usually crash after few hours, the longest period we were able to run it without crash is 48 hours. @weshmashian can provide details. |
We're actually running three versions - 1.1.0, 1.2.0-rc1 and 1.2.0. So far rc1 has had no crashes in past three days. We're trying to get the same behavior on 1.2.0 (matching configs and using the same dataset), but so far it was not successful. We've also started getting the following panic on one 1.2.0-rc1:
Full version: 1.2.0~rc1, branch master, commit bb029b5 |
@weshmashian how often does that panic occur? Has it happened in the final |
@e-dard it happened on final 1.2.0 as well:
I've synced known good datadir and started up rc1 on it. It's not panicking any more, but it's stil too early to tell if it's going to crash. Updated panic in previous comment as I only now realized it was missing the first line. |
@weshmashian thanks. We need to open a new issue for this panic. |
Should be fixed via #8348 |
My influx version is 1.5.2-1 My influx instance goes down every 1 hour or so. and this has started happening just 2-3 weeks before. Attached some files for details |
Bug report
System info:
Influxdb 1.1.0
Os: Debian
Steps to reproduce:
Expected behavior:
Should run normally
Actual behavior:
Crashes after two hours
The text was updated successfully, but these errors were encountered: