Influxdb crashes after 2 hours #7640

rubycut · 2016-11-18T14:02:27Z

Bug report

System info:
Influxdb 1.1.0
Os: Debian

Steps to reproduce:

Start influxdb
Create 500 different databases with 500 users
Auth should be on

Expected behavior:

Should run normally

Actual behavior:

Crashes after two hours

rubycut · 2016-11-18T15:22:45Z

At the time of crash, there was 114k goroutines in the system:

grep "goroutine " error2.log|wc -l
114551

Roughly half of them were stuck at the same place:

grep engine.go:979 error2.log|wc -l
56118

goroutine 19985 [sleep]:
time.Sleep(0x3b9aca00)
    /usr/local/go/src/runtime/time.go:59 +0xe1
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactTSMLevel(0xc429826100, 0xb74601, 0x1, 0xc428dd2300)
    /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:979 +0x90

jwilder · 2016-11-18T15:37:08Z

@rubycut Can you attach the full panic? The two links provided cut off the actual cause of the panic.

Can you also run the following and attach to the issue:

curl -o block.txt "http://localhost:8086/debug/pprof/block?debug=1" 
curl -o goroutine.txt "http://localhost:8086/debug/pprof/goroutine?debug=1" 
curl -o heap.txt "http://localhost:8086/debug/pprof/heap?debug=1" 
curl -o vars.txt "http://localhost:8086/debug/vars" 
iostat -xd 1 30 > iostat.txt
uname -a > uname.txt
influx -execute "show shards" > shards.txt
influx -execute "show stats" > stats.txt
influx -execute "show diagnostics" > diagnostics.txt

rubycut · 2016-11-18T20:12:45Z

uname -a
Linux dedxxxx 3.10.95xx #30 SMP Fri Jan 29 11:54:46 CST 2016 x86_64 GNU/Linux

rubycut · 2016-11-18T20:17:17Z

iostat.txt

rubycut · 2016-11-18T20:17:47Z

curl -o block.txt "http://localhost:8086/debug/pprof/block?debug=1"
curl -o goroutine.txt "http://localhost:8086/debug/pprof/goroutine?debug=1"
curl -o heap.txt "http://localhost:8086/debug/pprof/heap?debug=1"
curl -o vars.txt "http://localhost:8086/debug/vars"
iostat -xd 1 30 > iostat.txt
uname -a > uname.txt
influx -execute "show shards" > shards.txt
influx -execute "show stats" > stats.txt
influx -execute "show diagnostics" > diagnostics.txt

rubycut · 2016-11-18T20:22:54Z

shards.txt

rubycut · 2016-11-18T20:28:41Z

diagnostics:

name: build
-----------
Branch  Build Time      Commit                                          Version
master                  800da5732b91c816b0a097acf8887fa2af1efa1a        1.1.0


name: network
-------------
hostname
dedxxxx


name: runtime
-------------
GOARCH  GOMAXPROCS      GOOS    version
amd64   24              linux   go1.7.3


name: system
------------
PID     currentTime                     started                         uptime
5324    2016-11-18T20:27:10.806259576Z  2016-11-18T20:01:05.251612142Z  26m5.554647524s

jwilder · 2016-11-18T20:40:23Z

@rubycut Can you attach show retention policies as well? The number of goroutines is high because it looks like you have 1 day shard group durations for all your shards but fairly large retention range. You also have a large number of databases as well.

rubycut · 2016-11-18T20:42:00Z

vars.txt

https://www.dropbox.com/s/h5ftawg2ax2baoc/vars.txt?dl=0

rubycut · 2016-11-18T20:45:15Z

@jwilder , show retention policies doesn't work, database must be selected first, so in all databases, retention policy is the same, it looks like this:

> show retention policies
name            duration        shardGroupDuration      replicaN        default
default         0s              168h0m0s                1               false
four_months     3000h0m0s       24h0m0s                 1               true

rubycut · 2016-11-18T20:49:51Z

@jwilder , I attached full panic, there is nothing important before panic, influxdb was writing, business as usual:

[httpd] <ip_hidden> - c3941_telegraf [18/Nov/2016:12:16:03 +0000] "POST /write?consistency=&db=c3941&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c6823a4f-ad88-11e6-9ab6-000000000000 98278
[httpd] <ip_hidden> - c3503_telegraf [18/Nov/2016:12:16:04 +0000] "GET /query?db=&q=CREATE+DATABASE+IF+NOT+EXISTS+%22c3503%22 HTTP/1.1" 400 94 "-" "InfluxDBClient" c6939fb5-ad88-11e6-9ac5-000000000000 993
[httpd] <ip_hidden> - c3439_telegraf [18/Nov/2016:12:16:03 +0000] "POST /write?consistency=&db=c3439&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c6870f8e-ad88-11e6-9ab9-000000000000 85296
[httpd] <ip_hidden> - c3623_telegraf [18/Nov/2016:12:16:04 +0000] "POST /write?consistency=&db=c3623&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c693d5fa-ad88-11e6-9ac6-000000000000 2071
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x40ea9d]

rubycut · 2016-11-18T20:56:25Z

curl "http://localhost:8086/debug/pprof/goroutine?debug=1"
404 page not found

pprof is not working, I am using stock influxdb 1.1.0 downloaded from your side, do I need to recompile or use anything else to turn pprof on?

jwilder · 2016-11-18T20:56:44Z

@rubycut Can you attach the full trace? That tells me what the panic was, but not where it occurred or what else was going on at the time.

For your shards, I'd recommend using a larger shard group duration. For a four month retention period, a shard group duration of 1 mon would significantly reduce the number of shards and goroutines running on your system. The number of databases compounds the problem as well. The shards don't look very big so smaller shard durations is likely just creating overhead for little benefit.

It looks like you may be using a custom kernel as well which might be having problems with the number of goroutines/shards/etc.. or other dependent system resources associated with them.

jwilder · 2016-11-18T20:58:16Z

You may need to set:

[http]
pprof-enabled = true

rubycut · 2016-11-18T20:58:42Z

stats.txt:

https://www.dropbox.com/s/nux9u48fgiuul6i/stats.txt?dl=0

rubycut · 2016-11-18T21:57:25Z

@jwilder , when influxdb crashes, it writes state in the log, this is what I already sent you, I don't know any other way how to obtain full trace.

jwilder · 2016-11-18T22:02:00Z

@rubycut The two dropbox links only have the end of the trace. The beginning part is cut off which doesn't indicate what the panic was or where it occurred. This is all I see:

x197

goroutine 119256 [sleep]:
time.Sleep(0x3b9aca00)
    /usr/local/go/src/runtime/time.go:59 +0xe1
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactCache(0xc6aeb83800, 0xc727eece40)
    /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:953 +0xd7
github.com/influxdata/influxdb/tsdb/engine/tsm1.

... rest of trace...

rubycut · 2016-11-18T22:23:28Z

oh man, dropbox sucks, here is the gist of trace:

https://gist.github.com/rubycut/d35daf01887b5d23ddd4571c500c8e64

rubycut · 2016-11-24T14:37:34Z

As we changed shardGroupDuration from one day to two weeks, number of goroutines is going down, we are at 111k goroutines, they go down by 1k each day.

Today, influxdb was running for 4.5 hours without going down.

rubycut · 2016-11-24T15:00:36Z

But we are observing strange memory behavior compared with influxdb version 0.13.

Blue line is Influxdb 1.1.0, while green is influxdb 0.13, both servers are running same load.

As you can see influxdb 1.1.0 has constantly rising memory footprint, and in many cases I suspect that memory could be cause of crash, although server doesn't seem to die from OOM killer.

Sudden drop in blue line every couple of hours when influxdb crashes.

rubycut · 2016-11-25T12:31:13Z

This is after running for 50 minutes:

goroutine.txt: https://gist.github.com/995cd72876f8638447fab50af25995d5
block.txt: https://gist.github.com/d3b83295fa62e931af14f77298d57427
heap.txt: https://gist.github.com/595f25f3f77f3e597e060ec9d73a314b

rubycut · 2016-11-25T12:33:02Z

@jwilder , I added all the info you required. I am standing by if you need anything else.

rubycut · 2016-11-25T14:05:23Z

This is prior to crash: https://gist.github.com/rubycut/a30c2351664424189e312e168154629e

rubycut · 2016-11-25T15:38:46Z

Here is the pprof heap map as it is getting full:

https://gist.github.com/b036527bc1163a7f52b0de6fa7fd2a6b

rubycut · 2016-11-28T15:15:16Z

@jwilder , it seems that

defaultWaitingWALWrites = 10 could be bottleneck, look at:

https://github.com/influxdata/influxdb/blob/v1.1.0/tsdb/engine/tsm1/wal.go#L71

is it safe for me to increase this value and recompile?

jwilder · 2016-11-28T23:32:19Z

@rubycut Increasing it could cause higher memory usage. You may want to try lowering it since you have a lot of databases and shards.

rubycut · 2016-11-29T17:13:42Z

I can't get it to work, if I reduce it to 5, or increase it, it always grows as visible in this heatmap.

rubycut · 2016-12-01T16:52:53Z

I brought memory under control with cache-snapshot-memory-size = 2621440

Still, it crashes every two hours with:

influxd.log.7-panic: runtime error: invalid memory address or nil pointer dereference
influxd.log.7-[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0]
influxd.log.7-
influxd.log.7-goroutine 92627 [running]:
influxd.log.7-panic(0x9cf120, 0xc4200120b0)
influxd.log.7-  /usr/local/go/src/runtime/panic.go:500 +0x1a1
influxd.log.7-created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).enableLevelCompactions
influxd.log.7-  /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:214 +0x113
influxd.log.7-[signal SIGSEGV: segmentation violation code=0x2 addr=0xc6634fb180 pc=0xc6634fb180]
influxd.log.7-
influxd.log.7-goroutine 39672 [running]:
influxd.log.7-runtime.throw(0xa7c76c, 0x5)
influxd.log.7-  /usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc7e9024e68 sp=0xc7e9024e48
influxd.log.7-runtime.sigpanic()
influxd.log.7-  /usr/local/go/src/runtime/sigpanic_unix.go:27 +0x288 fp=0xc7e9024ec0 sp=0xc7e9024e68
influxd.log.7-created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).enableLevelCompactions
influxd.log.7-  /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:215 +0x13f
influxd.log.7-

jwilder · 2016-12-02T17:29:33Z

@rubycut I'm not sure what to make of that. That call trace is what happens at startup when loading shards. I'm not sure how you would see that just before it crashed.

BTW, how did you generate that image? What tool is that?

Big pool can lead to huge memory usage in certain loads. See influxdata#7640 for detailed discussion.

rubycut · 2016-12-02T17:53:55Z

@jwilder , you are right, this could already be startup since instance is automatically restarted with supervisor.

Tool is relatively new, https://stackimpact.com/

allenpetersen · 2016-12-02T23:10:37Z

@jwilder I've been digging into the memory usage. At least in this 500 database config the size of the requested buffers varies quite a bit resulting in a large amount of reallocating which prevents the pool from doing a good job.

I've been experimenting with making 4 different pools each with a different buffer size. All buffers in that pool are allocated with the same capacity so there isn't a need for reallocating. After analyzing the requested buffers here are the pool sizes that work well for this test config.

small      500KB
medium       1MB
large        4MB
XL     requested

I see a handful of 11MB and 13MB requests so I figured XL can just use whatever they requested.

Under the load test for 40 minutes I'm only seeing 151MB being retained by the pools in total. Most of the activity is in the small and large pools at about 50-100 requests per second each.

The XL pool only has 1 item and has only gotten accessed 189 times in 40 minutes. I've thought about not pooling the XL requests but it is only using up 13MB of RAM but it is preventing heap fragmentation by not having to reallocate that same block.

The medium pool got used a lot at startup time and then leveled off. I think that size must get used while reading in the cachloader.

Here are the pool stats after running for about 40 minutes under load

Size    Held (MB)  Requests
Small       14      382803
Medium       8      350591
Large      116       36119
XLarge      13         189
Total      151      769702

jwilder · 2016-12-02T23:20:57Z

@allenpetersen Interesting. Would you be able to try reverting the pool back to sync.Pool? This is the code that was there before: https://github.com/influxdata/influxdb/pull/7348/files#diff-25cc809730c7e22463531e3af8bf5dec

allenpetersen · 2016-12-02T23:44:55Z

I was looking at that code and I agree that sync.Pool might not be the best fit because the pools can get cleared often. The thrashing I was seeing with bufPool is gone with the bucketed sizes and Bytes.Get() is really low on the inuse_objects list. I've had influx under load for 70 minutes and it is quite happy, memory isn't growing.

Here is the WIP commit, there is a lot of logging code I've been using to monitor the pools

https://github.com/allenpetersen/influxdb/commit/cf76c3ee4f14055a4bb1a08dace9d3879f6c2721

I'm still looking into what seems like a heap fragmentation issue. Linux is reporting 30GB virtual and 11GB resident and there is 16GB of HeapReleased RAM.

# runtime.MemStats
# Alloc = 7887105264
# TotalAlloc = 125086394008
# Sys = 26991342520
# Lookups = 454114
# Mallocs = 1068203357
# Frees = 868244704
# HeapAlloc = 7887105264
# HeapSys = 25709838336
# HeapIdle = 16905437184
# HeapInuse = 8804401152
# HeapReleased = 15752298496
# HeapObjects = 199958653
# Stack = 32309248 / 32309248
# MSpan = 140084000 / 220659712
# MCache = 9600 / 16384
# BuckHashSys = 1721121

allenpetersen · 2016-12-03T01:15:40Z

@jwilder I quickly swapped out pool.Bytes with sync.Pool and it's performing really well.

https://github.com/allenpetersen/influxdb/commit/dd71b23513e42edd3ff25cfb918faac1491768a0

I'll get exact numbers here in a bit but for example with the older code it took almost 2 minutes to finish the cachloader stage of startup. With sync.Pool it took 15 seconds.

Linux is only reporting 10GB virtual and 4.4GB resident, each about 1/3rd of the older code.

# runtime.MemStats
# Alloc = 2372125720
# TotalAlloc = 32378802448
# Sys = 4349636208
# Lookups = 209778
# Mallocs = 291427402
# Frees = 235183224
# HeapAlloc = 2372125720
# HeapSys = 4078731264
# HeapIdle = 1594015744
# HeapInuse = 2484715520
# HeapReleased = 37691392
# HeapObjects = 56244178
# Stack = 34897920 / 34897920
# MSpan = 50029920 / 57786368
# MCache = 9600 / 16384
# BuckHashSys = 1627555

allenpetersen · 2016-12-03T02:18:05Z

@jwilder After 70 minutes with sync.Pool things continue to look really good. Overall it seems to be using about half of the memory as before.

Here are the results for a 70 minute test with sync.Pool buckets.

# runtime.MemStats
# Alloc = 4262116424
# TotalAlloc = 91533304016
# Sys = 6153510512
# Lookups = 779259
# Mallocs = 723575666
# Frees = 624965997
# HeapAlloc = 4262116424
# HeapSys = 5792759808
# HeapIdle = 1475100672
# HeapInuse = 4317659136
# HeapReleased = 0
# HeapObjects = 98609669
# Stack = 35028992 / 35028992
# MSpan = 67064000 / 75710464
# MCache = 9600 / 16384
# BuckHashSys = 1696723
# NextGC = 7056486574
# PauseNs = [530289 314619 320596 1096046 399833 308890 266669 471729 483568 400554 509909 550167 513533 301108 91926 68829 290793 56072 103335 109035 81930 103377 92215 52479 55663 106495 106711 69988 215513 222778 226723 52625 208723 56263 248848 225839 111737 269211 265993 80512 351578 256429 186419 198496 343178 462119 217230 2149745 460028 581279 726927 1155523 1180971 1252676 1633985 2201706 3144175 3008831 3424641 3293706 3314497 4204496 3700088 3804114 8579937 4113989 4518797 3754106 3912663 3819891 4386894 3742700 4217162 3866343 4455789 4059505 4565848 4185661 1892691 3760559 4036164 4194639 4772064 4188802 5763421 4162223 4569555 4053164 5886723 3843425 4263090 4310646 7016731 5057678 4185530 3967028 4009587 4095376 4807964 4115051 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# NumGC = 100
# DebugGC = false

I'll do a similar run with 1.1.0 now.

allenpetersen · 2016-12-03T03:40:48Z

Here is a comparison of 1.1.0 to master+sync.Pool buckets

                 1.1.0       sync.Pool    delta
Alloc         7337573088    4262116424    58.09%
TotalAlloc  102558047400   91533304016    89.25%
Sys          10669295144    6153510512    57.67%
Lookups           748249        779259   104.14%
Mallocs        933161203     723575666    77.54%
Frees          767299281     624965997    81.45%
HeapAlloc     7337573088    4262116424    58.09%
HeapSys      10072588288    5792759808    57.51%
HeapIdle      2716852224    1475100672    54.29%
HeapInuse     7355736064    4317659136    58.70%
HeapReleased  1911767040             0          
HeapObjects    165861922      98609669    59.45%

To be honest I don't understand the 1.9GB of HeapReleased, my guess is something like a fragmented heap so there are not full pages the kernel can take back.

allenpetersen · 2016-12-05T20:48:45Z

@jwilder I ran a few more tests and using fixed size sync.Pools worked best. Unfortunately there is a lot of variability between the test runs. I want to make a repeatable workload with influx_stress to better compare the different mechanisms before making a PR.

e-dard · 2016-12-07T11:05:51Z

@rubycut @allenpetersen there are a couple of problems with the current pool in pkg/pool`.

@rubycut's fix in #7683 will help with memory management for systems under load with occasional large buffer requirements but the cost is increased allocations due to more contention over the now smaller pool.

@allenpetersen we originally used sync.Pool but I believe the issue with very high loads was that it was aggressively releasing items back to be GC and so the number of allocations was quite high. I think the tradeoff from going from a sync.Pool backed pool to a fixed channel backed pool was to reduce allocations with the cost of increased memory usage.

I've been working on a new pkg/pool which will tracked requested buffer sizes and dynamically release buffers from the pool if they're not appropriately sized. I'm hoping it will give us the best of both worlds (memory usage and allocations over time). Would you be interested in trying it out when the branch is ready?

allenpetersen · 2016-12-08T17:30:34Z

@e-dard What I noticed while trying to simulate the workload for this issue was a very high number of Get() requests failing the capacity check https://github.com/influxdata/influxdb/blob/master/pkg/pool/bytes.go#L28

	if cap(c) < sz {
		return make([]byte, sz)
	}

There were 200-400 calls to Bytes.Get() per second. Half of those requests were for small buffers of various sizes of less than 300k. The other half were for a single size of 1048576 bytes. In a very short period of time buffers in the pool channel were essentially randomly sized resulting in a lot of reallocations when the capacity check failed.

Looking back at the sync.Pool implementation from September there was similar logic. My guess is the problem was more about discarding these pooled buffers when the capacity check fails rather than a GC emptying the pool. Maintaining a few pools of specific sized buffers removed this thrashing and the allocations went way down. Actually the total memory in all the pools went down quite a bit as well.

I've been busy this week and haven't been able to simplify my test procedure. I've looked at stress and stress/v2 but this workload is rather specific and I think it will be easier to just write a simple program to generate the load.

Let me know when your new branch is ready, I'm happy to test it.

allenpetersen · 2016-12-08T17:46:12Z

@e-dard Here is the latest version of my experiment pools of specific sizes.

https://github.com/allenpetersen/influxdb/commit/7a86d160ac2bba3e97c2ddcdf14c7e108eddac6d

[edit] updated to latest commit

rubycut · 2016-12-09T10:13:02Z

@e-dard, @allenpetersen , reducing pool size with my patch solved memory issues for us, but if you want that we test something, I am available.

rubycut · 2016-12-20T15:31:03Z

@jwilder, after I changed shard group duration, number of shards is decreasing little by little every day in last two weeks. Now, sudden crashes of influxdb are not happening so often.

It's quite possible that when thousands of these 24 hours shards expire, that influxdb will be not be crashing anymore.

weshmashian · 2017-01-18T08:50:46Z

Upgrade to influx 1.2.0-rc1 fixed the memory leak issue we were having. We've also dropped ~250 subscriptions which brought the number of goroutines down to a reasonable level.

However, Influx still crashes after 90-180 minutes of uptime and it seems it's not related to usage (write queries vs mixed queries).

jwilder · 2017-01-26T00:09:38Z

@rubycut Has stability improved since lowering the number of shards?

rubycut · 2017-01-26T09:11:58Z

@jwilder , we still have crashes, we are running 1.2.0-rc1, my colleague @weshmashian is running tests and trying various stuff, number of shards decreased significantly, since we are running 14 day shard groups since mid Dec, but we still have crashes, they usually crash after few hours, the longest period we were able to run it without crash is 48 hours.

@weshmashian can provide details.

weshmashian · 2017-01-27T09:48:25Z

We're actually running three versions - 1.1.0, 1.2.0-rc1 and 1.2.0. So far rc1 has had no crashes in past three days. We're trying to get the same behavior on 1.2.0 (matching configs and using the same dataset), but so far it was not successful.

We've also started getting the following panic on one 1.2.0-rc1:

panic: count of short block: got 0, exp 1

goroutine 62183 [running]:
panic(0x9cd1e0, 0xc6b6936eb0)
        /home/influx/.gvm/gos/go1.7.4/src/runtime/panic.go:500 +0x1a1
github.com/influxdata/influxdb/tsdb/engine/tsm1.BlockCount(0x0, 0x0, 0x0, 0x8)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.go:224 +0x315
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).combine(0xc5b67aca00, 0x0, 0xa763a0, 0x1, 0xc7b9cbfbc0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1165 +0x6eb
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).merge(0xc5b67aca00)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1101 +0x100
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).Next(0xc5b67aca00, 0xc813bb36ea)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1068 +0x2bd
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).write(0xc5ae0d8b40, 0xc5bb7f1db0, 0x47, 0xe2c620, 0xc5b67aca00, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:774 +0x29b
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).writeNewFiles(0xc5ae0d8b40, 0x22f, 0x6, 0xe2c620, 0xc5b67aca00, 0xe2c620, 0xc5b67aca00, 0x0, 0x0, 0xc5ca243bf8)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:728 +0x284
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).compact(0xc5ae0d8b40, 0xba8400, 0xc5bb7f5300, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:654 +0x52a
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).CompactFull(0xc5ae0d8b40, 0xc5bb7f5300, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:672 +0x18e
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup.func1(0xc5b4dcefc0, 0xc5bb7f5300, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1072 +0x15d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup(0xc5b4dcefc0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1074 +0x47d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply.func1(0xc5b7551f80, 0xc5b4dcefc0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1047 +0x5d
created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1048 +0xcc

Full version: 1.2.0~rc1, branch master, commit bb029b5

e-dard · 2017-01-27T10:43:26Z

@weshmashian how often does that panic occur? Has it happened in the final 1.2.0 release?

weshmashian · 2017-01-27T11:28:24Z

@e-dard it happened on final 1.2.0 as well:

panic: count of short block: got 0, exp 1

goroutine 53530 [running]:
panic(0x9d12c0, 0xc6ad599740)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/influxdata/influxdb/tsdb/engine/tsm1.BlockCount(0x0, 0x0, 0x0, 0x8)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.go:224 +0x315
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).combine(0xc4a7675b00, 0x0, 0xa7a980, 0x1, 0xc7086c8c00)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1165 +0x6eb
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).merge(0xc4a7675b00)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1101 +0x100
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).Next(0xc4a7675b00, 0xc613bb36ea)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1068 +0x2bd
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).write(0xc5183ba050, 0xc57fbdbdb0, 0x47, 0xe30720, 0xc4a7675b00, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:774 +0x29b
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).writeNewFiles(0xc5183ba050, 0x22f, 0x6, 0xe30720, 0xc4a7675b00, 0xe30720, 0xc4a7675b00, 0x0, 0x0, 0xc57cba1bf8)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:728 +0x284
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).compact(0xc5183ba050, 0xbacc00, 0xc570c61aa0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:654 +0x52a
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).CompactFull(0xc5183ba050, 0xc570c61aa0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:672 +0x18e
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup.func1(0xc56eba0930, 0xc570c61aa0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1072 +0x15d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup(0xc56eba0930, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1074 +0x47d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply.func1(0xc57cac3380, 0xc56eba0930, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1047 +0x5d
created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1048 +0xcc
[I] 2017-01-27T08:27:13Z InfluxDB starting, version 1.2.0, branch master, commit b7bb7e8359642b6e071735b50ae41f5eb343fd42

I've synced known good datadir and started up rc1 on it. It's not panicking any more, but it's stil too early to tell if it's going to crash.

Updated panic in previous comment as I only now realized it was missing the first line.

e-dard · 2017-01-27T11:31:49Z

@weshmashian thanks. We need to open a new issue for this panic.

jwilder · 2017-05-25T18:28:07Z

Should be fixed via #8348

mahaveer1707 · 2019-09-17T08:48:55Z

My influx version is 1.5.2-1
I read this whole thread and only understood to change the duration of shards , but it is no help.

My influx instance goes down every 1 hour or so. and this has started happening just 2-3 weeks before.
I have 35 databases . Retention policy of 1 month to each and shard duration is 2 weeks.

Attached some files for details

uname.txt
diagnostics.txt
iostat.txt
shards.txt
stats.txt

jwilder added need more info panic labels Nov 18, 2016

rubycut pushed a commit to rubycut/influxdb that referenced this issue Dec 2, 2016

Reduce pool size, change WAL writers default

6ed708e

Big pool can lead to huge memory usage in certain loads. See influxdata#7640 for detailed discussion.

rubycut mentioned this issue Dec 2, 2016

Reduce pool size, change WAL writers default #7683

Merged

e-dard added area/performance and removed need more info labels Dec 5, 2016

weshmashian mentioned this issue Jan 27, 2017

panic: count of short block: got 0, exp 1 #7897

Closed

jwilder closed this as completed May 25, 2017

mahaveer1707 mentioned this issue Sep 19, 2019

Compaction limits #8348

Merged

4 tasks

Influxdb crashes after 2 hours #7640

Influxdb crashes after 2 hours #7640

Comments

rubycut commented Nov 18, 2016

Bug report

rubycut commented Nov 18, 2016

jwilder commented Nov 18, 2016

rubycut commented Nov 18, 2016

rubycut commented Nov 18, 2016

rubycut commented Nov 18, 2016 • edited Loading

rubycut commented Nov 18, 2016

rubycut commented Nov 18, 2016 • edited Loading

jwilder commented Nov 18, 2016

rubycut commented Nov 18, 2016

rubycut commented Nov 18, 2016

rubycut commented Nov 18, 2016

rubycut commented Nov 18, 2016

jwilder commented Nov 18, 2016

jwilder commented Nov 18, 2016

rubycut commented Nov 18, 2016

rubycut commented Nov 18, 2016

jwilder commented Nov 18, 2016

rubycut commented Nov 18, 2016

rubycut commented Nov 24, 2016

rubycut commented Nov 24, 2016 • edited Loading

rubycut commented Nov 25, 2016

rubycut commented Nov 25, 2016

rubycut commented Nov 25, 2016

rubycut commented Nov 25, 2016

rubycut commented Nov 28, 2016

jwilder commented Nov 28, 2016

rubycut commented Nov 29, 2016

rubycut commented Dec 1, 2016

jwilder commented Dec 2, 2016

rubycut commented Dec 2, 2016

allenpetersen commented Dec 2, 2016 • edited Loading

jwilder commented Dec 2, 2016

allenpetersen commented Dec 2, 2016

allenpetersen commented Dec 3, 2016

allenpetersen commented Dec 3, 2016

allenpetersen commented Dec 3, 2016 • edited Loading

allenpetersen commented Dec 5, 2016

e-dard commented Dec 7, 2016

allenpetersen commented Dec 8, 2016

allenpetersen commented Dec 8, 2016 • edited Loading

rubycut commented Dec 9, 2016

rubycut commented Dec 20, 2016

weshmashian commented Jan 18, 2017

jwilder commented Jan 26, 2017

rubycut commented Jan 26, 2017

weshmashian commented Jan 27, 2017 • edited Loading

e-dard commented Jan 27, 2017

weshmashian commented Jan 27, 2017

e-dard commented Jan 27, 2017

jwilder commented May 25, 2017

mahaveer1707 commented Sep 17, 2019

rubycut commented Nov 18, 2016 •

edited

Loading

rubycut commented Nov 18, 2016 •

edited

Loading

rubycut commented Nov 24, 2016 •

edited

Loading

allenpetersen commented Dec 2, 2016 •

edited

Loading

allenpetersen commented Dec 3, 2016 •

edited

Loading

allenpetersen commented Dec 8, 2016 •

edited

Loading

weshmashian commented Jan 27, 2017 •

edited

Loading