Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Influxdb crashes after 2 hours #7640

Closed
rubycut opened this issue Nov 18, 2016 · 60 comments
Closed

Influxdb crashes after 2 hours #7640

rubycut opened this issue Nov 18, 2016 · 60 comments

Comments

@rubycut
Copy link
Contributor

rubycut commented Nov 18, 2016

Bug report

System info:
Influxdb 1.1.0
Os: Debian

Steps to reproduce:

  1. Start influxdb
  2. Create 500 different databases with 500 users
  3. Auth should be on

Expected behavior:

Should run normally

Actual behavior:

Crashes after two hours

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

At the time of crash, there was 114k goroutines in the system:

grep "goroutine " error2.log|wc -l
114551

Roughly half of them were stuck at the same place:

grep engine.go:979 error2.log|wc -l
56118
goroutine 19985 [sleep]:
time.Sleep(0x3b9aca00)
    /usr/local/go/src/runtime/time.go:59 +0xe1
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactTSMLevel(0xc429826100, 0xb74601, 0x1, 0xc428dd2300)
    /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:979 +0x90

@jwilder
Copy link
Contributor

jwilder commented Nov 18, 2016

@rubycut Can you attach the full panic? The two links provided cut off the actual cause of the panic.

Can you also run the following and attach to the issue:

curl -o block.txt "http://localhost:8086/debug/pprof/block?debug=1" 
curl -o goroutine.txt "http://localhost:8086/debug/pprof/goroutine?debug=1" 
curl -o heap.txt "http://localhost:8086/debug/pprof/heap?debug=1" 
curl -o vars.txt "http://localhost:8086/debug/vars" 
iostat -xd 1 30 > iostat.txt
uname -a > uname.txt
influx -execute "show shards" > shards.txt
influx -execute "show stats" > stats.txt
influx -execute "show diagnostics" > diagnostics.txt

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

uname -a
Linux dedxxxx 3.10.95xx #30 SMP Fri Jan 29 11:54:46 CST 2016 x86_64 GNU/Linux

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

iostat.txt

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

shards.txt

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

diagnostics:

name: build
-----------
Branch  Build Time      Commit                                          Version
master                  800da5732b91c816b0a097acf8887fa2af1efa1a        1.1.0


name: network
-------------
hostname
dedxxxx


name: runtime
-------------
GOARCH  GOMAXPROCS      GOOS    version
amd64   24              linux   go1.7.3


name: system
------------
PID     currentTime                     started                         uptime
5324    2016-11-18T20:27:10.806259576Z  2016-11-18T20:01:05.251612142Z  26m5.554647524s

@jwilder
Copy link
Contributor

jwilder commented Nov 18, 2016

@rubycut Can you attach show retention policies as well? The number of goroutines is high because it looks like you have 1 day shard group durations for all your shards but fairly large retention range. You also have a large number of databases as well.

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

@jwilder , show retention policies doesn't work, database must be selected first, so in all databases, retention policy is the same, it looks like this:

> show retention policies
name            duration        shardGroupDuration      replicaN        default
default         0s              168h0m0s                1               false
four_months     3000h0m0s       24h0m0s                 1               true

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

@jwilder , I attached full panic, there is nothing important before panic, influxdb was writing, business as usual:

[httpd] <ip_hidden> - c3941_telegraf [18/Nov/2016:12:16:03 +0000] "POST /write?consistency=&db=c3941&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c6823a4f-ad88-11e6-9ab6-000000000000 98278
[httpd] <ip_hidden> - c3503_telegraf [18/Nov/2016:12:16:04 +0000] "GET /query?db=&q=CREATE+DATABASE+IF+NOT+EXISTS+%22c3503%22 HTTP/1.1" 400 94 "-" "InfluxDBClient" c6939fb5-ad88-11e6-9ac5-000000000000 993
[httpd] <ip_hidden> - c3439_telegraf [18/Nov/2016:12:16:03 +0000] "POST /write?consistency=&db=c3439&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c6870f8e-ad88-11e6-9ab9-000000000000 85296
[httpd] <ip_hidden> - c3623_telegraf [18/Nov/2016:12:16:04 +0000] "POST /write?consistency=&db=c3623&precision=ns&rp= HTTP/1.1" 204 0 "-" "InfluxDBClient" c693d5fa-ad88-11e6-9ac6-000000000000 2071
unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x40ea9d]

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

curl "http://localhost:8086/debug/pprof/goroutine?debug=1"
404 page not found

pprof is not working, I am using stock influxdb 1.1.0 downloaded from your side, do I need to recompile or use anything else to turn pprof on?

@jwilder
Copy link
Contributor

jwilder commented Nov 18, 2016

@rubycut Can you attach the full trace? That tells me what the panic was, but not where it occurred or what else was going on at the time.

For your shards, I'd recommend using a larger shard group duration. For a four month retention period, a shard group duration of 1 mon would significantly reduce the number of shards and goroutines running on your system. The number of databases compounds the problem as well. The shards don't look very big so smaller shard durations is likely just creating overhead for little benefit.

It looks like you may be using a custom kernel as well which might be having problems with the number of goroutines/shards/etc.. or other dependent system resources associated with them.

@jwilder
Copy link
Contributor

jwilder commented Nov 18, 2016

You may need to set:

[http]
pprof-enabled = true

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

@jwilder , when influxdb crashes, it writes state in the log, this is what I already sent you, I don't know any other way how to obtain full trace.

@jwilder
Copy link
Contributor

jwilder commented Nov 18, 2016

@rubycut The two dropbox links only have the end of the trace. The beginning part is cut off which doesn't indicate what the panic was or where it occurred. This is all I see:

x197

goroutine 119256 [sleep]:
time.Sleep(0x3b9aca00)
    /usr/local/go/src/runtime/time.go:59 +0xe1
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).compactCache(0xc6aeb83800, 0xc727eece40)
    /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:953 +0xd7
github.com/influxdata/influxdb/tsdb/engine/tsm1.

... rest of trace...

@rubycut
Copy link
Contributor Author

rubycut commented Nov 18, 2016

oh man, dropbox sucks, here is the gist of trace:

https://gist.github.com/rubycut/d35daf01887b5d23ddd4571c500c8e64

@rubycut
Copy link
Contributor Author

rubycut commented Nov 24, 2016

As we changed shardGroupDuration from one day to two weeks, number of goroutines is going down, we are at 111k goroutines, they go down by 1k each day.

Today, influxdb was running for 4.5 hours without going down.

@rubycut
Copy link
Contributor Author

rubycut commented Nov 24, 2016

But we are observing strange memory behavior compared with influxdb version 0.13.

Blue line is Influxdb 1.1.0, while green is influxdb 0.13, both servers are running same load.

As you can see influxdb 1.1.0 has constantly rising memory footprint, and in many cases I suspect that memory could be cause of crash, although server doesn't seem to die from OOM killer.

Sudden drop in blue line every couple of hours when influxdb crashes.

screen shot 2016-11-24 at 15 48 35

@rubycut
Copy link
Contributor Author

rubycut commented Nov 25, 2016

@rubycut
Copy link
Contributor Author

rubycut commented Nov 25, 2016

@jwilder , I added all the info you required. I am standing by if you need anything else.

@rubycut
Copy link
Contributor Author

rubycut commented Nov 25, 2016

@rubycut
Copy link
Contributor Author

rubycut commented Nov 25, 2016

Here is the pprof heap map as it is getting full:

https://gist.github.com/b036527bc1163a7f52b0de6fa7fd2a6b

@rubycut
Copy link
Contributor Author

rubycut commented Nov 28, 2016

@jwilder , it seems that

defaultWaitingWALWrites = 10 could be bottleneck, look at:

https://github.com/influxdata/influxdb/blob/v1.1.0/tsdb/engine/tsm1/wal.go#L71

is it safe for me to increase this value and recompile?

@jwilder
Copy link
Contributor

jwilder commented Nov 28, 2016

@rubycut Increasing it could cause higher memory usage. You may want to try lowering it since you have a lot of databases and shards.

@rubycut
Copy link
Contributor Author

rubycut commented Nov 29, 2016

I can't get it to work, if I reduce it to 5, or increase it, it always grows as visible in this heatmap.

@rubycut
Copy link
Contributor Author

rubycut commented Dec 1, 2016

I brought memory under control with cache-snapshot-memory-size = 2621440

Still, it crashes every two hours with:

influxd.log.7-panic: runtime error: invalid memory address or nil pointer dereference
influxd.log.7-[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x0]
influxd.log.7-
influxd.log.7-goroutine 92627 [running]:
influxd.log.7-panic(0x9cf120, 0xc4200120b0)
influxd.log.7-  /usr/local/go/src/runtime/panic.go:500 +0x1a1
influxd.log.7-created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).enableLevelCompactions
influxd.log.7-  /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:214 +0x113
influxd.log.7-[signal SIGSEGV: segmentation violation code=0x2 addr=0xc6634fb180 pc=0xc6634fb180]
influxd.log.7-
influxd.log.7-goroutine 39672 [running]:
influxd.log.7-runtime.throw(0xa7c76c, 0x5)
influxd.log.7-  /usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc7e9024e68 sp=0xc7e9024e48
influxd.log.7-runtime.sigpanic()
influxd.log.7-  /usr/local/go/src/runtime/sigpanic_unix.go:27 +0x288 fp=0xc7e9024ec0 sp=0xc7e9024e68
influxd.log.7-created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).enableLevelCompactions
influxd.log.7-  /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:215 +0x13f
influxd.log.7-

@jwilder
Copy link
Contributor

jwilder commented Dec 2, 2016

@rubycut I'm not sure what to make of that. That call trace is what happens at startup when loading shards. I'm not sure how you would see that just before it crashed.

BTW, how did you generate that image? What tool is that?

rubycut pushed a commit to rubycut/influxdb that referenced this issue Dec 2, 2016
Big pool can lead to huge memory usage in certain loads.

See influxdata#7640 for detailed discussion.
@rubycut
Copy link
Contributor Author

rubycut commented Dec 2, 2016

@jwilder , you are right, this could already be startup since instance is automatically restarted with supervisor.

Tool is relatively new, https://stackimpact.com/

@allenpetersen
Copy link
Contributor

allenpetersen commented Dec 2, 2016

@jwilder I've been digging into the memory usage. At least in this 500 database config the size of the requested buffers varies quite a bit resulting in a large amount of reallocating which prevents the pool from doing a good job.

I've been experimenting with making 4 different pools each with a different buffer size. All buffers in that pool are allocated with the same capacity so there isn't a need for reallocating. After analyzing the requested buffers here are the pool sizes that work well for this test config.

small      500KB
medium       1MB
large        4MB
XL     requested

I see a handful of 11MB and 13MB requests so I figured XL can just use whatever they requested.

Under the load test for 40 minutes I'm only seeing 151MB being retained by the pools in total. Most of the activity is in the small and large pools at about 50-100 requests per second each.

The XL pool only has 1 item and has only gotten accessed 189 times in 40 minutes. I've thought about not pooling the XL requests but it is only using up 13MB of RAM but it is preventing heap fragmentation by not having to reallocate that same block.

The medium pool got used a lot at startup time and then leveled off. I think that size must get used while reading in the cachloader.

Here are the pool stats after running for about 40 minutes under load

Size    Held (MB)  Requests
Small       14      382803
Medium       8      350591
Large      116       36119
XLarge      13         189
Total      151      769702

@jwilder
Copy link
Contributor

jwilder commented Dec 2, 2016

@allenpetersen Interesting. Would you be able to try reverting the pool back to sync.Pool? This is the code that was there before: https://github.com/influxdata/influxdb/pull/7348/files#diff-25cc809730c7e22463531e3af8bf5dec

@allenpetersen
Copy link
Contributor

I was looking at that code and I agree that sync.Pool might not be the best fit because the pools can get cleared often. The thrashing I was seeing with bufPool is gone with the bucketed sizes and Bytes.Get() is really low on the inuse_objects list. I've had influx under load for 70 minutes and it is quite happy, memory isn't growing.

Here is the WIP commit, there is a lot of logging code I've been using to monitor the pools

https://github.com/allenpetersen/influxdb/commit/cf76c3ee4f14055a4bb1a08dace9d3879f6c2721

I'm still looking into what seems like a heap fragmentation issue. Linux is reporting 30GB virtual and 11GB resident and there is 16GB of HeapReleased RAM.

# runtime.MemStats
# Alloc = 7887105264
# TotalAlloc = 125086394008
# Sys = 26991342520
# Lookups = 454114
# Mallocs = 1068203357
# Frees = 868244704
# HeapAlloc = 7887105264
# HeapSys = 25709838336
# HeapIdle = 16905437184
# HeapInuse = 8804401152
# HeapReleased = 15752298496
# HeapObjects = 199958653
# Stack = 32309248 / 32309248
# MSpan = 140084000 / 220659712
# MCache = 9600 / 16384
# BuckHashSys = 1721121

@allenpetersen
Copy link
Contributor

@jwilder I quickly swapped out pool.Bytes with sync.Pool and it's performing really well.

https://github.com/allenpetersen/influxdb/commit/dd71b23513e42edd3ff25cfb918faac1491768a0

I'll get exact numbers here in a bit but for example with the older code it took almost 2 minutes to finish the cachloader stage of startup. With sync.Pool it took 15 seconds.

Linux is only reporting 10GB virtual and 4.4GB resident, each about 1/3rd of the older code.

# runtime.MemStats
# Alloc = 2372125720
# TotalAlloc = 32378802448
# Sys = 4349636208
# Lookups = 209778
# Mallocs = 291427402
# Frees = 235183224
# HeapAlloc = 2372125720
# HeapSys = 4078731264
# HeapIdle = 1594015744
# HeapInuse = 2484715520
# HeapReleased = 37691392
# HeapObjects = 56244178
# Stack = 34897920 / 34897920
# MSpan = 50029920 / 57786368
# MCache = 9600 / 16384
# BuckHashSys = 1627555

@allenpetersen
Copy link
Contributor

@jwilder After 70 minutes with sync.Pool things continue to look really good. Overall it seems to be using about half of the memory as before.

Here are the results for a 70 minute test with sync.Pool buckets.

# runtime.MemStats
# Alloc = 4262116424
# TotalAlloc = 91533304016
# Sys = 6153510512
# Lookups = 779259
# Mallocs = 723575666
# Frees = 624965997
# HeapAlloc = 4262116424
# HeapSys = 5792759808
# HeapIdle = 1475100672
# HeapInuse = 4317659136
# HeapReleased = 0
# HeapObjects = 98609669
# Stack = 35028992 / 35028992
# MSpan = 67064000 / 75710464
# MCache = 9600 / 16384
# BuckHashSys = 1696723
# NextGC = 7056486574
# PauseNs = [530289 314619 320596 1096046 399833 308890 266669 471729 483568 400554 509909 550167 513533 301108 91926 68829 290793 56072 103335 109035 81930 103377 92215 52479 55663 106495 106711 69988 215513 222778 226723 52625 208723 56263 248848 225839 111737 269211 265993 80512 351578 256429 186419 198496 343178 462119 217230 2149745 460028 581279 726927 1155523 1180971 1252676 1633985 2201706 3144175 3008831 3424641 3293706 3314497 4204496 3700088 3804114 8579937 4113989 4518797 3754106 3912663 3819891 4386894 3742700 4217162 3866343 4455789 4059505 4565848 4185661 1892691 3760559 4036164 4194639 4772064 4188802 5763421 4162223 4569555 4053164 5886723 3843425 4263090 4310646 7016731 5057678 4185530 3967028 4009587 4095376 4807964 4115051 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# NumGC = 100
# DebugGC = false

I'll do a similar run with 1.1.0 now.

@allenpetersen
Copy link
Contributor

allenpetersen commented Dec 3, 2016

Here is a comparison of 1.1.0 to master+sync.Pool buckets

                 1.1.0       sync.Pool    delta
Alloc         7337573088    4262116424    58.09%
TotalAlloc  102558047400   91533304016    89.25%
Sys          10669295144    6153510512    57.67%
Lookups           748249        779259   104.14%
Mallocs        933161203     723575666    77.54%
Frees          767299281     624965997    81.45%
HeapAlloc     7337573088    4262116424    58.09%
HeapSys      10072588288    5792759808    57.51%
HeapIdle      2716852224    1475100672    54.29%
HeapInuse     7355736064    4317659136    58.70%
HeapReleased  1911767040             0          
HeapObjects    165861922      98609669    59.45%

To be honest I don't understand the 1.9GB of HeapReleased, my guess is something like a fragmented heap so there are not full pages the kernel can take back.

@allenpetersen
Copy link
Contributor

@jwilder I ran a few more tests and using fixed size sync.Pools worked best. Unfortunately there is a lot of variability between the test runs. I want to make a repeatable workload with influx_stress to better compare the different mechanisms before making a PR.

@e-dard
Copy link
Contributor

e-dard commented Dec 7, 2016

@rubycut @allenpetersen there are a couple of problems with the current pool in pkg/pool`.

@rubycut's fix in #7683 will help with memory management for systems under load with occasional large buffer requirements but the cost is increased allocations due to more contention over the now smaller pool.

@allenpetersen we originally used sync.Pool but I believe the issue with very high loads was that it was aggressively releasing items back to be GC and so the number of allocations was quite high. I think the tradeoff from going from a sync.Pool backed pool to a fixed channel backed pool was to reduce allocations with the cost of increased memory usage.

I've been working on a new pkg/pool which will tracked requested buffer sizes and dynamically release buffers from the pool if they're not appropriately sized. I'm hoping it will give us the best of both worlds (memory usage and allocations over time). Would you be interested in trying it out when the branch is ready?

@allenpetersen
Copy link
Contributor

@e-dard What I noticed while trying to simulate the workload for this issue was a very high number of Get() requests failing the capacity check https://github.com/influxdata/influxdb/blob/master/pkg/pool/bytes.go#L28

	if cap(c) < sz {
		return make([]byte, sz)
	}

There were 200-400 calls to Bytes.Get() per second. Half of those requests were for small buffers of various sizes of less than 300k. The other half were for a single size of 1048576 bytes. In a very short period of time buffers in the pool channel were essentially randomly sized resulting in a lot of reallocations when the capacity check failed.

Looking back at the sync.Pool implementation from September there was similar logic. My guess is the problem was more about discarding these pooled buffers when the capacity check fails rather than a GC emptying the pool. Maintaining a few pools of specific sized buffers removed this thrashing and the allocations went way down. Actually the total memory in all the pools went down quite a bit as well.

I've been busy this week and haven't been able to simplify my test procedure. I've looked at stress and stress/v2 but this workload is rather specific and I think it will be easier to just write a simple program to generate the load.

Let me know when your new branch is ready, I'm happy to test it.

@allenpetersen
Copy link
Contributor

allenpetersen commented Dec 8, 2016

@e-dard Here is the latest version of my experiment pools of specific sizes.

https://github.com/allenpetersen/influxdb/commit/7a86d160ac2bba3e97c2ddcdf14c7e108eddac6d

[edit] updated to latest commit

@rubycut
Copy link
Contributor Author

rubycut commented Dec 9, 2016

@e-dard, @allenpetersen , reducing pool size with my patch solved memory issues for us, but if you want that we test something, I am available.

@rubycut
Copy link
Contributor Author

rubycut commented Dec 20, 2016

@jwilder, after I changed shard group duration, number of shards is decreasing little by little every day in last two weeks. Now, sudden crashes of influxdb are not happening so often.

It's quite possible that when thousands of these 24 hours shards expire, that influxdb will be not be crashing anymore.

@weshmashian
Copy link

Upgrade to influx 1.2.0-rc1 fixed the memory leak issue we were having. We've also dropped ~250 subscriptions which brought the number of goroutines down to a reasonable level.

However, Influx still crashes after 90-180 minutes of uptime and it seems it's not related to usage (write queries vs mixed queries).

@jwilder
Copy link
Contributor

jwilder commented Jan 26, 2017

@rubycut Has stability improved since lowering the number of shards?

@rubycut
Copy link
Contributor Author

rubycut commented Jan 26, 2017

@jwilder , we still have crashes, we are running 1.2.0-rc1, my colleague @weshmashian is running tests and trying various stuff, number of shards decreased significantly, since we are running 14 day shard groups since mid Dec, but we still have crashes, they usually crash after few hours, the longest period we were able to run it without crash is 48 hours.

@weshmashian can provide details.

@weshmashian
Copy link

weshmashian commented Jan 27, 2017

We're actually running three versions - 1.1.0, 1.2.0-rc1 and 1.2.0. So far rc1 has had no crashes in past three days. We're trying to get the same behavior on 1.2.0 (matching configs and using the same dataset), but so far it was not successful.

We've also started getting the following panic on one 1.2.0-rc1:

panic: count of short block: got 0, exp 1

goroutine 62183 [running]:
panic(0x9cd1e0, 0xc6b6936eb0)
        /home/influx/.gvm/gos/go1.7.4/src/runtime/panic.go:500 +0x1a1
github.com/influxdata/influxdb/tsdb/engine/tsm1.BlockCount(0x0, 0x0, 0x0, 0x8)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.go:224 +0x315
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).combine(0xc5b67aca00, 0x0, 0xa763a0, 0x1, 0xc7b9cbfbc0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1165 +0x6eb
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).merge(0xc5b67aca00)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1101 +0x100
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).Next(0xc5b67aca00, 0xc813bb36ea)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1068 +0x2bd
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).write(0xc5ae0d8b40, 0xc5bb7f1db0, 0x47, 0xe2c620, 0xc5b67aca00, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:774 +0x29b
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).writeNewFiles(0xc5ae0d8b40, 0x22f, 0x6, 0xe2c620, 0xc5b67aca00, 0xe2c620, 0xc5b67aca00, 0x0, 0x0, 0xc5ca243bf8)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:728 +0x284
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).compact(0xc5ae0d8b40, 0xba8400, 0xc5bb7f5300, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:654 +0x52a
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).CompactFull(0xc5ae0d8b40, 0xc5bb7f5300, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:672 +0x18e
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup.func1(0xc5b4dcefc0, 0xc5bb7f5300, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1072 +0x15d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup(0xc5b4dcefc0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1074 +0x47d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply.func1(0xc5b7551f80, 0xc5b4dcefc0, 0x0)
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1047 +0x5d
created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply
        /home/influx/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1048 +0xcc

Full version: 1.2.0~rc1, branch master, commit bb029b5

@e-dard
Copy link
Contributor

e-dard commented Jan 27, 2017

@weshmashian how often does that panic occur? Has it happened in the final 1.2.0 release?

@weshmashian
Copy link

@e-dard it happened on final 1.2.0 as well:

panic: count of short block: got 0, exp 1

goroutine 53530 [running]:
panic(0x9d12c0, 0xc6ad599740)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/influxdata/influxdb/tsdb/engine/tsm1.BlockCount(0x0, 0x0, 0x0, 0x8)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.go:224 +0x315
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).combine(0xc4a7675b00, 0x0, 0xa7a980, 0x1, 0xc7086c8c00)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1165 +0x6eb
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).merge(0xc4a7675b00)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1101 +0x100
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*tsmKeyIterator).Next(0xc4a7675b00, 0xc613bb36ea)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:1068 +0x2bd
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).write(0xc5183ba050, 0xc57fbdbdb0, 0x47, 0xe30720, 0xc4a7675b00, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:774 +0x29b
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).writeNewFiles(0xc5183ba050, 0x22f, 0x6, 0xe30720, 0xc4a7675b00, 0xe30720, 0xc4a7675b00, 0x0, 0x0, 0xc57cba1bf8)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:728 +0x284
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).compact(0xc5183ba050, 0xbacc00, 0xc570c61aa0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:654 +0x52a
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Compactor).CompactFull(0xc5183ba050, 0xc570c61aa0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:672 +0x18e
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup.func1(0xc56eba0930, 0xc570c61aa0, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1072 +0x15d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).compactGroup(0xc56eba0930, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1074 +0x47d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply.func1(0xc57cac3380, 0xc56eba0930, 0x0)
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1047 +0x5d
created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*compactionStrategy).Apply
        /root/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:1048 +0xcc
[I] 2017-01-27T08:27:13Z InfluxDB starting, version 1.2.0, branch master, commit b7bb7e8359642b6e071735b50ae41f5eb343fd42

I've synced known good datadir and started up rc1 on it. It's not panicking any more, but it's stil too early to tell if it's going to crash.

Updated panic in previous comment as I only now realized it was missing the first line.

@e-dard
Copy link
Contributor

e-dard commented Jan 27, 2017

@weshmashian thanks. We need to open a new issue for this panic.

@jwilder
Copy link
Contributor

jwilder commented May 25, 2017

Should be fixed via #8348

@jwilder jwilder closed this as completed May 25, 2017
@mahaveer1707
Copy link

My influx version is 1.5.2-1
I read this whole thread and only understood to change the duration of shards , but it is no help.

My influx instance goes down every 1 hour or so. and this has started happening just 2-3 weeks before.
I have 35 databases . Retention policy of 1 month to each and shard duration is 2 weeks.

Attached some files for details

uname.txt
diagnostics.txt
iostat.txt
shards.txt
stats.txt

@mahaveer1707 mahaveer1707 mentioned this issue Sep 19, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants