Compactor Retention Issue and Resource Usage #824

PsychoSid · 2019-02-08T13:47:54Z

I am running compactor with these retention values --retention.resolution-raw=14d --retention.resolution-5m=30d --retention.resolution-1h=60d the stores show a min date of 2019-01-14T17:00:00-07:00 for a while we saw data go back to 14th on our dashboard then 2 days ago it’s now stopping at getting now data beyond January 24, 2019 1:00:00 AM what could have happened and why is it happening ?

We have minimal space available to our Prometheus servers (3d worth for relatively high volume kubernetes clusters with 000's of nodes) and it's more cost effective to our company to have longer data in object store rather than block devices, or SAN/NAS based NFS space.

With 1tb uploaded by the sidecars the compactor is now using 120gb of RAM, and 600gb of disk space.

I am aware of #813

But of concern is why I cannot access data past 14d even though Thanos seems to think it exists since I started the sidecars on the 14th of Jan.

The text was updated successfully, but these errors were encountered:

GiedriusS · 2019-02-10T09:24:43Z

Does thanos bucket inspect show that the data is still there?

PsychoSid · 2019-02-10T10:02:31Z

level=info ts=2019-02-10T10:01:53.246252562Z caller=factory.go:39 msg="loading bucket configuration"

ULID	FROM	UNTIL	RANGE	UNTIL-COMP	#SERIES	#SAMPLES	#CHUNKS	COMP-LEVEL	COMP-FAILED	LABELS	RESOLUTION	SOURCE
01D25JMWSFD557G4T97G4DEF15	14-01-2019 17:00:00	23-01-2019 17:00:00	216h0m0s	24h0m0s	12,206,573	9,177,242,643	73,665,378	4	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	5m0s	compactor
01D265D25HSQP1YAZV57Z6W12H	14-01-2019 17:00:00	23-01-2019 17:00:00	216h0m0s	24h0m0s	12,207,133	9,177,217,216	73,667,088	4	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	5m0s	compactor
01D33MWER55PHHE981BCXHHMQN	23-01-2019 17:00:00	06-02-2019 17:00:00	336h0m0s	-296h0m0s	13,142,888	123,041,649,307	1,121,859,781	4	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	0s	compactor
01D34FS6V382Q2Q4DJVH2RSF1V	23-01-2019 17:00:00	06-02-2019 17:00:00	336h0m0s	-296h0m0s	13,143,039	123,044,107,111	1,124,200,506	4	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	0s	compactor
01D359PP5ZM7TWV1HWSKV3RGXQ	23-01-2019 17:00:00	06-02-2019 17:00:00	336h0m0s	-96h0m0s	13,142,510	15,322,939,441	116,356,121	4	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	5m0s	compactor
01D369XTD50BE3GEZKTY3YCKMA	23-01-2019 17:00:00	06-02-2019 17:00:00	336h0m0s	-96h0m0s	13,142,661	15,323,053,822	116,355,359	4	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	5m0s	compactor
01D36STGZXYVYTA36K5WHEG0P3	23-01-2019 17:00:00	06-02-2019 17:00:00	336h0m0s	-	13,142,510	1,417,435,026	23,402,380	4	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	1h0m0s	compactor
01D3710GWDMSQJKGMG7G83ABAH	23-01-2019 17:00:00	06-02-2019 17:00:00	336h0m0s	-	13,142,661	1,418,061,660	23,402,914	4	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	1h0m0s	compactor
01D387A5VGTTYH5TKKMRTSAZV3	06-02-2019 17:00:00	08-02-2019 17:00:00	48h0m0s	-8h0m0s	6,405,401	18,626,201,751	169,232,695	3	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	0s	compactor
01D389J26J6S6RTVNPV4ZQV4D3	06-02-2019 17:00:00	08-02-2019 17:00:00	48h0m0s	-8h0m0s	6,405,761	18,655,975,984	169,316,879	3	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	0s	compactor
01D38BP0SAJV5SCFMNGGHD5AYR	06-02-2019 17:00:00	08-02-2019 17:00:00	48h0m0s	192h0m0s	6,405,340	2,330,400,348	21,586,466	3	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	5m0s	compactor
01D38EJAEQ27ZDZB12XY83D5PE	06-02-2019 17:00:00	08-02-2019 17:00:00	48h0m0s	192h0m0s	6,405,700	2,332,020,925	21,620,050	3	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	5m0s	compactor
01D390QS9P32MB147HWJH9PBZP	08-02-2019 17:00:00	09-02-2019 01:00:00	8h0m0s	32h0m0s	4,758,789	3,128,247,584	28,353,957	2	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	0s	compactor
01D39174030H4CQ4V1EYKDABB8	08-02-2019 17:00:00	09-02-2019 01:00:00	8h0m0s	32h0m0s	4,758,828	3,128,246,632	28,350,301	2	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	0s	compactor
01D39W6XHSMREP6FE3W5T6F2D0	09-02-2019 01:00:00	09-02-2019 09:00:00	8h0m0s	32h0m0s	4,730,878	3,121,777,120	28,211,924	2	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	0s	compactor
01D39WPEAREN69DSF6W7F44VGV	09-02-2019 01:00:00	09-02-2019 09:00:00	8h0m0s	32h0m0s	4,730,896	3,121,779,114	28,211,992	2	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	0s	compactor
01D3AQNCX7EKC0VHJZA9EXDC15	09-02-2019 09:00:00	09-02-2019 17:00:00	8h0m0s	32h0m0s	4,690,366	3,119,310,387	28,043,813	2	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	0s	compactor
01D3AR4HQ1W5SF523CE7DX0Q37	09-02-2019 09:00:00	09-02-2019 17:00:00	8h0m0s	32h0m0s	4,690,383	3,119,315,054	28,043,940	2	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	0s	compactor
01D3ANHSYXWCYVPCTSR71GQEZC	09-02-2019 17:00:00	09-02-2019 19:00:00	2h0m0s	38h0m0s	4,534,649	780,472,618	7,036,095	1	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	0s	sidecar
01D3ANHT1KXTQVCV0EZMAT8S9P	09-02-2019 17:00:00	09-02-2019 19:00:00	2h0m0s	38h0m0s	4,534,670	780,471,870	7,036,116	1	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	0s	sidecar
01D3AWDH6NENQ67667Y7VNG2MM	09-02-2019 19:00:00	09-02-2019 21:00:00	2h0m0s	38h0m0s	4,621,212	780,405,852	7,123,398	1	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	0s	sidecar
01D3AWDH8X94297CYZEA9ER7HS	09-02-2019 19:00:00	09-02-2019 21:00:00	2h0m0s	38h0m0s	4,621,233	780,405,157	7,123,419	1	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	0s	sidecar
01D3B398F5QGJKJM8SZFYQ1VJJ	09-02-2019 21:00:00	09-02-2019 23:00:00	2h0m0s	38h0m0s	4,587,685	780,266,307	7,088,261	1	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	0s	sidecar
01D3B398H4TD0360RWYPWC0QZ1	09-02-2019 21:00:00	09-02-2019 23:00:00	2h0m0s	38h0m0s	4,587,710	780,265,824	7,088,286	1	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	0s	sidecar
01D3BA4ZR9WCGATZYNFT5TTAPM	09-02-2019 23:00:00	10-02-2019 01:00:00	2h0m0s	38h0m0s	4,515,394	780,694,678	7,017,329	1	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50628	0s	sidecar
01D3BA4ZSGWF6DDF2MW0M3KN0Z	09-02-2019 23:00:00	10-02-2019 01:00:00	2h0m0s	38h0m0s	4,515,419	780,694,401	7,017,354	1	false	cluster=v4/customer/paas/e1,environment=ecp-prometheus-e1_blue,replica=lpdosput50627	0s	sidecar
level=info ts=2019-02-10T10:01:53.566092588Z caller=main.go:184 msg=exiting

PsychoSid · 2019-02-10T10:03:15Z

So yes. Also as per the Slack channel my storage nodes are panic'ing:-

panic: runtime error: slice bounds out of range
goroutine 836 [running]:
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc11b060e40, 0x127d1e0, 0xc104d4d900, 0xc0cd0fc000, 0x13a1, 0x4800, 0x7d, 0x1911011418ddefaf, 0x0, 0x0)
/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1573 +0x6d3
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).preload.func3(0x4346e9, 0x11757b0)
/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1544 +0xb2
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc104db0e40, 0xc104db0d80, 0xc0ca97e0f0)
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xbe
thanos_storage.service: main process exited, code=exited, status=2/INVALIDARGUMENT

This is happening in 2 environments now.

PsychoSid · 2019-02-10T10:04:52Z

Wonder if these are related at all ?

GiedriusS · 2019-02-10T15:43:40Z

It seems like it crashes at the same place just like in #829 ... you might be running into the same issue and it doesn't seem like this is related to being out-of-memory :( so I guess your query node sends a query to Thanos Store, it crashes, and then you only get data from Prometheus. Are you running 0.3.0? What kind of storage provider do you use in Thanos Store?

PsychoSid · 2019-02-10T15:48:40Z

Yes to all those questions, and its S3 object store.

PsychoSid · 2019-02-10T15:51:34Z

Verify tells me I have some issues. Not sure how to resolve these as repair cannot fix these:-
level=info ts=2019-02-10T12:06:55.976663571Z caller=index_issue.go:130 msg="verified issue" with-repair=true issue=index_issue
level=info ts=2019-02-10T12:06:55.976901997Z caller=overlapped_blocks.go:25 msg="started verifying issue" with-repair=true issue=overlapped_blocks
level=warn ts=2019-02-10T12:06:56.324043501Z caller=overlapped_blocks.go:38 msg="found overlapped blocks" group="0@{cluster="v4/customer/paas/e1",environment="ecp-prometheus-e1_blue",replica="lpdosput50627"}" overlap="[mint: 154975680
0000, maxt: 1549756800000, range: 0s, blocks: 2]: <ulid: 01D3ANHT1KXTQVCV0EZMAT8S9P, mint: 1549756800000, maxt: 1549764000000, range: 2h0m0s>, <ulid: 01D3AQNCX7EKC0VHJZA9EXDC15, mint: 1549728000000, maxt: 1549756800000, range: 8h0m0s>"
level=warn ts=2019-02-10T12:06:56.32416754Z caller=overlapped_blocks.go:38 msg="found overlapped blocks" group="0@{cluster="v4/customer/paas/e1",environment="ecp-prometheus-e1_blue",replica="lpdosput50628"}" overlap="[mint: 1549756800
000, maxt: 1549756800000, range: 0s, blocks: 2]: <ulid: 01D3ANHSYXWCYVPCTSR71GQEZC, mint: 1549756800000, maxt: 1549764000000, range: 2h0m0s>, <ulid: 01D3AR4HQ1W5SF523CE7DX0Q37, mint: 1549728000000, maxt: 1549756800000, range: 8h0m0s>"
level=warn ts=2019-02-10T12:06:56.324207636Z caller=overlapped_blocks.go:42 msg="repair is not implemented for this issue" issue=overlapped_blocks
level=info ts=2019-02-10T12:06:56.324223267Z caller=verify.go:68 msg="verify completed" issues=2 repair=true

GiedriusS · 2019-02-10T16:00:38Z

Did it do any changes to the indices? Was there any output prior to the first line there? There should've been. I wonder if this is related to #816. The issue with overlapping blocks is that the timestamps of different blocks overlap and they have the same sets of labels so it is impossible to determine which we should choose when responding to a query. You should sensibly choose one of them which has the real data and then delete it. I wonder, however, how that could've happened. Is the issue still reproducible now?

GiedriusS · 2019-02-10T16:05:40Z

As for the resource usage - we are aware of the issues here: #814. It seems that it is especially painful for you since your index files are probably huge considering you have 000s of K8s nodes (you probably mean pods?). Maybe compactor has been killed or something in the middle of the uploading/deleting process considering how big the index files are however that should've still not happen. Those numbers seem excessively huge to me for the amount of time series that you have.

PsychoSid · 2019-02-10T18:15:07Z

Did it do any changes to the indices? Was there any output prior to the first line there? There should've been. I wonder if this is related to #816. The issue with overlapping blocks is that the timestamps of different blocks overlap and they have the same sets of labels so it is impossible to determine which we should choose when responding to a query. You should sensibly choose one of them which has the real data and then delete it. I wonder, however, how that could've happened. Is the issue still reproducible now?

There is no prior lines, and the issue persists on one of our clusters with this issue I removed the blocks that report overlaps and the storage node still panic's

PsychoSid · 2019-02-10T19:59:45Z

As for the resource usage - we are aware of the issues here: #814. It seems that it is especially painful for you since your index files are probably huge considering you have 000s of K8s nodes (you probably mean pods?). Maybe compactor has been killed or something in the middle of the uploading/deleting process considering how big the index files are however that should've still not happen. Those numbers seem excessively huge to me for the amount of time series that you have.

000's of nodes, many 000's of pods :)

We did have issues with running out of our quota space on our internal object store. Not to mention the OOM issues. So these may have caused the issue.

If I can help with any debugging information I will be happy to provide it. If not I should probably wipe this environment and start again. Please let me know. Thanks for all the guidance.

bwplotka · 2020-01-10T17:02:10Z

In terms of OOM, this should be fixed in 0.10.0 release, given improvements we made for Prometheus compaction (:

Closing for now, unless you can still reproduce this in newest version (:

bwplotka closed this as completed Jan 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor Retention Issue and Resource Usage #824

Compactor Retention Issue and Resource Usage #824

PsychoSid commented Feb 8, 2019

GiedriusS commented Feb 10, 2019

PsychoSid commented Feb 10, 2019

PsychoSid commented Feb 10, 2019

PsychoSid commented Feb 10, 2019

GiedriusS commented Feb 10, 2019 •

edited

Loading

PsychoSid commented Feb 10, 2019 •

edited

Loading

PsychoSid commented Feb 10, 2019

GiedriusS commented Feb 10, 2019

GiedriusS commented Feb 10, 2019

PsychoSid commented Feb 10, 2019

PsychoSid commented Feb 10, 2019

bwplotka commented Jan 10, 2020

Compactor Retention Issue and Resource Usage #824

Compactor Retention Issue and Resource Usage #824

Comments

PsychoSid commented Feb 8, 2019

GiedriusS commented Feb 10, 2019

PsychoSid commented Feb 10, 2019

PsychoSid commented Feb 10, 2019

PsychoSid commented Feb 10, 2019

GiedriusS commented Feb 10, 2019 • edited Loading

PsychoSid commented Feb 10, 2019 • edited Loading

PsychoSid commented Feb 10, 2019

GiedriusS commented Feb 10, 2019

GiedriusS commented Feb 10, 2019

PsychoSid commented Feb 10, 2019

PsychoSid commented Feb 10, 2019

bwplotka commented Jan 10, 2020

GiedriusS commented Feb 10, 2019 •

edited

Loading

PsychoSid commented Feb 10, 2019 •

edited

Loading