-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compactor Retention Issue and Resource Usage #824
Comments
Does |
level=info ts=2019-02-10T10:01:53.246252562Z caller=factory.go:39 msg="loading bucket configuration"
|
So yes. Also as per the Slack channel my storage nodes are panic'ing:- panic: runtime error: slice bounds out of range This is happening in 2 environments now. |
Wonder if these are related at all ? |
It seems like it crashes at the same place just like in #829 ... you might be running into the same issue and it doesn't seem like this is related to being out-of-memory :( so I guess your query node sends a query to Thanos Store, it crashes, and then you only get data from Prometheus. Are you running 0.3.0? What kind of storage provider do you use in Thanos Store? |
Yes to all those questions, and its S3 object store. |
Verify tells me I have some issues. Not sure how to resolve these as repair cannot fix these:- |
Did it do any changes to the indices? Was there any output prior to the first line there? There should've been. I wonder if this is related to #816. The issue with overlapping blocks is that the timestamps of different blocks overlap and they have the same sets of labels so it is impossible to determine which we should choose when responding to a query. You should sensibly choose one of them which has the real data and then delete it. I wonder, however, how that could've happened. Is the issue still reproducible now? |
As for the resource usage - we are aware of the issues here: #814. It seems that it is especially painful for you since your index files are probably huge considering you have 000s of K8s nodes (you probably mean pods?). Maybe compactor has been killed or something in the middle of the uploading/deleting process considering how big the index files are however that should've still not happen. Those numbers seem excessively huge to me for the amount of time series that you have. |
There is no prior lines, and the issue persists on one of our clusters with this issue I removed the blocks that report overlaps and the storage node still panic's |
000's of nodes, many 000's of pods :) We did have issues with running out of our quota space on our internal object store. Not to mention the OOM issues. So these may have caused the issue. If I can help with any debugging information I will be happy to provide it. If not I should probably wipe this environment and start again. Please let me know. Thanks for all the guidance. |
In terms of OOM, this should be fixed in 0.10.0 release, given improvements we made for Prometheus compaction (: Closing for now, unless you can still reproduce this in newest version (: |
I am running compactor with these retention values
--retention.resolution-raw=14d --retention.resolution-5m=30d --retention.resolution-1h=60d
the stores show a min date of2019-01-14T17:00:00-07:00
for a while we saw data go back to 14th on our dashboard then 2 days ago it’s now stopping at getting now data beyondJanuary 24, 2019 1:00:00 AM
what could have happened and why is it happening ?We have minimal space available to our Prometheus servers (3d worth for relatively high volume kubernetes clusters with 000's of nodes) and it's more cost effective to our company to have longer data in object store rather than block devices, or SAN/NAS based NFS space.
With 1tb uploaded by the sidecars the compactor is now using 120gb of RAM, and 600gb of disk space.
I am aware of #813
But of concern is why I cannot access data past 14d even though Thanos seems to think it exists since I started the sidecars on the 14th of Jan.
The text was updated successfully, but these errors were encountered: