Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark all alternative stores but TSDB as deprecated #9105

Closed
wardbekker opened this issue Apr 11, 2023 · 10 comments · Fixed by #9246
Closed

Mark all alternative stores but TSDB as deprecated #9105

wardbekker opened this issue Apr 11, 2023 · 10 comments · Fixed by #9246
Assignees
Labels
docs-p2 Non-urgent and important type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories

Comments

@wardbekker
Copy link
Member

Describe the bug
Object storage with TSDB (also called single store) is the recommended default going forward with 2.8+. In the docs there is still a lot of references to Cassandra/Bigtable/DynamoDB/BoltDB that might set a new Grafana Loki user on the wrong foot. Recommend to mark all references to those explicitly legacy/deprecated in the docs to remove any confusion

@wardbekker wardbekker added type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories docs-p2 Non-urgent and important labels Apr 11, 2023
@periklis
Copy link
Collaborator

m2cents mark anything but boltdb-shipper ❤️

@wardbekker
Copy link
Member Author

@periklis just to clarify, you prefer both the TSDB index and the boltdb-shipper to be marked supported, and the rest deprecated?

@wardbekker
Copy link
Member Author

btw. @JStickler I'm planning on creating a PR for this.

@periklis
Copy link
Collaborator

@periklis just to clarify, you prefer both the TSDB index and the boltdb-shipper to be marked supported, and the rest deprecated?

Yes that is my intention. As per both will need to run in parallel for some installations out there. At least till 3.0 it wouldn't hurt keeping both stores supported.

@timansky
Copy link

If TSDB is now primary support storage, i didn't found any TSDB configs in tanka or helm installation(Neither in docs or config/values files).

@brophyja
Copy link

As a new Loki user, I can confirm that the list of documented back-end options is confusing, even if someone has decided to use S3/object storage.

For example, I have seen several comments "around the internet" that table-manager is going to be deprecated. My search to find specific details on the Loki roadmap or component life-cycle led me to this issue.

While I would like to be able to use DynamoDB for the Loki index (thus requiring table-manager) to limit the need for Persistent Volumes in a Kubernetes based deployment, I would like to know now if this combination of features is not going to be supported in the near future (we are still evaluating Loki).

If TSDB is the future, and components like table-manager are going to be deprecated, then these decisions should be clearly documented somewhere!

@liguozhong
Copy link
Contributor

hi, all.this is a very important suggestion.
I found in the upstream system cortex that there is an upper limit for a single tenant of cortex in block mode, up to 20 million metrics. Because the compactor is a bottleneck, does our tsdb block implementation have similar problems (the data size of a single tenant cannot exceed a threshold). Although the design of loki should not have too many labels, once there are too many labels, will the compactor in loki become the bottleneck of the loki system?

The latest cortex project deletes the code of cassandra before the bottleneck of the compactor is resolved, which prevents us from introducing the cortex project into our infrastructure. I'm very worried about something like this happening in the loki project. We should slow down our pace on removing cassandra code. cc @owen-d

Currently, I am running a loki tenant with a huge log ingest rate of about 3Gb/s. But I can't know how many labels there are. I am running the index module in cassandra. Cassandra's powerful scalability has no single point of bottleneck, and I am very relieved of it when operating larger log data volumes.

Therefore, my suggestion is whether we can not delete the code of the Cassandra index part so quickly in the future. Various systems such as the single-point bottleneck problem of the compactor, cortex thanos mimir, etc. have made various attempts. Can we wait for this compactor problem to be completely resolved before delete Cassandra code?

https://aws.amazon.com/cn/blogs/opensource/scaling-cortex-with-parallel-compaction/
cortexproject/cortex#4843

@bmarinov
Copy link

bmarinov commented May 5, 2023

The situation with the outdated documentation and the feeling for a 'moving target' with regards to recommendations is unfortunate. I almost choked on my coffee reading that boltdb is now apparently deprecated.

I am running 2.7.x versions in several environments, and just today I deployed 2.7.4 yet again (the version running in prod) on staging for some extensive troubleshooting. This is a version from February this year, and at that point in time there were only a few references of TSDB. Boltdb was the way to go.

What im trying to communicate is a legitimate problem with the developer UX with regards to documentation, configuration and getting Loki up and running in a setup suitable for production.

Personally, I am happy to see boltdb go - since we moved to 2.7.x and configured compaction, we've had a ton of issues with queries. Turning on alerts with regular checks for the conditions exacerbated the problem greatly, making the alerts unusable:

  • attempting to query broken chunks ends up in 'database not open' and similar errors, going further back in time is OK
  • aforementioned broken chunks getting lost (gap in historical data which was queryable just a few moments ago)
  • race conditions with the compactor? index_12345/1683...400: no such file or directory
  • (Loki) error rate spikes whenever recent data is being queried (while being compacted?)
  • Error = failed to execute query A: rpc error: code = Unknown desc = database not open

And this is in the simplest possible, monolithic setup. Basically, a setup which should be easy to deploy and operate.

This is not meant as criticism, but having a well organized documentation for several common scenarios, deployment topologies and the recommended storage backend (at the time) should be of very high priority. Having to slap together the configuration from several different pages and sources is time-intensive and does not inspire much confidence.

TL DR: sensible defaults should be preconfigured, and complete configuration examples for several common scenarios should be available, discoverable and kept up to date. Roadmap transparency would also be a big plus.

@jerryjvl
Copy link

I could not agree more with the sentiment above.

The hardest challenge I've found in setting up Loki in production (although based on flipping through docs, I expect the same to hold true for Mimir and Tempo once I get around to those), is the fact that the Documentation and Examples are very in-cohesive. When I try to correlate example configurations, or official templates to the recommendations in the docs to try to understand how to extrapolate partial setups into a full cohesive setup for my own environment, I'm finding I have to constantly back-track and iterate as I discover that parts of the advice I had incorporated have been outdated by changes elsewhere that weren't clearly signposted.

I love the power of Loki, but it is extremely difficult to synthesize a functioning configuration file from the docs and examples.

And I'm certainly not averse to putting my effort where my mouth is, but based on the drift in the existing content, I'm not confident any external party could keep the docs aligned with the moving target of internal developmental changes to how key parts of the configuration function together.

@liguozhong
Copy link
Contributor

The situation with the outdated documentation and the feeling for a 'moving target' with regards to recommendations is unfortunate. I almost choked on my coffee reading that boltdb is now apparently deprecated.

I am running 2.7.x versions in several environments, and just today I deployed 2.7.4 yet again (the version running in prod) on staging for some extensive troubleshooting. This is a version from February this year, and at that point in time there were only a few references of TSDB. Boltdb was the way to go.

What im trying to communicate is a legitimate problem with the developer UX with regards to documentation, configuration and getting Loki up and running in a setup suitable for production.

Personally, I am happy to see boltdb go - since we moved to 2.7.x and configured compaction, we've had a ton of issues with queries. Turning on alerts with regular checks for the conditions exacerbated the problem greatly, making the alerts unusable:

  • attempting to query broken chunks ends up in 'database not open' and similar errors, going further back in time is OK
  • aforementioned broken chunks getting lost (gap in historical data which was queryable just a few moments ago)
  • race conditions with the compactor? index_12345/1683...400: no such file or directory
  • (Loki) error rate spikes whenever recent data is being queried (while being compacted?)
  • Error = failed to execute query A: rpc error: code = Unknown desc = database not open

And this is in the simplest possible, monolithic setup. Basically, a setup which should be easy to deploy and operate.

This is not meant as criticism, but having a well organized documentation for several common scenarios, deployment topologies and the recommended storage backend (at the time) should be of very high priority. Having to slap together the configuration from several different pages and sources is time-intensive and does not inspire much confidence.

TL DR: sensible defaults should be preconfigured, and complete configuration examples for several common scenarios should be available, discoverable and kept up to date. Roadmap transparency would also be a big plus.

👍 Thanks for sharing such a great tsdb migration experience.
I am worried about encountering these things you said in my production loki cluster. so I am still stuck in Cassandra.
We are still understanding the source code of tsdb to understand more details in order to better deal with it appeal problem.

MasslessParticle added a commit that referenced this issue Aug 16, 2023
#9246)

**What this PR does / why we need it**:

making clearer that TSDB is the new recommended index, and make explicit
that non TSDB and BoltDB shipper are deprecated and not recommended
Fixes #9105 

**Special notes for your reviewer**:

This needs to be changed in more places, but this is a start as it
concerns the main storage docs

**Checklist**
- [X] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [X] Documentation added
- [ ] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`

---------

Co-authored-by: Travis Patterson <[email protected]>
Co-authored-by: J Stickler <[email protected]>
Co-authored-by: Travis Patterson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-p2 Non-urgent and important type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants