Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lifetime feature #78

Closed
wants to merge 1 commit into from
Closed

Conversation

lemoer
Copy link

@lemoer lemoer commented Jun 1, 2016

I wrote an implementation for the lifetime feature a few months ago. It's working great for me so I want to bring it back to you. What do you think about this implementation?

Implements #19

In some situations it is very useful if you can submit values to
the pushgateway that disappear after a certain while (if they
are not refreshed).

The lifetime is specified by adding a "Lifetime"-Field to your
HTTP-Header. The value is a string that the "ParseDuration"
(golang-builtin) function accepts as valid format.

Implements prometheus#19
@@ -70,6 +70,7 @@ type MetricStore interface {
type WriteRequest struct {
Labels map[string]string
Timestamp time.Time
Lifetime time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would tend to name this Timeout.

@fabxc
Copy link
Contributor

fabxc commented Jun 2, 2016

Thanks for submitting this!
I left a few comments but am not very involved into the PGW. So others might have different stronger/different opinions.

@brian-brazil
Copy link
Contributor

What's the use case for this?

Usually the only time you want to delete data from the Pushgateway is when a job is being turned down.
Automatically deleting it will cause problems, such as alerts not firing. Please see https://prometheus.io/docs/practices/pushing/

@beorn7
Copy link
Member

beorn7 commented Jun 2, 2016

@brian-brazil The use-case discussion was done in #19, and a @brian-brazil said there: “A use case for this has appeared, it may be a way to allow clients who really really want to push to do so; while offering some GC.”
Has anything changed since then? (In which case we should close #19 with an explanation to avoid confusion.)

Personally, I'm not a fan of a timeout. In general, I see the need for the timeout as a sign for a not very “Promethean” design. But enough people seem to be stuck in such design patterns (for reasons) and expressed the need for timeouts, so I'd be willing to add it to the PGW if there is no strong resistance otherwise.

I'll look at the code once we have agreement that we need the feature at all.

@beorn7 beorn7 self-assigned this Jun 2, 2016
@brian-brazil
Copy link
Contributor

In the past year I think we've solidified a fair bit on not encouraging push via the pushgateway, and there's intricacies around getting such a thing to work at the semantic level. I'm against it.

What do others think? @juliusv @fabxc

@lemoer
Copy link
Author

lemoer commented Jun 2, 2016

My usecase:

  • The data is generated on a server were no extra port should be accessible to the public
    • Every extra open port is an additional attack vector (which is important in this scenario)
    • So the easiest way is using the pushgateway + cronjobs
  • Outdated data results in confusion
    • The time until we notice that the values are outdated (e.g. because the network is not reachable anymore) is precious

Or is there any better solution in this scenario?

@brian-brazil
Copy link
Contributor

We generally recommend running Prometheus on the same network as what it's monitoring. Trying to workaround with the pushgateway will run into issues as you've discovered, and this PR is not sufficient to fully resolve them.

@lemoer
Copy link
Author

lemoer commented Jun 2, 2016

Hm. I don't exactly know which problems are left over after with this feature. If I get it right, it is a problem if you wan't to use it in combination with the alert manager. Is it because the the statistics are simply removed and therefore not monitored anymore?

Possible solutions to this problem:

  1. Differentiate between properly removed values and timeouted ones
    • This could be done by simply replacing the value with another token:
      • NaN
  2. Change the behavior of the alert manager to send an alert when a statistic disappears
    • This should be configurable

Are there are any other issues regarding to this feature?

@beorn7
Copy link
Member

beorn7 commented Jun 6, 2016

@lemoer Introducing a semantics of timed-out vs removed into the Prometheus data model would be really confusing. Prometheus is all pull-based. Representing anything that only makes sense in push world would break consistency and is asking for a lot of trouble down the rabbit hole.

Alerting on disappearing metrics should already be possible with the existing expression language. No reason to change anything in Alertmanager.

@juliusv @fabxc I'd still be very interested in your opinion if we should declare a timeout for metrics on the pushgateway a goal or non-goal. I think all aspects have been discussed here, or in the past. I just would like to wrap this up, either by adding the timeout feature (even if not used for @lemoer 's use-case in the end, but quite possibly based on this PR) or by closing #19 as "won't fix" and thus this PR.

@beorn7
Copy link
Member

beorn7 commented Jun 13, 2016

@juliusv @fabxc Would be great to get an explicit answer from you (even if it is "I have no opinion on this") instead of just running into a timeout.

So far, @brian-brazil is pretty clearly against a timeout feature. I don't really like it myself, but I'm not completely decided yet. However, if nobody else of you speaks up for it, we'll toss it.

@juliusv
Copy link
Member

juliusv commented Jun 13, 2016

I see all arguments, but I don't have a strong opinion either way.

@beorn7
Copy link
Member

beorn7 commented Jun 13, 2016

@fabxc expressed not having a strong opinion on this earlier. I've just now talked to the OP of #19 , and he agreed that he never really ran into the use case again.

I conclude for now that the consensus is to not have a timeout feature in PGW. I'll close #19. @lemoer my sincerest apologies for not making up our mind earlier and thus luring you into creating this PR. We are still grateful for your contribution. At the very least, it sparked a discussion and we finally made up our mind.

@beorn7 beorn7 closed this Jun 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants