Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deadman's switch #149

Merged
merged 1 commit into from
Jan 20, 2016
Merged

Add deadman's switch #149

merged 1 commit into from
Jan 20, 2016

Conversation

nathanielc
Copy link
Contributor

Fixes #137

A deadman's switch can now be added to any node within any task. This is accomplished via exposing the internal statistics per node to the TICKscript itself via the stats method.

The stats method emits the internal stats of a node at a given interval. The deaman's switch uses the collected stat to trigger an alert if it drops below a threshold.

The method deadman is available on all nodes and is a helper function to easily create a deadman's switch.

This:

 var data = stream.from()...
    // Trigger critical alert if the throughput drops below 100 events per 10s and checked every 10s.
    data.deadman(100.0, 10s)
    //Do normal processing of data
    data....

is equivalent to this:

 var data = stream.from()...
    // Trigger critical alert if the throughput drops below 100 events per 1s and checked every 10s.
    data.stats(10s)
          .derivative('collected')
              .nonNegative()
              .unit(10s)
          .alert()
              .id('node \'stream0\' in task \'{{ .TaskName }}\'')
              .message('{{ .ID }} is dead: {{ index .Fields "collected" }} points/10s')
              .crit(lamdba: "collected" <= 100.0)
    //Do normal processing of data
    data....

In addition to the deadman method you can globally configure all stream tasks to have deadman's switches.

[deadman]
  global = true
  interval = "10s"
  threshold = 100.0

The id and message fields can also be configured globally.

EDIT: Changed examples to match final implementation

@nathanielc nathanielc force-pushed the nc-issue#137 branch 2 times, most recently from f0fc7de to 94c1649 Compare January 19, 2016 22:07
@nathanielc
Copy link
Contributor Author

@gunnaraasen @rossmcdonald I know you both had some original input. How does the final result look?

@rossmcdonald
Copy link
Contributor

@nathanielc Just to make sure, if I wanted to alert if throughput drops below 1 event every 10s, would that be the following?

data.deadman(10s, 0.1)

I can see this notation being a little confusing, since it's not obvious that it samples every second. Without looking at any documentation, I would assume that this:

data.deadman(10s, 100.0)

Means "alert when you receive less than 100 events in 10 seconds".

@nathanielc
Copy link
Contributor Author

@rossmcdonald Correct. I was torn as well on which I liked.

With this one data.deadman(10s, 0.1) you can reason about it always in the units of seconds. So if you want to increase the frequency at which you check you only need change the interval. data.deadman(1s, 0.1)

For this one data.deadman(10s, 100.0) If you change the time you also have to remember to change the threshold or you just broke your deadman's switch. data.deadman(1s, 10.0)

@nathanielc
Copy link
Contributor Author

To be clear, I can easily do it either way, I just decided that once you know its always events/second it becomes easier to use.

@gunnaraasen
Copy link
Contributor

I'll second @rossmcdonald's confusion on the arguments to the deadman function.

It would be nice if the rate interval in the deadman helper was configurable. This would make deadman checks over larger time periods, e.g. >15 min, easier and help add context for other people looking at the TICKscript after it's written. It's not immediately obvious that .deadman(5m, 0.1111) means check for a rate less than 10 events over the last 15 minutes every 5 minutes.

Also, what would be the recommendation for alerting on an absolute time since the last event? Say alerting only when a service hasn't sent any events in the last 10 minutes.

@nathanielc
Copy link
Contributor Author

@gunnaraasen So do you want a third argument, like

.deaman(5m, 10.0, 15m)

or is

.deadman(5m, 3.33) 

good enough?

@nathanielc
Copy link
Contributor Author

@gunnaraasen As for alerting on an absolute time since last event that is the same as a 0 threshold for that time period correct?

.deadman(10m, 0)

If you want both its doable(but not via the configuration)

var data = stream.from()...

data.deadman(10m, 0.0)
data.deadman(1h, 5.0)// using Ross's notation read as less than 5 events for the hour

// Do normal data fprocessing
data....

As an aside I am leaning towards Ross's notation since otherwise its a pain to calculate.

@rossmcdonald
Copy link
Contributor

Regarding:

With this one data.deadman(10s, 0.1) you can reason about it always in the units of seconds.

I like the idea, but I think in practice it will just end up becoming confusing since it's not explicitly obvious from the function name. It also requires a little bit of mental arithmetic that I think may be a bit too cumbersome.

@gunnaraasen
Copy link
Contributor

I like Ross' notation as well. Although switching the order of the arguments might be slightly more readable:

data.deadman(5.0, 1h) // Then reading left to right it would be 5 events / 1 hour

However, then it's unclear how often the rate checked? Does it default to checking once an hour?

Also, the 0 rate threshold works perfectly for checking the absolute time since the last event.

@rossmcdonald
Copy link
Contributor

I also think that switching the parameters would help improve readability.

nathanielc pushed a commit that referenced this pull request Jan 20, 2016
@nathanielc nathanielc merged commit af8e166 into master Jan 20, 2016
@nathanielc nathanielc deleted the nc-issue#137 branch January 20, 2016 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants