Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time weighted average/mean #7445

Closed
deandob opened this issue Oct 11, 2016 · 20 comments
Closed

Time weighted average/mean #7445

deandob opened this issue Oct 11, 2016 · 20 comments

Comments

@deandob
Copy link

deandob commented Oct 11, 2016

Proposal:
Provide an aggregate function that provides a time weighted mean. Similar to the interpolation function in OpenTSB http://opentsdb.net/docs/build/html/user_guide/query/aggregators.html

I am new to influxdb and possibly I have not understood how this can be done, but from spending an hour researching this it looks like it is not natively possible in Influxdb and without the possibility of doing user defined functions this type of calculation has to be done in the host application, negating some of the benefits and performance of influxdb.

Use Case:
For IoT there are a lot of sensor scenarios where data isn't sent on a regular time basis, mostly where the sensors are reacting to changes (which saves battery and network bandwidth compared to sending the same unchanging data value in regular intervals). For example, temperature measurements won't change while a freezer door is closed (which is most of the time) sending no data, but will send many data points when the door is opened for a short while (getting warmer) then closed (getting colder again). A conventional mean will not represent the average temperature in the freezer as most of the sensor data ingested is from the short time the door was open so needs to be weighted by the time interval between sensor readings.

Apologies if there is already a feature request for this but I couldn't find anything similar in the GH issues log. I like the philosophy behind influxdb and the active development - so hoping this is a feature that could be added. Also surprised that this isn't a more requested feature as it should be a common use case. I don't want to use Kapacitor as my servers are Windows not Linux.

Thanks for listening.

@jsternberg
Copy link
Contributor

I think you might be able to do something similar to this using FILL(). You, unfortunately, can't use fill unless you use an aggregate, but you might be able to use moving_average along with mean and an aggregate interval to get something similar to what you want. We still haven't added a weighted moving average yet though which seems like it would be better for your purposes.

@deandob
Copy link
Author

deandob commented Oct 12, 2016

Hi,

Fill doesn't really help as it interpolates to produce a linear time series (values at regular time intervals based on interpolated values) from a delta time series which would give incorrect readings if the changes are step changes. See the examples below.

Sensors sends the following in response to an environment changes at 10:23 and 4 am:
10:20am value = 3
10:23am value = 10
10:24am value = 8
If we were to get a time weighted average between 10:20am and 10:25am the calculation is:
(3 x 3min + 10 x 1min + 8 * 1min) / 5min = 5.4

If we were to use FILL then a mean we would get:
10:20am value = 3
10:21am value = 5.33 (interpolated)
10:22am value = 7.66 (interpolated)
10:23am value = 10
10:24am value = 8
MEAN would give you the value of (3 + 5.33 + 7.66 + 10 + 8) / 5 = 6.8. Quite different.

Why is this a common use case for IoT? Because having sensors send data at regular intervals wastes network bandwidth, battery life, CPU time and storage when most of the time we are only interested to capture the changes that the sensor is monitoring. Interpolation is useful when there is linear change and a sensor resolution is poor (ie. it can't capture smaller changes) or intermittent but where there is step change involved (eg. common when calculating safety exposure in an industrial situation) only a time weighted average will give the correct answer.

Is this something that you would consider for a future release? The math isn't too hard but would require a new aggregate function to be added (I could help with a pull but I'm no Go expert.....)

@deandob deandob closed this as completed Oct 12, 2016
@desa
Copy link
Contributor

desa commented Dec 9, 2016

@deandob Why did you close this issue? I've got a couple questions about how time-weighted average should work. Namely in the following case

tag time value
A 1 2
B 2 3
A 2 4
A 3 5

What should the time-weighted average be?

@dreamon-dd
Copy link

Hello Everyone,

Not sure if the issue creator is still interested in this feature, but we find it quite important.
@desa As for how it should work, I think @deandob explained it quite detailed: mean should be calculated according to how long the value was kept. Another question is, should it be grouped by the tags or not. In my schema I would have a separate value for each sensor and no tags actually, so there the calculation is clear, but I would assume that if there are different values for different tags existing, then probably result should be grouped by the values of the tags. Therefore in forementioned case result for grouping in the interval from time 1 to time 3 will be:

tag timeWeightedMean(value)
A (2*(2-1) + 4*(3-2))/(3-1) = 3
B 3

But the result depends on the grouping time interval, so for the interval of time from 1 to 4 result for A will already be: (2*(2-1) + 4*(3-2) + 5*(4-3))/(4-1) = 11/3

@rs-blade
Copy link

rs-blade commented May 11, 2017

There are two aspects that add more complexity to this:

  1. Limiting the duration of a value to the aggregation interval
  2. Having a maximum valid time for values (values older than that are considered as unreliable; maybe there was a loss in communication to the device).

An example to make this clear:
10:18am 5
10:21am 3
10:23am 10
10:24am 8
10:29am 12
10:32am 14

The maximum valid time is 4 minutes and we want to get the weighted mean value for interval 10:20am to 10:30am.
The result would be:
10:20am-10:21am 5 => 5 * 1min !!! This is where the interval boundary limitation kicks in !!!
10:21am-10:23am 3 => 3 * 2min
10:23am-10:24am 10 => 10 * 1min
10:24am-10:28am 8 => 8 * 4min !!! This is where the maximum validation time kicks in !!!!
10:29am-10:30am 12 => 12 * 1min !!! This is where the interval boundary limitation kicks in !!!

=> (5*1 + 3*2 + 10*1 + 8*4 + 12*1) / (1 + 2 + 1 + 4 + 1) = 65 / 9 = 7.222...

@arastoo78
Copy link

Hi,
the time average aggregation is also very important for us. Does someone know if the feature has been implemented since this issue was opened? Or is it possible to achieve it somehow with the already provided API? We have a simple implementation for the time average algorithm according to the above mentioned calculation and could contribute it if necessary.

@deandob
Copy link
Author

deandob commented Jul 1, 2018

Apologies for closing this issue prematurely, it was accidental.
I'm still looking for this feature in influxdb, and about to launch a new product that could potentially use the enterprise version of influxdb if this feature was available. At this point we are looking at either implementing time weighting outside influxdb (which reduces the benefits of influxdb to us) or alternate products. As mentioned earlier this would be a common use case for using influxdb in IoT scenarios.

@gpomykala
Copy link

@ruediger-stevens comment about duration of grouping interval is valid.
There are 2 main issues:

  • how to pass interval start/end time to the reducer (10:20 - 10:30)
  • how to determine a value of a point at 10:18 given that the iterator will not contain this point given the grouping interval

@nyhu
Copy link

nyhu commented Dec 3, 2018

Hello,
I obviously need this too but I "disagree" with @ruediger-stevens.

  • Limiting the duration of a value to the aggregation interval
  • Having a maximum valid time for values (values older than that are considered as unreliable; maybe there was a loss in communication to the device).

I also need this done in a few cases so I clearly understand the need, but this add a lot of business logic into a simple function we cannot spare today for www.iot.

I suggest a first implementation with time_weighted_average(< field >) and later on we'll think about a time_weighted_average(< field >[, <max_valid_time> ]) with no breaking changes. The default could just be 0 as infinite like limit() implementation.
max_valid_time is raising a lot of unanswered questions and I just suggest to postpone. (In same PR or the next).

I match max_valid_time with "Having a maximum valid time", as for limiting the duration to the aggregation interval I does not understand completely, could you give a sample in @gpomykala proposal for a test case ? Is this a parameter for the function or a std behavior ?

I'll do my best to follow implem and speed this up, I don't want to be between you and a feature, just make sure something gets out ;)

We could discuss works around to handle sensor quality (also part of OPC-UA implementation). I'm not yet perfectly sure it should be build in influx. I would lovely exchange on a iot scenario cloud implem standard to handle such things !

@gpomykala
Copy link

Hi @ruediger-stevens max_valid_time could be an optional parameter for an average in no grouping scenario - similar as moving_average. This could be added at some in future at some point as long as it is an optional parameter. Nevertheless it goes well beyond definition of time weighted average and seems more like an application specific requirement than a general purpose function.

@arastoo78
Copy link

arastoo78 commented Dec 5, 2018

For @ruediger-stevens' example I would also suggest to go for the algorithm without max_valid_time first. I also agree with @gpomykala that this is an application specific feature. If you are not storing deterministic values but value changes, it is normal that you sometimes have larger gaps.

An example to make this clear:
10:18am 5
10:21am 3
10:23am 10
10:24am 8
10:29am 12
10:32am 14

We want to get the weighted mean value (time average aggregation) for interval 10:20am to 10:30am.
The result would be:
10:20am-10:21am 5 => 5 * 1min !!! This is where the interval boundary limitation kicks in !!!
10:21am-10:23am 3 => 3 * 2min
10:23am-10:24am 10 => 10 * 1min
10:24am-10:29am 8 => 8 * 5min
10:29am-10:30am 12 => 12 * 1min !!! This is where the interval boundary limitation kicks in !!!

=> (5 x 1 + 3 x 2 + 10 x 1 + 8 x 5 + 12 x 1) / (1 + 2 + 1 + 5 + 1) = 73 / 10 = 7.3

This would be the correct value I would focus on first.

@dgnorton dgnorton added the 1.x label Jan 7, 2019
@stale
Copy link

stale bot commented Jul 23, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 23, 2019
@stale
Copy link

stale bot commented Jul 31, 2019

This issue has been automatically closed because it has not had recent activity. Please reopen if this issue is still important to you. Thank you for your contributions.

@stale stale bot closed this as completed Jul 31, 2019
@gpomykala
Copy link

There is a pending PR for this issue, why it's been closed?

@ashishkaransingh
Copy link

This is an important feature.

@iwita
Copy link

iwita commented Apr 7, 2020

Yeah, I agree it would be great to add more value to older/newer values on a timeseries

@inselbuch
Copy link

This is a critical and fundamental capability of a time-series database. If anybody in the process industries is going to use InfluxDB this feature must be supported.

@Sdelausnay
Copy link

We are also looking for this feature making sure we have consistent result across systems of our IoT solution. Current functionality is to dependent on continuous sampling. When would the feature be available?

@fgheysels
Copy link

I agree with previous commenters that this is a crucial issue for IoT solutions. I've seen that Azure TimeSeries Insights does support a time weighted average.
In TSI, when using the twavg function, you need to specify an interpolation strategy as well. This interpolation strategy defines how "irregular" intervals will be handled. The interpolation interval can be either step or linear and this defines what values should be used for the points in time for which we have no data. After the interpolation has been applied, the TWAVG is calculated on the interpolated set of data.

@christianTF
Copy link

christianTF commented Oct 21, 2021

I agree with all requesters of the time weighted average and it's sad, that this commit did not get to the master in any way gpomykala@1b11912

I realized that Influx cannot manage correct time based averages of time series data with GROUP BY time(2m) fill(previous).
In my opinion, the fill(previous) should exactly do what it is named, to fill up datapoints where no datapoints are, with the previous value. That would exactly be a time weighted average.
But fill(previous) only fills time spans without datapoints inside the group clause, but not all the areas between existing datapoints. The group by itself only calculates arithmetic averages of existing datapoints in the group.

Especially in IoT environments, but actually in every environment where data are not pulled but pushed by a change trigger on the senders side, data arrive on a non-regular basis, and only a time based average can get the real value.

This feature is really important for non-regularly sent data points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests