-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group measures by resource metadata history #1059
Group measures by resource metadata history #1059
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important fix that we had to develop. We discovered when reprocessing data for our billing system that Gnocchi was always returning the latest metadata for the query, which caused a few trouble to us. Therefore, we would like to contribute this fix back to the community.
Hi and thank you guys! I don't see the CI running on this PR so I feel like it really does not work anymore or has been disabled? I feel like there's really some work to be done here before we can merge anything. This PR seems quite large so it might require some time to review, especially since nobody's usually around. |
ae23462
to
d01029a
Compare
quite a large change, can you please recheck/reopen to get a new CI check and we can try getting this back on the road. |
@mergify rebase |
Command
Hey, I reacted but my real name is @Mergifyio |
d01029a
to
4286df8
Compare
b118051
to
8bfb691
Compare
8bfb691
to
53ea81e
Compare
* Update cloudkitty from branch 'master' to 213087869afe60f9d1291ccc017c41871687cd00 - Merge "Create 'use_all_resource_revisions' for Gnocchi collector" - Create 'use_all_resource_revisions' for Gnocchi collector This option is useful when using Gnocchi with the patch introduced in gnocchixyz/gnocchi#1059. That patch can cause queries to return more than one entry per granularity ( timespan), according to the revisions a resource has. This can be problematic when using the 'mutate' option of Cloudkitty. Therefore, we proposed this option to allow operators to discard all datapoints returned from Gnocchi, but the last one in the granularity queried by CloudKitty. The default behavior is maintained, which means, CloudKitty always use all of the data points returned. Change-Id: I051ae1fa3ef6ace9aa417f4ccdca929dab0274b2
This option is useful when using Gnocchi with the patch introduced in gnocchixyz/gnocchi#1059. That patch can cause queries to return more than one entry per granularity ( timespan), according to the revisions a resource has. This can be problematic when using the 'mutate' option of Cloudkitty. Therefore, we proposed this option to allow operators to discard all datapoints returned from Gnocchi, but the last one in the granularity queried by CloudKitty. The default behavior is maintained, which means, CloudKitty always use all of the data points returned. Change-Id: I051ae1fa3ef6ace9aa417f4ccdca929dab0274b2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty long and complex, so I'm unable to review it deeply. But a quick glance make it looks good so I think we can merge.
The only thing I'd change is the option which I would remove.
aa29c3d
to
c3dcc66
Compare
This patch will enable the feature that is being introduced via PR gnocchixyz/gnocchi#1059.
This patch will enable the feature that is being introduced via PR gnocchixyz/gnocchi#1059.
c3dcc66
to
d5c786d
Compare
This patch will enable the feature that is being introduced via PR gnocchixyz/gnocchi#1059.
This patch will enable the feature that is being introduced via PR gnocchixyz/gnocchi#1059.
This patch will enable the feature that is being introduced via PR gnocchixyz/gnocchi#1059.
d5c786d
to
6ab260c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't understand this enough to approve, but not against since it seems leave current functionality as is.
6ab260c
to
f617dae
Compare
70c87c6
to
944a66e
Compare
d2dff6c
to
ddc21b8
Compare
Pull request has been modified.
ddc21b8
to
dc04a02
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more feedback on this patch now that I've managed to fix the CI again. My intent is to merge this PR now after walking through it but have some more things, see the review.
@jd my intent is to merge this, please provide feedback on this patch if you have any.
Please rebase against master as well to get all the changes for the CI to work as well. |
dc04a02
to
5ecae91
Compare
Pull request has been modified.
d7d92b6
to
5a2564c
Compare
@jd will merge this in the next couple of days if you don't have any feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
honestly I don't have enough time to review this, I'm sorry
except indexer.IndexerException as e: | ||
api.abort(400, six.text_type(e)) | ||
except Exception as e: | ||
LOG.exception(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not the right place to log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe, where do you suggest I should put this log? I cant find another place. Thanks 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the suggestion is it should be propagated to user since it's a REST call. so something like above api.abort(400, six.text_type(e))
if you truly want to catch any error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below on the same point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bringing discussion back up here. :)
as i understand it, previously any unexpected exception would crash, spew stacktrace and effectively return some 500 error but now it will log but return a 404 because empty result. i think 404 might be wrong response code in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will not return an empty result, it will raise the Exception again (and maybe return error 500, or 404, it will depends how the caller is handling exceptions, but it is how it was behaving before (except by the log stuff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh. there's a raise. didn't notice it. my bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i imagine this would only make sense for the rate/delta metrics? kind of wondering how it'd be used for other metric types? i keep thinking arbitrarily dividing something like cpu utilisation's aggregated measure across two groups doesn't make sense and dividing something like disk.storage.bytes even less.
gnocchi/rest/aggregates/api.py
Outdated
from gnocchi import utils | ||
|
||
LOG = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to be consistent
LOG = logging.getLogger(__name__) | |
LOG = daiquiri.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks.
from gnocchi.rest import api | ||
from gnocchi import storage | ||
|
||
LOG = logging.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG = logging.getLogger(__name__) | |
LOG = daiquiri.getLogger(__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no problem, thanks 🙂
except indexer.IndexerException as e: | ||
api.abort(400, six.text_type(e)) | ||
except Exception as e: | ||
LOG.exception(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the suggestion is it should be propagated to user since it's a REST call. so something like above api.abort(400, six.text_type(e))
if you truly want to catch any error
except indexer.NoSuchMetric: | ||
pass | ||
return results | ||
|
||
@staticmethod | ||
def _get_measures_by_name(resources, metric_wildcards, operations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we kept this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including the exception however could include information about internals that should not be exposed, so either we should just leave it and give a 500 error and treat it as a bug, or return 400 and hope that it's valid. I'm more for 500 than returning 400 and it's potentially not an input issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in that case, we have some problems to discuss here:
- About line 252
- If we catch the Exception and abort it with an error 500 or 400, in some cases the Exception is an instance of
weboc.HTTPExceptions
that is some already handled exception, so in the case this Exception raises an HTTP 400 and we catch it to an HTTP 500, it will be an inconsistency. To try to solve it, we could catch anweboc.HTTPExceptions
before catching the Exception itself and and ignore it. - If we dont catch the generic Exception, we will delegate to who called this method to handle it, and its ok in my opinion, and that is what we are doing here right now, except that we are logging the Exception before throwing it. This is the initial behavior of this procedure (we only added the logging stuff).
- If we catch the Exception and abort it with an error 500 or 400, in some cases the Exception is an instance of
- About line 331. No reason, I missed this one, sorry.
"detail": [(str(e.metric.id), e.method)]}) | ||
|
||
|
||
def _get_measures_by_name(resources, metric_wildcards, operations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slight preference that _get_measures_by_name
and get_measures_or_abort
functions were not part of this grouper file. seems it's just here to avoid import cycle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In deed it was to fix circular imports 😉. Do you think we should move all the stuff in the measures_grouper.py to api.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. might be simpler to be api.py for now. haven't really thought about structure but it is odd that we jumped into this 'grouper' to use a common get measures
functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will move the grouper content to api.py
self.granularity.astype(float)) | ||
|
||
|
||
class Grouper(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not entirely clear this needs to be a class tbh. it seems like get_grouped_measures
is the entry point and it's pretty linear but you've split it into logical methods and using the instance variables as lookup rather passing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We created a class just to try to organize things up when codding this one. Is there any problem? We could change it if necessary.
if details: | ||
self.references = aggregated_measures['references'] | ||
for measure in measures: | ||
self.add_measure(list(measure), date_map, start, stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... the performance of this is a bit concerning. it seems to be going through each timeseries point by point and trying to slot in to a group (in python)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in this step we are recreating all measures entries in the response, and overriding the resource metadata by the resource metadata the resource had when the measure was collected; in some cases, we have multiple resource metadata entries in the same time window (in our case is 1 hour), it means that the resource metadata was changed multiple times in an hour.
When it happens we split the measure N times, where N is the number of times the resource metadata was changed in this time window, and we calculate how much time the resource stayed with the value in the time window and we set the same ration in the new measures. For example, if the resource metadata changed 3 times in a single aggregate entry, that was the aggregation of all data collected from 18:00 to 19:00, with value 100, and the resource metadata changed at 18:06, 18:12, 18:24, so this aggregation entry will be split into 4 new ones:
- 18:00:00, value 10, older metadata; (18:00 - 18:06)
- 18:00:00, value 10, new metadata; (18:06 - 18:12)
- 18:00:00, value 20, new1 metadata; (18:12 - 18:24)
- 18:00:00, value 60, new2 metadata; (18:24 - 19:00)
We did some internal tests (and it is already running in our prod for over 3 years) and the slow part of this new flow is the resource_history list query in the MySQL DB (that is executed just once)(the query is slow because we have lots of entries there). This was discovered with the flip and flop issue caused by Nova (https://bugs.launchpad.net/nova/+bug/1853048).
The real use case that led us to this extension was the following.
We have lots of Virtual Machines (VMs) in our environment and we monitor their uptime every 10 minutes after that we aggregate it to hours (6 entries in gnocchi become one in our rating service). These measures will also be enriched with some VM metadata, like the flavor (number of CPUs, RAM, Storage, so on), all these data are stored in Gnocchi.
Our rating service calculates and stores the cost of used resources and not the used resource itself; so, in cases, we need to change some resource price, we need to clear all the already calculated costs in the service rating and start to fetch again all the aggregated data from gnocchi to update the price.
The problem is that our clients can change their VMs' flavors any time they want, so, for example, we have some clients that change their VMs' flavors many times in a month. And this is the problem, the current gnocchi version, always shows us the last resource metadata set, and not the resource metadata captured when the measurement was collected.
So when we update the flavors prices and recalculate the costs in our rating service, we will charge the clients for their VMs' last flavor, which means that if a client changed his VMs' flavor to a bigger one during the last day of a month in a month where we will need to update the flavors' prices, this client will be charged like his VMs' flavor was always the last one that the cost is higher (or even lower) instead of receiving an accurate value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. (wasn't aware nova kept same id when flavour changes). i do understand the problem you're trying to solve, i guess my question is what the correct behaviour is to attribute ownership when dividing up an measure.
for instance, if we're capturing cpu.time|storage.bytes which is a gauge metric that (for the most part) only grows. if we split, it with current logic say it might end up being:
- 00:00:00, value 0, meta1
- 10:00:00, value 10, meta1
- 11:00:00, value 1.1; meta1(11:00 - 11:06)
- 11:00:00, value 8,8: meta2(11:06 - 11:54)
- 11:00:00, value 1.1, meta3 (11:54 - 12:00)
which would be an odd value considering it was meta1 for 11hrs and meta2 for 48min but the values are near identical. maybe i'm over-complicating this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your point, and yes, your are right, this kind of behavior does not make much sense when using gauge metrics, they are cumulative.
You should not configure metadata sensitive resources using gauge metrics when you have floating metadata (each possible metadata should be a resource itself to gauge make sense), because any time the metadata changes you should reset the measures values to zero (because it means that this is a new one) and pause the old metadata collecting until the resource changes its metadata again to the old value (togging between metadata values).
For these cases, if you want to use gauge metrics, you should create one metric for any possible metadata value in the resource and measure them separated in your data polling service.
So if this is the case, you could continue using the normal aggregation API without sending the parameter "user_history" in the request, so the default behavior will me maintained.
120f065
to
aa0a512
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if i'm honest, i'm not entirely sure this is the ideal implementation or behaviour but i'm also not as involved anymore and can see it's a valid use case so am not against it. if others are ok with it, i would suggest docs and some api tests.
except indexer.IndexerException as e: | ||
api.abort(400, six.text_type(e)) | ||
except Exception as e: | ||
LOG.exception(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh. there's a raise. didn't notice it. my bad.
aa0a512
to
bbada94
Compare
bbada94
to
ceef388
Compare
Allow users to specify if they want or not to group their measurements taking into account the resource's metadata history.
ceef388
to
455c1b9
Compare
I added some gabbit tests and Rest API docs regarding the changes. |
This patch will enable the feature that is being introduced via PR gnocchixyz/gnocchi#1059.
This patch will enable the feature that is being introduced via PR gnocchixyz/gnocchi#1059.
Gnocchi fixed its `aggregates` API with PR gnocchixyz/gnocchi#1059. Before that patch, the `aggregates` API would only return the latest metadata for the resource of the metric being handled. Therefore, for CloudKitty processing and reprocessing, we would always have the possibility of using the wrong attribute version to rate the computing resources. With this patch we propose to always use the correct metadata for the processing and reprocessing of CloudKitty. This means, we always use the metadata for the timestamp that we are collecting at Gnocchi. The patch was released under version 4.5.0 of Gnocchi. Change-Id: I31bc2cdf620fb5c0f561dc9de8c10d7882895cce
* Update cloudkitty from branch 'master' to b96b9afc68af583c5dbbd935bb96f3e41806bfcf - Merge "Use correct metadata for metrics gathered from gnocchi" - Use correct metadata for metrics gathered from gnocchi Gnocchi fixed its `aggregates` API with PR gnocchixyz/gnocchi#1059. Before that patch, the `aggregates` API would only return the latest metadata for the resource of the metric being handled. Therefore, for CloudKitty processing and reprocessing, we would always have the possibility of using the wrong attribute version to rate the computing resources. With this patch we propose to always use the correct metadata for the processing and reprocessing of CloudKitty. This means, we always use the metadata for the timestamp that we are collecting at Gnocchi. The patch was released under version 4.5.0 of Gnocchi. Change-Id: I31bc2cdf620fb5c0f561dc9de8c10d7882895cce
The option 'use_all_resource_revisions' is useful when using Gnocchi with the patch introduced in [1]. That patch can cause queries to return more than one entry per granularity (timespan), according to the revisions a resource has. This can be problematic when using the 'mutate' option of Cloudkitty. Therefore, this option ('use_all_resource_revisions') allows operators to discard all datapoints returned from Gnocchi, but the last one in the granularity that is queried by CloudKitty. The default behavior is maintained, which means, CloudKitty always uses all the data points returned. However, when the 'mutate' option is not used, we need to sum all the quantities and use this value with the latest version of the attributes received. Otherwise, we will miss the complete accounting for the time frame where the revision happened. [1] gnocchixyz/gnocchi#1059 Change-Id: I45bdaa3783ff483d49ecca70571caf529f3ccbc3
* Update cloudkitty from branch 'master' to 6075cdce95bc32043e9cf4a4fbfa4b063e60f048 - Merge "Patch for `use_all_resource_revisions` option" - Patch for `use_all_resource_revisions` option The option 'use_all_resource_revisions' is useful when using Gnocchi with the patch introduced in [1]. That patch can cause queries to return more than one entry per granularity (timespan), according to the revisions a resource has. This can be problematic when using the 'mutate' option of Cloudkitty. Therefore, this option ('use_all_resource_revisions') allows operators to discard all datapoints returned from Gnocchi, but the last one in the granularity that is queried by CloudKitty. The default behavior is maintained, which means, CloudKitty always uses all the data points returned. However, when the 'mutate' option is not used, we need to sum all the quantities and use this value with the latest version of the attributes received. Otherwise, we will miss the complete accounting for the time frame where the revision happened. [1] gnocchixyz/gnocchi#1059 Change-Id: I45bdaa3783ff483d49ecca70571caf529f3ccbc3
Problem description
When we are aggregating measures using Gnocchi, we can group the measures by resource's metadata. However, Gnocchi does not group the measures taking into account the resource's metadata history; so if some resource's metadata changed recently, it will be grouped using only the last metadata value.
Imagine the following situation:
I have a metric with metadata called "name", so I take a resource that is monitored by this metric and I change the resource's name from "test" to "test1" at 14:45:00. Then I call the Gnocchi's Aggregation API with an interval from 11 to 17 hours grouping by "name" and it returns this:
Here we have a problem. For the system consuming Gnocchi API, it will look like the resource has the same attributes in all of the timestamps, when that is not true. This can have a negative impact on billing systems that rely on Gnocchi as the time series database (e.g. CloudKitty).
Proposal
I propose to allow users to specify if they want or not to group their measurements taking into account the resource's metadata history; if they want, they can just add a "use_history=true" parameter in their API request and Gnocchi will do it for them. If users do not specify the "use_history" in the API, the following property will be used instead.
If the property or the query parameter are not defined, the "use_history" value assumes False (for backward compatibility). The new response expected with this proposal using the same situation listed in
Problem description
will be (using group_measures_using_history=true):As you can see, there are 2 entries at 14:00; it happens because the resource's name was changed at 14:45, that is equivalent to 75% of the granularity window (that is 1 hour), so the measure named "test" will take 75% of the total measurement of the granularity window, and the measure named "test1" will take the remaining 25% of the measurement value.