-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch Redis server errors #1119
Conversation
There seems to happen a bad interaction with Redis over HAProxy, and Redis connections are reported as closed, which is only temporary.
a9200ba
to
3ddee4f
Compare
Thank you for the review. The follow-up patch should address the review comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it's been a long time since I did Gnocchi stuff so my brain was a bit slow the first review
gnocchi/cli/metricd.py
Outdated
@@ -85,6 +86,8 @@ def run(self): | |||
with utils.StopWatch() as timer: | |||
try: | |||
self._run_job() | |||
except ConnectionError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see another problem here, is that you are violating separation of concern.
The metricd tool does not need to know about Redis, and anything else can be used. That means the exception must be managed in the redis driver itself, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
underlying incoming/redis does not handle any exceptions, but it probably should. There has been a change in redis-py around 3.0, which hinted/suggested to catch and handle exceptions in applications and not in redis-py. I can not find the issue/pr in redis-py at the moment.
Where would you want the exception handled then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such files https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/incoming/redis.py and its method.
I'm not sure which part of Redis raises ConnectionError
, if it's all of the, you might need to decorate with a tenacity decorator to retry a few times the operations, and bail out if it fails too much.
gnocchi/cli/metricd.py
Outdated
@@ -192,6 +195,8 @@ def _fill_sacks_to_process(self): | |||
self.wakeup() | |||
except exceptions.NotImplementedError: | |||
LOG.info("Incoming driver does not support notification") | |||
except ConnectionError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
gnocchi/cli/metricd.py
Outdated
@@ -219,6 +224,7 @@ def _get_sacks_to_process(self): | |||
finally: | |||
return self._tasks or self.fallback_tasks | |||
|
|||
@utils.retry_on_exception.wraps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unrelated to redis, it should probably not be there; it'll retry on ANY exception.
setup.cfg
Outdated
@@ -34,6 +34,7 @@ install_requires = | |||
futures; python_version < '3' | |||
jsonpatch | |||
cotyledon>=1.5.0 | |||
redis >= 3.2.0 # MIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redis is not mandatory to run gnocchi, do not make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Moving the change to the incoming redis driver should not make it a requirement here. I've moved it back to the redis section.
as suggested in the review. Closes gnocchixyz#1120
in tests. s3 Was spuriously failing, we turned off testing for ceph some time ago. Time to turn it on again.
.travis.yml
Outdated
@@ -18,10 +18,10 @@ env: | |||
|
|||
- TARGET: py36-mysql-file | |||
- TARGET: py36-mysql-swift | |||
- TARGET: py36-mysql-s3 | |||
- TARGET: py36-mysql-ceph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unrelated to this PR :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes indeed it is.
Ideally, I would like to have all backends enabled. There seems to be a timing(?) issue with CI, and lately, the s3 tests were failing with a chance of > 50%
gnocchi/incoming/redis.py
Outdated
@tenacity.retry( | ||
wait=utils.wait_exponential, | ||
# Never retry except when explicitly asked by raising TryAgain | ||
retry=tenacity.retry_never) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go with:
@tenacity.retry( | |
wait=utils.wait_exponential, | |
# Never retry except when explicitly asked by raising TryAgain | |
retry=tenacity.retry_never) | |
@tenacity.retry( | |
wait=utils.wait_exponential, | |
retry=tenacity.retry_if_exception_type(ConnectionError)) |
That should be simpler.
You probably want to stop at some point too though, I'd add a condition for it to stop retrying after a few tries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks better. Thank you for the suggestion.
I've never seen this repeated more than twice for now.
c1c9276
to
df18070
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you still miss a stop condition right?
gnocchi/incoming/redis.py
Outdated
try: | ||
for key in self._client.scan_iter(match=match, count=1000): | ||
metrics += 1 | ||
pipe.llen(key) | ||
if details: | ||
m_list.append(key.split(redis.SEP)[1].decode("utf8")) | ||
# group 100 commands/call | ||
if metrics % 100 == 0: | ||
results = pipe.execute() | ||
update_report(results, m_list) | ||
m_list = [] | ||
pipe = self._client.pipeline() | ||
else: | ||
results = pipe.execute() | ||
update_report(results, m_list) | ||
m_list = [] | ||
pipe = self._client.pipeline() | ||
else: | ||
results = pipe.execute() | ||
update_report(results, m_list) | ||
except ConnectionError: | ||
LOG.debug("Redis Server closed connection. Retrying.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this anymore (the try/except)
gnocchi/incoming/redis.py
Outdated
try: | ||
for message in p.listen(): | ||
if message['type'] == 'pmessage' and message['pattern'] == pattern: | ||
# FIXME(jd) This is awful, we need a better way to extract this | ||
# Format is defined by _get_sack_name: incoming128-17 | ||
yield self._make_sack(int(message['channel'].split(b"-")[-1])) | ||
except ConnectionError: | ||
LOG.debug("Redis Server closed connection. Retrying.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
gnocchi/incoming/redis.py
Outdated
@@ -176,6 +183,10 @@ def process_measures_for_sack(self, sack): | |||
pipe.ltrim(key, item_len + 1, -1) | |||
pipe.execute() | |||
|
|||
@tenacity.retry( | |||
wait=utils.wait_exponential, | |||
# Never retry except when explicitly asked by raising TryAgain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong comment
@mrunge LGTM |
Thank you for your reviews! Yes, I've tested this a couple of times already over each iteration. That's why I had debug logs etc and wasn't that much concerned about stop condition. This latest proposal does not fix the issue, try/except is required, apparently. |
That does not make sense. What's the traceback? |
Sorry for not getting back earlier. While I got stack traces on Friday, The deployment is now running flawlessly for over a hour, no stack traces. Before, these kind of traces errors happened every 2 to 5 minutes. We should consider this PR to fix the issues. |
Just after my last comment, I saw this trace in the logs
|
It turned out I had a try/except left over in my test deployment where I also tried this out. It's running for quite a while now in the form of being proposed here. Sorry to cause confusion here. Thank you for your effort here. With this PR merged being merged, I see the issue being fixed then. |
Your traceback makes little sense as the |
I had wrapped gnocchi/gnocchi/incoming/redis.py Lines 186 to 190 in eb87188
|
I understand it that way, that my additional changes/left overs were causing the exception reported in #1119 (comment). |
Oh ok, I got confused, I thought this PR was doing that. Ok so it should be fine! |
thank you @jd |
There seems to happen a bad interaction with Redis over
HAProxy, and Redis connections are reported as closed,
which is only temporary.