Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redis.exception.WatchError is not really an error #2639

Closed
chrisguidry opened this issue Jun 26, 2024 · 8 comments · Fixed by #2721
Closed

redis.exception.WatchError is not really an error #2639

chrisguidry opened this issue Jun 26, 2024 · 8 comments · Fixed by #2721
Labels

Comments

@chrisguidry
Copy link
Contributor

What problem do you want to solve?

When upgrading opentelemetry-instrumentation-redis from 0.39b0 to 0.46b0, we picked up the (very nice) fixes to instrumentation of the async redis client. However, all of our arq (https://github.com/samuelcolvin/arq/) workers started marking tons of traces as errors because they got a WatchError exception.

The redis WatchError is not really an error, it's a flow-control and concurrency-management mechanism (kind of like a LockError, etc). The fact that we're marking all of those spans as errored introduces a lot of noise to something that's a totally reasonable pattern and use of redis.

Here's an example of how it's used for concurrency control: https://github.com/samuelcolvin/arq/blob/main/arq/connections.py#L178C20-L180

Describe the solution you'd like

I'd recommend that we adjust the instrumentation to not mark spans as failed when they encounter WatchError (or to at least make this behavior configurable). It does make sense to still mark them as events on the current span, though.

Describe alternatives you've considered

No response

Additional Context

No response

Would you like to implement a fix?

None

@xrmx xrmx added the good first issue Good for newcomers label Jun 27, 2024
@xrmx
Copy link
Contributor

xrmx commented Jun 27, 2024

Yeah, makes sense.

@zhihali
Copy link
Contributor

zhihali commented Jun 28, 2024

May I take this one?

@xrmx
Copy link
Contributor

xrmx commented Jun 29, 2024

@Charlie-lizhihan sure

@zhihali
Copy link
Contributor

zhihali commented Jul 4, 2024

Hi, @xrmx @chrisguidry
When an WatchError occur, the current instrumentation will set span as

{
    "name": "SET",
    "context": {
        "trace_id": "0x878718c68c75dbc45f875358f21f1820",
        "span_id": "0x4302e43a40344ddd",
        "trace_state": "[]"
    },
    "kind": "SpanKind.CLIENT",
    "parent_id": null,
    "start_time": "2024-07-04T13:37:05.133151Z",
    "end_time": "2024-07-04T13:37:05.136076Z",
    "status": {
        "status_code": "ERROR",
        "description": "WatchError: Watched variable changed."
    },
    "attributes": {
        "db.statement": "SET ? ?",
        "db.system": "redis",
        "db.redis.database_index": 0,
        "net.peer.name": "localhost",
        "net.peer.port": 6379,
        "net.transport": "ip_tcp",
        "db.redis.pipeline_length": 1
    },
    "events": [
        {
            "name": "exception",
            "timestamp": "2024-07-04T13:37:05.136058Z",
            "attributes": {
                "exception.type": "redis.exceptions.WatchError",
                "exception.message": "Watched variable changed.",
                "exception.stacktrace": "Traceback (most recent call last):\n  File.....",
                "exception.escaped": "False"
            }
        }
    ],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.25.0",
            "service.name": "unknown_service"
        },
        "schema_url": ""
    }
}

My idea is to change the
Status Code: The status_code should be "UNSET" instead of "ERROR"

The updated span will looks like

{
    "name": "SET",
    "context": {
        "trace_id": "0xa3dfeee4a56d35db967d5009c1cce330",
        "span_id": "0x0843393d6d86562d",
        "trace_state": "[]"
    },
    "kind": "SpanKind.CLIENT",
    "parent_id": null,
    "start_time": "2024-07-04T10:19:59.647866Z",
    "end_time": "2024-07-04T10:19:59.651123Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "db.statement": "SET ? ?",
        "db.system": "redis",
        "db.redis.database_index": 0,
        "net.peer.name": "localhost",
        "net.peer.port": 6379,
        "net.transport": "ip_tcp",
        "db.redis.pipeline_length": 1
    },
    "events": [
        {
            "name": "exception",
            "timestamp": "2024-07-04T10:19:59.651113Z",
            "attributes": {
                "exception.type": "redis.exceptions.WatchError",
                "exception.message": "Watched variable changed.",
                "exception.stacktrace": "Traceback (most recent call last):\n  File......",
                "exception.escaped": "False"
            }
        }
    ],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.25.0",
            "service.name": "unknown_service"
        },
        "schema_url": ""
    }
}

Does this sounds good?

@chrisguidry
Copy link
Contributor Author

That sounds perfect to me! Thank you for taking this one on

@emdneto
Copy link
Member

emdneto commented Jul 15, 2024

@chrisguidry, would you care to give it a try with the fix from #2668?

@chrisguidry
Copy link
Contributor Author

Will do, likely a little later this week

@chrisguidry
Copy link
Contributor Author

Hi @emdneto I tested and left a note on the PR, I believe we're swallowing the error, but we should really be re-raising it

chrisguidry added a commit to chrisguidry/opentelemetry-python-contrib that referenced this issue Jul 17, 2024
In open-telemetry#2668, we started to avoid having the `redis.WatchError` mark the span as
having failed, but we were inadvertently suppressing that exception from the
calling code.  `redis.WatchError` is used for flow-of-control concurrency in
many applications, and applications depend on catching the error to handle
concurrent changes to keys during redis pipelines.

This re-raises the `WatchError` to keep the instrumentation transparent to the
application.

Fixes open-telemetry#2639
chrisguidry added a commit to chrisguidry/opentelemetry-python-contrib that referenced this issue Jul 18, 2024
In open-telemetry#2668, we started to avoid having the `redis.WatchError` mark the span as
having failed, but we were inadvertently suppressing that exception from the
calling code.  `redis.WatchError` is used for flow-of-control concurrency in
many applications, and applications depend on catching the error to handle
concurrent changes to keys during redis pipelines.

This re-raises the `WatchError` to keep the instrumentation transparent to the
application.

Fixes open-telemetry#2639
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants