Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception during serialization may lead to receiving a response for a key other than was requested #2528

Closed
serejja opened this issue Oct 18, 2023 · 4 comments
Labels
type: documentation A documentation update
Milestone

Comments

@serejja
Copy link

serejja commented Oct 18, 2023

Bug Report

Current Behavior

Hello maintainers. We've been trying to figure out one bug for a long time, but now I can reproduce it reliably.
When using an async client (happens both in Redis Cluster and standalone modes), it is possible to start receiving responses for different keys that were requested. This happens when encodeValue throws an exception in a concurrent environment.

The sequence is as follows:

  1. The application writes items to cache. At some point an exception is thrown during serialization which leads to a stack trace below
  2. After this happens, concurrent reads are messed up now. You start getting responses for other keys than were requested. For example, thread 1 requests cache key X, thread 2 requests cache key Y, but then thread 1 receives the response for cache key Y instead of X
  3. This doesn't happen when you read sequentially
Stack trace
io.netty.handler.codec.EncoderException: Cannot encode command. Please close the connection as the connection state may be out of sync.
	at io.lettuce.core.protocol.CommandEncoder.encode(CommandEncoder.java:96)
	at io.lettuce.core.protocol.CommandEncoder.encode(CommandEncoder.java:78)
	at io.netty.handler.codec.MessageToByteEncoder.write(MessageToByteEncoder.java:107)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:881)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856)
	at io.lettuce.core.protocol.CommandHandler.writeSingleCommand(CommandHandler.java:421)
	at io.lettuce.core.protocol.CommandHandler.write(CommandHandler.java:387)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:879)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:940)
	at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1247)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.NullPointerException: boom!
	at reproducer.BrokenCodec$.encodeValue(BrokenCodec.scala:19)
	at reproducer.BrokenCodec$.encodeValue(BrokenCodec.scala:8)
	at io.lettuce.core.protocol.CommandArgs$ValueArgument.encode(CommandArgs.java:742)
	at io.lettuce.core.protocol.CommandArgs.encode(CommandArgs.java:362)
	at io.lettuce.core.protocol.Command.encode(Command.java:126)
	at io.lettuce.core.protocol.AsyncCommand.encode(AsyncCommand.java:185)
	at io.lettuce.core.protocol.CommandEncoder.encode(CommandEncoder.java:93)
	... 19 more

Input Code

I have a reproducer project (in Scala) here - https://github.com/serejja/lettuce-reproducer. You would need Scala SBT (https://www.scala-sbt.org/) to run it via sbt run

Expected behavior/code

Even though the exception says the connection may be out of sync, my understanding is that you should never get the response for another key. Another question is why Lettuce doesn't close the connection on its own and just logs this message

Environment

  • Lettuce version(s): 6.2.6.RELEASE
  • Redis version: redis:6-alpine docker image. Also AWS ElastiCache Redis 7.0.7

Possible Solution

Additional context

@mp911de
Copy link
Collaborator

mp911de commented Oct 26, 2023

This is correct and a valid observation. We enqueue a command to be sent, then the encoder (within the channel pipeline) encodes the command to be sent to the Redis server. Any failure in the codec lead to exceptional completion of the command with the message:

Cannot encode command. Please close the connection as the connection state may be out of sync..

Encoding happens very late in the process and without a rewrite of the entire processing mechanism, we do not have a chance to remove the command from the protocol queue.

Custom codecs can be very handy, however they always run on the event loop thread and with a chance of exceptions, the protocol synchronization is going to be broken.

@serejja
Copy link
Author

serejja commented Oct 30, 2023

Thanks for your reply @mp911de!

Right now I'm trying to understand what would be the proper and elegant solution to handle this. Your message sounds like this is not an easy thing to fix, if possible at all with the current design.

To make sure we don't encounter this behavior anymore we wrapped our codec with an "exception free" version of it. We are catching every exception during serialization and fall back to storing an "empty" value in the cache. Then when reading the data back we check if the value we received is "empty" and if so, then consider that there was no value at all instead of trying to deserialize it. While this works, we still see this as a suboptimal solution because we now have to remember to use that "exception free" codec, and now we have to potentially store some useless data in our Redis.

Do you have any idea what other workaround could be implemented to make sure we don't encounter any discrepancies? I'm not sure if we can "close" the cache client when an exception is thrown, but even if so, we are handling multiple thousand requests per second and this brief period of time when we potentially could have discrepancies is sensitive for us.

In general, any insight on your plans on this would be very helpful for us to understand if there is a chance to expect a fix for this in foreseeable future.

Thanks!

@mp911de
Copy link
Collaborator

mp911de commented Nov 2, 2023

if there is a chance to expect a fix for this in foreseeable future.

The system isn't broken so there's nothing to fix.

The core idea here is that a codec should always be able to encode its value unless clearly, the system runs out of memory. We're talking encoding at the level of String to byte[] or byte[] to Base64, although the Base64-variant could fail on decoding if the underlying data isn't encoded properly.

When codecs operate on the level of object graphs, translating these into JSON or use JDK serialization, a whole new category of errors becomes possible. These cannot be handled by the Lettuce driver.

It makes sense to include this information in the docs (wiki).

From my perspective, this issue is resolved as it isn't actionable anymore.

@mp911de mp911de added type: documentation A documentation update and removed status: waiting-for-triage labels Nov 2, 2023
@mp911de
Copy link
Collaborator

mp911de commented Nov 2, 2023

Docs are updated now, see https://github.com/lettuce-io/lettuce-core/wiki/Codecs#exception-handling-during-encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: documentation A documentation update
Projects
None yet
Development

No branches or pull requests

2 participants