Exception during serialization may lead to receiving a response for a key other than was requested #2528

serejja · 2023-10-18T13:31:02Z

Bug Report

Current Behavior

Hello maintainers. We've been trying to figure out one bug for a long time, but now I can reproduce it reliably.
When using an async client (happens both in Redis Cluster and standalone modes), it is possible to start receiving responses for different keys that were requested. This happens when encodeValue throws an exception in a concurrent environment.

The sequence is as follows:

The application writes items to cache. At some point an exception is thrown during serialization which leads to a stack trace below
After this happens, concurrent reads are messed up now. You start getting responses for other keys than were requested. For example, thread 1 requests cache key X, thread 2 requests cache key Y, but then thread 1 receives the response for cache key Y instead of X
This doesn't happen when you read sequentially

Stack trace

io.netty.handler.codec.EncoderException: Cannot encode command. Please close the connection as the connection state may be out of sync.
	at io.lettuce.core.protocol.CommandEncoder.encode(CommandEncoder.java:96)
	at io.lettuce.core.protocol.CommandEncoder.encode(CommandEncoder.java:78)
	at io.netty.handler.codec.MessageToByteEncoder.write(MessageToByteEncoder.java:107)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:881)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:863)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:968)
	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:856)
	at io.lettuce.core.protocol.CommandHandler.writeSingleCommand(CommandHandler.java:421)
	at io.lettuce.core.protocol.CommandHandler.write(CommandHandler.java:387)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:879)
	at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:940)
	at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1247)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.NullPointerException: boom!
	at reproducer.BrokenCodec$.encodeValue(BrokenCodec.scala:19)
	at reproducer.BrokenCodec$.encodeValue(BrokenCodec.scala:8)
	at io.lettuce.core.protocol.CommandArgs$ValueArgument.encode(CommandArgs.java:742)
	at io.lettuce.core.protocol.CommandArgs.encode(CommandArgs.java:362)
	at io.lettuce.core.protocol.Command.encode(Command.java:126)
	at io.lettuce.core.protocol.AsyncCommand.encode(AsyncCommand.java:185)
	at io.lettuce.core.protocol.CommandEncoder.encode(CommandEncoder.java:93)
	... 19 more

Input Code

I have a reproducer project (in Scala) here - https://github.com/serejja/lettuce-reproducer. You would need Scala SBT (https://www.scala-sbt.org/) to run it via sbt run

Expected behavior/code

Even though the exception says the connection may be out of sync, my understanding is that you should never get the response for another key. Another question is why Lettuce doesn't close the connection on its own and just logs this message

Environment

Lettuce version(s): 6.2.6.RELEASE
Redis version: redis:6-alpine docker image. Also AWS ElastiCache Redis 7.0.7

Possible Solution

Additional context

The text was updated successfully, but these errors were encountered:

mp911de · 2023-10-26T12:39:46Z

This is correct and a valid observation. We enqueue a command to be sent, then the encoder (within the channel pipeline) encodes the command to be sent to the Redis server. Any failure in the codec lead to exceptional completion of the command with the message:

Cannot encode command. Please close the connection as the connection state may be out of sync..

Encoding happens very late in the process and without a rewrite of the entire processing mechanism, we do not have a chance to remove the command from the protocol queue.

Custom codecs can be very handy, however they always run on the event loop thread and with a chance of exceptions, the protocol synchronization is going to be broken.

serejja · 2023-10-30T13:04:04Z

Thanks for your reply @mp911de!

Right now I'm trying to understand what would be the proper and elegant solution to handle this. Your message sounds like this is not an easy thing to fix, if possible at all with the current design.

To make sure we don't encounter this behavior anymore we wrapped our codec with an "exception free" version of it. We are catching every exception during serialization and fall back to storing an "empty" value in the cache. Then when reading the data back we check if the value we received is "empty" and if so, then consider that there was no value at all instead of trying to deserialize it. While this works, we still see this as a suboptimal solution because we now have to remember to use that "exception free" codec, and now we have to potentially store some useless data in our Redis.

Do you have any idea what other workaround could be implemented to make sure we don't encounter any discrepancies? I'm not sure if we can "close" the cache client when an exception is thrown, but even if so, we are handling multiple thousand requests per second and this brief period of time when we potentially could have discrepancies is sensitive for us.

In general, any insight on your plans on this would be very helpful for us to understand if there is a chance to expect a fix for this in foreseeable future.

Thanks!

mp911de · 2023-11-02T10:41:21Z

if there is a chance to expect a fix for this in foreseeable future.

The system isn't broken so there's nothing to fix.

The core idea here is that a codec should always be able to encode its value unless clearly, the system runs out of memory. We're talking encoding at the level of String to byte[] or byte[] to Base64, although the Base64-variant could fail on decoding if the underlying data isn't encoded properly.

When codecs operate on the level of object graphs, translating these into JSON or use JDK serialization, a whole new category of errors becomes possible. These cannot be handled by the Lettuce driver.

It makes sense to include this information in the docs (wiki).

From my perspective, this issue is resolved as it isn't actionable anymore.

mp911de · 2023-11-02T13:31:57Z

Docs are updated now, see https://github.com/lettuce-io/lettuce-core/wiki/Codecs#exception-handling-during-encoding

mp911de added the status: waiting-for-triage label Oct 26, 2023

mp911de added type: documentation A documentation update and removed status: waiting-for-triage labels Nov 2, 2023

mp911de closed this as completed Nov 2, 2023

mp911de added this to the 6.3 milestone Nov 2, 2023

be-hase mentioned this issue Dec 8, 2023

CommandHandler.exceptionCaught does incorrect behavior when occuring exception in addToStack() #788

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception during serialization may lead to receiving a response for a key other than was requested #2528

Exception during serialization may lead to receiving a response for a key other than was requested #2528

serejja commented Oct 18, 2023 •

edited

Loading

mp911de commented Oct 26, 2023

serejja commented Oct 30, 2023

mp911de commented Nov 2, 2023

mp911de commented Nov 2, 2023

Exception during serialization may lead to receiving a response for a key other than was requested #2528

Exception during serialization may lead to receiving a response for a key other than was requested #2528

Comments

serejja commented Oct 18, 2023 • edited Loading

Bug Report

Current Behavior

Input Code

Expected behavior/code

Environment

Possible Solution

Additional context

mp911de commented Oct 26, 2023

serejja commented Oct 30, 2023

mp911de commented Nov 2, 2023

mp911de commented Nov 2, 2023

serejja commented Oct 18, 2023 •

edited

Loading