Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] #85

Closed
alejandrod opened this issue May 11, 2017 · 27 comments
Closed
Milestone

Comments

@alejandrod
Copy link

Hi folks,
I'm hoping to get some help. Maybe it's an issue, maybe I'm missing something.

I'm using the sync version of the API and I get the timeout errors quite often (10 seconds timeout). Not always.
At the beginning I thought it was just slow code calculating the entry on cache.getOrElse. But, then I even saw a timeout on removing a key. Why we would get a timeout here at all??

We run on AWS and use a single ElasticCache node. Before the cache was in memory and was just fine.

I know Brando is used under the hood. I'm wondering If I should change something on the akka system.

P/S: I can't provide a test for this as it's not deterministic. It's a simple use case with a small single server with a single redis node.

Thanks for your help

@KarelCemus
Copy link
Owner

Hi @alejandrod,

well, it is hard to tell, but if you say there is Brando under the hood that means you use old version of this library. Recently, I was forced to switch the engine to scredis instead due to the serious bug and no support in Brando.

Running it on AWS suggests, this might be the same case. Look at #29 #44, this might help.

Let me know the result.

Best,
Karel

@alejandrod
Copy link
Author

Thanks for the reply!

We are using 1.4.0. I mentioned Brando because I read it somewhere. So, I guess I was wrong.
Will take a look to those and see.

@KarelCemus
Copy link
Owner

Well, 1.4.0 already uses scredis instead of Brando and those issues are related to Brando, so I am worried they won't help. And without more details, I don't know how to help either.

Only one other thing occurs to me, which might be related. How big is the traffic? I believe that all requests to play-redis (async or not) are internally executed by just one thread, at least by default. There is no Akka dispatching implemented inside the library, so I don't know how many threads it uses I configured it to one to keep it simple (it was enough so far). I don't know that much about Akka to be sure how it works. However, this might result in those timeouts when you have really long execution of getOrElse methods and there are many (or more than one or so) requests at the time. The long computation runs for that certain amount of time and all other enqueued requests timeout meanwhile.

Do you think this is the case? That could explain the non-deterministic behavior.

If so, do you have how Akka works internally? Is there some need for manual dispatching to run the actors in multiple threads? Otherwise, I'll have to look at it.

@KarelCemus
Copy link
Owner

Oh, may be I am wrong about the default configuration, but still, there is no load balancer or dispatcher and I don't know what is the behavior of default akka dispatcher. I believe it runs just in one thread

@alejandrod
Copy link
Author

So, brando should not be the problem as we use scredis:

[info] +-com.github.karelcemus:play-redis_2.11:1.4.0 [S] [info] | +-com.livestream:scredis_2.11:2.0.8 [S] [info] | +-com.typesafe.akka:akka-actor_2.11:2.4.17 [S] [info] | | +-com.typesafe:config:1.2.0 (evicted by: 1.3.1) [info] | | +-com.typesafe:config:1.3.0 (evicted by: 1.3.1) [info] | | +-com.typesafe:config:1.3.1 [info] | | +-org.scala-lang.modules:scala-java8-compat_2.11:0.7.0 [S]

Will check the dispatcher thing. Even so 1 thread with 10seconds of timeout for a single server with a single node with 100 keys sounds like enough.
Will read a bit more about scredis then.

@KarelCemus
Copy link
Owner

I don't think that's it but you may try. I use this library in like 10-20 web apps on-premise and never seen anything like that. I use timeout 3 seconds but my getOrElse run fast.

I thought it was just slow code calculating the entry on cache.getOrElse

How long are running these? Is it like seconds or so? Or is it much faster, like milliseconds? This is where I would start digging. Multiple invocations of this and it timeouts right away.

The only situation when I sometimes run into timeouts is running all tests at once, which might be related to low throughput.

@alejandrod
Copy link
Author

Mainly DB request so should be ms. But, there might be a few http calls. Now, depending how this is being executed internally, one long request might cause issues on other requests.
Just guessing. So, will look into the dispatcher thing and check if there is anything on scredis that I could configure.
Will let you know. Thanks!

@KarelCemus
Copy link
Owner

KarelCemus commented May 11, 2017

Alright. If the dispatcher is the solution, it has to be part of the play-redis, not scredis. The timeout would be caused by this library. Scredis is asynchronous and there should be requests only within ms. Those long running task are executed within play-redis.

Let me know when you find how to support multiple threads, thanks!

@alejandrod
Copy link
Author

For sure. Will take a look and be back to you

@alejandrod
Copy link
Author

alejandrod commented May 26, 2017

Ok, after a bit more digging I think that the problem is that scredis reconnects every now and then.
Not sure why.

2017-05-25 01:30:37,584 - [ERROR] - from scredis.io.ListenerActor in application-scredis.io.akka.listener-dispatcher-4035
Receive timeout

2017-05-25 01:30:40,605 - [ERROR] - from scredis.io.IOActor in application-scredis.io.akka.io-dispatcher-4036
A timeout occurred while resetting the connection. 

In any case, it recovers. Now, I'd like to address a more important factor.
I think we agree that if the cache is not working, the app should continue to work.
I don't think it's the case, as soon as there is a reconnection, requests will fail with a Future timeout. Most likely because we are waiting for redis to reconnect.

Maybe you can help me with the following questions:

1 - How do I make sure that the call to the cache never affects the app. Worst case we perform the underlying request/app logic.
2 - play.cache.redis.timeout only affects the redis call and not the underlaying query/app logic. I think there should not be a timeout on the app code as the behaviour should not be affected by the cache.
Note: I set it to 10s because it was affecting some long running reports taking more than 1s.

Thanks!

@KarelCemus
Copy link
Owner

Hi,

I am sorry to hear about the issue with scredis. Honestly, I cannot imagine what causes that. However, due to several other issues, I am thinking of changing the connector again. So, in the upcoming release there won't be scredis anymore, I believe.

To the second part of your question. Yes, I agree that the cache should not affect the application flow. To resolve your first question, there is Recovery policy exactly for this reason. I check and the default value is log-and-default. So, if it still fails, it is a bug. It should log it and return some default value based on the operation. Could you confirm this issue and write exact scenario to reproduce it? I would check it.

As of your second question, you are right. I'll look at it and try to figure out some solution. However, this might be very difficult. The timeout is applied to convert async API to sync API on all futures regardless the operation. Honestly, I don't think it is possible to apply it only on the underlying calls. And I think (not sure) that Scala API requires the definition of some timeout when waiting for futures to be resolved, which is your case. The only thing I have in mind is to split the timeout into the connection timeout and the synchronization timeout.

What do you think? Would any of these help you?

@alejandrod
Copy link
Author

I can confirm it fails and it doesn't call the orElse part when asking for a cache entry.
Now, will see if I can come up with a scenario to reproduce it.

Regarding the timeout. Just to be clear, the timeout should be for the cache retrieval part I think.
So, basically if I can't get the cache entry in X time, call the app code. In most cases we do cache.getOrElse. So, always call the OrElse part (which is the app code) regardless of what happens with the cache.

Too bad that the connector has to be changed.

@KarelCemus
Copy link
Owner

I'll try to set up some scenario with timeout and we will see what will happen.

Well, one issue is here. There has to be some timeout. However, I'll browse the code and try to figure out how to introduce another internal timeout just for the cache connector.

Regarding changing the connector, I really dislike the solution, but it turned out there are significant issues with the repository maintenance, the lib wasn't released for a long time, there are unreleased features and bugfixes and it does not seem to get better. I think there is no choice, Scala 2.12 is must have.

@KarelCemus
Copy link
Owner

KarelCemus commented May 27, 2017

Alright, I reproduced your issue, I hope. Can you check it?

Note: Default timeout in tests is 3 secs.

"long running orElse" in {
  Cache.get[ String ]( s"$prefix-test-timeout-in-orelse" ) must beNone
  Cache.getOrElse[ String ]( s"$prefix-test-timeout-in-orelse" ) {
    Thread.sleep( 5000 ) // sleep 5 secs to timeout
    "some value"
  } must beEqualTo( "some value" )
}

This test used to fail, now I fixed it a bit (locally, not published) and it works although it shouldn't. I have to investigate it.


Regarding the different connection timeout and invocation timeout. As you are using SyncAPI and everything is implemented in a non-blocking manner, it has to wait for it. It uses Await from standard Scala library, which requires a timeout. So, you cannot get rid of it, but you can set it to some high value to satisfy your needs.

The internal timeout for communication with the Redis can be set directly in scredis configuration. I don't see any need to override this in play-redis.

If I am not wrong, the only thing left to deal with is the reconnection to the Redis, which I am worried I cannot help you with. It seems like an issue related to your environment, there are no other reported issues neither in this nor in the scredis repositories right now.


However, when I finished writing this answer, one thing occurred to me. The play-redis timeout is defined for the entire SyncAPI, for both the redis and the application logic parts together. When the invocation timeouts, it is real failure, I think. It lasted longer than it was expected. For communication with Redis there are another more specific timeout definitions. So right now, I am re-evaluating the statement and I don't think this is a bug, I'd consider it a feature - time-secured synchronization of the async API.

Is it clear or should I explain something better? Does it help you and resolve your issues?

@KarelCemus
Copy link
Owner

KarelCemus commented May 27, 2017

I believe I could improve the documentation clarity when speaking of the timeout.

@KarelCemus
Copy link
Owner

I have investigated the scredis code base and the only place where the receive-timeout is used is here, although the documentation says it is used only for batches. It seems this is the timeout you are looking for.

@alejandrod
Copy link
Author

alejandrod commented May 27, 2017

Ok, just to be sure we are in sync.

// This test should always pass. Even assuming redis is not working

"long running orElse" in {
  Cache.get[ String ]( s"$prefix-test-timeout-in-orelse" ) must beNone
  Cache.getOrElse[ String ]( s"$prefix-test-timeout-in-orelse" ) {
   // Regardless of how long it takes, this should not fail. It's the app code. 
   // If a timeout is needed, should be set to a long value.
    Thread.sleep( 5000 ) 
    "some value"
  } must beEqualTo( "some value" )
}

@KarelCemus
Copy link
Owner

// Regardless of how long it takes, this should not fail

This is not possible with SyncAPI, see the Scala docs I linked for Await. Or if disagree, please, propose a solution for conversion of Future[T] to T without any timeout.

@KarelCemus
Copy link
Owner

I think the solution you are looking for is following:

  1. set play.cache.redis.timeout to something like the eternity, e.g., a day
  2. set scredis.io.connect-timeout to something small, default are 2 secs.
  3. set scredis.io.receive-timeout to something really small, default are 5 secs and this is what bothers you.

Do this and you are fine I think.

@alejandrod
Copy link
Author

alejandrod commented May 27, 2017

Those are the default values. I haven't changed them. I only play with play.cache.redis.timeout.

The problem might be that the whole call is wrapped with a Future. The redis and the app logic.
Then yes, Await will fail if the redis part fails to return quick enough. They should not be part of the same future. Technically they are different things.
This is an idea:

 class MySyncCache extends CacheApi {
    val someAsyncCache = new CacheAsyncApi
    val redisCacheTimeout = 3 seconds
    
    override def getOrElse[A](key: String, expiration: Duration)(orElse: => A)
                             (implicit evidence$1: ClassManifest[A]): A = {

      try {
        Await.result(someAsyncCache.get(key), redisCacheTimeout).get
      } catch {
        case _: Exception  =>
          // Some error getting from async cache
          val v = orElse
          setCacheEntrySomehowSafeNoError(v)
          v
      }
    }
  }

@KarelCemus
Copy link
Owner

Well, you are hitting here a significant limitation - the whole library is intentionally async, including the orElse part, which is wrapped into the Future. The SyncAPI is just a thin wrapper to simplify the use. Your solution would require basically splitting the implementation into sync and async branches, which honestly, I don't see a point. However, I check the code the change should not be that major, I do it.

By the way, you could reach the same effect without changing the code, just by the proper configuration. Right now, there are 3 timeouts.

  1. connection timeout is used to establish the connection to the Redis. I think this lies out of the scope of this issue
  2. receive timeout is exactly the redisCacheTimeout from your example.
  3. play.cache.redis.timeout wraps it all including the application logic. We might argue if this is clean or not, but point is, that when you eliminate this by setting it to something high, in the end, you get exactly the behavior as is in your example because this timeout won't apply while the implementation stays simple and unified.

@KarelCemus
Copy link
Owner

Can you review PR #86?

@KarelCemus KarelCemus added this to the 1.4.1 milestone May 28, 2017
@KarelCemus
Copy link
Owner

@alejandrod, FYI, I'm implementing API for Play 2.6.x right now and found this. The Play framework implements getOrElse in their sync API the same way I did.

@alejandrod
Copy link
Author

Well, the thing is what cacheApi.getOrElseUpdate does. That could be honouring the app code if the cache service is not working or has problems

@KarelCemus
Copy link
Owner

It's not important what it does because, in the end, the whole statement is wrapped into Await.result. So regardless the cache service condition, the top statement is the 5 secs timeout, which includes the orElse clause.

@KarelCemus
Copy link
Owner

Will be released as 1.4.1 later today.

@alejandrod
Copy link
Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants