-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lutron] LEAP Bridge goes offline and does not return #9178
Comments
I don’t see anything unusual in these logs other than that the system lost the connections to two different bridge devices at about the same time. Everything else is the expected behavior when a connection is lost. What happens when it tries to re-connect? |
That's it. It never tries to reconnect. I let it sit for 45 minutes that time. It sat overnight once. For some reason it just stays dead. |
What do you have the heartbeat and reconnect parameters set to? I was just looking at the code to remember how it works. If the connection is lost, the reader thread just exits and it is up to the heartbeat/reconnect logic to initiate a reconnect. By default, that should happen within 10 minutes. |
Everything is default. Definitely didn't kick after an hour of waiting. |
Happened again today. Both bridges. ~9 minutes apart this time. I waited over an hour, they never came back. At 19:06 (see below) I did "bundle:restart org.openhab.binding.lutron" but that didn't restore the connection either. I attempted to refresh the things file. Also nothing. Restarting OH finally restored the connection. 2020-12-04 17:56:00.729 [TRACE] [n.internal.handler.LeapBridgeHandler] - Zone: 22 level: 50 After bundle restart, all things did: Touching the things file minutes later did: |
Do you see messages like these every 5 minutes in your log file? You should be seeing them as a result of the keepalive job running.
|
I think I grepped this correctly for what you're looking for. This is everything from 1700 until 1959 from the event I just posted. Notice this massive gap:
|
Matching the messages up by time, it looks like the keepalive job stopped running for both bridge things about 10 minutes before one connection was lost and 18 minutes before the other was lost. So the handlers would have been unable to restart the connections. Can you also grep for the "Canceling scheduled reconnect job" message and see what the last times were that it was logged? Are you sure you're not having some sort of more general problem with the system like running out of threads in a pool or some other resource?
|
Ignore previous. Was looking at the wrong file. Apparently my logs rolled.
I'm not tracking any thread issues right now. Just moved up to M5 and things have been pretty solid. |
Ok. So the heartbeat logic was behaving as expected until it stopped running. |
I've built a new OH3 version of the binding that includes some extra debug logging. This may help determine what is going wrong. https://github.com/bobadair/openhab-addons/releases/tag/9178-1 |
I can probably drop this in tomorrow morning and just let it run until things break again. |
My caseta just failed...
|
Both bridges died this time. I let them sit for almost 45 minutes before I restarted OH
|
So it doesn't look like the handler is shutting the connection down, or you would see it with the new debug messages. It seems it is being killed by something external. When is the last time the keepalive job (the one that logs "Sending keepalive query") runs before the connections are lost? |
I just noticed when upgrading to RC1 and then S2086 that the "stock" binding also loaded. I'm not 100% sure if that happened above with those logs. I've got the bindings squared away so let's see what happens next time this happens. |
Apparently that didn't take very long. The logs at the end are me shutting OH down for a restart. I clipped everything that wasn't a zone doing something.
|
It seems suspicious to me that the connection closes exactly 10 minutes after the last keepalive query. I thought I saw that in a previous log, also. Can you post all of the log messages from the binding from 2020-12-18 15:31:19 until the line: |
Here ya go
|
Here's an odd one. Bridge went offline. 20 minutes later something triggered the bridge to send a command to one of my lights (which it shouldn't have done with the bridge offline). That caused some kind of reconnect however the bridge was effectively unresponsive. Despite showing as online, I couldn't send any commands. In the end it kept trying to send "#OUTPUT,72,1,0,0.25"
|
As a test, to see if this is the leap code or the move to OH3, I've added IP bridges for my two bridges in parallel to the LEAP bridges. Let's see if one goes offline and the other doesn't. |
So I'm starting to wonder if you were right about the thread exhaustion question a while ago on this. My bridges went offline again this morning. Oddly enough, my ecobee binding also broke at almost the same time and didn't come back. Also at the same time, the rule that I have which runs periodic speedtests (every 15 minutes) stopped working. Given that the threading system was changed in OH3, we may need to look there. |
I was just reading through 1998. It sounds like that could be the issue. I hope it is, because I'm starting to run out of things to check in the Lutron binding. :-) |
Yeah I'd say hold until that Samsung binding is fixed up. It may fix a few things. |
I've submitted PR #9554 for the debug logging change. |
Cool. I had an interesting event today I've been watching. I loaded both the ecobee and denon changed bindings that resolved their thread issues. This time, instead of the things just going offline, one stayed up and the other flapped up and down. I just compiled and loaded the Samsung TV changes Kai submitted about an hour ago. Lets see if this calms everything down. |
This has been completely stable with the updated samsung, denon, and ecobee bindings in place. I'm going to close this for now as it looks like the lutron binding was just being impacted by the threading issues on those bindings. |
@bobadair FYI
From https://community.openhab.org/t/lutron-caseta-support/102310/56
LEAP bridges will go offline and not return. Trace logs below do not seem to indicate any specific reason other than "End of input stream detected" and "Message reader thread exiting". No new connection is established after this message.
The text was updated successfully, but these errors were encountered: