Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cog stops replying to commands or triggers #1153

Closed
agis opened this issue Nov 16, 2016 · 5 comments
Closed

Cog stops replying to commands or triggers #1153

agis opened this issue Nov 16, 2016 · 5 comments
Assignees
Milestone

Comments

@agis
Copy link
Member

agis commented Nov 16, 2016

Occassionally, Cog stops replying to commands from Slack (even help too) and triggers.

However the process is still up and Cog is emitting presence chat events in the logs. It doesn't emit any events in audit_log when commands are called though.

The problem goes away by restarting Cog. This typically happens every ~20mins.

We use the following command to trigger the issue sooner:

$ watch -n 5 curl  -s --fail  --data-urlencode 'body=ping' http://cog.xxx.xxx:4001/v1/triggers/29619c59-d990-4205-a7db-de75c73e1fc4

The trigger invoked is the following:

ID              29619c59-d990-4205-a7db-de75c73e1fc4
Name            SimpleEcho
Pipeline        echo $body
Enabled         true
As User         admin
Timeout (sec)   9

We're on Cog 0.16.1 but this happened in 0.16.0 too.

@agis
Copy link
Member Author

agis commented Nov 16, 2016

The Trigger invoking script pasted in my initial comment might not be relevant to this issue after all. I'm executing it for 30 min. and everything is still normal.

@agis agis changed the title Cog stops replying to commands Cog stops replying to commands or triggers Nov 16, 2016
@agis
Copy link
Member Author

agis commented Nov 17, 2016

We've added some more logging in different points inside Cog and we compared the logs of a help command in both scenarios (when Cog works and when it doesn't): https://gist.github.com/agis-/098d3ff27728d40f8a89b29ea965a2ef

It seems that after (Carrier.Messaging.Connection:167) [warn] Connection.sync_publish /bot/commands something happens along the way and the code never reaches (Cog.Command.Pipeline.Initializer:36) as it does when the command succeeds.

@kevsmith kevsmith added this to the Cog 0.17 milestone Nov 17, 2016
@kevsmith kevsmith self-assigned this Nov 17, 2016
@kevsmith
Copy link
Member

During the port from the old to new chat provider interface we somehow overlooked sending Slack pings (as documented here). My current theory is that at some point Slack decides Cog is offline/unresponsive due to lack of traffic and we enter this zombie state.

I have reimplemented ping/pong in the kevsmith/liveness-detection branch to mirror what we had in the old provider API. I've also added Logger.warn calls when Slack takes longer than 250ms to respond to a ping which hopefully will give us some indication of slow and/or flaky connection states.

@agis- is currently testing my branch in his environment to see if my changes fix the problem for him.

@kevsmith kevsmith modified the milestones: Cog 0.16.2, Cog 0.17 Nov 18, 2016
@kevsmith
Copy link
Member

It's been about 24 hours and I haven't heard any news (good or bad) regarding kevsmith/liveness-detection. I'm going to go ahead and PR the change with the assumption it fixed Cog's unresponsiveness.

@agis- Please update this ticket if you're still experiencing the issue with the changed code.

@agis
Copy link
Member Author

agis commented Nov 22, 2016

@kevsmith I was on a trip, thus the delayed answer. The issue hasn't popped up yet, I guess the PR fixed it 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants