Skip to content

Unstable Behaviour of Fly.io Deployment #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kingdomcoding opened this issue Dec 4, 2024 · 16 comments
Open

Unstable Behaviour of Fly.io Deployment #177

kingdomcoding opened this issue Dec 4, 2024 · 16 comments

Comments

@kingdomcoding
Copy link

kingdomcoding commented Dec 4, 2024

I have a fly.io deployment of an audio streaming service largely adapted from live_ex_webrtc.

There's a heisenbug on both the publisher and player ends where sometimes they can start/join streams and sometimes that fails.

I can confirm from IO.inspects in my forks of ex_webrtc and ex_ice that when it fails, valid candidate pairs are absent in the checklist of the ICEAgents.

Why that happens on fly.io and not locally and how to fix this is still unclear. Any help will be greatly appreciated.

For context, I'm running on two shared-cpu-1x@1024MB instances. Changing the machine specs didn't resolve the issue. Upgrading to a dedicated IPv4 address also didn't resolve it.

I'm happy to provide any other information that can help.

@mickel8
Copy link
Member

mickel8 commented Dec 4, 2024

Hi @kingdomcoding ,
is your app deployed right now? Could you provide URL? Also, could you deploy your app with debug logs and capture them (especially from ex_ice) when your connection fails? And one more question, do you try to connect via VPN or some other non-standard network?

@kingdomcoding
Copy link
Author

Hey @mickel8

The app is live, with the publisher page and the listener page and a sample of the logs

The behaviour with a VPN is unchanged

@mickel8
Copy link
Member

mickel8 commented Dec 5, 2024

@kingdomcoding I assume that you use https://hexdocs.pm/ex_webrtc/ExWebRTC.ICE.FlyIpFilter.html?

From logs, it looks like we can't get a response to any connectivity check.

However, I cannot reproduce your error.
In my case, sending and receiving from Chromium always works. Firefox can always hear what Chromium sends too. The only problem I noticed is that Firefox cannot transmitt. The connection is established, packets are sent and received but I cannot hear anything. However, the same behaviour I can observe when using Google Meet so this might be something on my side or Firefox side.

What browser do you use?

@kingdomcoding
Copy link
Author

kingdomcoding commented Dec 5, 2024

@mickel8 Yes, I use the FLyIpFilter in my runtime, as per:

if System.get_env("FLY_APP_NAME") do
  config :solving_media, ice_ip_filter: &ExWebRTC.ICE.FlyIpFilter.ip_filter/1
end

I've tested with Chrome on windows and ubuntu.

I've not tried different server locations. I wonder if the ping time from my location could be a factor?

PS: I'm happy to create a separate public repo and fly deployment if that helps

Edit: I just tested Chrome on a mac- works fine, but windows and ubuntu still fail

@mickel8
Copy link
Member

mickel8 commented Dec 5, 2024

@kingdomcoding so the problem only happens on Windows and Ubuntu right?

@mickel8
Copy link
Member

mickel8 commented Dec 5, 2024

PS: I'm happy to create a separate public repo and fly deployment if that helps

Let's do that. I would like to solve that problem as working Fly.io is one of our priorities

@mickel8
Copy link
Member

mickel8 commented Dec 5, 2024

Regarding ping. I don't think so

@kingdomcoding
Copy link
Author

@kingdomcoding so the problem only happens on Windows and Ubuntu right?

It appears so. But to be clear, it works sometimes and fails sometimes, which is the issue.
In addition, I've noticed the same inconsistency on mobile browsers.

PS: I'm happy to create a separate public repo and fly deployment if that helps

Let's do that. I would like to solve that problem as working Fly.io is one of our priorities

Just realized LiveBroadcaster mimics the same issue. Repo. Live

@kingdomcoding
Copy link
Author

Hi, @mickel8

I wonder if there's good news on this issue, or if there's any way I can contribute to solving this

Or if there's an alternative in the elixir-webrtc/membrane world that I can explore to make progress on our app

@mickel8
Copy link
Member

mickel8 commented Dec 12, 2024

@kingdomcoding sorry for no response :( We have some priority work to do and I didn't have time to debug this further :/ The only way is to try to analyze debug logs and try to catch the problem. Unfortunately, a deep knowledge about the ICE protocol is needed :/

You can also try to deploy your app on bare machine according to: https://hexdocs.pm/ex_webrtc/bare.html

@kingdomcoding
Copy link
Author

Thanks! We'll pursue both and see what's possible

@mickel8
Copy link
Member

mickel8 commented Dec 12, 2024

@kingdomcoding thanks! Please, keep us updated :) We will get back to this issue ASAP

@kingdomcoding
Copy link
Author

Hey, @mickel8

From more log comparing, I now know that in the successful case, a conn check response is received which isn't received in the unsuccessful case. I also know that the conn check response came from the IP and port of a remote ExICE.Candidate of type srflx.

My hunch is that this inconsistent behaviour might be from the ice server the app depends on. I suspect that the app's selective connection might be caused by the (selective?) availability of the Googe ice server. Please confirm if I'm thinking in the right direction.

My attempt at resolving this was to add stun:stun.cloudflare.com:3478 to the default stun:stun.l.google.com:19302. Worked for a while, broke again, then worked again.

Are there reliable stun servers you can recommend to eliminate this and check if this is the root of the problem?

PS: I've now seen the connection fail on MacOS too

@mickel8
Copy link
Member

mickel8 commented Dec 14, 2024

My hunch is that this inconsistent behaviour might be from the ice server the app depends on. I suspect that the app's selective connection might be caused by the (selective?) availability of the Googe ice server. Please confirm if I'm thinking in the right direction.

I don't think so. In every case, google STUN responds correctly and we are able to gather srflx candidate on the server side :(

I don't see much difference in these cases except ip addresses that are used. Might it be that your hosts are connected to different networks during tests? In particular, can the failing host be behind symmetric nat? You can check it here: https://www.checkmynat.com/

@kingdomcoding
Copy link
Author

I don't think so. In every case, google STUN responds correctly and we are able to gather srflx candidate on the server side :(

I'm still building my mental model for how WebRTC and ex_webrtc work. At the moment, my understanding is that the server sends a conn check to the stun server and receives a response which is successful or not. In my logs, handle_conn_check_success_response only gets called in the successful case.

Am I missing something in my understanding?

In particular, can the failing host be behind symmetric nat? You can check it here: https://www.checkmynat.com/

The failing hosts are behind a Port Restricted Cone

@mickel8
Copy link
Member

mickel8 commented Dec 14, 2024

First, we send stun binding req to the stun server to gather our public IP address. This operation always succeeds (look for new srflx candidate). We then send this public IP to the other side. Once we also receive some IP address es from the other side, we start performing conn checks, which sometimes fail.

What is the type od nat of successful hosts? Also, could you check whether our demos work for you?

https://elixir-webrtc.org/#demos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants