-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
headscale server stopped answering after a day of uptime despite listening of all the ports #1572
Comments
+1. After ~20 minutes headscale is stuck with timeouts while updating nodes. |
After carefully checks about my headscale VMs stats, I found that is very ram sensitive. |
I am wrong on this, headscale sometimes gets stuck, even with enough memory and CPU. |
+1 Running on a nixos server with 24 GB or RAM (so ram isn't the issue). Headscale randomly (from what I can tell so far) get's stuck using 3% of the CPU, and takes 5 minutes for systemctl to restart, and doesn't allow new connections. |
I think this might be fixed with the tip of #1564, could you test? |
I wasn't inheriting the overlay-ed pkgs to the server host. I'm switching to the patched version of the headscale server and will let you know how it works in a few days. |
I'm not sure if this is the right thread for this, but I noticed while |
0.23.0-alpha2 addresses a series of issues with node synchronisation, online status and subnet routers, please test this release and report back if the issue still persist. |
@kradalby The issue of the headscale server not responding has gone away. I've noticed a weird issue with the app where I need to log out, change the server and save it, then log back in before I can connect. But I have not had to restart the headscale server at all since I switched to the alpha release. |
well, scratch that, now I keep getting prompted to re-register my phone when I try to connect to the server. But I'm not getting the same issue as before where I couldn't log on at all and the headscale service would hang. |
Hi, the stability and responsiveness are much better. Alpha 1 was easily stuck with some stress (+300 nodes) in just minutes. This new alpha is robust enough to handle the same workload 20+ hours without issues. Thanks @kradalby for the advances and the follow of this issue. I can share any info if needed to help with the development of v0.23. |
I'm experiencing something similar on 0.23-alpha2 - I haven't had time to debug yet, but I've had to restart headscale twice on a moderately sized install (~30 nodes) in the last two days. |
I deleted my phone node, re-registered it, and I have not had any issues connecting in two days since! |
Same issue on 0.23-alpha2 after one hour, I will check if the issue still reappears |
I can confirm the issue is still present, after one hour or so, no response from grpc or http api but headscale still work (I haven't tested if I can register new nodes) |
And I also confirm that new devices or disconnected devices cannot join/reconnect to the tailnet |
I have to add this bug is very inconsistent, today it did not appear but yesterday, it was here. |
Finding myself in a similar situation where Headscale randomly locks running master branch from this commit and postgres. I haven't been able to find exactly where the issue happens while debugging. I'm hoping something like #1701 may solve it. In the mean time, I've put together a small/lazy systemd service that acts as a healthcheck, rebooting Headscale if it locks up. Taking yesterday's data, this fired 9 times with 30 nodes in a for-fun containerized testing environment. headscale-health #!/usr/bin/env bash
set -e
while true; do
echo "Checking headscale health"
timeout 5 headscale nodes list >/dev/null ||
(echo "Failed to get nodes list, rebooting headscale" && timeout 3 systemctl restart headscale) ||
(echo "Failed to restart, killing process" && kill -9 "$(ps aux | grep 'headscale serve' | grep -v grep | awk '{print $2}')")
sleep 10
done headscale-health.service
|
@TotoTheDragon Could you have a look at this? I wont have time to get around to it for a bit more. |
@kradalby I have been looking into this issue for the past day and agree with dustin that #1701 is a good contender for solving this issue and #1656 . This issue was present before the commit referenced. I do not have the infrastructure to test with 30 nodes or so to recreate the issue, but we could make a build that includes the changes from #1701 and hopefully dustin is able to test that to see if it makes any difference. |
I'm down! If #1701 is considered complete in it's current state, I can ship it and see how it does. |
I would say that it is complete, but complete as in tip of main, not tested sufficiently to release as a version. But I interpret your current running of main as your risk appetite is fine with that. |
More or less, it's the fact codebase is easy to read, so at least I know what I'm bringing in off main. I'll give this a shot either this week or next :) |
@dustinblackman Would you be able to test with current version of main? |
@TotoTheDragon I've been running from 83769ba for the last four days. At first it looked like all was good, but looking at the logs I'm still seeing lockups, but less. I can try from the latest master later in the week. I also have a set of scripts for a local cluster I had written for #1725. I can look to PR them if you think they'd be helpful in debugging this. |
@dustinblackman seeing as |
Could you give alpha4 a spin: https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4 |
@kradalby I ran this for about two hours, I saw no reboots, but I experienced issues where some newly added ephemeral nodes were unable to communicate over the network (port 443 requests), even with tailscale ping showing a direct connection. I'm wondering if nodes are not always being notified when a new node joins the network. I'm going to try again in a localized environment and see if I can repro it. |
thank you @dustinblackman, thats helpful, it does sounds like there is some missing updates. There is a debug env flag you can turn on which will dump all the mapresponses sent, if you can repro, that would potentially be helpful info, but it produces a lot of data and might not be suitable if you have a lot of nodes. You can play around with that by setting |
Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ? |
@kradalby No reboots again, but after 30 minutes I get several lines such as the following. Couldn't prove they were actually causing issues. I'll test further.
|
I'm unable to repro this in a local cluster. :( |
Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back? |
I think the latest alpha should have improved this a lot, can someone experiencing this give it a try? |
I'll look to give this a spin this week if I can slot it in :) |
Been running this for a little over a day with no issues! Amazing work, thank you! I appreciate all the effort. |
Will close this as fixed. |
Went to a local shop and tried to connect to my remote headscale server v0.23.0-alpha1 I've got working several days ago.
Tailscale client (1.48.2 android) stuck at "connecting"
then I logged out and tried to log in, it stuck at this stage also with no output or warning
I got home, tried to connect from my desktop computer (1.51 windows 11) with no success (it stuck openning listen_addr and port in browser to authenticate)
Then I connected to my server over ssh and ran 'headscale apikeys list' and 'headscale nodes list' with no success, then I restarted headscale server with 'service headscale stop' / 'service headscale start' and everything started working just fine.
Can I do better for you to invetigate this issue if I get it next time?
What I've found so far:
tcpdump on port 8080 (which is my 'listen_addr') showed smth like that:
I tried to restart headscale with 'service headscale stop' and logs went like that:
Before that logs were full of:
Oct 9 20:30:15 server headscale[482461]: 2023-10-09T20:30:15Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:33 > Waiting for update on stream channel node=MikroTik node_key=removed_key_hash_data
removed_key_hash_data noise=true omitPeers=false readOnly=false stream=true
The text was updated successfully, but these errors were encountered: