WIP: try to fix halt channel closed after starting UDP tracker #594

josecelano · 2024-01-10T15:47:51Z

Hi @da2ce7 @WarmBeer I'm been debugging this error some hours but I can't find the problem.

It seems that the "halt" channel is immediately closed after starting the UDP tracker. So the app panics with:

thread 'tokio-runtime-worker' panicked at src/servers/signals.rs:52:25:
Failed to install stop signal: channel closed

The UDP is started but it's not possible to shut it down gracefully.

I've added a lot of debug lines in this PR:

$ cargo run
   Compiling torrust-tracker v3.0.0-alpha.12-develop (/home/josecelano/Documents/git/committer/me/github/torrust/torrust-tracker)
    Finished dev [optimized + debuginfo] target(s) in 7.05s
     Running `target/debug/torrust-tracker`
Loading default configuration file: `./share/default/config/tracker.development.sqlite3.toml` ...
2024-01-10T17:46:40.348930411+00:00 [torrust_tracker::bootstrap::logging][INFO] logging initialized.
2024-01-10T17:46:40.349651636+00:00 [UDP Tracker][DEBUG] Launcher starting ...
2024-01-10T17:46:40.349677556+00:00 [UDP Tracker][INFO] Starting on: udp://0.0.0.0:6969
2024-01-10T17:46:40.349682046+00:00 [UDP Tracker][INFO] Started on: udp://0.0.0.0:6969
2024-01-10T17:46:40.349684556+00:00 [UDP Tracker][DEBUG] Launcher finished ...
2024-01-10T17:46:40.349687856+00:00 [UDP Tracker][DEBUG] Waiting for packets ...
2024-01-10T17:46:40.349687736+00:00 [torrust_tracker::bootstrap::jobs::http_tracker][INFO] Note: Not loading Http Tracker Service, Not Enabled in Configuration.
2024-01-10T17:46:40.349691306+00:00 [UDP Tracker][DEBUG] Wait for launcher (UDP service) to finish ...
2024-01-10T17:46:40.349689536+00:00 [UDP Tracker][DEBUG] Server on socket address: udp://0.0.0.0:6969 waiting for halt signal ...
2024-01-10T17:46:40.349700506+00:00 [torrust_tracker::bootstrap::jobs::health_check_api][INFO] Starting Health Check API server: http://127.0.0.1:1313
thread 'tokio-runtime-worker' panicked at src/servers/signals.rs:52:25:
Failed to install stop signal: channel closed
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-01-10T17:46:40.349793665+00:00 [torrust_tracker::bootstrap::jobs::health_check_api][INFO] Torrust Health Check API server started on: http://127.0.0.1:1313
2024-01-10T17:46:40.349793225+00:00 [UDP Tracker][DEBUG] Halt signal spawned task stopped on address: udp://0.0.0.0:6969

The function panicking is:

pub async fn shutdown_signal(rx_halt: tokio::sync::oneshot::Receiver<Halted>) {
    let halt = async {
        match rx_halt.await {
            Ok(signal) => signal,
            Err(err) => panic!("Failed to install stop signal: {err}"),
        }
    };

    tokio::select! {
        signal = halt => { info!("Halt signal processed: {}", signal) },
        () = global_shutdown_signal() => { info!("Global shutdown signal processed") }
    }
}

When the thread starts waiting for the signal it receives the error "channel closed".

I fixed a similar problem for the Tracker API and the HTTP Tracker:

Apparently the problem was the sender in the channel was dropped (I don't know why, it looks like a compiler optimization)

Are we using this feature (sending a halt signal to the services to stop them). I think we are not using it yet and we do not have tests. Maybe I should just remove it. It's an interesting topic but it might take long to solve it.

Could you take a look? Maybe I'm missing something obvious.

The only problem with this bug it's the UDP is forced to close when we close the app and we don't see anything in the logs. But maybe it does not make sense to use the signal if we do nothing with it.

codecov · 2024-01-10T15:54:13Z

Codecov Report

Attention: 14 lines in your changes are missing coverage. Please review.

Comparison is base (49c961c) 80.64% compared to head (75c07ff) 80.59%.

Files	Patch %	Lines
src/servers/udp/server.rs	70.27%	11 Missing ⚠️
src/bootstrap/jobs/udp_tracker.rs	0.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #594      +/-   ##
===========================================
- Coverage    80.64%   80.59%   -0.05%     
===========================================
  Files          117      117              
  Lines         7831     7853      +22     
===========================================
+ Hits          6315     6329      +14     
- Misses        1516     1524       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

da2ce7 · 2024-01-11T02:53:10Z

@josecelano

My branch uses the following udp service management code: (for reference)

impl UdpServer<Stopped>

pub async fn start(self, tracker: Arc<Tracker>, form: RegistrationForm) -> Result<UdpServer<Running>, Error> {
    let (tx_start, rx_start) = tokio::sync::oneshot::channel::<Started>();
    let (tx_halt, rx_halt) = tokio::sync::oneshot::channel::<Halted>();

    let task = self.state.launcher.start(tracker, tx_start, rx_halt);

    let binding = rx_start.await.expect("it should be able to start the service").address;

    form.send(Listing::new(binding, super::check))
        .expect("it should be able to send service registration");

    let running_udp_server: UdpServer<Running> = UdpServer {
        state: Running {
            binding,
            halt_task: tx_halt,
            task,
        },
    };

    trace!("Running UDP Tracker on Socket: {}", running_udp_server.state.binding);

    Ok(running_udp_server)
}

impl Launcher

pub fn start(
    &self,
    tracker: Arc<Tracker>,
    tx_start: oneshot::Sender<Started>,
    rx_halt: oneshot::Receiver<Halted>,
) -> JoinHandle<Launcher> {
    let launcher = Launcher::new(self.bind_to);
    tokio::spawn(async move {
        Udp::run_with_graceful_shutdown(tracker, launcher.bind_to, tx_start, rx_halt).await;
        launcher
    })
}

struct ActiveRequests

struct ActiveRequests {
    rb: LocalRb<Static<AbortHandle, 50>>, // the number of requests we handle at the same time.
}

impl Drop for ActiveRequests {
    fn drop(&mut self) {
        for h in self.rb.pop_iter() {
            if !h.is_finished() {
                h.abort();
            }
        }
    }
}

impl Udp

async fn run_with_graceful_shutdown(
    tracker: Arc<Tracker>,
    bind_to: SocketAddr,
    tx_start: oneshot::Sender<Started>,
    rx_halt: oneshot::Receiver<Halted>,
) {
    let socket = Arc::new(UdpSocket::bind(bind_to).await.expect("Could not bind to {self.socket}."));
    let address = socket.local_addr().expect("Could not get local_addr from {binding}.");
    let halt = shutdown_signal_with_message(rx_halt, format!("Halting Http Service Bound to Socket: {address}"));

    let running = tokio::task::spawn(async move {
        let tracker = tracker.clone();
        let socket = socket.clone();

        let reqs = &mut ActiveRequests::default();

        loop {
            let old = reqs.rb.push_overwrite(
                Self::do_request(Self::receive_request(socket.clone()).await, tracker.clone(), socket.clone()).abort_handle(),
            );

            if let Some(h) = old {
                if !h.is_finished() {
                    h.abort();
                }
            }
        }
    });

    tx_start
        .send(Started { address })
        .expect("the UDP Tracker service should not be dropped");

    let stop = running.abort_handle();

    select! {
        _ = running => {},
        () = halt => {}
    }
    stop.abort();

    task::yield_now().await; // lets allow the other threads to complete.
}

This version seems to work well...

josecelano · 2024-01-11T08:50:24Z

@josecelano

My branch uses the following udp service management code: (for reference)

impl UdpServer<Stopped>

pub async fn start(self, tracker: Arc<Tracker>, form: RegistrationForm) -> Result<UdpServer<Running>, Error> {
    let (tx_start, rx_start) = tokio::sync::oneshot::channel::<Started>();
    let (tx_halt, rx_halt) = tokio::sync::oneshot::channel::<Halted>();

    let task = self.state.launcher.start(tracker, tx_start, rx_halt);

    let binding = rx_start.await.expect("it should be able to start the service").address;

    form.send(Listing::new(binding, super::check))
        .expect("it should be able to send service registration");

    let running_udp_server: UdpServer<Running> = UdpServer {
        state: Running {
            binding,
            halt_task: tx_halt,
            task,
        },
    };

    trace!("Running UDP Tracker on Socket: {}", running_udp_server.state.binding);

    Ok(running_udp_server)
}

impl Launcher

pub fn start(
    &self,
    tracker: Arc<Tracker>,
    tx_start: oneshot::Sender<Started>,
    rx_halt: oneshot::Receiver<Halted>,
) -> JoinHandle<Launcher> {
    let launcher = Launcher::new(self.bind_to);
    tokio::spawn(async move {
        Udp::run_with_graceful_shutdown(tracker, launcher.bind_to, tx_start, rx_halt).await;
        launcher
    })
}

struct ActiveRequests

struct ActiveRequests {
    rb: LocalRb<Static<AbortHandle, 50>>, // the number of requests we handle at the same time.
}

impl Drop for ActiveRequests {
    fn drop(&mut self) {
        for h in self.rb.pop_iter() {
            if !h.is_finished() {
                h.abort();
            }
        }
    }
}

impl Udp

async fn run_with_graceful_shutdown(
    tracker: Arc<Tracker>,
    bind_to: SocketAddr,
    tx_start: oneshot::Sender<Started>,
    rx_halt: oneshot::Receiver<Halted>,
) {
    let socket = Arc::new(UdpSocket::bind(bind_to).await.expect("Could not bind to {self.socket}."));
    let address = socket.local_addr().expect("Could not get local_addr from {binding}.");
    let halt = shutdown_signal_with_message(rx_halt, format!("Halting Http Service Bound to Socket: {address}"));

    let running = tokio::task::spawn(async move {
        let tracker = tracker.clone();
        let socket = socket.clone();

        let reqs = &mut ActiveRequests::default();

        loop {
            let old = reqs.rb.push_overwrite(
                Self::do_request(Self::receive_request(socket.clone()).await, tracker.clone(), socket.clone()).abort_handle(),
            );

            if let Some(h) = old {
                if !h.is_finished() {
                    h.abort();
                }
            }
        }
    });

    tx_start
        .send(Started { address })
        .expect("the UDP Tracker service should not be dropped");

    let stop = running.abort_handle();

    select! {
        _ = running => {},
        () = halt => {}
    }
    stop.abort();

    task::yield_now().await; // lets allow the other threads to complete.
}

This version seems to work well...

Hi @da2ce7 OK. Then, it does not make sense I continue working on the issue.

josecelano · 2024-01-11T13:50:39Z

Hi @da2ce7 could you tell me what branch is that?

josecelano · 2024-01-11T17:00:49Z

Hi @da2ce7 in the end I decided to re-open this PR to fix the problem even if you are working on a better solution in the long-term. I want to fix problems in the live demo asap. I did some changes similar to want you are doing alhougth the graceful shutdown is not implemented yet.

…lt" channel 5fd0c84 chore: normalize log ouput (Jose Celano) 0c1f389 fix: [#591] panicking after starting UDP server due to close halt channel (Jose Celano) Pull request description: Relates to: #594 ### Context We use a oneshot channel between the main app that launches the UDP server and the running UDP server. It could be used to send the "halt" message to the UDP server. When the app starts the server this server waits for that signal. That code is not working because when the server starts waiting the channel is closed. That does not affect the server that is still running. And currently, we are not using that signal in production (only for test environments). ### Details The "halt" channel is immediately closed after starting the UDP tracker. So the app panics with: ```console thread 'tokio-runtime-worker' panicked at src/servers/signals.rs:52:25: Failed to install stop signal: channel closed ``` The UDP is started but it's not possible to shut it down gracefully. I've added a lot of debug lines in this PR: ```rust $ cargo run Compiling torrust-tracker v3.0.0-alpha.12-develop (/home/josecelano/Documents/git/committer/me/github/torrust/torrust-tracker) Finished dev [optimized + debuginfo] target(s) in 7.05s Running `target/debug/torrust-tracker` Loading default configuration file: `./share/default/config/tracker.development.sqlite3.toml` ... 2024-01-10T17:46:40.348930411+00:00 [torrust_tracker::bootstrap::logging][INFO] logging initialized. 2024-01-10T17:46:40.349651636+00:00 [UDP Tracker][DEBUG] Launcher starting ... 2024-01-10T17:46:40.349677556+00:00 [UDP Tracker][INFO] Starting on: udp://0.0.0.0:6969 2024-01-10T17:46:40.349682046+00:00 [UDP Tracker][INFO] Started on: udp://0.0.0.0:6969 2024-01-10T17:46:40.349684556+00:00 [UDP Tracker][DEBUG] Launcher finished ... 2024-01-10T17:46:40.349687856+00:00 [UDP Tracker][DEBUG] Waiting for packets ... 2024-01-10T17:46:40.349687736+00:00 [torrust_tracker::bootstrap::jobs::http_tracker][INFO] Note: Not loading Http Tracker Service, Not Enabled in Configuration. 2024-01-10T17:46:40.349691306+00:00 [UDP Tracker][DEBUG] Wait for launcher (UDP service) to finish ... 2024-01-10T17:46:40.349689536+00:00 [UDP Tracker][DEBUG] Server on socket address: udp://0.0.0.0:6969 waiting for halt signal ... 2024-01-10T17:46:40.349700506+00:00 [torrust_tracker::bootstrap::jobs::health_check_api][INFO] Starting Health Check API server: http://127.0.0.1:1313 thread 'tokio-runtime-worker' panicked at src/servers/signals.rs:52:25: Failed to install stop signal: channel closed note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace 2024-01-10T17:46:40.349793665+00:00 [torrust_tracker::bootstrap::jobs::health_check_api][INFO] Torrust Health Check API server started on: http://127.0.0.1:1313 2024-01-10T17:46:40.349793225+00:00 [UDP Tracker][DEBUG] Halt signal spawned task stopped on address: udp://0.0.0.0:6969 ``` The function panicking is: ```rust pub async fn shutdown_signal(rx_halt: tokio::sync::oneshot::Receiver<Halted>) { let halt = async { match rx_halt.await { Ok(signal) => signal, Err(err) => panic!("Failed to install stop signal: {err}"), } }; tokio::select! { signal = halt => { info!("Halt signal processed: {}", signal) }, () = global_shutdown_signal() => { info!("Global shutdown signal processed") } } } ``` When the thread starts waiting for the signal it receives the error "channel closed". I fixed a similar problem for the Tracker API and the HTTP Tracker: - #593 - #589 Apparently the problem was the sender in the channel was dropped (I don't know why, it looks like a compiler optimization). In the end, I think the problem was we were not waiting for the right spawned task. I've reorganized the code but the change that fixed the problem is: ```rust let task = tokio::spawn(async move { debug!(target: "UDP Tracker", "Launcher starting ..."); let starting = launcher.start(tracker, tx_start, rx_halt).await; starting.await.expect("UDP server should have started running"); launcher }); ``` We needed to wait for the launcher to finish because in this case, the launcher is not the thread that runs the final server. So we were only waiting until the launcher was "launched" but not for the final task running the UDP server. That is the reason why the channel was closed prematurely. I have also normalized the log output: ``` $ cargo run Finished dev [optimized + debuginfo] target(s) in 0.07s Running `target/debug/torrust-tracker` Loading default configuration file: `./share/default/config/tracker.development.sqlite3.toml` ... 2024-01-11T17:12:11.134816964+00:00 [torrust_tracker::bootstrap::logging][INFO] logging initialized. 2024-01-11T17:12:11.135473883+00:00 [UDP Tracker][INFO] Starting on: udp://0.0.0.0:6969 2024-01-11T17:12:11.135494422+00:00 [UDP Tracker][INFO] Started on: udp://0.0.0.0:6969 2024-01-11T17:12:11.135503672+00:00 [torrust_tracker::bootstrap::jobs][INFO] TLS not enabled 2024-01-11T17:12:11.135587738+00:00 [HTTP Tracker][INFO] Starting on: http://0.0.0.0:7070 2024-01-11T17:12:11.135612497+00:00 [HTTP Tracker][INFO] Started on: http://0.0.0.0:7070 2024-01-11T17:12:11.135619586+00:00 [torrust_tracker::bootstrap::jobs][INFO] TLS not enabled 2024-01-11T17:12:11.135675454+00:00 [API][INFO] Starting on http://127.0.0.1:1212 2024-01-11T17:12:11.135688443+00:00 [API][INFO] Started on http://127.0.0.1:1212 2024-01-11T17:12:11.135701143+00:00 [Health Check API][INFO] Starting on: http://127.0.0.1:1313 2024-01-11T17:12:11.135718012+00:00 [Health Check API][INFO] Started on: http://127.0.0.1:1313 ^C2024-01-11T17:12:17.463967431+00:00 [torrust_tracker][INFO] Torrust shutting down.. 2024-01-11T17:12:17.463960892+00:00 [torrust_tracker::servers::signals][INFO] Global shutdown signal processed 2024-01-11T17:12:17.464011399+00:00 [torrust_tracker::servers::signals][INFO] Shutting down HTTP server on socket address: 0.0.0.0:7070 2024-01-11T17:12:17.464015379+00:00 [torrust_tracker::servers::signals][INFO] Sending graceful shutdown signal !! shuting down in 90 seconds !! 2024-01-11T17:12:17.464035618+00:00 [torrust_tracker::servers::signals][INFO] Global shutdown signal processed 2024-01-11T17:12:17.464055757+00:00 [torrust_tracker::servers::health_check_api::server][INFO] Stopping Torrust Health Check API server o http://127.0.0.1:1313 ... 2024-01-11T17:12:17.464064287+00:00 [torrust_tracker::servers::signals][INFO] Global shutdown signal processed 2024-01-11T17:12:17.464088366+00:00 [torrust_tracker::servers::signals][INFO] Shutting down UDP server on socket address: udp://0.0.0.0:6969 2024-01-11T17:12:17.464067327+00:00 [torrust_tracker::servers::signals][INFO] Shutting down tracker API server on socket address: 127.0.0.1:1212 2024-01-11T17:12:17.464097095+00:00 [torrust_tracker::servers::signals][INFO] Sending graceful shutdown signal !! shuting down in 90 seconds !! 2024-01-11T17:12:17.464105665+00:00 [Health Check API][INFO] Stopped server running on: http://127.0.0.1:1313 2024-01-11T17:12:17.464093226+00:00 [torrust_tracker::bootstrap::jobs::torrent_cleanup][INFO] Stopping torrent cleanup job.. 2024-01-11T17:12:17.464154493+00:00 [torrust_tracker][INFO] Torrust successfully shutdown. ``` I'm using targets for some reasons: 1. Easily identify service in logs 2. Easily filter log by service 3. Autodiscover services from logs ACKs for top commit: josecelano: ACK 5fd0c84 Tree-SHA512: 83d50ded031c8fe1505ca68a7b73dc0fac7dd9d5e414ac615f357daf3682e561fda5ddf9c8ae91dab811e2ecb3c8f540d1790ebbe469c218adcdc476c016df9f

josecelano requested a review from da2ce7 January 10, 2024 15:47

josecelano temporarily deployed to coverage January 10, 2024 15:47 — with GitHub Actions Inactive

josecelano requested a review from mickvandijke January 10, 2024 15:47

fix: [torrust#591] WIP

75c07ff

josecelano force-pushed the 591-error-starting-udp-tracker-failed-to-install-stop-signal-channel-closed branch from 2ba9391 to 75c07ff Compare January 10, 2024 17:37

josecelano temporarily deployed to coverage January 10, 2024 17:37 — with GitHub Actions Inactive

josecelano closed this Jan 11, 2024

This was referenced Jan 11, 2024

Fix panicking after starting UDP server due to closed "halt" channel #595

Merged

Error starting UDP tracker: Failed to install stop signal: channel closed #591

Closed

Implement a proper graceful shutdown for the UDP tracker #596

Closed

cgbosse assigned josecelano Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: try to fix halt channel closed after starting UDP tracker #594

WIP: try to fix halt channel closed after starting UDP tracker #594

josecelano commented Jan 10, 2024 •

edited

Loading

codecov bot commented Jan 10, 2024 •

edited

Loading

da2ce7 commented Jan 11, 2024

josecelano commented Jan 11, 2024

josecelano commented Jan 11, 2024

josecelano commented Jan 11, 2024 •

edited

Loading

WIP: try to fix halt channel closed after starting UDP tracker #594

WIP: try to fix halt channel closed after starting UDP tracker #594

Conversation

josecelano commented Jan 10, 2024 • edited Loading

codecov bot commented Jan 10, 2024 • edited Loading

Codecov Report

da2ce7 commented Jan 11, 2024

josecelano commented Jan 11, 2024

josecelano commented Jan 11, 2024

josecelano commented Jan 11, 2024 • edited Loading

josecelano commented Jan 10, 2024 •

edited

Loading

codecov bot commented Jan 10, 2024 •

edited

Loading

josecelano commented Jan 11, 2024 •

edited

Loading