Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracker restarted when memory is full #566

Closed
josecelano opened this issue Jan 2, 2024 · 3 comments
Closed

Tracker restarted when memory is full #566

josecelano opened this issue Jan 2, 2024 · 3 comments
Assignees
Labels
Bug Incorrect Behavior
Milestone

Comments

@josecelano
Copy link
Member

josecelano commented Jan 2, 2024

Relates to: #567

We are running the live demo in a droplet with 1GB of RAM.

The tracker domain is https://tracker.torrust-demo.com/health_check.

On the 29th of December I changed the tracker configuration from

remove_peerless_torrents = true

To:

remove_peerless_torrents = false

With this change, the tracker does not clean up torrents without peers. That makes the data structure containing the torrent data grow indefinitely.

That makes the process to run out of memory every 5 hours approx:

image

I guess, the Tracker container crashes and it's restarted again, otherwise I suppose the process would simply die.

We should limit the memory usage. This could happen even with the option remove_peerless_torrents enabled.

cc @WarmBeer @da2ce7

Last 24 hours

torrust-demo-DigitalOcean-24-hours png

Last 7 days

torrust-demo-DigitalOcean-7-days

@josecelano
Copy link
Member Author

Relates to: #567

@WarmBeer is working on it.

@cgbosse cgbosse moved this from BUG & Security to In Progress in Torrust Solution Jan 17, 2024
@josecelano
Copy link
Member Author

Hi @WarmBeer I've been thinking about this problem. I want to share some thoughts.

Is the tracker actually crashing?

The tracker was restarted on the demo environment but I suppose it was because of the container healthcheck. we are using this HEALTHCHECK instruction:

HEALTHCHECK --interval=5s --timeout=5s --start-period=3s --retries=3 \  
  CMD /usr/bin/http_health_check http://localhost:${HEALTH_CHECK_API_PORT}/health_check \
    || exit 1

And this compose configuration:

  tracker:
    image: torrust/tracker:develop
    container_name: tracker
    tty: true
    restart: unless-stopped
    environment:
      - USER_ID=${USER_ID}
      - TORRUST_TRACKER_DATABASE=${TORRUST_TRACKER_DATABASE:-sqlite3}
      - TORRUST_TRACKER_DATABASE_DRIVER=${TORRUST_TRACKER_DATABASE_DRIVER:-sqlite3}
      - TORRUST_TRACKER_API_ADMIN_TOKEN=${TORRUST_TRACKER_API_ADMIN_TOKEN:-MyAccessToken}
    networks:
      - backend_network
    ports:
      - 6969:6969/udp
      - 7070:7070
      - 1212:1212
    volumes:
      - ./storage/tracker/lib:/var/lib/torrust/tracker:Z
      - ./storage/tracker/log:/var/log/torrust/tracker:Z
      - ./storage/tracker/etc:/etc/torrust/tracker:Z
    logging:
      options:
        max-size: "10m"
        max-file: "10"

Notice the restart attribute. See https://github.com/compose-spec/compose-spec/blob/master/spec.md#restart

It might be that the tracker continues allocating more memory (using swap) without panicking as you mentioned yesterday in the meeting. It could be the case that the container is restarted due to the compose configuration because the container becomes unhealthy.

I have not checked it but if you want you can check it using docker. You can run the tracker limiting the memory with:

docker run -it -m 500m torrust/tracker:develop
docker exec -it CONTAINER_ID /bin/sh
 # free
              total        used        free      shared  buff/cache   available
Mem:       64929416    10815272    10446528     1200488    43667616    52189600
Swap:       8388604           0     8388604

NOTE: I do not know why the free command shows 64929416 (61.9215m).

You can set the option remove_peerless_torrents = false and make hundreds of announce requests until you use the 500m.

Anyway it seems (from what I've read in a quick research) that Rust panics when It can't allocate more memory. I've also seen a way to "capture" that event:

use std::alloc::set_alloc_error_hook;

fn main() {
    set_alloc_error_hook(|| {
        // Custom action on allocation error, like logging
        std::process::abort();
    });

    // Rest of your code
}

Limit memory consumption by limiting concurrent request

You are now working on this

you are controlling the amount of used memory and deleting torrents when you reach the limit.

I've been thinking about an alternative. Instead of directly controlling memory consumption we could control concurrent requests. In theory if we limit the number of concurrent requests that would indirectly limit the amount of memory used.

We could check the processing time for each request and set a maximum time. When we go over the maximum response time we can start rejecting new requests. It could be something similar to what @da2ce7 did by limiting active requests to 50. However, we could set that limit dynamically. In theory, if we stop accepting new requests the memory consumption should not increase.

This proposal has other advantages:

  • It work also with CPU consumption or any other resource because the metric is the response time.
  • We can handle overload gracefully (graceful degradation).
  • We could implement many policies like prioritising the peers that came first to the swarm.
  • We could also implement other measures to avoid DoS attacks.
  • We could reject requests from peers that announce themselves too often (less than the min interval).
  • We don't need to calculate the memory consumption. That could decrease the performance.

Disadvantages:

  • When the service degrades, it affects to all peers the same way no matter how long they have been using the service or how well they have been using it. For example, peers announcing too often can cause problems with peers behaving well.
  • We remove peerless torrents. If we limit the number of requests torrents will be removed randomly depending on the request we accept. It might be the case that we always reject requests for the most popular torrents. However, I suppose, requests would be randomly rejected so anyway the most popular torrents would have more peers and it would be more likely for them to stay in the repository.
  • I suppose response time would be measured in the service layer (HTTP or UDP handler) not in the repository or domain service because we need to adjust the "demand" to the "offer". But maybe that's not a bad thing. We can have multiple HTTP and UDP trackers and they would adjust their request load depending on the global resources consumption by all services. I suppose all services would be automatically balanced.

But this would work only if reducing the load means reducing memory consumption.

The question is: can we increase the memory consumption limiting the number of requests?

In theory, peers should announce themselves every 2 minutes. We can have different scenarios for normal cases (that means peers behaving well) like:

  1. Many peers announcing the same torrent
  2. Many peers announcing different torrents

If we limit the concurrent request to 1 per second we have in the torrent repository:

  • One torrent with 120 peers
  • 120 torrents with one peer

Assuming:

  • peer_size
  • torrent_size = stats_size + (peer_size * number_of_peers)

With 1 request per second (type 1):

  • size = (torrent_size + (peer_size * 120)) * 1

With 1 request per second (type 2):

  • size = (torrent_size + (peer_size * 1)) * 120

With N requests per second (type 1):

  • size = ((torrent_size + (peer_size * 120 * N)) * 1)

With N requests per second (type 2):

  • size = (torrent_size + (peer_size * 1)) * 120 * N

If we find out the worst scenario (the one that consumes more memory) we can limit the number of concurrent requests or indirectly limit the concurrent requests when the response time is high.

This approach assumes:

  • Peers only announce themselves once every 120 seconds (adjusted to the min interval in the settings).
  • We remove peerless torrents, otherwise, the memory consumption will grow anyway. This solution is based on the fact that we clean torrents without peers. And if we limit the number of requests some torrents are going to become peerless. If we limit the amount of requests only some amount of torrents will be updated, the ones that are not updated will be removed.

Does the option remove_peerless_torrents = false make sense?

Since we are now limiting memory consumption, maybe that option does not make sense anymore because memory consumption would grow indefinitely.

Does it even make sense? Why do we want to keep torrents without peers @WarmBeer? If you want to keep a list of torrents we already have persisted stats.

What do you think @WarmBeer @da2ce7?

@josecelano
Copy link
Member Author

I'm going to close this issue. I assume the tracker was restarted after the docker healthcheck.

We could apply memory consumption limits in the future if we consider it useful but for other reasons.

The system should only degrade the more memory consumed.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Torrust Solution Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Incorrect Behavior
Projects
Status: Done
Development

No branches or pull requests

3 participants