Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Not freeing system sockets when calling srt_close() #1822

Open
andersc opened this issue Feb 19, 2021 · 12 comments · May be fixed by #1829
Open

[BUG] Not freeing system sockets when calling srt_close() #1822

andersc opened this issue Feb 19, 2021 · 12 comments · May be fixed by #1829
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@andersc
Copy link

andersc commented Feb 19, 2021

Describe the bug

SRT got a problem freeing system-sockets the way it’s implemented.
If I start one server and ->

st = srt_close(socket);
srt_cleanup();

The system-socket is released immediately. after srt_cleanup();
However we dynamically start and stop multiple instances of SRT servers all the time. Meaning we can’t call srt_cleanup();
Since we in the system got other active servers while we close a server.
The problem then is that the srt_close(socket); seams just to mark the system-socket for deletion and your garbage collector that periodically runs is after a while actually closing the system-socket.
The problem is then that programatically the user of SRT thinks the socket is free to be used but ‘bind’ will fail.

srt_close() should close the system-socket.

To Reproduce
(This is verified on Ubuntu 20.04 and MacOS 11.2.1)
There is a repo verifying this claim ->
bugtest
bugtestmain.cpp

run the bugtest executable
and in the system also a terminal
where in linux you check the process binding the port like this->
netstat -tulpen | grep 8000

Expected behavior
The bugtest application and when the application prints ->
Stopped and destroyed the server, now the port should be free.
run the setstat command.. The socket is still bound to the application for a while (3-5 seconds).

Desktop (please provide the following information):

  • OS: Verified on Ubuntu 20 and MacOS 11.2.1
  • SRT Version / commit ID: 3754562
@andersc andersc added the Type: Bug Indicates an unexpected problem or unintended behavior label Feb 19, 2021
@jeandube
Copy link
Collaborator

I encountered this problem early in the development of SRT (circa 2013) and then implemented a workaround in my wrapper lib that was so effective that I forgot about it... until today. Basically, when srt_listen() fails with SRT_EDUPLISTEN the workaround enter a loop of 5 more srt_listen() attempts with 1 second period. I have never seen this error again. This timing and delay was (and still is) not critical to me.

@andersc
Copy link
Author

andersc commented Feb 19, 2021

The problem we got is that the user might select other protocols or features we do not control on that previously used SRT port so we cant fix this issue in any wrapper. We got a user interface where the user can select many different types of protocols where SRT is one of them. The user selects another protocol on the same port usually.. Then our system fails due to SRT still binds the port we did free.

Anyhow there are ugly workarounds to fix stuff like that but we boil down to how should it work and my view is that when closing a socket it should block the call until the system socket is released.

@ethouris
Copy link
Collaborator

@jeandube I fixed this as one of the first thing I did in SRT :) See CUDTUnited::close(CUDTSocket* s), the section with SRTS_LISTENING state.

That was, however, only the first problem, which was blocking the listener and made it unable to close one listener and immediately create another binding to the same port. Now it works, but it simply borrows the multiplexer from the closed-but-not-yet-recycled previous listener socket before it would lose all users and would be deleted.

What is likely required now is that the socket is disconnected from its multiplexer immediately when requested deletion. This is actually a bigger problem because in order to disconnect the socket from the multiplexer - and associated sender and receiver queues - first all activities concerning this socket must be stopped. The biggest problem actually is with these weird secondary lists that collect sockets to be operated with from another thread, which are not exactly well synchronized with socket deletion.

@ethouris
Copy link
Collaborator

@andersc Just note that what is really required here (and possible to be done) is to early disconnect the SRT socket from the multiplexer, not forcefully destroy the multiplexer. Multiplexer will be of course destroyed when disconnected from the last socket, but you cannot be ever sure it will happen. The multiplexer is shared between sockets in two cases:

  • When you forcefully bind a socket basing on existing multiplexer (use srt_bind_acquire() or bind a socket to the same port as another socket already has).
  • When you accept a connection from a listener socket (all accepted sockets out of a listener socket share its multiplexer)

To make sure that the multiplexer is destroyed, the application must track all sockets and know what to do to make it happen: for a caller socket do not bind explicitly (or at least close all sharing sockets) and for a listener, if you close the listener, you need to also close all sockets that were accepted off it. Only then will the listener's multiplexer be destroyed.

@jeandube
Copy link
Collaborator

The problem we got is that the user might select other protocols or features we do not control on that previously used SRT port so we cant fix this issue in any wrapper. We got a user interface where the user can select many different types of protocols where SRT is one of them. The user selects another protocol on the same port usually.. Then our system fails due to SRT still binds the port we did free.

Anyhow there are ugly workarounds to fix stuff like that but we boil down to how should it work and my view is that when closing a socket it should block the call until the system socket is released.

So my workaround was maybe not so effective after all. "C" workaround vs. C++ fix.

@ethouris
Copy link
Collaborator

This has actually nothing to do with C++. This is the UDT's design, which made the socket garbage-collected after being closed, while some activities still need to continue in other threads. But I think it should be possible to move the multiplexer-disconnection code to happen immediately when closing and actually in every place where the socket is moved from m_Sockets to m_ClosedSockets. The problem could be with the "reader-remaining" sockets, that is, sockets that were broken, but data they collected earlier are still not extracted by the application. Such a socket is not moved to the "trashcan" because these sockets are no longer dispatchable, and reading the remaining data wouldn't be possible. But then, this socket might remain connected to the multiplexer until it's explicitly closed.

@jeandube
Copy link
Collaborator

  • When you accept a connection from a listener socket (all accepted sockets out of a listener socket share its multiplexer)

@ethouris Don't you extend the definition of the Multiplexer too much here: I thought that all accepted sockets will share the local port but share the Multiplexer only if the are with the same peer.

@ethouris
Copy link
Collaborator

There's no other way in SRT to share the local port as sharing the Multiplexer. Multiplexer owns the Channel, Channel owns the UDP socket, UDP socket owns the port when binding with default SO_REUSEADDR=false.

Multiplexer also owns the sender and receiver queues. When you send a packet over a socket it lands in its sender queue - it doesn't matter that the same queue contain packets to be send to different peers, nor it matters that the receiver queue gets packets incoming through this same UDP socket from different peers.

@lcksk
Copy link

lcksk commented Nov 15, 2022

I also encountered the same problem. Our scenario is that the listener lives in a service process. Every time the socket returned by srt_accept() is closed, we observe that the internal resources are not released, so each time the following loop will cause the memory to continue to increase(about +100K ~ 200K)

...
srt_listen()

loop {
  fd = srt_accept()
  ...
  srt_recv(fd,...)
  ...
  srt_close(fd)
}
...

This is observed with some tools(https://github.com/bytedance/memory-leak-detector.git) that after each sequence of srt_accept()->srt_close() calls, the tool complains about no closed objects

  939,976	totals
      939,976	libsrt.so

0xc1211030, 409600, 1
0x00122c3d /lib/arm/libsrt.so (srt::CRcvLossList::CRcvLossList(int) + 64)
0x000d5f87 /lib/arm/libsrt.so (srt::CUDT::prepareConnectionObjects(srt::CHandShake const&, srt::HandshakeSide, srt::CUDTException*) + 422)

0xe48b7030, 196608, 1
0x00121883 /lib/arm/libsrt.so (srt::CSndLossList::CSndLossList(int) + 70)

0xc13c0010, 139776, 3
0x001299a7 /lib/arm/libsrt.so (srt::CUnitQueue::increase() + 358)

0xc2f10020, 65536, 1
0x000b72c1 /lib/arm/libsrt.so (srt::FixedArray<srt::CRcvBufferNew::Entry>::FixedArray(unsigned int) + 50)

0xe4d30050, 46592, 1
0x001296e9 /lib/arm/libsrt.so (srt::CUnitQueue::init(int, int, int) + 170)

0xc13e4040, 46592, 1
0x000b3947 /lib/arm/libsrt.so (srt::CSndBuffer::CSndBuffer(int, int) + 130)

0xc2557050, 20904, 1
0x00096049 /lib/arm/libsrt.so (srt::CUDTUnited::newConnection(int, srt::sockaddr_any const&, srt::CPacket const&, srt::CHandShake&, int&, srt::CUDT*&) + 1356)

0xebc4ce50, 6168, 3
0x00129951 /lib/arm/libsrt.so (srt::CUnitQueue::increase() + 272)

0xc252c850, 4096, 1
0x0012b3e9 /lib/arm/libsrt.so (srt::CHash::init(int) + 40)
0x0012d8a3 /lib/arm/libsrt.so (srt::CRcvQueue::init(int, unsigned int, int, int, srt::CChannel*, srt::sync::CTimer*) + 106)

0xebc4ded0, 2056, 1
0x00129695 /lib/arm/libsrt.so (srt::CUnitQueue::init(int, int, int) + 86)

0xebc562d0, 2048, 1
0x00129d97 /lib/arm/libsrt.so (srt::CSndUList::CSndUList(srt::sync::CTimer*) + 110)

Could you please tell me if there is any effective way to avoid this problem? The focus of our attention is that the memory should not continue to increase, because it is a service process, I can't call srt_cleanup directly. In addition, I noticed the following PR, is this related to this problem, is it stable enough to merge into the master state?

@maxsharabayko maxsharabayko modified the milestones: Parking Lot, v1.6.0 Nov 15, 2022
@lcksk
Copy link

lcksk commented Nov 16, 2022

In addition, after merging #1829 (some minor errors need to be corrected due to version changes), the memory still continues to grow.

@maxsharabayko maxsharabayko modified the milestones: v1.5.2, v1.6.0 Feb 10, 2023
@maxsharabayko
Copy link
Collaborator

Status Update

After PR #2643 (SRT v1.5.2) the closure time has been reduced from 4.5 s to 1.5s.

Setting the state to SRTS_CLOSING reduces the time to actually free the listener socket from 4.5 s to 1.5 s.
Adding CSync::notify_one_relaxed(m_GCStopCond); at the end of CUDTUnited::close() further reduces the time to 1 second by notifying the GC thread to run a re-check.

@maxsharabayko
Copy link
Collaborator

Removing the artificial delay of 1s to close a socket makes it closed almost immediately, but still in the background thread.
This artificial delay in case of a listener helps to gracefully reject or close connection requests if any in progress. Maybe some flag or a socket option could be used to tell to close immediately. In a way, the purpose of the SRTO_LINGER is close to this, and maybe should just be used without this additional 1s delay.

The Change

diff --git a/srtcore/api.cpp b/srtcore/api.cpp
index 7ec8ff57..e3015a97 100644
--- a/srtcore/api.cpp
+++ b/srtcore/api.cpp
@@ -2678,7 +2678,7 @@ void srt::CUDTUnited::checkBrokenSockets()
         // RcvUList
         const steady_clock::time_point now        = steady_clock::now();
         const steady_clock::duration   closed_ago = now - j->second->m_tsClosureTimeStamp;
-        if (closed_ago > seconds_from(1))
+        //if (closed_ago > seconds_from(1))
         {
             CRNode* rnode = j->second->core().m_pRNode;
             if (!rnode || !rnode->m_bOnList)

Unit Test

(Or use the bugtest provided by @andersc)

#include <gtest/gtest.h>
#include <thread>
#include <chrono>
#include <string>
#include <map>

#ifdef _WIN32
#define INC_SRT_WIN_WINTIME // exclude gettimeofday from srt headers
#endif

#include "srt.h"

TEST(Listener, Restart)
{
    srt_startup();
    srt_setloglevel(LOG_NOTICE);

    auto s = srt_create_socket();

    sockaddr_in bind_sa;
    memset(&bind_sa, 0, sizeof bind_sa);
    bind_sa.sin_family = AF_INET;
    ASSERT_EQ(inet_pton(AF_INET, "127.0.0.1", &bind_sa.sin_addr), 1);
    bind_sa.sin_port = htons(5555);

    EXPECT_NE(srt_bind(s, (sockaddr*)&bind_sa, sizeof bind_sa), -1);
    EXPECT_NE(srt_listen(s, 5), -1);

    std::this_thread::sleep_for(std::chrono::milliseconds(500));

    srt_close(s);
    
    s = srt_create_socket();
    int optval = 100;
    int optlen = sizeof optval;
    EXPECT_NE(srt_setsockflag(s, SRTO_IPTTL, (void*)&optval, optlen), SRT_ERROR) << srt_getlasterror_str();

    const auto time_start = std::chrono::steady_clock::now();
    while (srt_bind(s, (sockaddr*)&bind_sa, sizeof bind_sa) == -1)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(500));
    }

    std::cerr << "Binding took " << std::chrono::duration_cast<std::chrono::milliseconds>((std::chrono::steady_clock::now() - time_start)).count() << '\n';

    //EXPECT_NE(srt_bind(s, (sockaddr*)&bind_sa, sizeof bind_sa), -1);
    EXPECT_NE(srt_listen(s, 5), -1);

    srt_close(s);
    srt_cleanup();
}

@maxsharabayko maxsharabayko modified the milestones: v1.5.4, v1.6.0 Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants