Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPTIONS pings are sent out all at once #1132

Closed
MartinJPaterson opened this issue Mar 23, 2021 · 12 comments · Fixed by #1413
Closed

OPTIONS pings are sent out all at once #1132

MartinJPaterson opened this issue Mar 23, 2021 · 12 comments · Fixed by #1413
Labels
bug Something isn't working

Comments

@MartinJPaterson
Copy link
Contributor

Describe the bug
When you turn on the configuration to send out options pings (all-reg-options-ping and similar) the first set of OPTIONS pings get sent out all at once. There is code to randomise the interval between one OPTIONS ping and the next, unfortunately the same random offset is applied to all the user agents who are due an OPTIONS ping. So from the point of view of a single user agent, the OPTIONS pings arrive at the randomly-varied intervals specified, but FreeSWITCH still sends out all the OPTIONS pings at the same time.

The code (sofia_reg_check_ping_expire() in sofia_reg.c) checks the ping_expires field in the sip_registrations table to identify user agents whose ping is about to expire. It then sends OPTIONS pings to those destinations. Then it chooses the next ping_expires time by adding the configured time +/- a random amount. (This was the subject of FS-6400.) It then updates those registrations with the new ping_expires time with a single sql update, meaning they all get the same value.

To Reproduce
Steps to reproduce the behavior:

  1. In a system with a number of registered user agents, turn the options ping option on.
  2. Options pings will continue to be sent out to all user agents at the configured interval and random offset, but that interval will apply to all user agents.

Expected behavior
The randomisation in the timing of sending OPTIONS pings was added with FS-6400 and the clear intention was to spread out sending of OPTIONS pings over time to reduce load on FreeSWTICH processing the replies and to be nicer to an upstream sbc or proxy.

Package version or git hash

  • Version 1.10.5 from packages
@MartinJPaterson MartinJPaterson added the bug Something isn't working label Mar 23, 2021
@MartinJPaterson
Copy link
Contributor Author

It's not clear to me the best way of fixing this - one could loop round the user agents and calculate a new random ping_expires value for each one. However then instead of one database update you'd be doing one for each user agent - potentially thousands all at once, which could have a performance hit, but once FreeSWITCH had done that once, then it'll be doing fewer updates each time.

Instead one could break up the list gradually - say set the new ping_expires for half the updated user agents by adding the random offset, and half by subtracting it.

I'm not sure getting the database to apply the option is viable - the random() function in PostgreSQL and sqlite work differently (I don't think FreeSWITCH has a layer to iron out different database sql syntaxes).

@MartinJPaterson
Copy link
Contributor Author

MartinJPaterson commented Jul 7, 2021

This issue has started causing a problem as we've increased the number of phones on the system. When 160 phones all do OPTIONS pings at the same time, a handful of pings fail (and are not sent) with a message in the log:

Ping to sip user xxx failed with code 503

This is a problem if by random chance the same phone fails the options pings a few times in a row, then its registration gets cancelled.
I also noticed (only once) a phone registration fail with FS returning a 500 error at the same time as a bunch of OPTIONS pings were happening.

I have found a workaround solution though which side-steps the problem mentioned above of any change in the SQL statements in FreeSWITCH need to be compatible between the various databases used.

I set the FreeSWITCH options ping time to something long (10 minutes in my case):
<param name="ping-mean-interval" value="600"/>
That means when FreeSWITCH updates the ping_expires it will be between 5 and 15 minutes in the future.

Then I have a systemctl timer running the following SQL every minute:
update sip_registrations set ping_expires=round(extract(epoch from now()))+ floor(random() * 58 + 1)
This (in PostgreSQL-speak) updates all the ping_expires values to be between 1 and 58 seconds in the future. Because this is outside FreeSWITCH code, it can use whatever syntax is required for the database in use.

Doing this means the OPTIONS ping times ignore what FreeSWITCH sets for them because they get overridden by the systemctl timer.

@yois615
Copy link
Contributor

yois615 commented Oct 29, 2021

I came here because I have an endpoint behind a SonicWALL device that I do not control, and UDP timeout is 30 seconds. By default, pings will come out to 20-40 seconds, which exceed the timeout limit for the SonicWALL, despite the fact that I coded 30 seconds for ping interval.

The way FS-6400 was done seems backwards to me. Instead of using a random interval each time, only the FIRST packet needs to be randomized. After that, each packet should come at the exact interval specified, but they all won't come at once since the initial starting time is random.

I would love to contribute to an easy fix like this but the complete lack of basic documentation of this project and it's underlying functions has me scratching my head of where to start. Grepping the project has proven unhelpful so far.

QUESTIONS:

  1. Where is the initial value for ping_expires set when the endpoint registers?
  2. Where is the sofia_reg_check_ping_expire function called from? I see where it's defined but grepping shows me no results for any place that it is called.

@MartinJPaterson
Copy link
Contributor Author

MartinJPaterson commented Oct 29, 2021

@yois615 The function that queries the db to decide whether to send out the pings is this one: sofia_reg_check_ping_expire.

  1. Set from config value ping-mean-interval set in the sip profile
  2. Called from sofia_profile_worker_thread_run.

@yois615
Copy link
Contributor

yois615 commented Oct 29, 2021

@MartinJPaterson - See my edit above, sorry

@MartinJPaterson
Copy link
Contributor Author

@yois615 np. My comment also updated.

@yois615
Copy link
Contributor

yois615 commented Oct 29, 2021

Thanks.

For the first question, I know where it's set in XML. How does that XML value appear in the SQL value ping_expires the first time? If we can set the randomization in the initial value of ping_expires when the extension registers , with the max value being ping_mean_interval (which the name is not appropriate after what I'm suggesting), then the result will be randomization that will persist.

@yois615
Copy link
Contributor

yois615 commented Oct 29, 2021

I got it. It's right here in line 954

 sql = switch_mprintf("update sip_registrations set ping_expires = %ld where hostname='%q' and profile_name='%q' and ping_expires <= %ld ",
                                                                 next, mod_sofia_globals.hostname, profile->name, (long) now);

There's no >0 condition so it will be assigned from the profile.

We can add a condition that the random value should only be assigned if ping_expires = 0

@MartinJPaterson
Copy link
Contributor Author

The SQL used here has to be valid across the several db platforms used as the back-end storage in FS, which precludes using database functions in the sql to generate the random offset, so it has to be done in code (as it is now). The downside is that this function could be updating many registrations at once (1000's potentially) and to set a different value for each ping_expires would require an sql update for each separately, whereas now there's only one select and one update. Even if you do the update only for registrations when ping_expires is 0, that could still be many db queries (say a whole building's worth of phones turn on at the same time after a power cut).

Whether this increase in db traffic is actually a problem, I have no idea, but it's certainly a big increase.

@yois615
Copy link
Contributor

yois615 commented Oct 29, 2021

You're saying that this isn't a good idea?

sql = switch_mprintf("update sip_registrations set ping_expires = ping_expires+round(random()*%d) where hostname='%q' and profile_name='%q' and ping_expires = 0 ", interval, mod_sofia_globals.hostname);


yois615 added a commit to yois615/freeswitch that referenced this issue Oct 31, 2021
In FS-6400, the attempt was made to randomize OPTIONS
packets to be sent at a random interval.  The same random
interval is applied to all endpoints so this doesn't work.

Furthermore, rounding withing the code will ultimately
make the ping times converge over time.

This commmit will apply the random interval only during
registration of the endpoint.  All subsequent pings will
be incremented with the actual value of ping-mean-interval.

Fixes signalwire#1132, Fixes signalwire#1133
yois615 added a commit to yois615/freeswitch that referenced this issue Oct 31, 2021
In FS-6400, the attempt was made to randomize OPTIONS
packets to be sent at a random interval.  The same random
interval is applied to all endpoints so this doesn't work.

Furthermore, rounding withing the code will ultimately
make the ping times converge over time.

This commmit will apply the random interval only during
registration of the endpoint.  All subsequent pings will
be incremented with the actual value of ping-mean-interval.

Fixes signalwire#1132, Fixes signalwire#1133
@MartinJPaterson
Copy link
Contributor Author

MartinJPaterson commented Nov 1, 2021

No. The behaviour of the random() function in SQLite (the database used in the default config) is different to that in PostgreSQL. That returns a value between 0 and 1, and your SQL will work fine, but the SQLite random() function returns an integer between -9223372036854775808 and +9223372036854775807. MySQL/MariaDB don't have it at all (they have rand() instead).
EDIT: mod_lcr does work around the rand()/random() problem by discovering which the db supports with its set_db_random() function, however it uses the function in the ORDER BY part of an SQL statement, so the behaviour difference between SQLite and PostgreSQL doesn't matter here.

@yois615
Copy link
Contributor

yois615 commented Nov 1, 2021

Thanks.

I abandoned that idea and have an open PR where the initial randomization is done during registration. That way we don't hit the database any more than the flood of registrations will, and each registration gets its own random time. All subsequent pings go out exactly at the interval.

After testing I realized that all registrations during the same second get the same random number. This is because srand() is called during each execution. I have a commit in my repo to address this if you want to take a look. It's linked in #1133.

yois615 added a commit to yois615/freeswitch that referenced this issue Nov 5, 2021
In FS-6400, the attempt was made to randomize OPTIONS
packets to be sent at a random interval.  The same random
interval is applied to all endpoints so this doesn't work.
Furthermore, rounding within the code, as well as
reseeding with srand() on each run will ultimately
make the ping times converge over time.  Once the
times converge, they will not separate since the reseeding
will cause the same random number to apply to each
registration.

This commmit will apply the random interval only during
initial registration and update of registration.
All subsequent pings will be incremented with the
actual value of ping-mean-interval. (This parameter
name is no longer accurate, and would be better named
ping-max-interval).

srand() has been moved to the start of the worker
thread, and all repeat calls have been removed,
so that each call of rand(), even during the
same second, generates a different random number.

Fixes signalwire#1132, Fixes signalwire#1133
yois615 added a commit to yois615/freeswitch that referenced this issue Nov 23, 2021
In FS-6400, the attempt was made to randomize OPTIONS
packets to be sent at a random interval.  The same random
interval is applied to all endpoints so this doesn't work.
Furthermore, rounding within the code, as well as
reseeding with srand() on each run will ultimately
make the ping times converge over time.  Once the
times converge, they will not separate since the reseeding
will cause the same random number to apply to each
registration.

This commmit will apply the random interval only during
initial registration and update of registration.
All subsequent pings will be incremented with the
actual value of ping-mean-interval. (This parameter
name is no longer accurate, and would be better named
ping-max-interval).

srand() has been moved to the start of the worker
thread, and all repeat calls have been removed,
so that each call of rand(), even during the
same second, generates a different random number.

Fixes signalwire#1132, Fixes signalwire#1133
yois615 added a commit to yois615/freeswitch that referenced this issue Jul 29, 2022
In FS-6400, the attempt was made to randomize OPTIONS
packets to be sent at a random interval.  The same random
interval is applied to all endpoints so this doesn't work.
Furthermore, rounding within the code, as well as
reseeding with srand() on each run will ultimately
make the ping times converge over time.  Once the
times converge, they will not separate since the reseeding
will cause the same random number to apply to each
registration.

This commmit will apply the random interval only during
initial registration and update of registration.
All subsequent pings will be incremented with the
actual value of ping-mean-interval. (This parameter
name is no longer accurate, and would be better named
ping-max-interval).

srand() has been moved to the start of the worker
thread, and all repeat calls have been removed,
so that each call of rand(), even during the
same second, generates a different random number.

Fixes signalwire#1132, Fixes signalwire#1133
yois615 added a commit to yois615/freeswitch that referenced this issue Nov 20, 2022
In FS-6400, the attempt was made to randomize OPTIONS
packets to be sent at a random interval.  The same random
interval is applied to all endpoints so this doesn't work.
Furthermore, rounding within the code, as well as
reseeding with srand() on each run will ultimately
make the ping times converge over time.  Once the
times converge, they will not separate since the reseeding
will cause the same random number to apply to each
registration.

This commmit will apply the random interval only during
initial registration and update of registration.
All subsequent pings will be incremented with the
actual value of ping-mean-interval. (This parameter
name is no longer accurate, and would be better named
ping-max-interval).

srand() has been moved to the start of the worker
thread, and all repeat calls have been removed,
so that each call of rand(), even during the
same second, generates a different random number.

Fixes signalwire#1132, Fixes signalwire#1133
yois615 added a commit to yois615/freeswitch that referenced this issue Nov 21, 2022
In FS-6400, the attempt was made to randomize OPTIONS
packets to be sent at a random interval.  The same random
interval is applied to all endpoints so this doesn't work.
Furthermore, rounding within the code, as well as
reseeding with srand() on each run will ultimately
make the ping times converge over time.  Once the
times converge, they will not separate since the reseeding
will cause the same random number to apply to each
registration.

This commmit will apply the random interval only during
initial registration and update of registration.
All subsequent pings will be incremented with the
actual value of ping-mean-interval. (This parameter
name is no longer accurate, and would be better named
ping-max-interval).

srand() has been moved to the start of the worker
thread, and all repeat calls have been removed,
so that each call of rand(), even during the
same second, generates a different random number.

Fixes signalwire#1132, Fixes signalwire#1133
yois615 added a commit to yois615/freeswitch that referenced this issue Nov 21, 2022
In FS-6400, the attempt was made to randomize OPTIONS
packets to be sent at a random interval.  The same random
interval is applied to all endpoints so this doesn't work.
Furthermore, rounding within the code, as well as
reseeding with srand() on each run will ultimately
make the ping times converge over time.  Once the
times converge, they will not separate since the reseeding
will cause the same random number to apply to each
registration.

This commmit will apply the random interval only during
initial registration and update of registration.
All subsequent pings will be incremented with the
actual value of ping-mean-interval. (This parameter
name is no longer accurate, and would be better named
ping-max-interval).

srand() has been moved to the start of the worker
thread, and all repeat calls have been removed,
so that each call of rand(), even during the
same second, generates a different random number.

Fixes signalwire#1132, Fixes signalwire#1133
greenbea pushed a commit to greenbea/freeswitch that referenced this issue Mar 28, 2023
In FS-6400, the attempt was made to randomize OPTIONS
packets to be sent at a random interval.  The same random
interval is applied to all endpoints so this doesn't work.
Furthermore, rounding within the code, as well as
reseeding with srand() on each run will ultimately
make the ping times converge over time.  Once the
times converge, they will not separate since the reseeding
will cause the same random number to apply to each
registration.

This commmit will apply the random interval only during
initial registration and update of registration.
All subsequent pings will be incremented with the
actual value of ping-mean-interval. (This parameter
name is no longer accurate, and would be better named
ping-max-interval).

srand() has been moved to the start of the worker
thread, and all repeat calls have been removed,
so that each call of rand(), even during the
same second, generates a different random number.

Fixes signalwire#1132, Fixes signalwire#1133
Preston-PLB pushed a commit to Preston-PLB/freeswitch that referenced this issue Aug 14, 2023
In FS-6400, the attempt was made to randomize OPTIONS
packets to be sent at a random interval.  The same random
interval is applied to all endpoints so this doesn't work.
Furthermore, rounding within the code, as well as
reseeding with srand() on each run will ultimately
make the ping times converge over time.  Once the
times converge, they will not separate since the reseeding
will cause the same random number to apply to each
registration.

This commmit will apply the random interval only during
initial registration and update of registration.
All subsequent pings will be incremented with the
actual value of ping-mean-interval. (This parameter
name is no longer accurate, and would be better named
ping-max-interval).

srand() has been moved to the start of the worker
thread, and all repeat calls have been removed,
so that each call of rand(), even during the
same second, generates a different random number.

Fixes signalwire#1132, Fixes signalwire#1133
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants