Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring number of users into the server-load #99

Closed
paul1278 opened this issue Mar 25, 2020 · 38 comments
Closed

Bring number of users into the server-load #99

paul1278 opened this issue Mar 25, 2020 · 38 comments
Labels
enhancement New feature or request

Comments

@paul1278
Copy link
Contributor

For example, if you have two servers and there are three meetings running:

server1:    1 Meeting, 50 active users
server2:    2 Meetings, 5 active users

When now a fourth meeting wants to start, it will be on server1, but it would be better on server2, or did I misunderstand something?
Because the server-load is the number of meetings on a server, as seen here: https://github.com/blindsidenetworks/scalelite/blob/master/lib/tasks/poll.rake#L36

and here (only incremented on meeting create):

@farhatahmad
Copy link
Collaborator

Hi @paul1278,

No you're correct. Scalelite was designed to equal balance meetings across servers based only on the number of meetings. Ideally, there are multiple factors that tie into correctly load balancing across a server. The ones that come to mind are:

  • Server weight
  • Number of meetings
  • Number of users
  • Number of streams

Not quite sure if/when this will be changed as out current focus now is mostly on maintenance of our infrastructure and community support

@paul1278
Copy link
Contributor Author

Hi @farhatahmad,

oh ok, thank you for that information! Nevertheless good work!

@einhirn
Copy link
Contributor

einhirn commented Mar 26, 2020

I'll look into providing a PR with a slight improvement over the current situation. I think it should be doable, since the poller requests "getMeetings" anyway, so I'll only have to dig a little bit deeper into the XML, there's a count of Videos and Audios or something in there...
Server weight is a different story completely, because that would mean introducing a new data field into the program. Pretty sure my Ruby isn't good enough to do that 😁

@paul1278
Copy link
Contributor Author

That would be veeeery nice thank you so much!

@paul1278 paul1278 reopened this Mar 27, 2020
@paul1278
Copy link
Contributor Author

paul1278 commented Mar 27, 2020

Oh @einhirn, what I did on another script (has nothing to do with scalelite):
per Meeting, there is a node

  • participantCount, holding the total participants in the meeting
  • listenerCount, holding the amount of people just listening
  • voiceParticipantCount, holding the amount of people listening & with microphone turned on
  • videoCount, holding the amount of people with turned on video

My ruby is not that good either, but if you have any problems, I am happy to help.
I guess you mean the lines here https://github.com/blindsidenetworks/scalelite/blob/master/lib/tasks/poll.rake#L34-L36 ?

@einhirn
Copy link
Contributor

einhirn commented Mar 29, 2020

Thanks @paul1278, I found out that the "status" task already queries all the fields I want to use anyway, so I just copied that code with minor adjustments 😁

@paul1278
Copy link
Contributor Author

paul1278 commented Mar 29, 2020

Oh very nice forgot about that! Really appreciate your work! :D

One last thing if you have some spare time left: could you add an option, if an environment-variable like "overwrite_load_video" or "overwrite_load_participants" is set, then it overwrites the default values when calculating the server-load?

Just in case anybody needs that!

I guess environment variables are pretty easy to read when using ENV["whatever"].to_i (example)

Another example:
ENV.has_key?('overwrite_load_video') ? ENV['overwrite_load_video'].to_i : 100

@rabser
Copy link

rabser commented Mar 29, 2020

I believe that every element (not only the new ones) of the load calculus should be weighted with an external env variable, so you can choose the final formula with these values based on your needs.
thanks for your efforts

@einhirn
Copy link
Contributor

einhirn commented Mar 29, 2020

[X] done, I think. See #108

@paul1278
Copy link
Contributor Author

Looking good, but sorry my fault, travis shows that has_key? is the wrong function to check if a key exists..., it says https://travis-ci.com/github/blindsidenetworks/scalelite/builds/156361390#L317

But it looks like key? works the same way:

@paul1278
Copy link
Contributor Author

Seems to work now! Thank you very much!

@rabser
Copy link

rabser commented Mar 29, 2020

Great !

@paul1278
Copy link
Contributor Author

Just a question to you all: when using server-weights, do you think its enough that the whole server-load is multiplied by a number, e.g. 1 = default weight, 0.5 = the server can handle two times the load of a default-server, 2 = the server can handle half the load of a default server etc.?

@rabser
Copy link

rabser commented Mar 30, 2020

I think so.

@paul1278
Copy link
Contributor Author

paul1278 commented Mar 30, 2020

Ok, I also did some work, it seems to work #113 (Load-multiplier to weight servers)

Well, the CI-tests don't work, will fix them.

Edit: works now

@einhirn
Copy link
Contributor

einhirn commented Apr 2, 2020

Hi @farhatahmad, thanks for your response in #108 - I can understand that you don't want to introduce something as close to the core of your product just like that.
Especially, because the default values I chose off the top of my head will change the behaviour. Maybe setting the default weight values for (Video, Voice, Meetings) to (0,0,1) would be a way to introduce it without directly changing the behaviour in existing installs while unconfigured. I went with much lower factors for video and voice streams (7,3,1) when deploying it on our farm, but I'm going to run it with my patch, so if you need some data, perhaps I can help.

Of course Paul's Idea has it's merits too, especially when you put your farm together from differently 'able' hardware...

Anyway, the main issue I see with load balancing this kind of dynamic workload is that you can't move a room from one server to another or even have a room split across two servers. One can't (at least in my case) know how many users or even video streams a room will have over time, so you can just place the room on the least loaded server and hope for the best...

@jodoma
Copy link

jodoma commented Apr 2, 2020

Just wanted to add a few observations

  • Giving different weights to audio and video is important, but not the full story. Having a single meeting with 50 video streams is different from having 50 meetings with a single video streams, so there should be a non-linear component.
  • adding weights to servers (I assume to express their capabilities) may not be needed when the overall system load metric is taken into account that expresses how (over)loaded the server is.
  • When speaking about cloud (EC2) deployments, it may make sense to further incorporate the cpu.steal value in the decision, as it describes how overloaded the physical server is that hosts the virtual machine.

Anyway, also having sophisticated metrics in place, the system as built now has some natural inertia so that it may be a bad idea to always pick the least loaded server (assume five new meetings starting a time interval of 1 minute; it will take some time to see the impact of this new meeting in the metrics; also it will take some time until all users have joined the meetings and so on.). So, there still needs to be a Round Robin-like element in the heuristic that ensures that in avbove scenario not all new meetings end up on the same host. Currently, with only counting meetings, this round robin behaviour is built-in the algorithm.

@farhatahmad
Copy link
Collaborator

Coming back around to visit this now, we have discussed a bunch of different ways to accomplish this. Piggybacking off of what @jodoma said, simply counting the video streams isn't enough. I need to spend some time exploring different ways to do it, but we'll probably need to somehow factor in a video_streams * users number somewhere into the load

@einhirn
Copy link
Contributor

einhirn commented Apr 24, 2020

Right, basically @jodoma said that we don't know the future - and need a way to distribute the load evenly in spite of that. I'm running my patch for a while now and it seems to work quite well, but still, the "Round-Robin"-Component needs some work for edge cases, I guess. See a screenshot from our monitoring containing the total participants and the participants on each server. Looks nice and even on a Friday with no obvious lectures...
image
But even on a busy day where the lectures just started this semester (you can spot them easily) the load seems to be quite well balanced:
image
EDIT: Oh, the blue line on top is the total amount in both images...

@defnull
Copy link
Contributor

defnull commented Apr 29, 2020

I did not see that this issue was still open an posted to #108 instead.

We improved upon this idea and added separate load factors for audio/video downstreams (in addition to upstreams). Downstream counts depend on the individual meeting size and must be calculated per meeting, then summed up. But since we are iterating over all meetings anyway, that was easy to add.
Note that audio and video downstreams must be calculated differently, because BBB mixes audio into a single channel, but does not do that for video. A meeting with 10 participants, each transmitting video, has roughly the same video downstream load as a lecture with a single presenter and 100 viewers.

For an implementation, see my comment in #108

@TheJJ
Copy link
Contributor

TheJJ commented Apr 30, 2020

Maybe taking the real server load into account is the simplest solution. By load I mean /proc/loadavg, because this is the load that actually matters. A server is then configured to have a max_load (the number of cores, i.e. nproc). Then you calculate loadavg/max_load and use this value to determine the least-loaded-server. Which of the 3 loadavgs is best I can't tell :)

We wouldn't need to model the load behavior of rooms with videos, listeners, talkers, ..., instead, we query the actual server load. What do you think?

@einhirn
Copy link
Contributor

einhirn commented Apr 30, 2020

but you need to get that info back to scalelite - it's not available via BBB API.

@defnull
Copy link
Contributor

defnull commented Apr 30, 2020

The load factor is a very bad metric for actual pressure on a server, and also not exposed via the BBB API.

Also, there are scenarios (e.g. scheduled lectures) where meetings are created in advance, and in a short timespan, but actual load only increase some minutes after that. With using meeting-count as the only factor, this is not a Problem: Each new meeting will increment the load factor by one, and remove the server from the top of the list immediately. Meetings are distributed evenly. But if we use any other metric, then there is a risk that the same server stays 'best choice' for a while and all new meetings are created on the same server. There MUST be a random factor in server selection if we move away from the meeting-count-only metric.

@TheJJ
Copy link
Contributor

TheJJ commented Apr 30, 2020

Indeed, creating meetings in advance wouldn't be balanced when the meeting load hits later. That's not the case with the current algorithm either. But the only way to balance them correctly before the meetings are fully utilized, is to tell the balancer about the expected size beforehand. Otherwise it can't guess the expected load.

If a server stays the best choice, then that's totally ok: It is the least loaded server then. To optimize pre-distribute, we can check if multiple server loads are within some bounds, and select one of them randomly.

Exposing the linux load via BBB api should be very simple. And I wouldn't say it's a very bad metric, it's likely the best metric we can have for "now". It would be way better than the current round-robin balancing, which also doesn't look into the future.

If we want to take into account the future load, we have to tell scalelite (e.g. via meeting metadata) as "expected load", somehow select a server by that value. To distribute, we could do

(sum(all_expected_loads_on_the_server) + linux_load) / max_load

and then select the server which got the smallest value (plus the bounds-check-and-randomly-distribution-optimization from above).

@ryprfpryr
Copy link

ryprfpryr commented Aug 9, 2020

@einhirn @paul1278 did you had a better output or improvement in your performance?

Scalelite should also know if the servers are processing recordings. If true it should lower their priority on loadMultiplier.
(Beside videoCount, voiceParticipantCount, listenerCount, participantCount and Number of meetings).

Opened this issue here: #291

@ichdasich
Copy link

Any plan on this? I kind of had the scenario today, where meetings were scheduled on a highly loaded server, because there were only two very large meetings on it, while other servers had a couple of meetings with 2-3 users.

While I get that the solution including linux load would be nice, I doubt that this will make it in BBB so soonish; On the other hand, the proposal to incorporate users/videostreams does not look like too big of a pull request?

@ichdasich
Copy link

Ok, i finally have been bit by this. I just had a server collaps (main.js dying) during a rather important conference, because the LB scheduled another large meeting on the same server, while the other node in the cluster was basically free (~7 users vs. 200 on the active node; but 4 meetings on the empty one vs. 2 on the full one. Another meeting with 100 users was then scheduled on top to the second one.)

Is there any way we can get this prioritized higher?

@defnull
Copy link
Contributor

defnull commented Dec 10, 2020

Still not sure how scalelite should know in advance how large a meeting will grow. What if a couple of new meetings arrive at the same time that will all be very large in a couple of minutes? How to ensure that these do not end up on the same server?

One idea would be to take meeting age into account and factor in the uncertainty of very fresh meetings. For example: If a meeting was created very recently and is still empty, count it as if it already had as many users as the largest meeting. If a meetings was created more than 15 minutes ago, it will not grow much more, so assume the current user count is correct. Interpolate between these two.

New meetings will now prefer servers with low user count, but avoid servers with many new meetings for which the final user count is not known yet. This approach would prevent the issues I mentioned earlier, that too many new meetings could end up on the same server.

@ichdasich
Copy link

Good point about the knowing in advance part. However, as a basic thing, it would already help if the server with 100 times more users does not get another meeting scheduled (regardless of the new meeting's size).

Besides, i like the algorithm you are proposing.

@pielonet
Copy link
Contributor

pielonet commented Feb 8, 2021

We would also appreciate an enhancement in the load-balancing strategy. I think @defnull proposal looks best (#99 (comment))
Until now we tend to oversize the number of servers in the pool behind Scalelite in order to prevent a single server to get overloaded. It is FAR FROM COST EFFECTIVE !
Please consider giving this issue the highest priority !

@ffdixon
Copy link
Member

ffdixon commented Feb 8, 2021

If a meeting was created very recently and is still empty, count it as if it already had as many users as the largest meeting. If a meetings was created more than 15 minutes ago

One approach is to give each meeting a "warmup period" where Scalelite gives it a minimum size of 15 users for the first 15 minutes, and thereafter uses the actual size.

@ichdasich
Copy link

You could theoretically also track state across time on the LB, i.e., avg. max users over the past N sessions (maybe using a top N percentile approach to get 'test sessions' with 1-2 users out of the equation).

@pielonet
Copy link
Contributor

pielonet commented Feb 23, 2021

Hi, wouldn't it be even wiser to use this metric :
Sum up for each room the product : number of users in the room * (number of active webcams + 1 (if screen sharing))
It would be roughly the number of video flows treated by the server, which is IMHO what really "loads" the server.

@glxnet
Copy link

glxnet commented Mar 1, 2021

Our workload includes a great variation of participant numbers - which leads to the already mentioned starvation problem when putting several "to be big" meetings on one server at a time. For the moment we start the meeting in advance and then disable the server in scalelite until enough people are in.

We think, that the KISS approach to handle this situation would be to create meetings with the expected number of participants and increase the server load in scalelite with this number.
The expected number of particpants would be taken from:

  1. an additional parameter of the create call,
  2. the maxParticipants parameter of the create call,
  3. a global constant given via the environment (medium room size).

in this order.

Rationale:

  • Resource reservation for a conference is immediate. If a current request "loads" a server higher than all other servers any subsequent request will go to one of them, regardless of (laggy) real time load.
  • There is no need/place for in-advance scheduling of meetings with all it's hassles.
  • One has to size the cluster anyway big enough for the maximum load to expect (and a little bit higher), "squeezing" out resources from "unused slots" in running meetings will lead to overload every now and then.

Additionally a max users per server warn/reject limit could be introduced. This would be a function of:

  • current users, with/without camera, microphone, audio, etc.
  • server load
  • server capability

and could therefore be derived from real time measurement. For a starter, however, we propose a fixed, per server number.

@mbunkus
Copy link

mbunkus commented Mar 1, 2021

We've been running Scalelite with these changes for nearly a year now for a school district in Germany. We regularly have 2000 concurrent users handled by a cluster of 15 conference nodes (12 dedicated CPU cores each). This works nicely. Servers aren't loaded over capacity, we don't overprovision too much. Without those patches things would be unusable, or we'd have to run three times the number of servers. We do not have any kind of foresight, no planned scheduling, just a weighted load based on the number of active video & audio channels instead of the number of conferences.

I think my point is that we've got a "perfect is the enemy of good" situation. Sure, bringing planned meetings into the fray might make things a bit more precise. However, the real advantage and the real improvement here is in going from "solely use the number of conferences" to "guess current load based on actual usage numbers (audio, video, listen-only)". Going from "guess current load based on actual usage numbers (audio, video, listen-only)" to ""guess current load based on actual usage numbers (audio, video, listen-only) and take future conferences into account" is a much, much smaller step, maybe one that turns out to be impossible[1]. So why wait for the latter? Why not implement the former for the time being? It would obviously help out a lot of people based on the comments posted here and in other issues & PRs.

So pleeeease, think seriously about implementing a hands-off approach of simply taking the current load weighted by active stream types into account. I'm pretty sure it'd save 90% of the problems people have here.

[1] Possible reasons why it might be impossible & why it isn't a panacea:

  1. Requiring users to schedule their meeting is a hurdle a lot of users won't be willing to take. I know that the teachers in our school district have way too many things to do already, a lot of them aren't tech-savvy, and requiring even more tech interaction & planning from them will simply not work in the real world. We'd have to teach each and every one of them how to estimate conference sizes in advance, how to differentiate the audio types etc.
  2. Even for scheduled meetings you cannot estimate the load correctly as it isn't clear who'll use video, who joins via audio and who joins as a listen-only client. Those have vastly differing load characteristics. Sure, you could implement scheduling parameters for each type, but see 1: no one will want to do that work.
  3. Resource blocking might lead to waste if the resources aren't needed for some reason, e.g. you schedule a meeting, you get sick, forget to cancel the meeting, and then the server will sit there doing nothing.

@einhirn
Copy link
Contributor

einhirn commented Mar 4, 2021

I applied @defnull's code proposition from #108 to current master. Maybe this time the PR (#476) will be accepted.

@jfederico
Copy link
Member

#476 was merged and it is planned to be part of v1.3

@einhirn
Copy link
Contributor

einhirn commented May 16, 2022

Hmm... Maybe I'm wrong, but it doesn't seem to exist in the code base of 1.3.3.1. Oh well, I'll keep the branch from the PR available, just patch your installation if you need/want this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests