-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bring number of users into the server-load #99
Comments
Hi @paul1278, No you're correct. Scalelite was designed to equal balance meetings across servers based only on the number of meetings. Ideally, there are multiple factors that tie into correctly load balancing across a server. The ones that come to mind are:
Not quite sure if/when this will be changed as out current focus now is mostly on maintenance of our infrastructure and community support |
Hi @farhatahmad, oh ok, thank you for that information! Nevertheless good work! |
I'll look into providing a PR with a slight improvement over the current situation. I think it should be doable, since the poller requests "getMeetings" anyway, so I'll only have to dig a little bit deeper into the XML, there's a count of Videos and Audios or something in there... |
That would be veeeery nice thank you so much! |
Oh @einhirn, what I did on another script (has nothing to do with scalelite):
My ruby is not that good either, but if you have any problems, I am happy to help. |
Thanks @paul1278, I found out that the "status" task already queries all the fields I want to use anyway, so I just copied that code with minor adjustments 😁 |
Oh very nice forgot about that! Really appreciate your work! :D One last thing if you have some spare time left: could you add an option, if an environment-variable like "overwrite_load_video" or "overwrite_load_participants" is set, then it overwrites the default values when calculating the server-load? Just in case anybody needs that! I guess environment variables are pretty easy to read when using Another example: |
I believe that every element (not only the new ones) of the load calculus should be weighted with an external env variable, so you can choose the final formula with these values based on your needs. |
[X] done, I think. See #108 |
Looking good, but sorry my fault, travis shows that But it looks like |
Seems to work now! Thank you very much! |
Great ! |
Just a question to you all: when using server-weights, do you think its enough that the whole server-load is multiplied by a number, e.g. 1 = default weight, 0.5 = the server can handle two times the load of a default-server, 2 = the server can handle half the load of a default server etc.? |
I think so. |
Ok, I also did some work, it seems to work #113 (Load-multiplier to weight servers) Well, the CI-tests don't work, will fix them. Edit: works now |
Hi @farhatahmad, thanks for your response in #108 - I can understand that you don't want to introduce something as close to the core of your product just like that. Of course Paul's Idea has it's merits too, especially when you put your farm together from differently 'able' hardware... Anyway, the main issue I see with load balancing this kind of dynamic workload is that you can't move a room from one server to another or even have a room split across two servers. One can't (at least in my case) know how many users or even video streams a room will have over time, so you can just place the room on the least loaded server and hope for the best... |
Just wanted to add a few observations
Anyway, also having sophisticated metrics in place, the system as built now has some natural inertia so that it may be a bad idea to always pick the least loaded server (assume five new meetings starting a time interval of 1 minute; it will take some time to see the impact of this new meeting in the metrics; also it will take some time until all users have joined the meetings and so on.). So, there still needs to be a Round Robin-like element in the heuristic that ensures that in avbove scenario not all new meetings end up on the same host. Currently, with only counting meetings, this round robin behaviour is built-in the algorithm. |
Coming back around to visit this now, we have discussed a bunch of different ways to accomplish this. Piggybacking off of what @jodoma said, simply counting the video streams isn't enough. I need to spend some time exploring different ways to do it, but we'll probably need to somehow factor in a video_streams * users number somewhere into the load |
Right, basically @jodoma said that we don't know the future - and need a way to distribute the load evenly in spite of that. I'm running my patch for a while now and it seems to work quite well, but still, the "Round-Robin"-Component needs some work for edge cases, I guess. See a screenshot from our monitoring containing the total participants and the participants on each server. Looks nice and even on a Friday with no obvious lectures... |
I did not see that this issue was still open an posted to #108 instead.
For an implementation, see my comment in #108 |
Maybe taking the real server load into account is the simplest solution. By load I mean We wouldn't need to model the load behavior of rooms with videos, listeners, talkers, ..., instead, we query the actual server load. What do you think? |
but you need to get that info back to scalelite - it's not available via BBB API. |
The Also, there are scenarios (e.g. scheduled lectures) where meetings are created in advance, and in a short timespan, but actual load only increase some minutes after that. With using meeting-count as the only factor, this is not a Problem: Each new meeting will increment the load factor by one, and remove the server from the top of the list immediately. Meetings are distributed evenly. But if we use any other metric, then there is a risk that the same server stays 'best choice' for a while and all new meetings are created on the same server. There MUST be a random factor in server selection if we move away from the meeting-count-only metric. |
Indeed, creating meetings in advance wouldn't be balanced when the meeting load hits later. That's not the case with the current algorithm either. But the only way to balance them correctly before the meetings are fully utilized, is to tell the balancer about the expected size beforehand. Otherwise it can't guess the expected load. If a server stays the best choice, then that's totally ok: It is the least loaded server then. To optimize pre-distribute, we can check if multiple server loads are within some bounds, and select one of them randomly. Exposing the linux load via BBB api should be very simple. And I wouldn't say it's a very bad metric, it's likely the best metric we can have for "now". It would be way better than the current round-robin balancing, which also doesn't look into the future. If we want to take into account the future load, we have to tell scalelite (e.g. via meeting metadata) as "expected load", somehow select a server by that value. To distribute, we could do
and then select the server which got the smallest value (plus the bounds-check-and-randomly-distribution-optimization from above). |
@einhirn @paul1278 did you had a better output or improvement in your performance? Scalelite should also know if the servers are processing recordings. If true it should lower their priority on loadMultiplier. Opened this issue here: #291 |
Any plan on this? I kind of had the scenario today, where meetings were scheduled on a highly loaded server, because there were only two very large meetings on it, while other servers had a couple of meetings with 2-3 users. While I get that the solution including linux load would be nice, I doubt that this will make it in BBB so soonish; On the other hand, the proposal to incorporate users/videostreams does not look like too big of a pull request? |
Ok, i finally have been bit by this. I just had a server collaps (main.js dying) during a rather important conference, because the LB scheduled another large meeting on the same server, while the other node in the cluster was basically free (~7 users vs. 200 on the active node; but 4 meetings on the empty one vs. 2 on the full one. Another meeting with 100 users was then scheduled on top to the second one.) Is there any way we can get this prioritized higher? |
Still not sure how scalelite should know in advance how large a meeting will grow. What if a couple of new meetings arrive at the same time that will all be very large in a couple of minutes? How to ensure that these do not end up on the same server? One idea would be to take meeting age into account and factor in the uncertainty of very fresh meetings. For example: If a meeting was created very recently and is still empty, count it as if it already had as many users as the largest meeting. If a meetings was created more than 15 minutes ago, it will not grow much more, so assume the current user count is correct. Interpolate between these two. New meetings will now prefer servers with low user count, but avoid servers with many new meetings for which the final user count is not known yet. This approach would prevent the issues I mentioned earlier, that too many new meetings could end up on the same server. |
Good point about the knowing in advance part. However, as a basic thing, it would already help if the server with 100 times more users does not get another meeting scheduled (regardless of the new meeting's size). Besides, i like the algorithm you are proposing. |
We would also appreciate an enhancement in the load-balancing strategy. I think @defnull proposal looks best (#99 (comment)) |
One approach is to give each meeting a "warmup period" where Scalelite gives it a minimum size of 15 users for the first 15 minutes, and thereafter uses the actual size. |
You could theoretically also track state across time on the LB, i.e., avg. max users over the past N sessions (maybe using a top N percentile approach to get 'test sessions' with 1-2 users out of the equation). |
Hi, wouldn't it be even wiser to use this metric : |
Our workload includes a great variation of participant numbers - which leads to the already mentioned starvation problem when putting several "to be big" meetings on one server at a time. For the moment we start the meeting in advance and then disable the server in scalelite until enough people are in. We think, that the KISS approach to handle this situation would be to create meetings with the expected number of participants and increase the server load in scalelite with this number.
in this order. Rationale:
Additionally a max users per server warn/reject limit could be introduced. This would be a function of:
and could therefore be derived from real time measurement. For a starter, however, we propose a fixed, per server number. |
We've been running Scalelite with these changes for nearly a year now for a school district in Germany. We regularly have 2000 concurrent users handled by a cluster of 15 conference nodes (12 dedicated CPU cores each). This works nicely. Servers aren't loaded over capacity, we don't overprovision too much. Without those patches things would be unusable, or we'd have to run three times the number of servers. We do not have any kind of foresight, no planned scheduling, just a weighted load based on the number of active video & audio channels instead of the number of conferences. I think my point is that we've got a "perfect is the enemy of good" situation. Sure, bringing planned meetings into the fray might make things a bit more precise. However, the real advantage and the real improvement here is in going from "solely use the number of conferences" to "guess current load based on actual usage numbers (audio, video, listen-only)". Going from "guess current load based on actual usage numbers (audio, video, listen-only)" to ""guess current load based on actual usage numbers (audio, video, listen-only) and take future conferences into account" is a much, much smaller step, maybe one that turns out to be impossible[1]. So why wait for the latter? Why not implement the former for the time being? It would obviously help out a lot of people based on the comments posted here and in other issues & PRs. So pleeeease, think seriously about implementing a hands-off approach of simply taking the current load weighted by active stream types into account. I'm pretty sure it'd save 90% of the problems people have here. [1] Possible reasons why it might be impossible & why it isn't a panacea:
|
#476 was merged and it is planned to be part of v1.3 |
Hmm... Maybe I'm wrong, but it doesn't seem to exist in the code base of 1.3.3.1. Oh well, I'll keep the branch from the PR available, just patch your installation if you need/want this. |
For example, if you have two servers and there are three meetings running:
When now a fourth meeting wants to start, it will be on server1, but it would be better on server2, or did I misunderstand something?
Because the server-load is the number of meetings on a server, as seen here: https://github.com/blindsidenetworks/scalelite/blob/master/lib/tasks/poll.rake#L36
and here (only incremented on meeting create):
scalelite/app/controllers/bigbluebutton_api_controller.rb
Line 149 in 89a658a
The text was updated successfully, but these errors were encountered: