Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classic scoring growth is too steep after scorev2 changes #23763

Closed
WitherFlower opened this issue Jun 5, 2023 · 32 comments · Fixed by #24924
Closed

Classic scoring growth is too steep after scorev2 changes #23763

WitherFlower opened this issue Jun 5, 2023 · 32 comments · Fixed by #24924
Assignees

Comments

@WitherFlower
Copy link
Contributor

WitherFlower commented Jun 5, 2023

Type

Game behaviour

Bug description

Because the previous scoring was linear, a power of 2 was applied to the standardised score to make its growth feel like scorev1.

Since scorev2 follows a growth curve very similar to scorev1 (quadratic), the power of two makes classic scoring have a quartic growth, feeling very wrong when playing.

This should (hopefully) be as easy to fix as moving scaledRawScore outside of the Math.Pow operation here :

return (long)Math.Round(Math.Pow(scaledRawScore * Math.Max(1, maxBasicJudgements), 2) * getStandardisedToClassicMultiplier(rulesetId));

I should also mention that making this change resolves the issue of classic scoring having squared mod multipliers, as well as adressing part of #17824

Screenshots or videos

No response

Version

2023.605.0

Logs

runtime.log

@Akarinnnnn
Copy link

This scaling proposal seems making classicised score from scorev2 closer to previous version of lazer. Considering the current scoring's growth curve is different from legacy lazer scoring system, accept this scaling algorithm might make the gap of display score between older version and current closer.

@Zyfarok
Copy link
Contributor

Zyfarok commented Aug 21, 2023

Yep what @WitherFlower is suggesting makes sense.
To give the line modified with what he suggest:

return (long)Math.Round(Math.Pow(scaledRawScore * Math.Max(1, maxBasicJudgements), 2) * getStandardisedToClassicMultiplier(rulesetId));

becomes (by simply moving scaledRawScore)

return (long)Math.Round(scaledRawScore * Math.Pow(Math.Max(1, maxBasicJudgements), 2) * getStandardisedToClassicMultiplier(rulesetId));

With current lazer using scorev2, it's matching quite well v1's progression.

With sqrt-root scoring however it might need to be changed depending on what we want to match.
Score progression when FCing (~quadratic in v1) OR score-progression when missing "regularly" (somewhat linear in v1).
If we apply and keep the change above it would be the latter, but if we want to match the progression when FCing, then it has to be changed to something like this:

return (long)Math.Round(Math.Pow(scaledRawScore, 2/1.5) * Math.Pow(Math.Max(1, maxBasicJudgements), 2) * getStandardisedToClassicMultiplier(rulesetId));

or alternatively (trading issues for other issues...) :

return (long)Math.Round((AccScore + ComboScore * Math.Pow(map_progression, 0.5)) * Math.Pow(Math.Max(1, maxBasicJudgements), 2) * getStandardisedToClassicMultiplier(rulesetId));

So that ComboScore grows quadratically when FCing.
(Note: it's impossible to match both scenarios at the same time, due to the difference in combo scaling)

However I believe it's not worth going down the road of matching the progression when FCing because it's detrimental in other scenarios, thus it makes more sense to stay with the original proposal from @WitherFlower.

@Zyfarok
Copy link
Contributor

Zyfarok commented Aug 21, 2023

I updated my above answer to fix a small mistake and add my viewpoint (TL;DR: basically saying Wither's original proposal is good and the "non-quadratic scaling when FCing" issue can not really be fixed without bigger downsides)

Just one more thing:
If the goal of "classic scoring growth" is to be used in total score to allow "score farming" and also allow players to compare "how much experience" they have in the game through that total score value, one issue is that with the current max score scaling with Math.Pow(object_count, 2) is that non-FCs on long maps (especially marathon maps) will give MUCH MORE points than before due to AccScore not accounting for map length at all, and also, with the square-root scoring, ComboScore also being "more linear" than v1/v2. Achieving even 90% on a marathon will give a ridiculous amount of score. Thus, it might make sense to either change classic scoring growth to scale less (Maybe Math.Pow(object_count, 1.5) or similar ? The power needs to be between 1.5 and 1 for sure) or to propose a third alternative that is used for "total score".

This concerns score farmers, but also all people that use "total score" to compare how much experience they have in the game.

@bdach
Copy link
Collaborator

bdach commented Aug 22, 2023

I think we should go with @WitherFlower's change for now - any objections?

As for score farming, not sure. It's a very niche community and I'm not sure how to reach them, is there any good place to gather such feedback?

We should probably look to get this done before ppy/osu-queue-score-statistics#134, because if we want to use classic mode for total score, we should probably get it in a good place first.

@WitherFlower
Copy link
Contributor Author

WitherFlower commented Aug 22, 2023

As for score farming, not sure. It's a very niche community and I'm not sure how to reach them, is there any good place to gather such feedback?

I'm part of the scorefarming community myself, and I'm also present on most discords related to it, namely osu!alternative's discord server (discord.gg/osualt, currently mostly maintained by @respektive) so I can handle the community feedback part.

Also, regarding what Zyf said in terms of classic scoring balancing after #24166, I'm planning to (very soon) run the score conversion algorithm on the aforementioned osu!alternative's score database (which contains around 78 million scores) to check if classic score needs any further adjustments.

I guess that ppy/osu-queue-score-statistics#134 can be looked at once those final adjustments are made, if any.

@WitherFlower
Copy link
Contributor Author

WitherFlower commented Aug 23, 2023

After applying the score conversion from #24166 and the change I initially proposed in this PR, that is to move scaledStandardized out of the square operation, I obtained the following results for the top 50 ranked score players using the osu!alternative score database :

https://docs.google.com/spreadsheets/d/1bnK2JaC1wPVeVwQ6FHAC2O38WvaInoDyZJqnOTUpMeg/edit#gid=1635690359

For the technical details, here is the SQL query I ran on the database (which took 26 minutes to run on my home computer 😅) : https://gist.github.com/WitherFlower/ce1b20c61a16902d0f4a6f1d4771fca6

A few remarks regarding sources of errors of the calculation :

  • The bonus portion was ignored since I don't have access to the maximum bonus portion
  • Modded scores were rescaled using the mod multiplier as I only had access to the maximum nomod scorev1 score
  • The mod multipliers used are the ones from osu!stable

If I had to give a conclusion, I'd say that this is totally acceptable as :

  • Maps mostly keep the same maximum score for nomod values
  • Score progression stays quite close to the one from scorev1
  • Linearity between the two scoring systems is not sacrificed

I will also be gathering feedback from the scorefarming community to ensure that there is no significant pushback regarding the results I got.

@bdach
Copy link
Collaborator

bdach commented Aug 29, 2023

@WitherFlower are we good to go with your proposed change, or do you still need some time for gathering feedback?

@WitherFlower
Copy link
Contributor Author

WitherFlower commented Aug 29, 2023

I've received feedback from a few members of the community, including the current first and second place, and the change was mostly a welcome one, so i think we're good to go for the osu! ruleset for now.

There's 2 other issues i have in mind for classic scoring though.

  • Spinner "clear bonus" doesn't increment score when set to "classic" display mode #17011 which after this change could be solved by making the minimum possible classic score 100k, meaning the current smallest score increase of +10 would never be scaled below +1.
  • Is applying a quadratic scaling a good direction for taiko and mania, which weren't using that in stable ? I can make a separate issue thread for this one thread if you'd like.

@bdach
Copy link
Collaborator

bdach commented Aug 29, 2023

Is applying a quadratic scaling a good direction for taiko and mania, which weren't using that in stable ? I can make a separate issue thread for this one thread if you'd like.

Well I don't know that, and it kinda falls under scope of this issue, I'd say? I'm not sure why it would be a separate one. It would also imply that we'd have even more divergent implementations of standardised->classic conversion which I'm not sure that is something we're willing to abide. I'd say we should not be having multiple implementations if it can be avoided, but that'll probably need the players' blessings. That said this already exists, so may be fine:

/// <summary>
/// Returns a ballpark multiplier which gives a similar "feel" for how large scores should get when displayed in "classic" mode.
/// This is different per ruleset to match the different algorithms used in the scoring implementation.
/// </summary>
private static double getStandardisedToClassicMultiplier(int rulesetId)
{
double multiplier;
switch (rulesetId)
{
// For non-legacy rulesets, just go with the same as the osu! ruleset.
// This is arbitrary, but at least allows the setting to do something to the score.
default:
case 0:
multiplier = 36;
break;
case 1:
multiplier = 22;
break;
case 2:
multiplier = 28;
break;
case 3:
multiplier = 16;
break;
}
return multiplier;
}

@ppy/team-client thoughts?

@bdach
Copy link
Collaborator

bdach commented Sep 4, 2023

@ppy/team-client bumping this again - tl;dr: are we okay with even further divergence in how standardised -> classic scoring is done than what we already have? Any foreseeable problems? I don't immediately see any, but you may do.

@WitherFlower if we decided that we'd be okay with not applying quadratic scaling for taiko/mania, could we ask for your assistance in establishing a better formula for those rulesets?

@WitherFlower
Copy link
Contributor Author

WitherFlower commented Sep 4, 2023

Sure, I'd be glad to help for the formulas.

Also, I'd like to mention that i ran a survey asking what the direction would be and judging by the results, mania scorefarmers definitely prefer to use 1-million based scoring even for classic.

image

Responses for taiko are more split and i didn't get any input from taiko scorefarmers as there are barely any to begin with, so I guess we need dev input on that one. I'll try asking around the taiko community to get more feedback on that in the meantime.

@Zyfarok
Copy link
Contributor

Zyfarok commented Sep 4, 2023

Since mania (and taiko) is not a combo-games (acc is much more important in the score), it makes no sense to have a score growing "quadratically" like it does in standard (and even less "exponentially" as asked). Might be interesting to know their opinion on a linear grows instead.

I'm not sure it would get much more people to score-farming no matter what you do though, because score-farming is and will remain a niche, so it might be good to formulate the question differently. ("Would you prefer", "Do you think it would make sense"...) Edit: ignore this, the first question is good enough.

@WitherFlower
Copy link
Contributor Author

@bdach I asked some of the top 10 ranked score taiko players, and the responses i got ranged from "keeping it the same as stable" to "not caring about change", so i assume it would be a better idea to ditch quadratic scaling for taiko and mania.
For taiko, I can start working on an approximation once I get confirmation that this direction is approved by the devs. For mania, classic will just end up being equal to standardized.


@Zyfarok I disagree with your comment. Linear growth is basically the same as what taiko currently uses, and seeing the rejection from osu!mania players I think they'd rather see no change at all.
The second question is precisely what I wanted to ask as, before the poll, other scorefarmers and I suspected that the scene wasn't developed in those gamemodes precisely because the scoring doesn't use quadratic scaling.
And also, "exponential" is just more clear in people's head than "quadratic", and the poll also included reference values to give an idea of scale.

@Zyfarok
Copy link
Contributor

Zyfarok commented Sep 7, 2023

seeing the rejection from osu!mania players I think they'd rather see no change at all.

The rejection from quadratic scaling ? I don't see how it would imply that they don't want linear neither.

taiko and mania are very similar games and use very similar scoring, so it would make sense for mania to also offer the same linear scaling as taiko for score-farming. I guess players don't like when things change though.

@bdach
Copy link
Collaborator

bdach commented Sep 11, 2023

Since I can't get any response to my pings I'm just gonna make a judgement call here and say that we are probably fine with having taiko and mania work differently with respect to classic scoring since they kinda already do.

@WitherFlower if you are able to help out with an estimation for taiko, it'd be very helpful. As a reminder, the criteria is that the classic score must be:

  • derived from standardised score (plus maybe some beatmap-specific variables)
  • it can not reorder scores (i.e. it must be a monotonic function of standardised score)

Would be appreciated if you can provide something along those lines for taiko too. For mania I presume we'd just be using the identity function (i.e. just standardised score directly as classic too).

@WitherFlower
Copy link
Contributor Author

WitherFlower commented Sep 11, 2023

After reusing the spreadsheet from when we did the estimations for osu and catch, i arrvied to the following estimation for taiko :

classicTaikoScore = standardizedTaikoScore / 1_000_000 * hitobjects * 1100

For mania, using the identity function is most likely the way to go indeed.


I think we should also fix the "score not changing on spinners" issue i mentioned earlier in the thread by adding a +0.1 to the classic multiplier in all modes, which would make the smallest increase of +10 give at least +1 in classic scoring.

So in the end we'd have :

  • For osu! :
    classicOsuMultiplier = hitObjectCount * hitObjectCount * 36 / 1_000_000 + 0.1
    classicOsuScore = standardizedOsuScore * classicOsuMultiplier

  • For osu!taiko :
    classicTaikoMultiplier = hitObjectCount * 1100 / 1_000_000 + 0.1
    classicTaikoScore = standardizedTaikoScore * classicTaikoMultiplier

  • For osu!catch :
    classicCatchMultiplier = hitObjectCount * hitObjectCount * 28 / 1_000_000 + 0.1
    classicCatchScore = standardizedCatchScore * classicCatchMultiplier

  • For osu!mania :
    classicManiaScore = standardizedManiaScore


One last thing : the hitObjectCount should always be equal to the maximum amount of 300s for nomod play, as mods can change that number (this is only an issue with strict tracking in standard for now afaik)

@bdach
Copy link
Collaborator

bdach commented Sep 11, 2023

should always be equal to the maximum amount of 300s for nomod play, as mods can change that number (this is only an issue with strict tracking in standard for now afaik

Is this at all negotiable? Where is this particular requirement coming from?

It's something we probably can make happen, but it's rather annoying to make happen and is yet another complication. Instinctively it also doesn't make all that much sense to me.

@WitherFlower
Copy link
Contributor Author

WitherFlower commented Sep 11, 2023

Is this at all negotiable? Where is this particular requirement coming from?

If a mod (like strict tracking) changes the number of hitobjects / 300s, then the property of "non-reordering" isn't preserved anymore, as some scores get an unfair advantage.

See #19232

another solution is to disallow mods to do this, but i don't know how reasonable that requirement is...

@bdach
Copy link
Collaborator

bdach commented Sep 11, 2023

I mean we were midway to fixing that one so that issue can't happen, so "fixing the mod" is alright with me for sure. At least in the short term.

@bdach bdach moved this from Needs discussion to Needs implementation in Path to osu!(lazer) ranked play Sep 11, 2023
@bdach bdach self-assigned this Sep 12, 2023
@bdach bdach moved this from Needs implementation to In Progress in Path to osu!(lazer) ranked play Sep 12, 2023
@bdach
Copy link
Collaborator

bdach commented Sep 15, 2023

@WitherFlower Just for my reference (and to avoid unnecessary back-and-forths): can you provide some reference - spreadsheet or otherwise - that describes how those formulae were derived / arrived at?

I've implemented the formulae you provided on top of the test scenes added above, and the general results seem pretty good when it comes to the general feel and trend, but it looks like the formulae above may result in classic score being slightly inflated. This of course partially depends on parameters like map length and the mystery "score multiplier", and I don't feel like I have a good enough feel for those parameters yet to make conclusions, but it's something I want to investigate for sure and it may be better / easier if I have your source materials to cross-check with.

@WitherFlower
Copy link
Contributor Author

WitherFlower commented Sep 15, 2023

Here is the spreadsheet I used to get the formulas I sent above : https://docs.google.com/spreadsheets/d/1hYsT3U3b0tg9SIMDhmBmqGBVTXASufQ8W0sQ-toifTg/edit#gid=0

Now for a brief rundown of how I used that.

I first took maps at random of all star ratings and lengths to make sure the approximation would cover most cases.
The maximum stable nomod score was taken from the maps' nomod leaderboard #1 score, which are almost all SS scores so not much precision is lost there.

I then searched for the multiplier that minimised the sum of squared relative error between the estimation and the expected SV1 value.

In mathematical terms, we're looking for the following values :

  • For osu and catch :
    $$\underset{m}{\mathrm{argmin}} \left( \sum \left( \frac{m*hitObjects^2 - scoreV1}{scoreV1}\right)^2 \right)$$
  • For taiko :
    $$\underset{m}{\mathrm{argmin}} \left( \sum \left( \frac{m*hitObjects - scoreV1}{scoreV1}\right)^2 \right)$$

A few precisions :

  • The relative error is used to put equal weight to maps regardless of length, as the absolute errors get very large for marathons, which would skew the tuning of the estimation in their favor.

  • The estimations for osu and catch are using $hitObjects^2$ as I found out they gave much closer results than using either $combo^2$ or $combo*hitObjects$.

  • A correction (25 -> 28) was applied to catch to make the total ranked score values closer, checking with an external scores database.

One last thing : if you want to compare the values between scoreV1 and the estimation, I suggest you use map/leaderboard data directly instead of the scoring test scene, cause afaik that simulates an edge case scenario of a map with only circles that you will very rarely encounter in actual gameplay.

@bdach
Copy link
Collaborator

bdach commented Sep 15, 2023

I suggest you use map/leaderboard data directly instead of the scoring test scene, cause afaik that simulates an edge case scenario of a map with only circles that you will very rarely encounter in actual gameplay

That is correct, but I'm not sure I'm following why that's an "edge case". While yes, an average map is rarely like that, the test scene still simulates the score algorithm in terms of the general magnitude of score, rate of growth, etc.? So I'm not seeing why it would be that much of an edge case as to discard any findings based on it.

@WitherFlower
Copy link
Contributor Author

I'm not sure I'm following why that's an "edge case".

If I remember correctly the test scene lets you adjust the max combo of the simulated score. However maps with the same max combo will generally contain less hitobjects (because of sliders), resulting in scoreV1 values noticeably smaller than what the test scene would suggest. The estimation is based on that general case, so the difference in maximum score between classic and SV1 can appear much larger in the test scene than it actuallly is on most maps.

tl;dr big differences in max score in the test scene don't mean much in practice

In any case, it's just something to keep in mind when comparing maximum score values. Rate of growth shouldn't be affected.

@Zyfarok
Copy link
Contributor

Zyfarok commented Sep 16, 2023

If we want to have something closer to scorev1 for SSes, it would have to account for the object counts of each kind separately, and possibly slider-ticks too. This might be doable without too much hassle though.
The questions are: how close to v1 do we want to get on that maximum score ? What do we want to take into account or not ? If the goal is to be as precise as possible, then why not simply computing the true v1 SS score and using that ? Do we want to simplify for the sake of being easy to understand ?

@WitherFlower
Copy link
Contributor Author

@Zyfarok I'm pretty sure the only goal of classic scoring is to give a feeling similar to that of scoreV1, aka big numbers of the same magnitude. The proposal I gave above accomplishes that with good accuracy, so I'm not sure it's worth going the extra mile by considering more map characteristics for a 5% improvement, which will also add more potential ways in which classic scoring can break in the future (see the current issue with strict tracking).

Not to mention this requires time that could be spent on all the other stuff needed to get permanent lazer ranking out.

On a personal note, I'd be glad to see scoreV1's difficulty multiplier nonsense get deleted from reality, but maybe that's just me...

@bdach
Copy link
Collaborator

bdach commented Sep 18, 2023

@WitherFlower I've reviewed the spreadsheets you provided. I've got a few follow-up matters.

First of all, after looking at the least-squares estimation method used to determine the multipliers, I'm not sure the results are valid, since the estimation does not appear to include the 0.1 linear factor, which obviously would skew the final result. After adjusting for this, I think the actual formulae should be closer to:

$$ \texttt{standardisedScore} \cdot \left( \frac{\texttt{hitObjectCount}^2 \cdot 34}{1000000} + 0.1 \right) \tag{osu!} $$

$$ \texttt{standardisedScore} \cdot \left( \frac{\texttt{hitObjectCount} \cdot 860}{1000000} + 0.1 \right) \tag{taiko} $$

$$ \texttt{standardisedScore} \cdot \left( \frac{\texttt{hitObjectCount}^2 \cdot 23}{1000000} + 0.1 \right) \tag{catch} $$

-- although, on that last part, 23 is the number that fell out of least squares, and since I can't reproduce this part:

  • A correction (25 -> 28) was applied to catch to make the total ranked score values closer, checking with an external scores database.

I'm not sure what correction - if any - would be appropriate.

Can you take a look at the above and corroborate as to whether I'm correct on the potentially mistaken estimation, and elaborate as to what "closer ranked score values" would entail?

Finally, either way, I'd probably be looking to use the same methodology but on a wider gamut of maps, rather than checking only 19 instances per ruleset. But before I go and try that I'd want to make sure that I understand the method correctly and dispel the doubts mentioned above.

@WitherFlower
Copy link
Contributor Author

WitherFlower commented Sep 18, 2023

@bdach I ran a query on the catch top 1000 data dump from data.ppy.sh to check the balancing of catch and with the migration to scoreV2 the values are pretty close when using a *23 multiplier, so no adjustments should be needed for that.

Ranked Score Classic Scoring Total user_id
936200855408 914320006047 2400918
883815307450 857431313234 3097304
772926936438 888562412850 1375955
653826926408 569725973305 2351567
608653952081 551371597047 4568537
551415721587 472909287449 448547
442640520302 460851611326 1101600
404122717518 450147678182 538717
382028597623 366018635101 1952803
372748926535 346267608491 5163523

I can also confirm that the multipliers you found are correct after taking into consideration the +0.1 used for bonus. That was an oversight on my part and I updated my spreadsheet accordingly.

I'd probably be looking to use the same methodology but on a wider gamut of maps, rather than checking only 19 instances per ruleset.

When doing this, I think it's best to include a wide variety of maps rather than just taking maps at random which will likely result in low difficulties / "low score maps" being overrepresented in the final estimation, as 40% of all maps in osu are below 3 stars and half of all maps in osu give less than 5 million score.

@bdach
Copy link
Collaborator

bdach commented Sep 18, 2023

I ran a query on the catch top 1000 data dump from data.ppy.sh to check the balancing of catch and with the migration to scoreV2 the values are pretty close when using a *23 multiplier, so no adjustments should be needed for that.

Cool, thanks!

When doing this, I think it's best to include a wide variety of maps rather than just taking maps at random which will likely result in low difficulties / "low score maps" being overrepresented in the final estimation, as 40% of all maps in osu are below 3 stars and half of all maps in osu give less than 5 million score.

I was thinking of running that through the full 150k maps from data.ppy.sh dumps. That should be about as wide as one should ever wish for - but I'll see how fast I can get the estimation to actually calculate in practice...

Maybe I'll add on some weighting to avoid overfit on low difficulties. Not sure.

@WitherFlower
Copy link
Contributor Author

Maybe I'll add on some weighting to avoid overfit on low difficulties. Not sure.

You could try using the squares of absolute differences instead of relative in order to counter the abundance of low difficulties, but if could also have the opposite effect and "buff" marathons too much. The results of that could be interesting though.

@bdach
Copy link
Collaborator

bdach commented Sep 21, 2023

@WitherFlower for your spreadsheets, where did you source "max score v1" and "object count" data from? Asking because when recalcing over the set of all maps, I'm seeing discrepancies, and knowing your sources may help explain them.

@WitherFlower
Copy link
Contributor Author

WitherFlower commented Sep 21, 2023

For max scorev1 I just used the #1 nomod score on the maps, those being high acc FCs in all cases so pretty close to the actual max value.

The "object count" should be correct for osu, but might be off by a bit for taiko and catch.

For taiko I used the maximum combo as it is equal to the amount of don/kat hits.
For catch I used the sum of 300s + misses from the leaderboard scores, as slider repeats are normal fruits in game meaning using the circle + slider count doesn't work to get the total fruit count.

It's possible that I made a few mistakes as I did all of this by hand, but the "object count" I aimed for is the amount of full-judgement-awarding-objects in all gamemodes.
In lazer terms it should correspond to every judgement except ticks and bonus.

@bdach
Copy link
Collaborator

bdach commented Sep 21, 2023

Thanks for the info. I've suspected that was the case for high scores; my program that uses lazer's score V1 simulators was returning higher values for most maps. Some of that is the fact that the simulators are purposefully overestimating bonus, some was actual cases of nomod top scores not actually matching autoplay/max score yet.

I've also had mismatching object counts at the start, but that was my bad and I've sorted it out.

All in all I should have something concrete to get back with very soon. I am now able to export estimations of top scores on all beatmaps from data.ppy.sh (converts included), so I should be about ready to do some data crunching on those and see how our iterative attempts at the conversion formula are doing (and maybe suggest better ones yet, we'll see).

Source for everything will be provided and fully transparent for review when I'm done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants