Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the use of StringBuilder with StringBuffer #77158

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

komugi1211s
Copy link
Contributor

Replaces StringBuilder with StringBuffer across codebases, which seems to be present in a codebase for quite a while and had similar functionality as StringBuilder, but with improved performance as measured by the benchmark done by @SlugFiller.

It's mostly a full-on search-and-replace, but there are few modification to make it compile as well.

@komugi1211s komugi1211s force-pushed the use_string_buffer branch 2 times, most recently from 218f001 to 49f2d14 Compare May 17, 2023 13:05
@lawnjelly
Copy link
Member

lawnjelly commented May 17, 2023

I tested out of interest in godbolt, and since c++17 it looks like you can omit the brackets, i.e. StringBuffer mybuf, instead of StringBuffer<> mybuf.

I think (master at least) is c++17, so that might be worth doing. This is called "class template type deduction".

As StringBuffer is so vague, as far as naming is concerned, it might be worth considering going the other way, and renaming StringBuffer to StringBuilder, which seems a more descriptive name (I don't know what others think on this).

P.S. Also for things like this we should ask @reduz as he likes to be involved in such decisions, and it can save some effort if he prefers different approach. 👍

Also as @SlugFiller pointed out, we should benchmark this with some longer / variable sized strings to make sure it's always a win as he was benchmarking with small strings (5 chars).

@@ -35,7 +35,7 @@

template <int SHORT_BUFFER_SIZE = 64>
class StringBuffer {
char32_t short_buffer[SHORT_BUFFER_SIZE];
char32_t short_buffer[SHORT_BUFFER_SIZE] = {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the StringBuffer class, but I would double check this is required.

This is kind of controversial - it might lead to needless extra processing if used in a bottleneck, if not optimized out. Profiling / benchmarking may help, maybe there's no cost in practice but it would be nice to show this first. 🤷‍♂️

Some people often do this kind of thing thinking it is somehow "safer", but this is an illusion. Providing you aren't reading from uninitialized data there's no problem. Personally if it's a small structure and not a bottleneck I'd be more likely to zero. If a bottleneck, no. If a large buffer no (because zeroing commits the pages).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for review! I was unsure about this line myself so I am thankful that you pointed this out for me.
you are right about the observation -- this line felt "safe" to me since when I added a const variation of as_string I was confused by how length of string are treated in a way.

I will work on this as soon as I have time available.

@komugi1211s
Copy link
Contributor Author

I tried to measure performance difference between StringBuilder, StringBuffer<>, and String::operator+= with variable-sized string as @lawnjelly suggested, using godot's source code from core, editor and scene directory.
each source code (cpp files and header files) from directories listed above were loaded line-by-line into LocalVector<String>, then measured how long each method takes to concatenate all lines into one big contiguous string. each method were iterated 50 times. Interestingly, on my computer, using String::operator+= was faster than using StringBuffer.

I'm not really sure there results are trustworthy, since I'm unfamiliar with the good basics of benchmarking. could be taken with a huge grain of salt, but these are the result that I observed;

some info about data:

Total File Processed: 1381 files
Total Line Collected: 8996381 lines
Line, short lines only (length < 50): 8803860 lines
Line, long lines only (length > 1000): 720 lines

File with biggest sizes:  editor/builtin_fonts.gen.h (size 14739878)
File with smallest sizes: core/disabled_classes.gen.h (size 112)
File with longest lines:  editor/editor_icons.gen.h: line 424 (length 52619)

Debug Build

[StringBuilder] iterated 50 times: all lines :  79.781862  sec.
[StringBuilder] iterated 50 times: short lines (length < 50) only :  75.36788  sec.
[StringBuilder] iterated 50 times: long lines (length > 1000) only :  0.222945  sec.

[StringBuffer<>] iterated 50 times: all lines :  19.765698  sec.
[StringBuffer<>] iterated 50 times: short lines (length < 50) only :  18.431823  sec.
[StringBuffer<>] iterated 50 times: long lines (length > 1000) only :  0.067781  sec.

[String::operator+=] iterated 50 times: all lines :  18.939201  sec.
[String::operator+=] iterated 50 times: short lines (length < 50) only :  17.538249  sec.
[String::operator+=] iterated 50 times: long lines (length > 1000) only :  0.017776  sec.

Release Build

[StringBuilder] iterated 50 times: all lines :  22.623217  sec.
[StringBuilder] iterated 50 times: short lines (length < 50) only :  20.618615  sec.
[StringBuilder] iterated 50 times: long lines (length > 1000) only :  0.101997  sec.

[StringBuffer<>] iterated 50 times: all lines :  6.53894  sec.
[StringBuffer<>] iterated 50 times: short lines (length < 50) only :  5.820837  sec.
[StringBuffer<>] iterated 50 times: long lines (length > 1000) only :  0.032712  sec.

[String::operator+=] iterated 50 times: all lines :  4.773253  sec.
[String::operator+=] iterated 50 times: short lines (length < 50) only :  4.164241  sec.
[String::operator+=] iterated 50 times: long lines (length > 1000) only :  0.01728  sec.

@komugi1211s komugi1211s marked this pull request as ready for review May 22, 2023 01:14
@komugi1211s komugi1211s requested review from a team as code owners May 22, 2023 01:14
@lawnjelly
Copy link
Member

Just to recap it seems like to know how to proceed we need to do more benchmarking, just in case String is faster than either alternative approach. It is indeed very easy to mess things up in a benchmark and have things e.g. optimized out, or have hot / cold cache effects.

Either using String or StringBuffer seem to be quite a bit faster according to the benchmarks above, but we need to know which is best before deciding how to proceed.

@komugi1211s
Copy link
Contributor Author

I am sincerely sorry for late response, I've been busy lately with my work.
I've been slowly collecting benchmark and investigating why my previous benchmark came out that way, and I think I have a somewhat enough information now.

I feel like I should've asked someone to supervise my approach before doing this 😅 I tried my best to make accurate measurement, and see if I can improve performance here and there.


The reason why String::operator+= was faster than StringBuffer<> all around was because the right hand value I used for concatenation was already String, not const char*. StringBuffer<>(String) internally uses while loop to determine the length of string before concatenating, as opposed to String::operator+=(String) which just uses String::length under the hood.

To confirm my suspicion, I benchmarked both StringBuffer<> and String+= using const char* and String.
string of each size (16, 64, 128, 256, 512) are being concatenated into one StringBuffer<> or String 1000 times, then repeated that process 1000 times (1000 repeat, 1000 workload).

this was the result before the change:
perf_before

I updated StringBuffer<>(String) with similar approach to String+=(String), and measured again.

this was the result after change in StringBuffer<>(String):
perf_after


next, the performance difference of StringBuffer<>(const char*) between debug and release was suspicious. I wasn't sure why it's not gaining as much speed as String::operator+=(const char*), so I looked into it and found out that these lines are the culprit:

for (const char *c_ptr = p_str; *c_ptr; ++c_ptr) {
buf[string_length++] = *c_ptr;

because this constantly dereferences pointer to see if it reached the end of string (instead of relying on the result of strlen prior to this section), this for loop wasn't being optimized at all.

This is the compiler output of said line for release build:
image_before_vectorization

replacing line above with simple for-loop using the result of strlen improved performance drastically.

this was the result after change in StringBuffer<>(String), and StringBuffer<>(const char*):
perf_vector

This was the compiler output after the for-loop change, which is doing what I wanted it to do:
image_aftor_vectorization

@lawnjelly
Copy link
Member

Great work! 👍
Just to let you know this is not forgotten, just waiting as I think @reduz may be on holiday, so we can check he is okay with this.

@reduz
Copy link
Member

reduz commented Aug 18, 2023

I am sorry but I just don't see how this can have better performance than StringBuilder, specially if some optimizations are made to StringBuilder that are pending (moving the internal vectors to a LocalVector with a reserve usage). The approach in StringBuilder pretty much does not allocate memory until the end and it does not do unnecesary memory copying. If anything, to me it should be the other way around (no idea why StringBuffer exists, which has a worse approach).

@SlugFiller
Copy link
Contributor

@reduz You can try benchmarking. The current benchmarks show a massive difference. Also, even if there was a different yet more efficient approach, it could simply replace StringBuffer as well. There would still be no reason to have two different implementations of the same thing.

I could explain the theory of why StringBuilder is so significantly slower, as I have many times, but such a theoretical explanation is irrelevant in the face of the practical benchmark data.

@lawnjelly
Copy link
Member

lawnjelly commented Aug 20, 2023

I could explain the theory of why StringBuilder is so significantly slower, as I have many times, but such a theoretical explanation is irrelevant in the face of the practical benchmark data.

We do try to keep a good explanation of the reasoning for PRs available from the PR, simply so it is easier for reviewers to come in and read all the relevant information rather than just effectively a "trust me, this is better".

Afaik the original discussion took place on rocket chat, I'd forgotten the details myself as it was a while ago:
https://chat.godotengine.org/channel/devel?msg=27YwiqFc8ndSnAnes

StringBuilder: #10860
StringBuffer: #10844

@komugi1211s
Copy link
Contributor Author

@reduz sorry for taking so long to respond! I wished to post a response sooner, but I was busy with my work.

The StringBuffer performs better compared to StringBuilder because the concatenation is performed in a simple loop, allowing for the compiler to optimize. this comes with the cost of possibly allocating excess memory in some situations. I admit that I did not measure the amount of memory allocated for each case, only the speed in a controlled environment. I would like to measure the memory when I have time.

Given that StringBuilder is used to generate the compiled shader code from the visual shader editor, I hope that this provides a better experience to people who like the feature, although I do think that the gain is insignificant aside from this particular case.

If possible, I would love to get some advice on where to go with this PR. I think from the performance perspective this is better suited for general use, but I am not yet familiar with the codebase so I may be missing something important.

@AThousandShips
Copy link
Member

The StringBuffer performs better compared to StringBuilder because the concatenation is performed in a simple loop, allowing for the compiler to optimize

Optimize how and what?

@komugi1211s
Copy link
Contributor Author

@AThousandShips After re-reading my previous comment I realized that I did not provide any substance on my claim, I apologize. I will try to explain as much as I can

I was thinking about the append method that uses for loop to append const char*, which can now be replaced with SIMD instruction to process multiple bytes at once, resulting in a better performance. I am not sure if the same level of optimization is being applied to (or can be applied to) StringBuilder, but looking at the benchmark taken in a release build it is unlikely. unfortunately this seems to be only applicable for const char * path.

when appending String on the other hand, although the concatenation itself uses identical method (simply using memcpy), StringBuilder has a slight overhead from copying a string object because it needs to store the copy of it in order to delegate the actual concatenation to the end, which adds up when gathering a lot of small string (which seems to be the original use case of StringBuilder). StringBuffer does not have the same problem.

I hope this answers the question, and let me know if I missed anything.

@komugi1211s komugi1211s requested a review from a team as a code owner November 5, 2024 14:20
- Replaces the use of StringBuilder with StringBuffer.
- updates StringBuffer to be more performant on certain cases.
@YYF233333
Copy link
Contributor

Come across this accidentally, it seems this PR has a simillar goal with #97777.

Firstly I have a question:

I was thinking about the append method that uses for loop to append const char*, which can now be replaced with SIMD instruction to process multiple bytes at once, resulting in a better performance.

Do you mean this?
https://github.com/godotengine/godot/blob/9e6098432aac35bae42c9089a29ba2a80320d823/core/string/string_buffer.h#L100C1-L102C3

I think StringBuilder do exactly the same thing.
https://github.com/godotengine/godot/blob/9e6098432aac35bae42c9089a29ba2a80320d823/core/string/string_builder.cpp#L84C1-L86C5

And not relevant but I think both of them can utilize memcpy for better performance since they know the length of the c string.


Back to StringBuffer, I think it is ill-designed, it is basicly just a String with on-stack cache, better renamed to InlinableString or simillar. This is a pretty common optimization for String, take this rust ver. as an example.

StringBuilder is a fundementally different thing. What it wants to solve is multiple resize during string concat. E.g. you have three string "aaa", "bbb", "ccc", if you use += it resize twice, but if using StringBuilder you only resize once.

The problem of StringBuilder is that it is to heavyweight, containing three Vector, slow to construct. The solution is to reuse them as much as possible (static/thread_local).

So in summary, I think StringBuffer is another type of string, it is not a Builder so we cannot use it to replace StringBuilder. The overhead of StringBuilder should be optimized, but not go back to operator +=.

The inline cache itself is a good thing and maybe we can add that to String? Should be a big perf gain in small string concat.

@SlugFiller
Copy link
Contributor

It is annoying for me to keep having to explain it, but I'll do it one more time. StringBuilder isn't "different". It is wrong. There is no scenario in which it makes sense to use the algorithm it uses. Now I will explain why.

StringBuilder attempts to take a bunch of strings, pre-allocate the buffer into which they are inserted, and then copy them into said buffer. In theory, this seems to make sense. It avoids any need to reallocate the buffer as you go.

Where it goes wrong is in the "take a bunch" bit. It doesn't accept an iterator that returns the strings on the fly. Rather, the strings are added one by one, before being iterated at the generation step. And the way it "adds" a string is by inserting it into a Vector (actually, 2 different ones for each add, but that just increases the already existing issue).

The issue is that appending an element to a Vector is, in and of itself, just as expensive as enlarging a buffer to be able to copy an extra string into it. This means that just at the "take a bunch" step, before the accumulation even begins, StringBuilder has already used the same amount of resources as StringBuffer uses in its entire operation.

As a result, every single instruction of the accumulation step is overhead. By default, StringBuilder will always, and in every possible scenario, simply be a slower version of StringBuffer. There is nothing it can do, faster or otherwise, that StringBuffer can't.

I honestly, swear to Carmack, thought this is obvious. But if it wasn't, then here, a full explanation as to why StringBuilder uses a suboptimal algorithm. If it's still unclear, I'm open to any clarifying questions.

The only scenario StringBuilder's algorithm makes sense, is if you somehow already have an array or iterator of strings. An example where it might happen in practice, is if you have a pre-made array of strings, and you're constantly replace one in the middle and rebuild the entire concatenation. However, StringBuilder doesn't have a method which accepts an array or vector, nor a method for replacing one of the strings in the middle. In other words, there is no way to use it, as it is currently built, in a way that justifies the algorithm it uses.

By contrast, StringBuffer is significantly faster, because it only adds each string once, and doesn't have a final accumulation step. In other words, it has 0 overhead, and is the fastest possible algorithm for appending strings one at a time. And the benchmarks confirm this.

P.S. Making the StringBuilder vectors static threadlocal could help, reducing allocations during the "take a bunch" step. It would come with a cost: You'd only be able to use one StringBuilder at a time (in any given thread), making the process bug-prone. For instance, this would not work:

StringBuilder str1;
StringBuilder str2;
str1 += 'prefix_';
str2 += 'different_';
if (strs.size() > 0) {
  str1 += strs[0];
  str2 += strs[0];
}
for (int i = 1; i < strs.size(); i++) {
  str1 += '_';
  str2 += '_';
  str1 += strs[i];
  str2 += strs[i];
}
str1 += '_suffix';
str2 += '_and_unique';
String res1 = str1;
String res2 = str2; // If static threadlocal buffers are used, this would be the same as res1, even though it's not supposed to be

But this exact same solution could be applied to StringBuffer - using a static threadlocal buffer to hold the final string, then copying a substring (with a single allocation) as a replacement "accumulation" step. And it would carry the same issue: As soon as you create a new StringBuffer, any existing one becomes invalid, as it overwrites the same buffer. At this point, you might as well have a string-building singleton.

@clayjohn
Copy link
Member

@YYF233333 Thanks for bringing attention to this. I vaguely remember the discussion from last year on the contributor chat about this, similar to here, the chat there was pretty divided on whether this change would or would not be faster. But notably, nobody tested the performance inside of Godot.

The benchmarking done in this PR is helpful, but not conclusive since it is a synthetic benchmark.

I profiled both the relevant code in this PR and the code in #97777 by running a scene that compiles 300 shaders in one frame. While #97777 is significantly faster than master (about 2x faster), this PR is faster than #97777 by about 3x.

For compiling 300 GLES3 shaders in one frame master spends about 200 ms in StringBuilder, #97777 spends about 90 ms (almost all of that time is in as_string()), and this PR spends about 28 ms (all of that time is in append).

Do you mind testing the same scene in #97777 with the code from this PR to compare performance?

@YYF233333
Copy link
Contributor

Firstly thanks for your detailed explanation @SlugFiller , I still have some point to discuss.

Where it goes wrong is in the "take a bunch" bit. It doesn't accept an iterator that returns the strings on the fly. Rather, the strings are added one by one, before being iterated at the generation step. And the way it "adds" a string is by inserting it into a Vector (actually, 2 different ones for each add, but that just increases the already existing issue).

The problem is that iterator doesn't exist in the usecase, typically the string you want to concat are local variables/string literals that are distributed on stack. You need some bookkeeping, a builder, to gather them up. Bookkeeping do has overhead, so we use it only when simple += becomes bottleneck.

The issue is that appending an element to a Vector is, in and of itself, just as expensive as enlarging a buffer to be able to copy an extra string into it. This means that just at the "take a bunch" step, before the accumulation even begins, StringBuilder has already used the same amount of resources as StringBuffer uses in its entire operation.

Sorry I forget we are still using Vector in master. Vector::push_back resize to size()+1 is frustrating, but luckily we have LocalVector which does normal x2 resize. I optimize this in my PR.

By contrast, StringBuffer is significantly faster, because it only adds each string once, and doesn't have a final accumulation step. In other words, it has 0 overhead, and is the fastest possible algorithm for appending strings one at a time. And the benchmarks confirm this.

Problem is, += does exactly the same thing, and the benchmark proves this, the performance of StringBuffer and += are at same level. Then why use StringBuffer instead of direct concat?

The only usecase of StringBuffer in VariantParser::get_token shows that. What it actually want is a stack array to store small string. If the string is too long fallback to normal +=. This is good and we can use this stratergy in more place, but it should be named as InlinableString, not StringBuilder/StringBuffer.

So if the profile shows that StringBuilder is slow than direct concat, feel free to fallback to +=, but that should be case by case, not find-and-replace in a batch.

@YYF233333
Copy link
Contributor

YYF233333 commented Nov 21, 2024

For compiling 300 GLES3 shaders in one frame master spends about 200 ms in StringBuilder, #97777 spends about 90 ms (almost all of that time is in as_string()), and this PR spends about 28 ms (all of that time is in append).

That's expected and means probably the bookkeeping overhead is larger than resize overhead and we should use += instead. Theoretically in this case I think time costed low to high:

  • += with a stack buffer == StringBuffer
  • += only
  • static StringBuilfer
  • StringBuilfer only

Do you mind testing the same scene in #97777 with the code from this PR to compare performance?

I will if I have time.

Edit: Test done. I cannot find performance difference between StringBuffer, StringBuilder and String. The test case no longer valid now. Need a new benchmark.

Rebase #97777 against latest master, merge this PR, and change only tooltip's type. Build with scons production=yes debug_symbols=yes. Test with time .\godot.windows.editor.x86_64.console.exe -e --path ".\ManyNodes\" --quit-after 9. One run warmup, run three times.

  • String: 22.02s, 22.25s, 21.80s
  • StringBuilder: 22.21s, 21.99s, 21.82s
  • StringBuffer: 22.03s, 21.90s, 22.21s

P.S. This PR reminds me that I'm trapped by the old implementation. We should directly use a String as buffer since we need to return one, whatever else buffer we use is adding another memcpy. Will correct this ASAP.

Again, I'm not against this class. I'm against its name and replacing all StringBuilder.

@SlugFiller
Copy link
Contributor

You need some bookkeeping, a builder, to gather them up. Bookkeeping do has overhead, so we use it only when simple += becomes bottleneck.

And my point was that the bookkeeping is equally expensive as +=. Except the bookkeeping requires you to do a collection step later, and += does not. Therefore, it is better to do += instead of the bookkeeping.

Sorry I forget we are still using Vector in master. Vector::push_back resize to size()+1 is frustrating, but luckily we have LocalVector which does normal x2 resize. I optimize this in my PR.

That, in and of itself, is not the issue. It merely contributes to the issue. The issue is that appending to the Vector or LocalVector is as expensive as expanding a buffer.

Problem is, += does exactly the same thing, and the benchmark proves this, the performance of StringBuffer and += are at same level. Then why use StringBuffer instead of direct concat?

You're suggesting replacing StringBuffer with just String. This is... fair. It largely depends on how String is implemented. I believe, however, that String has additional copy-on-write logic, since it's supposed to be reusable and immutable, whereas StringBuffer does not. But don't take my word for it, I didn't dive into the String source code yet.

So if the profile shows that StringBuilder is slow than direct concat, feel free to fallback to +=, but that should be case by case, not find-and-replace in a batch.

It's not just the profile, although the profile is pretty conclusive. Logically, StringBuilder will always be slower than +=, in every single possible case. This is an inevitable result of its design. This is why a search-and-replace is appropriate.

It's like replacing bubble sort with quicksort. There's no conceivable condition under which the prior would be preferable, other than maybe code simplicity. And StringBuilder's algorithm is anything but simple.

@komugi1211s
Copy link
Contributor Author

Hi, I am sorry for the lack of activity on my end.

I generally agree with the point made by @SlugFiller, and thank you for extremely thorough explanation.
I am sorry if I am late to this, but I would like to add my perspective and question to the topic.


And not relevant but I think both of them can utilize memcpy for better performance since they know the length of the c string.

I don't think we can use the memcpy for this part of the code, since the datatype of the destination and the source is different (char32_t and char).

You're suggesting replacing StringBuffer with just String. This is... fair. It largely depends on how String is implemented.

As far as I can tell, the String::operator+= should not become faster than StringBuffer::operator+= when it comes to handling const char * as an input. This is because String::operator+= performs the validity check on individual source characters, and replaces it with NUL if fails the check before appending it to a destination buffer. it is almost exactly the same as StringBuffer::operator+= except the existence of this check, which is tucked inside of the for loop.

I am honestly not sure why we are not doing the same validation check in either StringBuilder or StringBuffer. This might be worth investigating further.

Again, I'm not against this class. I'm against its name and replacing all StringBuilder.

I agree that the name StringBuffer does not accurately convey the actual purpose of this code. I feel like this name was apt if the usage was fully contained within the get_token code, but expanding it to other places justifies the name change. I think InlinedString is a good idea for the name.


I am struggling to understand the appeal of keeping StringBuilder. I understand that it does reduce the amount of reallocation during the gathering phase, but does the concern of keeping reallocation low come from the speed of realloc operation? or the amount of memory consumed by the power_of_two resize?

This is another issue that I've been wondering about since I read the code of StringBuilder for the first time, but is there a guarantee that the const char * kept in the StringBuilder is alive at the time when StringBuilder::as_string is called? I am wondering if the following code can break the code:

StringBuilder builder;
{ // inside function or something, anything that can invalidate foo
  const char foo[10] = {};
  // fill foo somehow
  builder.append(foo);
}

builder.as_string(); // what would the content be like?

@SlugFiller
Copy link
Contributor

I am struggling to understand the appeal of keeping StringBuilder. I understand that it does reduce the amount of reallocation during the gathering phase

...but adds an equal or larger amount of reallocations during the append phase, thereby removing any semblance of having an advantage or purpose.

but is there a guarantee that the const char * kept in the StringBuilder is alive at the time when StringBuilder::as_string is called?

I think there isn't. But in order for your example code to actually create garbage, you need to thrash the stack in between the append and the call to to_string. One way to do it is to call another function, which uses several stack variables. That would corrupt the result.

If this was Rust or Swift, an escaping or linked lifetime guarantee could be required from the const char * to prevent it. But, generally, the const char * version is intended purely for constant strings which are located in the program code, and therefore are guaranteed to last the entire lifetime of the process.

@clayjohn
Copy link
Member

I am struggling to understand the appeal of keeping StringBuilder. I understand that it does reduce the amount of reallocation during the gathering phase, but does the concern of keeping reallocation low come from the speed of realloc operation? or the amount of memory consumed by the power_of_two resize?

Realistically the end result of these conversations should be to remove either StringBuffer or StringBuilder, settle on an implementation of the remaining class, and ensure that class is named StringBuilder.

godotengine/godot-proposals#11241 is interesting and may help many cases of String concatenation, but there is still value in a purpose-built class for concatenating strings that we know are valid in advance so we can avoid validating individual characters and avoid the bookkeeping that comes from resizing a Vector.

@Ivorforce
Copy link
Contributor

Ivorforce commented Dec 6, 2024

I believe I have run enough tests to make a recommendation.

I ran some of my own benchmarks against StringBuilder, StringBuffer and simply String +=, taking into account what I've learned from working with the classes over the past few days.

Pre-Test

Before starting, I patched StringBuffer to its (i believe) optimal state, because it currently does some unnecessary allocations, and we cannot draw any final conclusions with half unfinished implementations. The optimizations are mainly

  • Build string directly in buffer, instead of building elsewhere and copying over.
  • LocalVector instead of Vector, to avoid CowData overhead.
  • One vector with variant-like entries instead of 3, to linearize data access.

(I tested all 3 optimizations iteratively top to bottom, and each contributes to better scores).

string_builder.h
/**************************************************************************/
/*  string_builder.h                                                      */
/**************************************************************************/
/*                         This file is part of:                          */
/*                             GODOT ENGINE                               */
/*                        https://godotengine.org                         */
/**************************************************************************/
/* Copyright (c) 2014-present Godot Engine contributors (see AUTHORS.md). */
/* Copyright (c) 2007-2014 Juan Linietsky, Ariel Manzur.                  */
/*                                                                        */
/* Permission is hereby granted, free of charge, to any person obtaining  */
/* a copy of this software and associated documentation files (the        */
/* "Software"), to deal in the Software without restriction, including    */
/* without limitation the rights to use, copy, modify, merge, publish,    */
/* distribute, sublicense, and/or sell copies of the Software, and to     */
/* permit persons to whom the Software is furnished to do so, subject to  */
/* the following conditions:                                              */
/*                                                                        */
/* The above copyright notice and this permission notice shall be         */
/* included in all copies or substantial portions of the Software.        */
/*                                                                        */
/* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,        */
/* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF     */
/* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. */
/* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY   */
/* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,   */
/* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE      */
/* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                 */
/**************************************************************************/

#ifndef STRING_BUILDER_H
#define STRING_BUILDER_H

#include "core/string/ustring.h"
#include "core/templates/vector.h"

#include <core/templates/local_vector.h>

class StringBuilder {
	struct Item {
		String s;
		const char *ptr;
		size_t ptr_len;
	};

	uint32_t string_length = 0;

	LocalVector<Item> strings;
public:
	StringBuilder &append(const String &p_string);
	StringBuilder &append(const char *p_cstring);

	_FORCE_INLINE_ StringBuilder &operator+(const String &p_string) {
		return append(p_string);
	}

	_FORCE_INLINE_ StringBuilder &operator+(const char *p_cstring) {
		return append(p_cstring);
	}

	_FORCE_INLINE_ void operator+=(const String &p_string) {
		append(p_string);
	}

	_FORCE_INLINE_ void operator+=(const char *p_cstring) {
		append(p_cstring);
	}

	_FORCE_INLINE_ int num_strings_appended() const {
		return strings.size();
	}

	_FORCE_INLINE_ uint32_t get_string_length() const {
		return string_length;
	}

	String as_string() const;

	_FORCE_INLINE_ operator String() const {
		return as_string();
	}
};

#endif // STRING_BUILDER_H
string_builder.cpp
/**************************************************************************/
/*  string_builder.cpp                                                    */
/**************************************************************************/
/*                         This file is part of:                          */
/*                             GODOT ENGINE                               */
/*                        https://godotengine.org                         */
/**************************************************************************/
/* Copyright (c) 2014-present Godot Engine contributors (see AUTHORS.md). */
/* Copyright (c) 2007-2014 Juan Linietsky, Ariel Manzur.                  */
/*                                                                        */
/* Permission is hereby granted, free of charge, to any person obtaining  */
/* a copy of this software and associated documentation files (the        */
/* "Software"), to deal in the Software without restriction, including    */
/* without limitation the rights to use, copy, modify, merge, publish,    */
/* distribute, sublicense, and/or sell copies of the Software, and to     */
/* permit persons to whom the Software is furnished to do so, subject to  */
/* the following conditions:                                              */
/*                                                                        */
/* The above copyright notice and this permission notice shall be         */
/* included in all copies or substantial portions of the Software.        */
/*                                                                        */
/* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,        */
/* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF     */
/* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. */
/* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY   */
/* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,   */
/* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE      */
/* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                 */
/**************************************************************************/

#include "string_builder.h"

#include <string.h>

StringBuilder &StringBuilder::append(const String &p_string) {
	if (p_string.is_empty()) {
		return *this;
	}

	strings.push_back({p_string, nullptr, 0});
	string_length += p_string.length();

	return *this;
}

StringBuilder &StringBuilder::append(const char *p_cstring) {
	size_t len = strlen(p_cstring);
	if (len == 0) {
		return *this;
	}

	strings.push_back({String(), p_cstring, len});
	string_length += len;

	return *this;
}

String StringBuilder::as_string() const {
	if (string_length == 0) {
		return "";
	}

	String string;
	string.resize(string_length);
	char32_t *buffer_ptr = string.ptrw();

	for (uint32_t i = 0; i < strings.size(); i++) {
		const auto& item = strings[i];

		if (!item.ptr) {
			// Godot string
			const String &s = item.s;

			memcpy(buffer_ptr, s.ptr(), s.length() * sizeof(char32_t));
			buffer_ptr += s.length();
		} else {
			const char *s = item.ptr;

			for (int32_t j = 0; j < item.ptr_len; ++j, ++buffer_ptr, ++s) {
				*buffer_ptr = *s;
			}
		}
	}

	return string;
}

The Test

Here's the code I ran:

test_main.cpp

(templating should optimize away StrBuilder, just adding it for convenience).

struct StrBuilder {
	String x;

	void append(const String& s) {
		x += s;
	}

	String as_string() {
		return x;
	}
};

template <typename Builder>
void test(const String &s, int runs, int count) {
	for (int r = 0; r < runs; ++r) {
		Builder b;
		for (int i = 0; i < count; ++i) {
			b.append(s);
		}
		int l = b.as_string().length();
	}
}

int test_main(int argc, char *argv[]) {
	String s = String("A").repeat(1);
	int runs = 100;
	int count = 200;

	test<StringBuilder>(s, runs, count);
	auto t0 = std::chrono::high_resolution_clock::now();
	test<StringBuilder>(s, runs, count);
	auto t1 = std::chrono::high_resolution_clock::now();
	std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count() << "us\n";

	test<StringBuffer<>>(s, runs, count);
	t0 = std::chrono::high_resolution_clock::now();
	test<StringBuffer<>>(s, runs, count);
	t1 = std::chrono::high_resolution_clock::now();
	std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count() << "us\n";

	test<StrBuilder>(s, runs, count);
	t0 = std::chrono::high_resolution_clock::now();
	test<StrBuilder>(s, runs, count);
	t1 = std::chrono::high_resolution_clock::now();
	std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count() << "us\n";
}

I ran the test with 4 configurations, differing in String length and append count. The reason is that I suspected StringBuffer to perform the best with many small appends, StringBuilder to perform best with few big appends, and += performing about on-par with StringBuffer for big strings but worse for small strings.

Indeed, this is exactly what we see:

(Each StringBuilder, StringBuffer, String +=)

10000 length, 5 runs, 3 appends
10us
64us
17us

20 length, 100 runs, 20 appends
88us
51us
68us

40 length, 200 runs, 20 appends
168us
173us
148us

1 length, 100 runs, 200 appends
1253us
284us
472us

I hope it's clear what the reason for this is:

  • StringBuilder performs badly on very short strings because the book-keeping overhead overtakes the benefit of avoiding big reallocations.
  • StringBuffer performs badly on very long strings because it allocates more than is needed (power-of-2), and re-allocates long strings several times.
  • String += is never considerably better than the others.
  • The handover happens at about 40-length strings.
  • Both perform badly, up to a factor of 6x, when used for strings that are too short or too long.

Discussion

I think this is enough data to make a judgement call.

I propose the following:

  • Rename StringBuffer to ShortStringBuilder and StringBuilder to LongStringBuilder (or something along those lines).
  • StringBuilder should be addressed ASAP for its inefficiencies.
    • I have a PR with a few pretty unopinionated changes here; #97777 is likely to gain more speed but is more opinionated. We could probably proceed in small steps.

Additionally, here are follow-ups that are likely to further optimize string concatenation:

  • For strings for which the components are known at compile-time, concatenate_strings should be used.
  • Repetitive short-string appends in StringBuilder (such as in shader_gles3.cpp should be avoided. Instead, a longer string should be appended at once.
  • For multiple appends at once, a function extend should be added to both StringBuilder and StringBuffer, to reduce buffer growth overhead. It should be statically evaluated, modeled after concatenate_strings.
  • In cases where several strings of short length are appended to StringBuilder, append(concatenate_strings) should be used instead.
  • char * arguments should be replaced with StrRange arguments, to avoid unnecessary strlen calls, as proposed in Soft-deprecate the use of implicit reliance on NULL-terminated strings. #99806.

@SlugFiller
Copy link
Contributor

  • StringBuffer performs badly on very long strings

"Performs badly" is a misphrasing here. The difference is in microseconds. It's on the level of measurement error, and is roughly 20 times less than the difference in the many appends version.

To be complete, you should probably test many appends long length, or at least increase the run count until the run time is measured in seconds, not in microseconds.

Or you can flip the stop condition, and measure number of runs that can be completed before a second elapses. This should somewhat reduce measurement error, although overhead from repeatedly getting the current time might take over in the case of a very short run.

At any rate, even if we take these measurements to be perfectly accurate, with no errors introduced from resolution limitation of the system clock, being able to earn 50us for the risk of losing 1000us is not a proper tradeoff. It does not justify the use of StringBuilder, even in its optimized form.

@Ivorforce
Copy link
Contributor

Ivorforce commented Dec 7, 2024

"Performs badly" is a misphrasing here. The difference is in microseconds. It's on the level of measurement error, and is roughly 20 times less than the difference in the many appends version.

Hrm, you are right, unfortunately this does cast a shadow on my conclusions.

But i agree with clay that if we went there, to try to totally replace StringBuilder, StringBuffer should be renamed to StringBuilder, instead of renaming references.

Also, StringBuffer using a short string cache is not suitable for building strings, since the data has to be copied over later anyhow. This should be addressed before proposing the switch.

@Ivorforce
Copy link
Contributor

Ivorforce commented Dec 7, 2024

Urgh, I just now realized that String.cow_data already grows in power-of-2 steps. This is likely the reason it's doing decently well for concatenation in the first place.

With that in the picture as well, I wonder why String += is even slower than StringBuffer in the first place, because all StringBuffer does is some additional overhead for resizing by power-of-2 (logic that is repeated in the string resize...).

@clayjohn
Copy link
Member

clayjohn commented Dec 7, 2024

Urgh, I just now realized that String.cow_data already grows in power-of-2 steps. This is likely the reason it's doing decently well for concatenation in the first place.

With that in the picture as well, I wonder why String += is even slower than StringBuffer in the first place, because all StringBuffer does is some additional overhead for resizing by power-of-2 (logic that is repeated in the string resize...).

String has a lot more bookeeping steps, look at the number of instructions in order to get to the Po2 resize logic. StringBuffer does the resize up front because it knows its going to need to resize a lot. Basically you skip all the string and CoW logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants