Replace kstream::to_string() - not just `int`, fix portability #50

generalmimon · 2022-04-03T01:18:49Z

Fix #28, close #39

Current kaitai::kstream::to_string implementation only works for int and uses snprintf, which is not portable, as pointed out in #28. Also, according to the benchmark at https://www.zverovich.net/2020/06/13/fast-int-to-string-revisited.html, sprintf is almost the slowest function that was tested there (slower were only boost::format and on macOS std::ostringstream).

Therefore, I chose the simple decimal_from function as a replacement (also included in the referred benchmark) by Alf P. Steinbach (@alf-p-steinbach), which scores well in the benchmark (not so important, but for a good feeling), but more importantly is elegant and portable (as far as I can say). I've added static_cast<char> around ('0' + number % 10), because without it I get this warning in CLion:

Clang-Tidy: Narrowing conversion from 'unsigned long' to signed type 'char' is implementation-defined

(if the parameter has type unsigned long). There are other instances of this in the runtime library, but I'll deal with them later.

Also, if I wasn't missing something, this part in the original code (https://ideone.com/nrQfA8) looked like an undefined behavior for number = LONG_MIN to me (if signed integers use two's complement):

    inline void to_decimal( long number, char* buffer )
    {
        if( number < 0 )
        {
            buffer[0] = '-';
            unsigned_to_decimal( -number, buffer + 1 );

If LONG_MIN is -0x8000_0000 (which is < 0), -number overflows, because 0x8000_0000 cannot be represented in the long type (the LONG_MAX would be 0x7fff_ffff). And signed integer overflow is undefined behavior in C++. From what I've found, it really is UB - see https://stackoverflow.com/a/8917528. Although -LONG_MIN sometimes yields LONG_MIN again (which is what the original code apparently relies on), so it looks like it works, it's not guaranteed.

Loosely related (because these wild manifestations seem to be specific to division operations), things like INT_MIN / -1 or INT_MIN % -1 do all kinds of stuff that one wants to avoid: https://kqueue.org/blog/2012/12/31/idiv-dos/, https://www.quora.com/Why-does-INT_MIN-1-evaluate-to-0-in-C-in-an-x86-64-Linux-machine

I've added the functions as templates - this causes some trouble in C++98 that we still support (I had to imitate std::make_unsigned with a crude implementation and drop checks whether the type parameter makes sense), but overall I hope that it makes sense. I haven't used C++ templates before, so I would really appreciate some review. I've ran our test suite for both C++98 and C++11 targets and it seems to work, but I can't be sure there aren't some pitfalls.

I was thinking if we should use templates for this at all, but without them, there is still no single "one size fits all" type. We would need multiple versions - at least a version for type int64_t and uint64_t (because only int64_t can represent numbers from -7fff_ffff_ffff_ffff to -1 and only uint64_t can store values from 0x8000_0000_0000_0000 to 0xffff_ffff_ffff_ffff). At first, I thought I will handle these variants with macros, but that feels hacky. Templates are intuitively a better choice.

See #8 (comment) Fix #28, close #39

generalmimon · 2022-04-03T13:28:57Z

If LONG_MIN is -0x8000_0000 (which is < 0), -number overflows, because 0x8000_0000 cannot be represented in the long type (the LONG_MAX would be 0x7fff_ffff). And signed integer overflow is undefined behavior in C++. From what I've found, it really is UB - see https://stackoverflow.com/a/8917528.

Confirmed by https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html:

$ cat neg_long_min.cpp
#include <iostream>
#include <limits>
int main() {
        long number = std::numeric_limits<long>::min();
        std::cout << -number << std::endl;
}
$ clang++ -fsanitize=undefined neg_long_min.cpp
$ ./a.out
neg_long_min.cpp:5:15: runtime error: negation of -9223372036854775808 cannot be represented in type 'long'; cast to an unsigned type to negate this value to itself
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior neg_long_min.cpp:5:15 in
-9223372036854775808

webbnh · 2022-04-04T16:55:59Z

Hi all!

First, I must embarrassingly admit, I'm now kind of rusty on C++ and templates...I've been working in Python for the past couple of years (how time flies!...). But, with that said, there's no reason not to use templates, if they are needed and they solve the problem.

However, along those lines, I'm wondering if they are needed. That is, suppose to_string() is simply defined to take whatever the largest integral type is...then it seems like the type promotion rules will take care of (nearly) all of the type issues. There is the problem that negative numbers cannot be coerced to unsigned values and the largest unsigned numbers cannot be represented using signed integers, but perhaps we can find a way to finesse that.

Also, -LONG_MIN is a fascinating problem which I had never considered. However, for the purposes here, I'm wondering if a slightly different approach might be worthwhile. For values with a magnitude greater than 10 (or 2 or any other convenient threshold), we can divide them by 10 (or 2 or...) before checking to see if they are positive or negative. Once that's done, we can use much more straightforward code for handling that problem. (So, for instance, if we peel off the lowest-order digit first, and then evaluate the sign, then we can proceed in the familiar way.)

So, with all that as background, I'm wondering if an approach like this would work:

We need two interfaces (accessed presumably via template functions), one for handling unsigned numbers and one for handling signed numbers, each defined to use the largest unsigned and signed types, respectively.
The signed interface determines whether the value is negative or positive and acts accordingly: positive values can be coerced to unsigned values and passed to the other interface; negative values can be divided in half, negated, converted to unsigned, re-doubled, and passed to the other interface, with the buffer pre-loaded with a negative sign. (We'll need special-case code to handle -1, but that's easy.)
The unsigned interface performs the string conversion using the progressive division-with-remainder approach (we might want to consider using the appropriate C STL function, like div if it will work, which performs both the division and finds the modulo in one step).

I think this approach removes the need for templating for unsigned_to_decimal() -- it can be defined to accept the largest unsigned integer type. Then to_string() is defined as a template function whose default is a simple wrapper for unsigned_to_decimal() using an unsigned type, and with a specialization for signed types which does all the extra work.

I'm thinking that this will be more portable and much less reliant on fancy template stuff, which might make C++98 support easier. (But, obviously, you all have delved much more deeply into this stuff than I have, so there might be something big that I'm missing.)

ams-tschoening · 2022-04-05T06:35:56Z

@generalmimon, is there still a problem with C++98 with your implementation or not? Your description reads to me like it works because of an implemented workaround. If there's no real problem right now, your implementation works in the tests and seems to pretty much reflect what others use in the web according your research and provided discussions, I suggest simply using that.

Any other implementation 1. needs to be created at all, 2. needs to work like the current one, 3. needs to provide some benefits and 4. needs to be understood by others as well. The only benefit I see is better C++98 support in theory(!), which doesn't seem worth it in my opinion. C++98 support won't be there forever anyway. OTOH, a totally different approach than what has been researched as done by many other people in similar cases always introduces a risk of doing things wrong.

webbnh · 2022-04-05T15:13:57Z

@ams-tschoening, my principal comment on this PR is that the proposed change seems like it might be more complex than it needs to be and that the added complexity is coming from the C++ support (i.e., the template magic) and especially from the additional code required to support C++98. The fundamental part of this change (i.e., unsigned_to_decimal()) seems good, and I'm not proposing a significant departure from that. Instead, I'm suggesting that the implementation of to_string() itself can be made more understandable by splitting the unsigned and signed cases apart, which removes the need for much of the template and type management which the proposed code employs.

But, if you're comfortable with the understandability and maintainability of the proposed code, I will happily withdraw my suggestion.

webbnh

Looks good to me.

The cast in static_cast<char>('0' + number % 10) is a necessary evil because you want the result of the expression to be a char but one of the operands (number) is an uint64_t. You could presumably apply the cast to just the result of the modulo operation, but I don't think that improves the readability of the code.

The expression, -static_cast<uint64_t>(val) is pretty scary (what does it really mean to negate an unsigned value??), but if the standards guys say that it has well-defined behavior that does what we want, I guess it's OK. 😁

The net result of this change is pretty effective and pretty minimal, so it looks good!

We only check `std::numeric_limits<I>::max()`, because `min()` is negative for signed types, and we need its absolute value (so we can compare it with `std::numeric_limits<uint64_t>::max()`). However, this value won't fit into the same signed type in two's complement representation, so we need to use at least an unsigned type of the same width - but in GCC, `std::make_unsigned<__int128_t>` is not available, and `uintmax_t` is `uint64_t`, not `__uint128_t`. I don't know how to get the "real" `uintmax_t` including implementation-specific extended types like `__uint128_t` - maybe there's no platform-independent way to do it. Fortunately, we don't have to deal with this for the `max()` case at all, which works just fine, so we'll at least check that.

generalmimon · 2022-04-07T20:21:47Z

@webbnh @ams-tschoening Thanks for your comments!

Replace kstream::to_string() - not just int, fix portability

5ee4e19

See #8 (comment) Fix #28, close #39

Simplify to_string(): accept uint64_t in unsigned_to_decimal()

819e595

generalmimon force-pushed the new-to_string branch from 2ff296b to 819e595 Compare April 7, 2022 10:05

webbnh approved these changes Apr 7, 2022

View reviewed changes

generalmimon merged commit da0d91f into master Apr 7, 2022

generalmimon deleted the new-to_string branch April 7, 2022 20:22

generalmimon mentioned this pull request Jun 2, 2023

GitHub Actions CI for running unit tests #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace kstream::to_string() - not just `int`, fix portability #50

Replace kstream::to_string() - not just `int`, fix portability #50

generalmimon commented Apr 3, 2022 •

edited

Loading

generalmimon commented Apr 3, 2022

webbnh commented Apr 4, 2022

ams-tschoening commented Apr 5, 2022

webbnh commented Apr 5, 2022

webbnh left a comment

generalmimon commented Apr 7, 2022

Replace kstream::to_string() - not just int, fix portability #50

Replace kstream::to_string() - not just int, fix portability #50

Conversation

generalmimon commented Apr 3, 2022 • edited Loading

generalmimon commented Apr 3, 2022

webbnh commented Apr 4, 2022

ams-tschoening commented Apr 5, 2022

webbnh commented Apr 5, 2022

webbnh left a comment

Choose a reason for hiding this comment

generalmimon commented Apr 7, 2022

Replace kstream::to_string() - not just `int`, fix portability #50

Replace kstream::to_string() - not just `int`, fix portability #50

generalmimon commented Apr 3, 2022 •

edited

Loading