Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement #scan_integer to efficiently parse Integer #114

Closed
wants to merge 1 commit into from

Conversation

casperisfine
Copy link

Fix: #113

This allows to directly parse an Integer from a String without needing to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.

cc @kou

@casperisfine
Copy link
Author

Arf, forgot this gem had a Java version. Looking at it.

@casperisfine
Copy link
Author

Another thing that in unclear is whether we assume ASCII compatible encoding or not.

@casperisfine
Copy link
Author

Another thing that in unclear is whether we assume ASCII compatible encoding or not.

>> "1234".encode(Encoding::UTF_32BE).to_i
(irb):2:in `to_i': ASCII incompatible encoding: UTF-32BE (Encoding::CompatibilityError)

So I guess we should just do the same here.

@casperisfine
Copy link
Author

PR updated with the Java implementation, and now raising when called with encoding that's not ASCII compatible.

@casperisfine casperisfine force-pushed the scan-integer branch 2 times, most recently from 8c63de8 to f8b7b02 Compare November 14, 2024 09:10
@casperisfine
Copy link
Author

Hum, TruffleRuby fails with

#<NoMethodError: undefined method `scan_integer' for #<StringScanner:0x31c8>>.

But I don't get why, doesn't it uses the C or Java implementation? I don't see any third implementation (cc @eregon )

@casperisfine
Copy link
Author

Apparently, the Truffle implement is in Truffle itself, so I just skip those test like many others.

@kou
Copy link
Member

kou commented Nov 18, 2024

TruffleRuby support isn't included in this repository. We can just ignore TruffleRuby.
See also: #35


ptr = CURPTR(p);

if (ptr[len] == '-' || ptr[len] == '+') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ptr[len] safe when S_RESTLEN() is 0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, because (for now?) Ruby strings are guaranteed to be NULL terminated. So if we're EOF, ptr[len] is \0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that a String created by rb_str_new_static() may not be terminated by \0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically you can but the documentation clearly states:

Identical to rb_str_new(), except it takes a C string literal.

So it assume it's a literal, hence is NUL terminated.

I can add the checks if you feel strongly about it, but it's a lot of busy code for something I think isn't supposed to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the extra checks in #115 noetheless.

Comment on lines +1310 to +1314
buffer = ALLOCV_N(char, buffer_v, len + 1);

MEMCPY(buffer, CURPTR(p), char, len);
buffer[len] = '\0';
integer = rb_cstr2inum(buffer, 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we want to avoid this copy...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, Ruby doesn't provide an API that doesn't require a NULL terminated string. But if you want I can implement a fast path like I did in ruby/json: ruby/json@3a4dc9e

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. We can use rb_cstr2inum() for now. Could you propose the fast path to Ruby itself? It may be useful for extensions.

Fix: ruby#113

This allows to directly parse an Integer from a String without needing
to first allocate a sub string.

Notes:

The implementation is limited by design, it's meant as a first step,
only the most straightforward, based 10 integers are supported.
ByteList bytes = str.getByteList();
int curr = this.curr;

int bite = bytes.get(curr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte?

Copy link
Contributor

@olleolleolle olleolleolle Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for asking that, I now re-learned that in Java, there are certain reserved words: https://en.wikipedia.org/wiki/List_of_Java_keywords and byte is one of them.

EDIT: Oh,

Of these 68 keywords, 17 of them are only contextually reserved, and can sometimes be used as an identifier, unlike standard reserved words

(Ah, right, that list of contextually-reserved didn't include byte.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!!!

@byroot
Copy link
Member

byroot commented Nov 25, 2024

Closing in favor of #115 for logistical reasons.

@byroot byroot closed this Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Provide an efficient way to parse Integers (and Floats)?
4 participants