-
-
Notifications
You must be signed in to change notification settings - Fork 21.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replacement of internal RegEx with PCRE2 #10148
Conversation
re2 or hyperscan perform much better than pcre, so are there any good reasons that I'm missing for choosing pcre over them? Its performance with JIT becomes comparable to them, apparently, but JIT may not be available in all platforms. Also re2 has a linear-time processing (which comes at the cost of dropping back-references). |
The biggest motivation for picking PCRE is because I can easily choose between 16-bit character or 32-bit character strings at compile time. In doing so I can get it to play nice with Godot's built-in String type rather than converting back and forth between UTF-8. |
Ah, does this String class use UCS-4? |
Annoyingly, not exactly. It uses `wchar_t` which is UCS-2 on Windows but
UCS-4 on *Nix systems.
Also, forced pushed some clang formatting fixes I forgot to do.
…On 8 August 2017 at 02:36, Ferenc Arn ***@***.***> wrote:
Ah, does this String class use UCS-4?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#10148 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALT6tSvyLM-7WLGjcw_0PVFHoOd38RFks5sV1kYgaJpZM4OvlbD>
.
|
I'm guessing godot's string class isn't using utf8 for historical reasons; maybe someone knows if there are any plans for a switch? |
No, there's no plans to switch away from `wchar_t`.
As discussed with reduz on IRC, I've moved away from SCons trying to detect
the width of `wchar_t` as doing a test compile can be unreliable at times.
Instead it'll be done during run-time, with hope that optimisers would get
rid of the unneeded parts of the code.
…On 8 August 2017 at 08:41, Ferenc Arn ***@***.***> wrote:
I'm guessing godot's string class isn't using utf8 for historical reasons;
maybe someone knows if there are any plans for a switch?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#10148 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALT6kQcpTCBkdYqTCtpAtKYS4jbHXS9ks5sV67DgaJpZM4OvlbD>
.
|
modules/regex/regex.cpp
Outdated
} | ||
|
||
void RegEx::_bind_methods() { | ||
|
||
ClassDB::bind_method(D_METHOD("clear"), &RegEx::clear); | ||
ClassDB::bind_method(D_METHOD("compile", "pattern"), &RegEx::compile); | ||
ClassDB::bind_method(D_METHOD("search:RegExMatch", "text", "start", "end"), &RegEx::search, DEFVAL(0), DEFVAL(-1)); | ||
ClassDB::bind_method(D_METHOD("sub", "text", "replacement", "all", "start", "end"), &RegEx::sub, DEFVAL(false), DEFVAL(0), DEFVAL(-1)); | ||
ClassDB::bind_method(D_METHOD("search:RegExMatch", "subject", "offset", "end"), &RegEx::search, DEFVAL(0), DEFVAL(-1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting today, the :type
suffix must not be used anymore, except for virtual method binds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Also fixed the merge conflicts.
There seems to be a linking issue on Windows (at least old mingw version on Ubuntu 14.04): https://travis-ci.org/godotengine/godot/jobs/263722203 |
Yeah, realised it was missing the |
modules/regex/SCsub
Outdated
env_pcre2.Append(CPPFLAGS=thirdparty_flags) | ||
env_pcre2.Append(CPPFLAGS=["-DPCRE2_CODE_UNIT_WIDTH=" + width]) | ||
|
||
if (env['builtin_pcre2'] != 'no'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the above should go inside the if
actually, when building against the system pcre2 (typically on Linux), we only want to build the module files themselves, not reference anything in thirdparty_dir
.
Could you also add the platform/x11/detect.py
detection of pcre2 with pkgconfig, as done for other opt-out thirdparty deps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If you're not on Linux, I could likely add it myself after merge if need be)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've moved as much as possible into the if
block to allow external linking.
On X11, builtin_pcre2
is still by default set to yes, as it's not available on Ubuntu Trusty (and thus Travis-CI). Is there a way to automatically select yes/no if the native lib is not-found/found?
The pattern and replacement matching behaviour has been changed purely due to the nature of switching to a standards-compliant library. One mistake in the previous behaviour was that named groups didn't have a number. This has been corrected. As names are actually just an alias of numbered groups, RegExMatch::get_name_dict() is now get_names() and is a dict referring to the group number it represents. Duplicate names are enabled and the with the first matching instance used. Due the lack of a suitable equivalent in PCRE2, RegExMatch::expand() was removed.
Turns out PCRE2 was not as heavy as I thought and was fairly easy to integrate into Godot. This is the minimal feature-parity-ish commit. Plans to introduce serialised compilations and JIT speed-ups would come separately later.
That said, discussion is needed formodules/regex/SCsub
. Due to howwchar_t
can be 32 bit on *nix but 16-bit on Windows, it is necessary to know at compile-time its width. Unfortunately, I'm not sure the tests inmodules/regex/SCsub
is the right place (or even the right method).If someone more familiar with SCons can help me out, that'd be great.Notable minor breaking changes:
RegExMatch::get_name_dict()
has been removed. Also, for clarity,RegExMatch::get_group_array()
has been renamed.RegExMatch::expand()
was removed.