-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add regexp_match_substring_all function to yaml #469
feat: add regexp_match_substring_all function to yaml #469
Conversation
The existing implementation of The implementation of I understand why there is no But I do not understand:
|
I was wondering the same thing when I submitted this PR. I think this was the case because I was looking at other functions named
I wasn't sure whether or not to put the |
IIUC, the I think we should aim for these two functions |
Sounds good, I've updated both. |
extensions/functions_string.yaml
Outdated
on. The `position` argument should be a positive non-zero integer. The regular | ||
expression capture group can be specified using the `group` argument. Specifying `0` | ||
will match the entire regular expression. The default group value is `1`, which means the | ||
first capture group will be matched. The `group` argument should be a non-negative integer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richtia thank you for adding this group
argument to the existing regexp_match_substring
function. This creates better correspondence between this existing function and the new regexp_match_substring_all
function added here. But to be clear, it's not only for the sake of completeness. Multiple real-world functions that extract a single regex match—including R's stringr::str_extract()
and Snowflake's with REGEXP_SUBSTR
—accept an optional group number argument which allows the user to specify which capture group to return. So this is a practical addition.
Is it considered a breaking change when we add a new argument like this? I forget whether or not Substrait technically allows optional arguments or whether any new argument counts as a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be considered a breaking change. I'm not 100% sure though. @westonpace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, adding a new positional argument is a breaking change (adding a new option is not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand yet but I haven't looked at the regex functions before so it's likely just ignorance on my part.
extensions/functions_string.yaml
Outdated
Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or | ||
the position value is out of range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these things undefined behaviors and not errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the existing regexp functions in functions_string.yaml use this same kind of boilerplate "Behavior is undefined" language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's just the decision we made during the original PR. But if erroring out is preferred, we can go that route too.
f64048e
to
f5973e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One new question and there is one outstanding (related to undefined behaviors). I've resolved the rest of my comments for clarity.
extensions/functions_string.yaml
Outdated
Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or | ||
the position value is out of range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or | |
the position value is out of range. | |
Behavior is undefined if the regex fails to compile, the occurrence value is out of range, | |
the position value is out of range, or the group value is out of range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richtia could you please also change the "Behavior is undefined" sentence for the regexp_match_substring
function to mention the group value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
f5973e8
to
10f9d30
Compare
fix: add group to match_substring and position to match_substring_all docs: update descriptions docs: update description docs: update undefined behavior in match substring
10f9d30
to
6440272
Compare
@richtia can you please change the PR description to say: BREAKING CHANGE: |
extensions/functions_string.yaml
Outdated
Behavior is undefined if the regex fails to compile, the occurrence value is out of range, | ||
the position value is out of range, or the group value is out of range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, there is no occurrence value for regexp_match_substring_all
Behavior is undefined if the regex fails to compile, the occurrence value is out of range, | |
the position value is out of range, or the group value is out of range. | |
Behavior is undefined if the regex fails to compile, the position value is out of range, | |
or the group value is out of range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good. @ianmcook let me know when you're content.
It looks good to me |
+1 Since this is now a breaking extension function change (adding new positional arguments is always a breaking change) it requires two SMC votes. @cpcloud / @jacques-n ? |
FYI: DuckDB just added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
BREAKING CHANGE:
group
argument added toregexp_match_substring
functionAdd regexp_match_substring_all function
Resolves #466