Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of trim for string view (10%) #12395

Merged
merged 26 commits into from
Sep 25, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
736eb11
draft.
Rachelint Sep 9, 2024
a0da2d0
add unit tests for xTrim.
Rachelint Sep 9, 2024
3c8b035
fix fmt.
Rachelint Sep 9, 2024
06d104d
tmp copy for ci.
Rachelint Sep 9, 2024
48cb4db
move `make_and_append_view` to common.
Rachelint Sep 11, 2024
863e9b7
fix sting view trim about the process of empty string.
Rachelint Sep 11, 2024
36a8125
fix compile.
Rachelint Sep 11, 2024
aa2c131
eliminate some repeated codes.
Rachelint Sep 11, 2024
e3e9b53
add sql test case about string view trim.
Rachelint Sep 17, 2024
dbd0f25
Merge branch 'main' into string-view-trim
Rachelint Sep 17, 2024
6d5660f
remove unused imports.
Rachelint Sep 17, 2024
65ef988
Merge branch 'main' into string-view-trim
Rachelint Sep 21, 2024
4129a43
Merge branch 'main' into string-view-trim
Rachelint Sep 22, 2024
522c87f
fix tests.
Rachelint Sep 22, 2024
840ec46
remove stale file.
Rachelint Sep 22, 2024
307850a
Merge remote-tracking branch 'apache/main' into string-view-trim
alamb Sep 23, 2024
064450f
Avoid unecessary unsafe
alamb Sep 23, 2024
c2510de
add unit test cases with a unlined string view output.
Rachelint Sep 23, 2024
38790b2
fix tests.
Rachelint Sep 23, 2024
20197d9
improve comments.
Rachelint Sep 23, 2024
2112bc5
add todo and the related issue.
Rachelint Sep 23, 2024
790f7a9
use the safe way to get `start_offset` after trimming.
Rachelint Sep 25, 2024
f817462
fix comments.
Rachelint Sep 25, 2024
148a991
Merge branch 'main' into string-view-trim
Rachelint Sep 25, 2024
f9c1543
Remove redundant test
alamb Sep 25, 2024
e39f916
Merge remote-tracking branch 'apache/main' into string-view-trim
alamb Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 135 additions & 1 deletion datafusion/functions/src/string/btrim.rs
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,11 @@ impl ScalarUDFImpl for BTrimFunc {
}

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
utf8_to_str_type(&arg_types[0], "btrim")
if arg_types[0] == DataType::Utf8View {
Ok(DataType::Utf8View)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Also eventually it would also be possible to return Utf8View when the input was Utf8 and save a copy as well

} else {
utf8_to_str_type(&arg_types[0], "btrim")
}
}

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
Expand All @@ -106,3 +110,133 @@ impl ScalarUDFImpl for BTrimFunc {
&self.aliases
}
}

#[cfg(test)]
mod tests {
use arrow::array::{Array, StringArray, StringViewArray};
use arrow::datatypes::DataType::{Utf8, Utf8View};

use datafusion_common::{Result, ScalarValue};
use datafusion_expr::{ColumnarValue, ScalarUDFImpl};

use crate::string::btrim::BTrimFunc;
use crate::utils::test::test_function;

#[test]
fn test_functions() {
test_function!(
BTrimFunc::new(),
&[ColumnarValue::Scalar(ScalarValue::Utf8View(Some(
String::from("alphabet ")
))),],
Ok(Some("alphabet")),
&str,
Utf8View,
StringViewArray
);
test_function!(
BTrimFunc::new(),
&[ColumnarValue::Scalar(ScalarValue::Utf8View(Some(
String::from(" alphabet ")
))),],
Ok(Some("alphabet")),
&str,
Utf8View,
StringViewArray
);
test_function!(
BTrimFunc::new(),
&[
ColumnarValue::Scalar(ScalarValue::Utf8View(Some(String::from(
"alphabet"
)))),
ColumnarValue::Scalar(ScalarValue::Utf8View(Some(String::from("t")))),
],
Ok(Some("alphabe")),
&str,
Utf8View,
StringViewArray
);
test_function!(
BTrimFunc::new(),
&[
ColumnarValue::Scalar(ScalarValue::Utf8View(Some(String::from(
"alphabet"
)))),
ColumnarValue::Scalar(ScalarValue::Utf8View(Some(String::from(
"alphabe"
)))),
],
Ok(Some("t")),
&str,
Utf8View,
StringViewArray
);
test_function!(
BTrimFunc::new(),
&[
ColumnarValue::Scalar(ScalarValue::Utf8View(Some(String::from(
"alphabet"
)))),
ColumnarValue::Scalar(ScalarValue::Utf8View(None)),
],
Ok(None),
&str,
Utf8View,
StringViewArray
);
test_function!(
BTrimFunc::new(),
&[ColumnarValue::Scalar(ScalarValue::Utf8(Some(
String::from("alphabet ")
))),],
Ok(Some("alphabet")),
&str,
Utf8,
StringArray
);
test_function!(
BTrimFunc::new(),
&[ColumnarValue::Scalar(ScalarValue::Utf8(Some(
String::from("alphabet ")
))),],
Ok(Some("alphabet")),
&str,
Utf8,
StringArray
);
test_function!(
BTrimFunc::new(),
&[
ColumnarValue::Scalar(ScalarValue::Utf8(Some(String::from("alphabet")))),
ColumnarValue::Scalar(ScalarValue::Utf8(Some(String::from("t")))),
],
Ok(Some("alphabe")),
&str,
Utf8,
StringArray
);
test_function!(
BTrimFunc::new(),
&[
ColumnarValue::Scalar(ScalarValue::Utf8(Some(String::from("alphabet")))),
ColumnarValue::Scalar(ScalarValue::Utf8(Some(String::from("alphabe")))),
],
Ok(Some("t")),
&str,
Utf8,
StringArray
);
test_function!(
BTrimFunc::new(),
&[
ColumnarValue::Scalar(ScalarValue::Utf8(Some(String::from("alphabet")))),
ColumnarValue::Scalar(ScalarValue::Utf8(None)),
],
Ok(None),
&str,
Utf8,
StringArray
);
}
}
149 changes: 116 additions & 33 deletions datafusion/functions/src/string/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,17 +21,39 @@ use std::fmt::{Display, Formatter};
use std::sync::Arc;

use arrow::array::{
new_null_array, Array, ArrayAccessor, ArrayDataBuilder, ArrayIter, ArrayRef,
GenericStringArray, GenericStringBuilder, LargeStringArray, OffsetSizeTrait,
StringArray, StringBuilder, StringViewArray, StringViewBuilder,
make_view, new_null_array, Array, ArrayAccessor, ArrayDataBuilder, ArrayIter,
ArrayRef, ByteView, GenericStringArray, GenericStringBuilder, LargeStringArray,
OffsetSizeTrait, StringArray, StringBuilder, StringViewArray, StringViewBuilder,
};
use arrow::buffer::{Buffer, MutableBuffer, NullBuffer};
use arrow::datatypes::DataType;
use arrow_buffer::{NullBufferBuilder, ScalarBuffer};
use datafusion_common::cast::{as_generic_string_array, as_string_view_array};
use datafusion_common::Result;
use datafusion_common::{exec_err, ScalarValue};
use datafusion_expr::ColumnarValue;

/// Make a `u128` based on the given substr, start(offset to view.offset), and
/// push into to the given buffers
pub(crate) fn make_and_append_view(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I wonder if we should (as a follow on PR) propose adding this upstream to arrow-rs as it seems valuable for any trim related kernels on stringview

Copy link
Contributor Author

@Rachelint Rachelint Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds great! and #12383 (comment) can be solved if it is function in arrow-rs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like what would be really useful is a StringViewBuilder that could be modified perhaps 🤔

I started to write a ticket in arrow-rs but I didn't know exactly what API to suggest. I think we would have to try it out

views_buffer: &mut Vec<u128>,
null_builder: &mut NullBufferBuilder,
raw_view: &u128,
substr: &str,
start: u32,
) {
let substr_len = substr.len();
let sub_view = if substr_len > 12 {
let view = ByteView::from(*raw_view);
make_view(substr.as_bytes(), view.buffer_index, view.offset + start)
} else {
// inline value does not need block id or offset
make_view(substr.as_bytes(), 0, 0)
};
views_buffer.push(sub_view);
null_builder.append_non_null();
}

pub(crate) enum TrimType {
Left,
Right,
Expand Down Expand Up @@ -72,65 +94,126 @@ pub(crate) fn general_trim<T: OffsetSizeTrait>(
};

if use_string_view {
string_view_trim::<T>(func, args)
string_view_trim(func, args)
} else {
string_trim::<T>(func, args)
}
}

// removing 'a will cause compiler complaining lifetime of `func`
fn string_view_trim<'a, T: OffsetSizeTrait>(
func: fn(&'a str, &'a str) -> &'a str,
fn string_view_trim<'a>(
trim_func: fn(&'a str, &'a str) -> &'a str,
args: &'a [ArrayRef],
) -> Result<ArrayRef> {
let string_array = as_string_view_array(&args[0])?;
let string_view_array = as_string_view_array(&args[0])?;
let mut views_buf = Vec::with_capacity(string_view_array.len());
let mut null_builder = NullBufferBuilder::new(string_view_array.len());

match args.len() {
1 => {
let result = string_array
.iter()
.map(|string| string.map(|string: &str| func(string, " ")))
.collect::<GenericStringArray<T>>();

Ok(Arc::new(result) as ArrayRef)
let array_iter = string_view_array.iter();
let views_iter = string_view_array.views().iter();
for (src_str_opt, raw_view) in array_iter.zip(views_iter) {
trim_and_append_str(
src_str_opt,
Some(" "),
trim_func,
&mut views_buf,
&mut null_builder,
raw_view,
);
}
}
2 => {
let characters_array = as_string_view_array(&args[1])?;

if characters_array.len() == 1 {
// Only one `trim characters` exist
if characters_array.is_null(0) {
return Ok(new_null_array(
// The schema is expecting utf8 as null
&DataType::Utf8,
string_array.len(),
&DataType::Utf8View,
string_view_array.len(),
));
}

let characters = characters_array.value(0);
let result = string_array
.iter()
.map(|item| item.map(|string| func(string, characters)))
.collect::<GenericStringArray<T>>();
return Ok(Arc::new(result) as ArrayRef);
let array_iter = string_view_array.iter();
let views_iter = string_view_array.views().iter();
for (src_str_opt, raw_view) in array_iter.zip(views_iter) {
trim_and_append_str(
src_str_opt,
Some(characters),
trim_func,
&mut views_buf,
&mut null_builder,
raw_view,
);
}
} else {
// A specific `trim characters` for a row in the string view array
let characters_iter = characters_array.iter();
let array_iter = string_view_array.iter();
let views_iter = string_view_array.views().iter();
for ((src_str_opt, raw_view), characters_opt) in
array_iter.zip(views_iter).zip(characters_iter)
{
trim_and_append_str(
src_str_opt,
characters_opt,
trim_func,
&mut views_buf,
&mut null_builder,
raw_view,
);
}
}

let result = string_array
.iter()
.zip(characters_array.iter())
.map(|(string, characters)| match (string, characters) {
(Some(string), Some(characters)) => Some(func(string, characters)),
_ => None,
})
.collect::<GenericStringArray<T>>();

Ok(Arc::new(result) as ArrayRef)
}
other => {
exec_err!(
return exec_err!(
"Function TRIM was called with {other} arguments. It requires at least 1 and at most 2."
)
);
}
}

let views_buf = ScalarBuffer::from(views_buf);
let nulls_buf = null_builder.finish();

// Safety:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I think using StringBuilder here might improve performance

Copy link
Contributor Author

@Rachelint Rachelint Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am seeking way to make in-place modification of array currently, too.
And I found only primitive support it through converting itself to builder.

It is really meaningful to support this ability for more arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related discussion: apache/arrow-rs#6430

// (1) The blocks of the given views are all provided
// (2) Each of the range `view.offset+start..end` of view in views_buf is within
// the bounds of each of the blocks
unsafe {
let array = StringViewArray::new_unchecked(
views_buf,
string_view_array.data_buffers().to_vec(),
nulls_buf,
);
Ok(Arc::new(array) as ArrayRef)
}
}

fn trim_and_append_str<'a>(
src_str_opt: Option<&'a str>,
trim_characters_opt: Option<&'a str>,
trim_func: fn(&'a str, &'a str) -> &'a str,
views_buf: &mut Vec<u128>,
null_builder: &mut NullBufferBuilder,
raw: &u128,
) {
if let (Some(src_str), Some(characters)) = (src_str_opt, trim_characters_opt) {
let trim_str = trim_func(src_str, characters);

// Safety:
// `trim_str` is computed from `str::trim_xxx_matches`,
// and its addr is ensured to be >= `origin_str`'s
let start = unsafe { trim_str.as_ptr().offset_from(src_str.as_ptr()) as u32 };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need the length of this in bytes, I don't think you need unsafe here

How about this:

Suggested change
let start = unsafe { trim_str.as_ptr().offset_from(src_str.as_ptr()) as u32 };
let start = (src_str.as_bytes().iter().len() - trim_str.as_bytes().len()) as u32;

Copy link
Contributor Author

@Rachelint Rachelint Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need the length of this in bytes, I don't think you need unsafe here

How about this:

I want to get the trim_str's "start offest" from src_str actually.
For example,

  • Assume that src_str: " abc ", after btrim, trim_str: "abc".
  • The start of src_str is ' '(offset 0), the new start of trim_str is 'a' (offset 2), and I expected the relative offset 2 (2 - 0).
  • And len(src_str) - len(trim_str) is 4.

Maybe wen can only make it using unsafe temporarily #12387

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this diff:

diff --git a/datafusion/functions/src/string/common.rs b/datafusion/functions/src/string/common.rs
index 4f70374b7..f796d10c2 100644
--- a/datafusion/functions/src/string/common.rs
+++ b/datafusion/functions/src/string/common.rs
@@ -204,10 +204,7 @@ fn trim_and_append_str<'a>(
     if let (Some(src_str), Some(characters)) = (src_str_opt, trim_characters_opt) {
         let trim_str = trim_func(src_str, characters);

-        // Safety:
-        // `trim_str` is computed from `str::trim_xxx_matches`,
-        // and its addr is ensured to be >= `origin_str`'s
-        let start = unsafe { trim_str.as_ptr().offset_from(src_str.as_ptr()) as u32 };
+        let start = (src_str.as_bytes().len() - trim_str.as_bytes().len()) as u32;

         make_and_append_view(views_buf, null_builder, raw, trim_str, start);
     } else {

And all tests passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my carelessness, I have added new unit test cases with the unlined string view output (len > 12).
And they seems broken after removing the unsafe codes.
The reason may be #12395 (comment)

I guess maybe we can't remove the unsafe codes until the feature Pattern get stable #12387 (comment)

I will file a issue to check it.


make_and_append_view(views_buf, null_builder, raw, trim_str, start);
} else {
null_builder.append_null();
views_buf.push(0);
}
}

fn string_trim<'a, T: OffsetSizeTrait>(
Expand Down
Loading