Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(source): intro native row format #7612

Merged
merged 7 commits into from
Feb 13, 2023
Merged

perf(source): intro native row format #7612

merged 7 commits into from
Feb 13, 2023

Conversation

waruto210
Copy link
Contributor

@waruto210 waruto210 commented Jan 31, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

  • intorcude new NATIVE ROW FORMAT, which is the default row format of nexmark and datagen connecotr, and it is invisible to user.
  • nexmark only support NATIVE ROW FORMAT, but datagen can support multiple formats(only native and json now)

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
    - [ ] I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features).
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

  • Connector (sources & sinks)

Release note

nexmark: user should not specify row format

create table T (
    schema def...
) with (
    connecotr = 'nexmark'
    ...
);

datagen: If user do not specify a row format, datagen would use NATIVE.

create table T (
    schema def...
) with (
    connecotr = 'datagen'
    ...
) [row foramt json];

Refer to a related PR or issue link (optional)

#6969 ,#4555

@waruto210
Copy link
Contributor Author

waruto210 commented Feb 1, 2023

@fuyufjh @lmatz Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?

According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

@tabVersion
Copy link
Contributor

@fuyufjh @lmatz Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?

According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

since datagen connector and nexmark connector both produce stream chunks directly, I prefer rejecting row format in this case. It is hard to tell users what is native row format.

@waruto210
Copy link
Contributor Author

@fuyufjh @lmatz Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?
According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

since datagen connector and nexmark connector both produce stream chunks directly, I prefer rejecting row format in this case. It is hard to tell users what is native row format.

PTAL #6970, it requires datagen to be able to generate multiple row format.

@codecov
Copy link

codecov bot commented Feb 1, 2023

Codecov Report

Merging #7612 (3a45614) into main (45b5e6b) will decrease coverage by 0.10%.
The diff coverage is 46.96%.

@@            Coverage Diff             @@
##             main    #7612      +/-   ##
==========================================
- Coverage   71.67%   71.58%   -0.10%     
==========================================
  Files        1111     1111              
  Lines      176936   177229     +293     
==========================================
+ Hits       126812   126862      +50     
- Misses      50124    50367     +243     
Flag Coverage Δ
rust 71.58% <46.96%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/batch/src/executor/source.rs 0.00% <0.00%> (ø)
src/connector/src/source/nexmark/source/message.rs 0.00% <ø> (-60.98%) ⬇️
src/sqlparser/src/keywords.rs 100.00% <ø> (ø)
src/sqlparser/src/parser.rs 91.57% <29.41%> (-0.37%) ⬇️
...nector/src/source/nexmark/source/combined_event.rs 27.11% <30.04%> (+27.11%) ⬆️
src/connector/src/parser/mod.rs 55.82% <38.46%> (-7.91%) ⬇️
src/sqlparser/src/ast/statement.rs 72.31% <50.00%> (-0.74%) ⬇️
...c/connector/src/source/datagen/source/generator.rs 89.28% <67.74%> (-4.84%) ⬇️
src/common/src/field_generator/mod.rs 75.00% <75.00%> (+0.50%) ⬆️
src/connector/src/source/datagen/source/reader.rs 86.10% <94.73%> (-1.36%) ⬇️
... and 22 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@fuyufjh
Copy link
Member

fuyufjh commented Feb 2, 2023

@fuyufjh @lmatz Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?
According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

since datagen connector and nexmark connector both produce stream chunks directly, I prefer rejecting row format in this case. It is hard to tell users what is native row format.

PTAL #6970, it requires datagen to be able to generate multiple row format.

Hmmmm. I think #6970 is meant to generate data with a nested column i.e. STRUCT, ARRAY, JSON. While the ROW FORMAT here is to define the serialization inside Datagen. They are not related. cc. @lmatz @kwannoel

@waruto210
Copy link
Contributor Author

@fuyufjh @lmatz Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?
According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

since datagen connector and nexmark connector both produce stream chunks directly, I prefer rejecting row format in this case. It is hard to tell users what is native row format.

PTAL #6970, it requires datagen to be able to generate multiple row format.

Hmmmm. I think #6970 is meant to generate data with a nested column i.e. STRUCT, ARRAY, JSON. While the ROW FORMAT here is to define the serialization inside Datagen. They are not related. cc. @lmatz @kwannoel

It seems the main purpose of #6970 is to test source parsing, both nested column and more ROW FORMAT are needed.

For performance, Datagen needs to be able to output native chunk directly, without going through parser; for testing against parsers, Datagen needs to genarate data with more complex columns and serializing data into more formats (e.g. avro, protobuf), then parse the serialized data into chunk.

So I think we need to introduce a native row format to control the Datagen.

@kwannoel
Copy link
Contributor

kwannoel commented Feb 2, 2023

@fuyufjh @lmatz Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?
According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

since datagen connector and nexmark connector both produce stream chunks directly, I prefer rejecting row format in this case. It is hard to tell users what is native row format.

PTAL #6970, it requires datagen to be able to generate multiple row format.

Hmmmm. I think #6970 is meant to generate data with a nested column i.e. STRUCT, ARRAY, JSON. While the ROW FORMAT here is to define the serialization inside Datagen. They are not related. cc. @lmatz @kwannoel

#6970 is meant to test source parsing of rows, I think @waruto210 has the right idea.
I'm thinking of something like this:

CREATE MATERIALIZED SOURCE IF NOT EXISTS source_abc 
WITH (
   connector='datagen',
   ...
)
ROW FORMAT (AVRO | JSON | PROTOBUF)
...;

Instead of (this is done separately, see #7132):

CREATE TABLE t (v1 struct<...>, v2 int[]);

@kwannoel kwannoel self-requested a review February 7, 2023 02:59
@waruto210 waruto210 marked this pull request as ready for review February 10, 2023 05:45
@waruto210 waruto210 force-pushed the waruto/opt-datagen branch 2 times, most recently from a4c62ff to 2b32204 Compare February 10, 2023 08:50
@waruto210 waruto210 changed the title perf(source): gen native StreamChunk perf(source): intro native row format Feb 10, 2023
@waruto210 waruto210 force-pushed the waruto/opt-datagen branch 2 times, most recently from 4bea865 to bc226e6 Compare February 10, 2023 10:17
Copy link
Contributor

@tabVersion tabVersion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants