perf(source): intro native row format #7612

waruto210 · 2023-01-31T06:10:42Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

intorcude new NATIVE ROW FORMAT, which is the default row format of nexmark and datagen connecotr, and it is invisible to user.
nexmark only support NATIVE ROW FORMAT, but datagen can support multiple formats(only native and json now)

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
~~- [ ] I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features).~~
All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

Connector (sources & sinks)

Release note

nexmark: user should not specify row format

create table T (
    schema def...
) with (
    connecotr = 'nexmark'
    ...
);

datagen: If user do not specify a row format, datagen would use NATIVE.

create table T (
    schema def...
) with (
    connecotr = 'datagen'
    ...
) [row foramt json];

Refer to a related PR or issue link (optional)

#6969 ,#4555

e2e_test/compaction/ingest_rows.slt

waruto210 · 2023-02-01T03:27:25Z

@fuyufjh @lmatz ~~Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?~~

According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

tabVersion · 2023-02-01T05:32:18Z

@fuyufjh @lmatz ~~Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?~~

According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

since datagen connector and nexmark connector both produce stream chunks directly, I prefer rejecting row format in this case. It is hard to tell users what is native row format.

waruto210 · 2023-02-01T05:37:02Z

@fuyufjh @lmatz ~~Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?~~
According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

since datagen connector and nexmark connector both produce stream chunks directly, I prefer rejecting row format in this case. It is hard to tell users what is native row format.

PTAL #6970, it requires datagen to be able to generate multiple row format.

codecov · 2023-02-01T13:54:55Z

Codecov Report

Merging #7612 (3a45614) into main (45b5e6b) will decrease coverage by 0.10%.
The diff coverage is 46.96%.

@@            Coverage Diff             @@
##             main    #7612      +/-   ##
==========================================
- Coverage   71.67%   71.58%   -0.10%     
==========================================
  Files        1111     1111              
  Lines      176936   177229     +293     
==========================================
+ Hits       126812   126862      +50     
- Misses      50124    50367     +243

Flag	Coverage Δ
rust	`71.58% <46.96%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/batch/src/executor/source.rs	`0.00% <0.00%> (ø)`
src/connector/src/source/nexmark/source/message.rs	`0.00% <ø> (-60.98%)`	⬇️
src/sqlparser/src/keywords.rs	`100.00% <ø> (ø)`
src/sqlparser/src/parser.rs	`91.57% <29.41%> (-0.37%)`	⬇️
...nector/src/source/nexmark/source/combined_event.rs	`27.11% <30.04%> (+27.11%)`	⬆️
src/connector/src/parser/mod.rs	`55.82% <38.46%> (-7.91%)`	⬇️
src/sqlparser/src/ast/statement.rs	`72.31% <50.00%> (-0.74%)`	⬇️
...c/connector/src/source/datagen/source/generator.rs	`89.28% <67.74%> (-4.84%)`	⬇️
src/common/src/field_generator/mod.rs	`75.00% <75.00%> (+0.50%)`	⬆️
src/connector/src/source/datagen/source/reader.rs	`86.10% <94.73%> (-1.36%)`	⬇️
... and 22 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

fuyufjh · 2023-02-02T03:47:36Z

@fuyufjh @lmatz ~~Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?~~
According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

since datagen connector and nexmark connector both produce stream chunks directly, I prefer rejecting row format in this case. It is hard to tell users what is native row format.

PTAL #6970, it requires datagen to be able to generate multiple row format.

Hmmmm. I think #6970 is meant to generate data with a nested column i.e. STRUCT, ARRAY, JSON. While the ROW FORMAT here is to define the serialization inside Datagen. They are not related. cc. @lmatz @kwannoel

waruto210 · 2023-02-02T04:10:34Z

@fuyufjh @lmatz ~~Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?~~
According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

since datagen connector and nexmark connector both produce stream chunks directly, I prefer rejecting row format in this case. It is hard to tell users what is native row format.

PTAL #6970, it requires datagen to be able to generate multiple row format.

Hmmmm. I think #6970 is meant to generate data with a nested column i.e. STRUCT, ARRAY, JSON. While the ROW FORMAT here is to define the serialization inside Datagen. They are not related. cc. @lmatz @kwannoel

It seems the main purpose of #6970 is to test source parsing, both nested column and more ROW FORMAT are needed.

For performance, Datagen needs to be able to output native chunk directly, without going through parser; for testing against parsers, Datagen needs to genarate data with more complex columns and serializing data into more formats (e.g. avro, protobuf), then parse the serialized data into chunk.

So I think we need to introduce a native row format to control the Datagen.

kwannoel · 2023-02-02T04:15:22Z

@fuyufjh @lmatz ~~Should we introduce a new native ROW FORMAT(a breaking change now?) or just ignore the ROW FORMAT?~~
According to issue #6970, we need to introduce a new native ROW FORMAT and need to rewrite the code for this pr.

since datagen connector and nexmark connector both produce stream chunks directly, I prefer rejecting row format in this case. It is hard to tell users what is native row format.

PTAL #6970, it requires datagen to be able to generate multiple row format.

Hmmmm. I think #6970 is meant to generate data with a nested column i.e. STRUCT, ARRAY, JSON. While the ROW FORMAT here is to define the serialization inside Datagen. They are not related. cc. @lmatz @kwannoel

#6970 is meant to test source parsing of rows, I think @waruto210 has the right idea.
I'm thinking of something like this:

CREATE MATERIALIZED SOURCE IF NOT EXISTS source_abc 
WITH (
   connector='datagen',
   ...
)
ROW FORMAT (AVRO | JSON | PROTOBUF)
...;

Instead of (this is done separately, see #7132):

CREATE TABLE t (v1 struct<...>, v2 int[]);

tabVersion

basically LGTM

src/connector/src/source/datagen/source/generator.rs

src/connector/src/source/nexmark/source/combined_event.rs

github-actions bot added the type/perf label Jan 31, 2023

tabVersion reviewed Jan 31, 2023

View reviewed changes

e2e_test/compaction/ingest_rows.slt Show resolved Hide resolved

waruto210 force-pushed the waruto/opt-datagen branch from e6bf3cc to 36134f0 Compare February 1, 2023 03:18

waruto210 force-pushed the waruto/opt-datagen branch from 36134f0 to 317a428 Compare February 1, 2023 04:28

kwannoel mentioned this pull request Feb 1, 2023

sqlsmith: Generate multiple input formats: Protobuf, AVRO, JSON #6970

Open

waruto210 force-pushed the waruto/opt-datagen branch from 317a428 to 1c45f7d Compare February 1, 2023 13:31

kwannoel self-requested a review February 7, 2023 02:59

introduce native format

bf33c8d

waruto210 force-pushed the waruto/opt-datagen branch from 1c45f7d to bf33c8d Compare February 10, 2023 05:20

fix proto

5924aab

waruto210 marked this pull request as ready for review February 10, 2023 05:45

waruto210 added 2 commits February 10, 2023 14:26

rerun

b9010ec

fix test

1268f9a

waruto210 force-pushed the waruto/opt-datagen branch 2 times, most recently from a4c62ff to 2b32204 Compare February 10, 2023 08:50

waruto210 changed the title ~~perf(source): gen native StreamChunk~~ perf(source): intro native row format Feb 10, 2023

waruto210 force-pushed the waruto/opt-datagen branch 2 times, most recently from 4bea865 to bc226e6 Compare February 10, 2023 10:17

fix row_id of nexmark

4ca4d8d

waruto210 force-pushed the waruto/opt-datagen branch from bc226e6 to 4ca4d8d Compare February 10, 2023 11:15

waruto210 requested a review from tabVersion February 10, 2023 12:30

tabVersion approved these changes Feb 13, 2023

View reviewed changes

src/connector/src/source/datagen/source/generator.rs Show resolved Hide resolved

src/connector/src/source/nexmark/source/combined_event.rs Show resolved Hide resolved

waruto210 added the mergify/can-merge label Feb 13, 2023

mergify bot added 2 commits February 13, 2023 08:36

Merge branch 'main' into waruto/opt-datagen

21c0bff

Merge branch 'main' into waruto/opt-datagen

3a45614

mergify bot merged commit 1bbf7bd into main Feb 13, 2023

mergify bot deleted the waruto/opt-datagen branch February 13, 2023 09:04

lmatz added the breaking-change label Feb 14, 2023

tabVersion mentioned this pull request Feb 19, 2023

validate compatibility between connector type and row format #6984

Closed

BugenZhao mentioned this pull request Mar 17, 2023

fix(test): use correct type for nexmark source planner test #8618

Merged

7 tasks

shanicky mentioned this pull request Mar 21, 2023

bug: datagen connector should not be used with ROW FORMAT JSON #3307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(source): intro native row format #7612

perf(source): intro native row format #7612

waruto210 commented Jan 31, 2023 •

edited

Loading

waruto210 commented Feb 1, 2023 •

edited

Loading

tabVersion commented Feb 1, 2023

waruto210 commented Feb 1, 2023

codecov bot commented Feb 1, 2023 •

edited

Loading

fuyufjh commented Feb 2, 2023

waruto210 commented Feb 2, 2023

kwannoel commented Feb 2, 2023

tabVersion left a comment

perf(source): intro native row format #7612

perf(source): intro native row format #7612

Conversation

waruto210 commented Jan 31, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Types of user-facing changes

Release note

Refer to a related PR or issue link (optional)

waruto210 commented Feb 1, 2023 • edited Loading

tabVersion commented Feb 1, 2023

waruto210 commented Feb 1, 2023

codecov bot commented Feb 1, 2023 • edited Loading

Codecov Report

fuyufjh commented Feb 2, 2023

waruto210 commented Feb 2, 2023

kwannoel commented Feb 2, 2023

tabVersion left a comment

Choose a reason for hiding this comment

waruto210 commented Jan 31, 2023 •

edited

Loading

waruto210 commented Feb 1, 2023 •

edited

Loading

codecov bot commented Feb 1, 2023 •

edited

Loading