s3 source with csv parser #31

waruto210 · 2022-12-18T13:47:55Z

Support load data from Amazon S3 and other Simple Storage Services with CSV format.

fuyufjh · 2022-12-20T07:28:31Z

rfcs/0031-s3-source.md

+
+  A file is different from a message queue, where a file is read as a series of byte chunks, instead of separate messages.
+
+  The parser should try to parse one record from the payload, if the payload is not enough to parse one record, the parser will buffer the payload in its internal buffer and wait for more bytes to be passed. If any error occurs, the parser should clean its internal buffer.


+1 for the "statefulness". In the future, I suppose the parser should be stateful, not only for buffering some input bytes, but also for storing some necessary metadata read from file's header (like ORC or Parquet files). Hopefully, this design can live up to those requirements.

In addition, in this future, we should use composing parser inside reader to make it more flexible. In this cases, NexMark connectors can directly output DataChunk without the redundant serialization.

fuyufjh · 2022-12-20T07:33:26Z

rfcs/0031-s3-source.md

+          &'a mut self,
+          payload: &'a mut &'b [u8],
+          writer: SourceStreamChunkRowWriter<'c>,
+      ) -> Self::ParseResult<'a>


By @hzxa21: May consider implementing it like a streaming operator to input streaming bytes and output streaming events. By the way, our streaming code adopts futures_async_stream to do this, which basically allows you to yield a DataChunk any time and the compiler & macro will handle other things like how to keep the states.

https://github.com/risingwavelabs/risingwave/blob/0128ff1b312ae22f51e979a73bd5cb9f61831c39/src/stream/src/executor/project_set.rs#L196

rfcs/0031-s3-source.md

rename header Co-authored-by: Eric Fu <[email protected]>

neverchanje · 2023-02-20T08:11:42Z

Isn't this feature implemented already? Will you finalize this document to reflect the up-to-date design?

waruto210 added 3 commits December 18, 2022 21:47

s3 source with csv parser

e65eb8f

update prefix

e6c1709

update

06380cd

fuyufjh reviewed Dec 20, 2022

View reviewed changes

rfcs/0031-s3-source.md Outdated Show resolved Hide resolved

Update rfcs/0031-s3-source.md

83ecda8

rename header Co-authored-by: Eric Fu <[email protected]>

waruto210 mentioned this pull request Jan 7, 2023

tracking: s3 source risingwavelabs/risingwave#6511

Closed

fuyufjh mentioned this pull request Jan 12, 2023

perf: nexmark q1 risingwavelabs/risingwave#7353

Closed

fuyufjh mentioned this pull request Jan 30, 2023

source: support native (data chunk) format for benchmark purpose risingwavelabs/risingwave#4555

Closed

waruto210 closed this Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3 source with csv parser #31

s3 source with csv parser #31

waruto210 commented Dec 18, 2022

fuyufjh Dec 20, 2022

fuyufjh Dec 20, 2022

fuyufjh Dec 20, 2022

neverchanje commented Feb 20, 2023


		A file is different from a message queue, where a file is read as a series of byte chunks, instead of separate messages.

		The parser should try to parse one record from the payload, if the payload is not enough to parse one record, the parser will buffer the payload in its internal buffer and wait for more bytes to be passed. If any error occurs, the parser should clean its internal buffer.

s3 source with csv parser #31

s3 source with csv parser #31

Conversation

waruto210 commented Dec 18, 2022

fuyufjh Dec 20, 2022

Choose a reason for hiding this comment

fuyufjh Dec 20, 2022

Choose a reason for hiding this comment

fuyufjh Dec 20, 2022

Choose a reason for hiding this comment

neverchanje commented Feb 20, 2023