Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel file writing #311

Open
mkeskells opened this issue Oct 8, 2024 · 4 comments
Open

Parallel file writing #311

mkeskells opened this issue Oct 8, 2024 · 4 comments

Comments

@mkeskells
Copy link
Contributor

It looks from the code and observations that the file writing is serial
If there are lots of files to be written then this seems to be limited by latency issues

it seems to me that this could be easily changed

the record writing is currently done like this (in the GCS sink)

recordGrouper.records().forEach(this::flushFile);

and it could be changed like this

           recordGrouper
                    .records()
                    .entrySet()
                    .parallelStream()
                    .forEach(entry -> flushFile(entry.getKey(), entry.getValue()));

probably we want some more controls, limiting the grouping, making it optional etc

What are the thoughts of the team on this

@aindriu-aiven
Copy link
Contributor

Hi @mkeskells sorry for the wait on responding, I am currently looking at some improvements for this on the S3-Sink Connector, or at least I have been for the last couple weeks. I have a PR open to optimize the use of the api now, and I am working on a follow up PR to leverage that to reduce the memory overheads.

After that is done, I am hoping to do a similar exercise across GCS and Azure blob sinks.

In relation to specifically utilising the parallelStream there would be a small memory overhead, but it does look like it would bean efficient way of improving the write especially for writing topic-partition files, I'd definitely want to test out if there are any issues when writing in parallel with key grouping though where we could be flushing a few thousand files in parallel.

@mkeskells
Copy link
Contributor Author

mkeskells commented Oct 29, 2024 via email

@aindriu-aiven
Copy link
Contributor

Thanks for letting me know, I think when you are ready we can put the PR up, I am probably not working on this project long enough yet to be able to give the implementation a full thumbs up or down, but I can definitely prod the people that can make those assessments into giving the feedback and reviews :)

@mkeskells
Copy link
Contributor Author

Hi
I have pushed a PR for this. It is a draft, and a way of discussing the approach, and not ready for a detailed review
its in #319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants