-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel file writing #311
Comments
Hi @mkeskells sorry for the wait on responding, I am currently looking at some improvements for this on the S3-Sink Connector, or at least I have been for the last couple weeks. I have a PR open to optimize the use of the api now, and I am working on a follow up PR to leverage that to reduce the memory overheads. After that is done, I am hoping to do a similar exercise across GCS and Azure blob sinks. In relation to specifically utilising the parallelStream there would be a small memory overhead, but it does look like it would bean efficient way of improving the write especially for writing topic-partition files, I'd definitely want to test out if there are any issues when writing in parallel with key grouping though where we could be flushing a few thousand files in parallel. |
Hi Aindriu,
I have been working on a more radical rewrite of the way that files are written, which allows for writes to occur in the background, and supports back pressure so that we can't run out of memory, timeouts for records and requesting flushes when we have some completion of files written
It's a bigger rewrite, and is close to functionally complete I think, but has had no reviews, just something that I have been working on. I am not sure the best way to discuss this, I don't want to fork, but it is a significant change. I would think that all of the changes should also work on S3, it's very little to do with the actual writing of the files
I am happy to share what I have, but I have work timescales to meet so I may need to fork in the short term.
My work covers a replacement for this parallel writing, enhanced file naming and the other previous PRs. The major drawback is that the files names and batching is not determined by the flush call, so there is a greater chance for duplication of files, but this should be only when there is overload
Please advise on how best to manage this. I am happy to discuss in any forum
BTW the parallel writing would only write the number of threads that the executor has, which should default to the number of cores I believe. In any case I think that what I have is significantly better and more reasoned approach to writing
Regards
Mike
Get Outlook for Android<https://aka.ms/ghei36>
…________________________________
From: Aindriú Lavelle ***@***.***>
Sent: Tuesday, October 29, 2024 12:07:49 pm
To: Aiven-Open/cloud-storage-connectors-for-apache-kafka ***@***.***>
Cc: mkeskells ***@***.***>; Mention ***@***.***>
Subject: Re: [Aiven-Open/cloud-storage-connectors-for-apache-kafka] Parallel file writing (Issue #311)
Hi @mkeskells<https://github.com/mkeskells> sorry for the wait on responding, I am currently looking at some improvements for this on the S3-Sink Connector, or at least I have been for the last couple weeks. I have a PR open to optimize the use of the api now, and I am working on a follow up PR to leverage that to reduce the memory overheads.
After that is done, I am hoping to do a similar exercise across GCS and Azure blob sinks.
In relation to specifically utilising the parallelStream there would be a small memory overhead, but it does look like it would bean efficient way of improving the write especially for writing topic-partition files, I'd definitely want to test out if there are any issues when writing in parallel with key grouping though where we could be flushing a few thousand files in parallel.
—
Reply to this email directly, view it on GitHub<#311 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFJ6LZIJ4F3WVICUDHCLUMDZ553BHAVCNFSM6AAAAABPRYD5HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBUGAZDSNBQGQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks for letting me know, I think when you are ready we can put the PR up, I am probably not working on this project long enough yet to be able to give the implementation a full thumbs up or down, but I can definitely prod the people that can make those assessments into giving the feedback and reviews :) |
Hi |
It looks from the code and observations that the file writing is serial
If there are lots of files to be written then this seems to be limited by latency issues
it seems to me that this could be easily changed
the record writing is currently done like this (in the GCS sink)
and it could be changed like this
probably we want some more controls, limiting the grouping, making it optional etc
What are the thoughts of the team on this
The text was updated successfully, but these errors were encountered: