Parallel file writing #311

mkeskells · 2024-10-08T09:28:02Z

It looks from the code and observations that the file writing is serial
If there are lots of files to be written then this seems to be limited by latency issues

it seems to me that this could be easily changed

the record writing is currently done like this (in the GCS sink)

recordGrouper.records().forEach(this::flushFile);

and it could be changed like this

           recordGrouper
                    .records()
                    .entrySet()
                    .parallelStream()
                    .forEach(entry -> flushFile(entry.getKey(), entry.getValue()));

probably we want some more controls, limiting the grouping, making it optional etc

What are the thoughts of the team on this

The text was updated successfully, but these errors were encountered:

aindriu-aiven · 2024-10-29T12:07:24Z

Hi @mkeskells sorry for the wait on responding, I am currently looking at some improvements for this on the S3-Sink Connector, or at least I have been for the last couple weeks. I have a PR open to optimize the use of the api now, and I am working on a follow up PR to leverage that to reduce the memory overheads.

After that is done, I am hoping to do a similar exercise across GCS and Azure blob sinks.

In relation to specifically utilising the parallelStream there would be a small memory overhead, but it does look like it would bean efficient way of improving the write especially for writing topic-partition files, I'd definitely want to test out if there are any issues when writing in parallel with key grouping though where we could be flushing a few thousand files in parallel.

mkeskells · 2024-10-29T14:04:45Z

Hi Aindriu, I have been working on a more radical rewrite of the way that files are written, which allows for writes to occur in the background, and supports back pressure so that we can't run out of memory, timeouts for records and requesting flushes when we have some completion of files written It's a bigger rewrite, and is close to functionally complete I think, but has had no reviews, just something that I have been working on. I am not sure the best way to discuss this, I don't want to fork, but it is a significant change. I would think that all of the changes should also work on S3, it's very little to do with the actual writing of the files I am happy to share what I have, but I have work timescales to meet so I may need to fork in the short term. My work covers a replacement for this parallel writing, enhanced file naming and the other previous PRs. The major drawback is that the files names and batching is not determined by the flush call, so there is a greater chance for duplication of files, but this should be only when there is overload Please advise on how best to manage this. I am happy to discuss in any forum BTW the parallel writing would only write the number of threads that the executor has, which should default to the number of cores I believe. In any case I think that what I have is significantly better and more reasoned approach to writing Regards Mike Get Outlook for Android<https://aka.ms/ghei36>

…

________________________________ From: Aindriú Lavelle ***@***.***> Sent: Tuesday, October 29, 2024 12:07:49 pm To: Aiven-Open/cloud-storage-connectors-for-apache-kafka ***@***.***> Cc: mkeskells ***@***.***>; Mention ***@***.***> Subject: Re: [Aiven-Open/cloud-storage-connectors-for-apache-kafka] Parallel file writing (Issue #311) Hi @mkeskells<https://github.com/mkeskells> sorry for the wait on responding, I am currently looking at some improvements for this on the S3-Sink Connector, or at least I have been for the last couple weeks. I have a PR open to optimize the use of the api now, and I am working on a follow up PR to leverage that to reduce the memory overheads. After that is done, I am hoping to do a similar exercise across GCS and Azure blob sinks. In relation to specifically utilising the parallelStream there would be a small memory overhead, but it does look like it would bean efficient way of improving the write especially for writing topic-partition files, I'd definitely want to test out if there are any issues when writing in parallel with key grouping though where we could be flushing a few thousand files in parallel. — Reply to this email directly, view it on GitHub<#311 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFJ6LZIJ4F3WVICUDHCLUMDZ553BHAVCNFSM6AAAAABPRYD5HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBUGAZDSNBQGQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

aindriu-aiven · 2024-10-29T14:24:29Z

Thanks for letting me know, I think when you are ready we can put the PR up, I am probably not working on this project long enough yet to be able to give the implementation a full thumbs up or down, but I can definitely prod the people that can make those assessments into giving the feedback and reviews :)

mkeskells · 2024-10-29T17:45:34Z

Hi
I have pushed a PR for this. It is a draft, and a way of discussing the approach, and not ready for a detailed review
its in #319

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel file writing #311

Parallel file writing #311

mkeskells commented Oct 8, 2024

aindriu-aiven commented Oct 29, 2024

mkeskells commented Oct 29, 2024 via email

aindriu-aiven commented Oct 29, 2024

mkeskells commented Oct 29, 2024

Parallel file writing #311

Parallel file writing #311

Comments

mkeskells commented Oct 8, 2024

aindriu-aiven commented Oct 29, 2024

mkeskells commented Oct 29, 2024 via email

aindriu-aiven commented Oct 29, 2024

mkeskells commented Oct 29, 2024