Add ThreadMapperIterDatapipe #1045

SvenDS9 · 2023-02-24T18:22:16Z

🚀 The feature

Similar to #1044 (thanks @ejguan!) I propose to add a new datapipe that uses ThreadPoolExecutor to multithread mapping.

Motivation, pitch

Speed up mapping by using Multithreading

Alternatives

Three possible implementations come to my mind.

Similar to Implement BatchAsyncMapper #1044 construct batches and use Executor.map() and then unbatch again. One disadvantage of this is that the first item can only be returned once all operations in the batch have finished.
This may change in a future python version see Make Executor.map work with infinite/large inputs correctly python/cpython#74028 and bpo-29842: Make Executor.map less eager so it handles large/unbounded… python/cpython#18566
Only allow batches as input and apply the operation to each element in the batch. Then return the processed batch.
Use concurrent.futures.as_completed with a parameter like scheduled_tasks to schedule a finite number of tasks. This would return results as soon as they are completed but not preserve order.

Which option do you prefer? We can of course also implement e.g. both option 1 and 3.

Additional context

I am not sure how (if at all) the ThreadPoolExecutor interferes/interacts with multiprocessing used in the Dataloader.

The text was updated successfully, but these errors were encountered:

ejguan · 2023-02-24T19:06:12Z

One disadvantage of this is that the first item can only be returned once all operations in the batch have finished.

Yeah. But, tbh, if we want to yield whenever the element is ready, users can always do dp.map(map_fn).prefetch(buffer_size).

I think we want to preserver order within batch (option 1) at least for now. Otherwise, the whole pipeline becomes nondeterministic. In the future, we might be able to design a mechanism like global switch to enable/disable deterministic training.

I am not sure how (if at all) the ThreadPoolExecutor interferes/interacts with multiprocessing used in the Dataloader.

I am not aware of any blocker for it. We can add more intensive tests to validate it.

As a follow up if you want to implement this DataPipe, we might want to do a benchmarking for LAION example between asyncio and threading. Then, we will be able to provide better recommendation for users on the choice.

SvenDS9 changed the title ~~ThreadMapperDatapipe~~ Add ThreadMapperIterDatapipe Feb 24, 2023

SvenDS9 mentioned this issue Feb 27, 2023

Add ThreadPoolMapper #1052

Closed

facebook-github-bot closed this as completed in fea20d4 Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ThreadMapperIterDatapipe #1045

Add ThreadMapperIterDatapipe #1045

SvenDS9 commented Feb 24, 2023 •

edited

Loading

ejguan commented Feb 24, 2023

Add ThreadMapperIterDatapipe #1045

Add ThreadMapperIterDatapipe #1045

Comments

SvenDS9 commented Feb 24, 2023 • edited Loading

🚀 The feature

Motivation, pitch

Alternatives

Additional context

ejguan commented Feb 24, 2023

SvenDS9 commented Feb 24, 2023 •

edited

Loading