-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a buffering stage #15
Comments
Hey @stevenmanton, thanks! There are a couple of ways you could do this. Please update pypeln via Iterable Stage import functools as ft
import cytoolz as cz
from pypeln import asyncio_task as aio
print(
range(100)
| aio.map(lambda x: x)
| ft.partial(cz.partition_all, 10)
| aio.map(sum)
| list
)
# [45, 145, 245, 345, 445, 545, 645, 745, 845, 945] The performance loss of doing this on a real application should be negligible. flat_map from pypeln import asyncio_task as aio
def batch(x, list_acc, n):
if len(list_acc) == n:
list_out = list(list_acc)
list_acc.clear()
yield list_out
else:
list_acc.append(x)
print(
range(100)
| aio.map(lambda x: x)
| aio.flat_map(lambda x, list_acc: batch(x, list_acc, 10), on_start=lambda: [])
| aio.map(sum)
| list
)
# [45, 155, 265, 375, 485, 595, 705, 815, 925] Here you are using accumulating items on a shared list Implementing this in PypelineI think having a |
Yes! This works like a charm. You're the man! This was a missing piece for me to refactoring a bunch of code to using this library. It's really simple and elegant so I appreciate all your efforts. |
Hey @cgarciae I don't suppose you got round to implementing this buffering/batching feature? I have the following case that would benefit greatly from it: def windows(): # returns a list of time windows to look for tasks in
def window_lookup(): # yields zero or more tasks for each time window
def batch_remove_running_processes(): # yields all tasks that aren't currently running (batch filtering)
def batch_queue_new_work(): # put new tasks onto a queue
(
windows()
| pypeln.thread.flat_map(window_lookup)
| pypeln.thread.buffer(10)
| pypeln.thread.flat_map(batch_remove_running_processes)
| pypeln.thread.buffer(10)
| pypeln.thread.flat_map(batch_queue_new_work)
| list
)
# Alternatively
(
windows()
| pypeln.thread.flat_map(window_lookup)
| pypeln.thread.flat_map(batch_remove_running_processes, batch=10)
| pypeln.thread.flat_map(batch_queue_new_work, batch=10)
| list
) Because there is an unknown number of results from each function we need to buffer at each stage in order to optimise DB operations. The solution proposed above only works if the number of results is divisible by the batch size. However in this case determining that ahead of time is not possible. As each stage finishes it needs to flush the buffer and return any remaining results. Is this still something you think could be added? |
@cgarciae I think the ability to customize an iterable, not only how to map/flat_map from previous one element is needed, especially when I want to use pypeln in deep learning. (https://github.com/pytorch/pytorch/blob/47894bb16594fc4bd6045d739fba6e63bdf793a8/torch/utils/data/datapipes/iter/grouping.py#L68) |
Love the package! Thanks for writing it.
I have a question that I've spent about a day poking at without any good ideas. I'd like to make a stage that buffers and batches records from previous batches. For example, let's say I have an iterable that emits records and a map stage that does some transformation to each record. What I'm looking for is a stage that would combine records into groups of, say, 100 for batch processing. In other words:
Is this at all possible?
Thanks!
The text was updated successfully, but these errors were encountered: