Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream.fromIterator doesn't create proper chunks #2010

Closed
susuro opened this issue Aug 25, 2020 · 5 comments
Closed

Stream.fromIterator doesn't create proper chunks #2010

susuro opened this issue Aug 25, 2020 · 5 comments

Comments

@susuro
Copy link
Contributor

susuro commented Aug 25, 2020

Stream.fromIterator fetches values one by one, effectively creating single-value chunks. This has negative impact on performance, especially when working with primitive values.
The situation is even worse for Stream.fromBlockingIterator, because context switch for each fetched value increases the performance penalty.

I think this could be improved with following implementation of PartiallyAppliedFromIterator:

private[fs2] final class PartiallyAppliedFromIterator[F[_]](
    private val dummy: Boolean
) extends AnyVal {
  def apply[A](iterator: Iterator[A], chunkSize: Int = 1)(implicit F: Sync[F]): Stream[F, A] = {
    def getNextChunk(i: Iterator[A]): F[Option[(Chunk[A], Iterator[A])]] =
      F.delay {
        for (_ <- 1 to chunkSize if i.hasNext) yield i.next()
      }.map { s =>
        if (s.isEmpty) None else Some((Chunk.seq(s), i))
      }

    Stream.unfoldChunkEval(iterator)(getNextChunk)
  }
}

Similarly with PartiallyAppliedFromBlockingIterator.

If you find this solution OK I would be happy to prepare and submit a PR.

@mpilquist
Copy link
Member

Any thoughts on Stream.fromIterator(iterator).chunkN(n) instead?

@djspiewak
Copy link
Member

Any thoughts on Stream.fromIterator(iterator).chunkN(n) instead?

It's more general, but it would incur the chunk management overhead. It's going to be significantly slower than batching next calls within a tight loop. I'm a little skeptical of using Iterator in performance-sensitive code to begin with, but if you are, then chunkN isn't going to solve the problem.

@susuro
Copy link
Contributor Author

susuro commented Aug 26, 2020

I haven't done proper performance testing, only a few tests within a CSV parsing library, so my claims below my not be fully correct. However, the processing time I've measured for Stream.fromIterator(iterator).chunkN(chunkSize).flatMap(Stream.chunk) is about 2x shorter that using simple Stream.fromIterator(iterator), while the tight loop I proposed gives me 10x better result.
In case of fromBlockingIterator, chunkN doesn't provide any significant improvement while tight loop executes 100x faster.

You're probably right that we should avoid using iterator for performance-sensitive operations. But sometimes, in existing projects, the iterator may be the only interface you have available and even if the code isn't really performance-sensitive, making it more efficient can still be valuable.

@mpilquist
Copy link
Member

OK, let's move on to a PR then. Thanks for testing.

@susuro
Copy link
Contributor Author

susuro commented Aug 29, 2020

Solved wit PR #2013.

@susuro susuro closed this as completed Aug 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants