Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Accelerated Kafka datasource. #330

Merged
merged 28 commits into from
Apr 30, 2020

Conversation

chinmaychandak
Copy link
Contributor

@chinmaychandak chinmaychandak commented Apr 21, 2020

This PR adds an engine parameter to from_kafka_batched. If engine="cudf", messages (for now, only JSON is supported) will be read from Kafka directly into a cuDF dataframe.

custreamz.kafka has the exact same API as Confluent Kafka, so it serves as a drop-in replacement in from_kafka_batched with minimal duplication of code. But under the hood, it reads messages from librdkafka and directly uploads them to the GPU as a cuDF dataframe instead of gathering all the messages back from C++ into Python.

This essentially avoids the GIL issue we encountered in the Confluent Kafka consumer, and hence enables reading from Kafka in a faster fashion with fewer processes. This accelerated reader also adheres to the current checkpointing mechanism.

Folks interested in trying out custreamz would benefit from this accelerated Kafka reader. If someone does not want to use GPUs, they can use streamz as is, with the default engine=None.

I am skipping the tests for this PR, since we would be having gpuCI tests in cuDF/cuStreamz once things consolidate.

engine: str (None)
If engine is "cudf", streamz reads data (messages must be JSON) from Kafka
in an accelerated manner directly into cudf dataframes.
Please refer to API here: github.com/jdye64/cudf/blob/kratos/python/custreamz/custreamz/kafka.py
Copy link
Contributor Author

@chinmaychandak chinmaychandak Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link would be updated in the future.

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Apr 21, 2020

FYI: All Dask tests with @gen_cluster are failing for some reason. This is completely unrelated to this PR.

Any ideas how to resolve this? I am trying to look at this: https://distributed.dask.org/en/latest/develop.html#writing-tests

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Apr 21, 2020

Okay, so installing pytest-tornasync and making the tests with @gen_cluster start with async def worked locally.

Can we do install pytest-tornasync on travisCI? If this is not the issue, I don't know why tests are still failing.

@martindurant
Copy link
Member

Can we do install pytest-tornasync on travisCI

I don't see why not

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Apr 21, 2020

I don't see why not

I added it to travis, but tests are still failing. Can someone please take a look?

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Apr 22, 2020

I see this got merged last week: dask/distributed#3706. But marking tests with async def should work.

Maybe we need to change from Python 3.6. Nope, tried it. Still doesn't work: https://travis-ci.org/github/python-streamz/streamz/builds/678266572

@@ -15,7 +15,7 @@


@gen_cluster(client=True)
def test_map(c, s, a, b):
async def test_map(c, s, a, b):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we suddenly need the async keyword here when it wasn't needed before? I think the @gen_cluster attribute should make this function asynchronous, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's been an upstream change in Dask, that's causing tests unrelated to the PR to fail. Posted the relevant information and links in the comments.

@@ -503,7 +507,7 @@ def checkpoint_emit(_part):
try:
low, high = self.consumer.get_watermark_offsets(
tp, timeout=0.1)
except (RuntimeError, ck.KafkaException):
except (RuntimeError, ck.KafkaException, ValueError):
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a sleep statement here to prevent a spam query for offsets? Also, I wonder if we need logging to here to explain the reason for the ValueError

@@ -6,6 +6,7 @@ channels:
dependencies:
- python>=3.7
- pytest
- pytest-tornasync
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add this dependency if because of the introduction of the async keyword?

.travis.yml Outdated
@@ -7,7 +7,7 @@ language: python

matrix:
include:
- python: 3.6
- python: 3.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this fix?

@@ -529,8 +533,14 @@ def checkpoint_emit(_part):

def start(self):
import confluent_kafka as ck
if self.engine == "cudf":
from custreamz import kafka
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a complete replacement for ck? (does it include TopicPartition)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you see the API calls like commit, committed, get_watermark_offsets, they're the exact same as CK, including the TopicPartition class. We've tried to do this deliberately to keep code duplication as minimal as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, why not do from custreamz import kafka as ck and remove the conditionals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see what you're saying. A little clarification here is that the API for custreamz.kafka are exactly same as CK in the sense that they take the TopicPartition object(s). But, the TopicPartition class does not exist in custreamz.kafka.

Copy link
Contributor Author

@chinmaychandak chinmaychandak Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserialization happens in these API and there is no concept of TopicPartition since this a librdkafka C++ implementation underneath: https://github.com/jdye64/cudf/blob/7307cbe4f33a67c7d31a8d48955d755d55b03e9d/python/custreamz/custreamz/kafka.py#L72

@CJ-Wright
Copy link
Member

Should these changes also be made in from_kafka?

@chinmaychandak
Copy link
Contributor Author

Should these changes also be made in from_kafka?

There's no checkpointing in from_kafka. Also, I'm assuming people wanting to leverage GPUs for streaming would use from_kafka_batched since that's the general convention to read from Kafka as it is faster. Even with CPUs, I think from_kafka should only be used for naive, simple use cases where the incoming data is low throughput.

@chinmaychandak
Copy link
Contributor Author

I see this got merged last week: dask/distributed#3706. But marking tests with async def should work.

@CJ-Wright Any idea about how we can resolve this? Tests are passing locally.

@jsmaupin
Copy link
Contributor

It seems like maybe this was an effort to use native Python features in 3.x, because 2.x is no longer supported. Tornado is python 2/3 compatible with how it does not use these new Python 3 keywords.

@codecov
Copy link

codecov bot commented Apr 24, 2020

Codecov Report

Merging #330 into master will increase coverage by 0.63%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #330      +/-   ##
==========================================
+ Coverage   94.79%   95.42%   +0.63%     
==========================================
  Files          13       13              
  Lines        1671     1661      -10     
==========================================
+ Hits         1584     1585       +1     
+ Misses         87       76      -11     
Impacted Files Coverage Δ
streamz/sources.py 97.23% <100.00%> (+1.46%) ⬆️
streamz/utils_test.py 94.18% <0.00%> (+6.40%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3131c32...bd3814f. Read the comment docs.

@chinmaychandak
Copy link
Contributor Author

Okay, with the upstream Dask error fixed: dask/distributed#3738, all tests are now passing.

@martindurant
Copy link
Member

Clearly coverage suffers here since we don't test for CUDA on Travis. Indeed, that means that we take it on trust that this code works... Can you think of how we can get around that?

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Apr 28, 2020

Clearly coverage suffers here since we don't test for CUDA on Travis. Indeed, that means that we take it on trust that this code works... Can you think of how we can get around that?

Yes, coverage definitely takes a small hit here. All the API that serve as a replacement for CK would be tested as part of cudf/custreamz nightly builds (we get a new build every 2-3 hours, so this is pretty reliable) on gpuCI. If those pass, then it is guaranteed that the individual API added as part of custreamz.kafka work just fine, and since the reading-in-batches-from-kafka logic remains the same on CPUs vs. GPUs, the engine="cudf" should work fine too.

As for actually having a test for from_kafka_batched with engine="cudf", I do not know of any way we can do that here. @jdye64, any ideas?

@martindurant
Copy link
Member

At the minimum, I would suggest adding a badge to the front page (README) linking to those nightly tests, and describe this somewhere, and adding no-coverage markers to GPU-specific branches.

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Apr 29, 2020

At the minimum, I would suggest adding a badge to the front page (README) linking to those nightly tests, and describe this somewhere, and adding no-coverage markers to GPU-specific branches.

Done. Please have a look.

@@ -21,3 +21,5 @@ BSD-3 Clause
:alt: Documentation Status
.. |Version Status| image:: https://img.shields.io/pypi/v/streamz.svg
:target: https://pypi.python.org/pypi/streamz/
.. |RAPIDS custreamz gpuCI| image:: https://img.shields.io/badge/gpuCI-custreamz-green
:target: https://github.com/jdye64/cudf/blob/kratos/python/custreamz/custreamz/kafka.py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link will be updated in the future.

This will install all GPU dependencies, including streamz.

Please refer to RAPIDS custreamz.kafka API here:
github.com/jdye64/cudf/blob/kratos/python/custreamz/custreamz/kafka.py
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link will be updated in the future.

@martindurant
Copy link
Member

@CJ-Wright , do you want another look?

@CJ-Wright
Copy link
Member

LGTM to the extent that I understand this.

@martindurant
Copy link
Member

OK, putting it in then - thank you.
I hope you accept responsibility for cuda-related issues, @chinmaychandak :)

@martindurant martindurant merged commit 51c3fcf into python-streamz:master Apr 30, 2020
@chinmaychandak
Copy link
Contributor Author

Thanks @martindurant, @CJ-Wright

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Apr 30, 2020

I will keep the links to custreamz.kafka and the gpuCI up to date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants