Add Accelerated Kafka datasource. #330

chinmaychandak · 2020-04-21T18:59:17Z

This PR adds an engine parameter to from_kafka_batched. If engine="cudf", messages (for now, only JSON is supported) will be read from Kafka directly into a cuDF dataframe.

custreamz.kafka has the exact same API as Confluent Kafka, so it serves as a drop-in replacement in from_kafka_batched with minimal duplication of code. But under the hood, it reads messages from librdkafka and directly uploads them to the GPU as a cuDF dataframe instead of gathering all the messages back from C++ into Python.

This essentially avoids the GIL issue we encountered in the Confluent Kafka consumer, and hence enables reading from Kafka in a faster fashion with fewer processes. This accelerated reader also adheres to the current checkpointing mechanism.

Folks interested in trying out custreamz would benefit from this accelerated Kafka reader. If someone does not want to use GPUs, they can use streamz as is, with the default engine=None.

I am skipping the tests for this PR, since we would be having gpuCI tests in cuDF/cuStreamz once things consolidate.

chinmaychandak · 2020-04-21T19:02:05Z

streamz/sources.py

+    engine: str (None)
+        If engine is "cudf", streamz reads data (messages must be JSON) from Kafka
+        in an accelerated manner directly into cudf dataframes.
+        Please refer to API here: github.com/jdye64/cudf/blob/kratos/python/custreamz/custreamz/kafka.py


This link would be updated in the future.

chinmaychandak · 2020-04-21T19:28:30Z

FYI: All Dask tests with @gen_cluster are failing for some reason. This is completely unrelated to this PR.

Any ideas how to resolve this? I am trying to look at this: https://distributed.dask.org/en/latest/develop.html#writing-tests

chinmaychandak · 2020-04-21T19:47:56Z

Okay, so installing pytest-tornasync and making the tests with @gen_cluster start with async def worked locally.

Can we do install pytest-tornasync on travisCI? If this is not the issue, I don't know why tests are still failing.

martindurant · 2020-04-21T21:08:40Z

Can we do install pytest-tornasync on travisCI

I don't see why not

chinmaychandak · 2020-04-21T21:41:28Z

I don't see why not

I added it to travis, but tests are still failing. Can someone please take a look?

chinmaychandak · 2020-04-22T16:43:34Z

I see this got merged last week: dask/distributed#3706. But marking tests with async def should work.

Maybe we need to change from Python 3.6. Nope, tried it. Still doesn't work: https://travis-ci.org/github/python-streamz/streamz/builds/678266572

jsmaupin · 2020-04-22T20:57:49Z

streamz/tests/test_dask.py

@@ -15,7 +15,7 @@


 @gen_cluster(client=True)
-def test_map(c, s, a, b):
+async def test_map(c, s, a, b):


Why do we suddenly need the async keyword here when it wasn't needed before? I think the @gen_cluster attribute should make this function asynchronous, correct?

There's been an upstream change in Dask, that's causing tests unrelated to the PR to fail. Posted the relevant information and links in the comments.

jsmaupin · 2020-04-22T21:01:23Z

streamz/sources.py

@@ -503,7 +507,7 @@ def checkpoint_emit(_part):
                    try:
                        low, high = self.consumer.get_watermark_offsets(
                            tp, timeout=0.1)
-                    except (RuntimeError, ck.KafkaException):
+                    except (RuntimeError, ck.KafkaException, ValueError):
                        continue


Do we need a sleep statement here to prevent a spam query for offsets? Also, I wonder if we need logging to here to explain the reason for the ValueError

jsmaupin · 2020-04-22T21:01:48Z

conda/environments/streamz_dev.yml

@@ -6,6 +6,7 @@ channels:
 dependencies:
  - python>=3.7
  - pytest
+  - pytest-tornasync


Do we need to add this dependency if because of the introduction of the async keyword?

jsmaupin · 2020-04-22T21:02:02Z

.travis.yml

@@ -7,7 +7,7 @@ language: python

 matrix:
  include:
-    - python: 3.6
+    - python: 3.7


What does this fix?

CJ-Wright · 2020-04-22T22:26:51Z

streamz/sources.py

@@ -529,8 +533,14 @@ def checkpoint_emit(_part):

    def start(self):
        import confluent_kafka as ck
+        if self.engine == "cudf":
+            from custreamz import kafka


Is this a complete replacement for ck? (does it include TopicPartition)

Yes, if you see the API calls like commit, committed, get_watermark_offsets, they're the exact same as CK, including the TopicPartition class. We've tried to do this deliberately to keep code duplication as minimal as possible.

ok, why not do from custreamz import kafka as ck and remove the conditionals?

Oh, I see what you're saying. A little clarification here is that the API for custreamz.kafka are exactly same as CK in the sense that they take the TopicPartition object(s). But, the TopicPartition class does not exist in custreamz.kafka.

Deserialization happens in these API and there is no concept of TopicPartition since this a librdkafka C++ implementation underneath: https://github.com/jdye64/cudf/blob/7307cbe4f33a67c7d31a8d48955d755d55b03e9d/python/custreamz/custreamz/kafka.py#L72

CJ-Wright · 2020-04-22T22:47:47Z

Should these changes also be made in from_kafka?

chinmaychandak · 2020-04-22T22:55:41Z

Should these changes also be made in from_kafka?

There's no checkpointing in from_kafka. Also, I'm assuming people wanting to leverage GPUs for streaming would use from_kafka_batched since that's the general convention to read from Kafka as it is faster. Even with CPUs, I think from_kafka should only be used for naive, simple use cases where the incoming data is low throughput.

chinmaychandak · 2020-04-22T22:58:58Z

I see this got merged last week: dask/distributed#3706. But marking tests with async def should work.

@CJ-Wright Any idea about how we can resolve this? Tests are passing locally.

jsmaupin · 2020-04-22T23:02:56Z

It seems like maybe this was an effort to use native Python features in 3.x, because 2.x is no longer supported. Tornado is python 2/3 compatible with how it does not use these new Python 3 keywords.

codecov · 2020-04-24T16:50:03Z

Codecov Report

Merging #330 into master will increase coverage by 0.63%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #330      +/-   ##
==========================================
+ Coverage   94.79%   95.42%   +0.63%     
==========================================
  Files          13       13              
  Lines        1671     1661      -10     
==========================================
+ Hits         1584     1585       +1     
+ Misses         87       76      -11

Impacted Files	Coverage Δ
streamz/sources.py	`97.23% <100.00%> (+1.46%)`	⬆️
streamz/utils_test.py	`94.18% <0.00%> (+6.40%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3131c32...bd3814f. Read the comment docs.

chinmaychandak · 2020-04-24T16:50:43Z

Okay, with the upstream Dask error fixed: dask/distributed#3738, all tests are now passing.

martindurant · 2020-04-28T13:15:18Z

Clearly coverage suffers here since we don't test for CUDA on Travis. Indeed, that means that we take it on trust that this code works... Can you think of how we can get around that?

chinmaychandak · 2020-04-28T18:03:16Z

Clearly coverage suffers here since we don't test for CUDA on Travis. Indeed, that means that we take it on trust that this code works... Can you think of how we can get around that?

Yes, coverage definitely takes a small hit here. All the API that serve as a replacement for CK would be tested as part of cudf/custreamz nightly builds (we get a new build every 2-3 hours, so this is pretty reliable) on gpuCI. If those pass, then it is guaranteed that the individual API added as part of custreamz.kafka work just fine, and since the reading-in-batches-from-kafka logic remains the same on CPUs vs. GPUs, the engine="cudf" should work fine too.

As for actually having a test for from_kafka_batched with engine="cudf", I do not know of any way we can do that here. @jdye64, any ideas?

martindurant · 2020-04-28T18:05:58Z

At the minimum, I would suggest adding a badge to the front page (README) linking to those nightly tests, and describe this somewhere, and adding no-coverage markers to GPU-specific branches.

chinmaychandak · 2020-04-29T00:18:23Z

At the minimum, I would suggest adding a badge to the front page (README) linking to those nightly tests, and describe this somewhere, and adding no-coverage markers to GPU-specific branches.

Done. Please have a look.

chinmaychandak · 2020-04-29T17:31:13Z

README.rst

@@ -21,3 +21,5 @@ BSD-3 Clause
   :alt: Documentation Status
 .. |Version Status| image:: https://img.shields.io/pypi/v/streamz.svg
   :target: https://pypi.python.org/pypi/streamz/
+.. |RAPIDS custreamz gpuCI| image:: https://img.shields.io/badge/gpuCI-custreamz-green
+   :target: https://github.com/jdye64/cudf/blob/kratos/python/custreamz/custreamz/kafka.py


This link will be updated in the future.

chinmaychandak · 2020-04-29T17:31:28Z

streamz/sources.py

+        This will install all GPU dependencies, including streamz.
+
+        Please refer to RAPIDS custreamz.kafka API here:
+        github.com/jdye64/cudf/blob/kratos/python/custreamz/custreamz/kafka.py


This link will be updated in the future.

martindurant · 2020-04-30T13:32:16Z

@CJ-Wright , do you want another look?

CJ-Wright · 2020-04-30T16:10:33Z

LGTM to the extent that I understand this.

martindurant · 2020-04-30T18:10:41Z

OK, putting it in then - thank you.
I hope you accept responsibility for cuda-related issues, @chinmaychandak :)

chinmaychandak · 2020-04-30T18:12:35Z

Thanks @martindurant, @CJ-Wright

chinmaychandak · 2020-04-30T18:13:49Z

I will keep the links to custreamz.kafka and the gpuCI up to date.

chinmaychandak added 17 commits March 11, 2020 20:19

Add engine=cudf in from_kafka_batched

73c0117

Resolve merge conflicts

05a11c3

Cleanup

11fa817

Comments

becf49a

Add an assign

398ac9d

Cleanup

ba09ee5

Cleanup

025707d

Cleanup

5194e3d

Cleanup

23d73fd

Add doc string

6d58dc5

Remove assign

c0b0a12

Resolve merge conflicts

e5f3eab

Formatting doc string

5900034

Closing consumer properly to avoid memory leak

ab1a5a4

Fix indent

e1ea43a

Remove print statments and bare excepts

54750dc

Formatting

680f4dc

chinmaychandak commented Apr 21, 2020

View reviewed changes

chinmaychandak added 3 commits April 21, 2020 19:43

Add async def to Dask test with @gen_cluster

8e4bfa4

Remove dask locks

d988482

Fix new line

e32cc32

chinmaychandak added 2 commits April 21, 2020 21:26

Add pytest-tornasync to dependencies

071cdf2

Cleanup dask workerspace

24cb24d

chinmaychandak added 2 commits April 22, 2020 17:16

Move to Python 3.7

7a402b9

Remove incorrect package from yml file

b718882

jsmaupin reviewed Apr 22, 2020

View reviewed changes

CJ-Wright reviewed Apr 22, 2020

View reviewed changes

Revert async test changes

b10be90

Remove ValueError

9aa67e2

chinmaychandak added 2 commits April 28, 2020 23:38

Badge for gpuCI, elaborate doc string, no-coverage for GPU branches

2311e14

Add pragma no cover to .coveragerc

bd3814f

chinmaychandak commented Apr 29, 2020

View reviewed changes

martindurant merged commit 51c3fcf into python-streamz:master Apr 30, 2020

chinmaychandak mentioned this pull request May 1, 2020

Update doc string links for accelerated Kafka datasource #331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Accelerated Kafka datasource. #330

Add Accelerated Kafka datasource. #330

chinmaychandak commented Apr 21, 2020 •

edited

Loading

chinmaychandak Apr 21, 2020 •

edited

Loading

chinmaychandak commented Apr 21, 2020 •

edited

Loading

chinmaychandak commented Apr 21, 2020 •

edited

Loading

martindurant commented Apr 21, 2020

chinmaychandak commented Apr 21, 2020 •

edited

Loading

chinmaychandak commented Apr 22, 2020 •

edited

Loading

jsmaupin Apr 22, 2020

chinmaychandak Apr 22, 2020

jsmaupin Apr 22, 2020

jsmaupin Apr 22, 2020

jsmaupin Apr 22, 2020

CJ-Wright Apr 22, 2020

chinmaychandak Apr 22, 2020

CJ-Wright Apr 22, 2020

chinmaychandak Apr 22, 2020

chinmaychandak Apr 22, 2020 •

edited

Loading

CJ-Wright commented Apr 22, 2020

chinmaychandak commented Apr 22, 2020

chinmaychandak commented Apr 22, 2020

jsmaupin commented Apr 22, 2020

codecov bot commented Apr 24, 2020 •

edited

Loading

chinmaychandak commented Apr 24, 2020

martindurant commented Apr 28, 2020

chinmaychandak commented Apr 28, 2020 •

edited

Loading

martindurant commented Apr 28, 2020

chinmaychandak commented Apr 29, 2020 •

edited

Loading

chinmaychandak Apr 29, 2020

chinmaychandak Apr 29, 2020

martindurant commented Apr 30, 2020

CJ-Wright commented Apr 30, 2020

martindurant commented Apr 30, 2020

chinmaychandak commented Apr 30, 2020

chinmaychandak commented Apr 30, 2020 •

edited

Loading

Add Accelerated Kafka datasource. #330

Add Accelerated Kafka datasource. #330

Conversation

chinmaychandak commented Apr 21, 2020 • edited Loading

chinmaychandak Apr 21, 2020 • edited Loading

Choose a reason for hiding this comment

chinmaychandak commented Apr 21, 2020 • edited Loading

chinmaychandak commented Apr 21, 2020 • edited Loading

martindurant commented Apr 21, 2020

chinmaychandak commented Apr 21, 2020 • edited Loading

chinmaychandak commented Apr 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chinmaychandak Apr 22, 2020 • edited Loading

Choose a reason for hiding this comment

CJ-Wright commented Apr 22, 2020

chinmaychandak commented Apr 22, 2020

chinmaychandak commented Apr 22, 2020

jsmaupin commented Apr 22, 2020

codecov bot commented Apr 24, 2020 • edited Loading

Codecov Report

chinmaychandak commented Apr 24, 2020

martindurant commented Apr 28, 2020

chinmaychandak commented Apr 28, 2020 • edited Loading

martindurant commented Apr 28, 2020

chinmaychandak commented Apr 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Apr 30, 2020

CJ-Wright commented Apr 30, 2020

martindurant commented Apr 30, 2020

chinmaychandak commented Apr 30, 2020

chinmaychandak commented Apr 30, 2020 • edited Loading

chinmaychandak commented Apr 21, 2020 •

edited

Loading

chinmaychandak Apr 21, 2020 •

edited

Loading

chinmaychandak commented Apr 21, 2020 •

edited

Loading

chinmaychandak commented Apr 21, 2020 •

edited

Loading

chinmaychandak commented Apr 21, 2020 •

edited

Loading

chinmaychandak commented Apr 22, 2020 •

edited

Loading

chinmaychandak Apr 22, 2020 •

edited

Loading

codecov bot commented Apr 24, 2020 •

edited

Loading

chinmaychandak commented Apr 28, 2020 •

edited

Loading

chinmaychandak commented Apr 29, 2020 •

edited

Loading

chinmaychandak commented Apr 30, 2020 •

edited

Loading