Added cyberpandas support #12

TomAugspurger · 2018-03-30T19:31:34Z

Updated example for change in import
Added dependency
refactored dataframe column check to assert dtypes

Waiting on pandas-dev/pandas#20556 to be merged and included in the intake pandas conda package.

* Updated example for change in import * Added dependency * refactored dataframe column check to assert dtypes Waiting on pandas-dev/pandas#20556

TomAugspurger · 2018-04-02T13:52:24Z

Building new pandas & cyberpandas packages for MacOS now. This will hopefully pass when those are done.

TomAugspurger · 2018-04-02T15:31:48Z

Do we have any stress tests for this? cyperpandas.to_ipaddress may be somewhat slow. I haven't written a parser for IP addresses yet, and am just relying on the standard library's.

martindurant · 2018-04-02T15:37:34Z

intake_pcap/stream.py

+        known_types = {k: v for k, v in self.dtype.items()
+                       if k not in ('src_host', 'dst_host')}
+        df = df.astype(known_types)
+        df['src_host'] = to_ipaddress(df['src_host'])


What is the input df['src_host'] here? Are we parsing text, bytes or just copying binary data? Seems like there ought to be a way of making this not-slow.

AFAICT, at this point src_host and dst_host are strings.

One issue is that the fastest way to build an IPArray is from a columar 64-bit aligned bytestring, whereas these seem to be record based. I'll look a bit deeper (after profiling things).

If the input is not already an array of the right structure (e.g., with .view()), it may be useful here to build the new columns using empty arrays and fill them, perhaps in a tight numba loop.

That will be generally true for the rest of the dataframe too.

TomAugspurger · 2018-04-02T16:03:43Z

Some rough timings.

For python examples/read-pcap.py examples/local.pcap on a file with 523 lines, the to_ipaddress lines take ~20% of the time, which isn't great.

For the streaming example like python examples/read-live.py en0 tcp, the lines only take 0-1% of the time (though not sure how heavy of a load I was putting it under).

martindurant · 2018-04-02T16:05:33Z

Given that we don't have throughput requirements, I wouldn't worry too much about it right now, but clearly we have some ideas for acceleration if it becomes needed.

TomAugspurger · 2018-04-02T16:10:32Z

Yeah, on a larger local PCAP file (55 M), things seem less drastic. It takes 1.78 / 2.82 seconds (0.63%).

Which I think is good enough for me at the moment.

TomAugspurger · 2018-04-02T16:18:06Z

https://cdn.rawgit.com/TomAugspurger/f87d38a4872621994a0b7d720d6d2c7d/raw/66219a41107bb6780edac964b96eadb83b3c180c/program.html may work. The relevant section is in the top-right

According to that, the to_dataframe spends about 1/3 of it's time decoding, and 2/3 of it's time in to_ipaddress (with the remainder in .astype). Which doesn't look so good, but I think is fine for now.

martindurant · 2018-04-02T18:47:28Z

Yes, I think the performance is fine for now. Are there any other concerns before merging this? Does it need to wait until pandas extensions/cyberpandas become mainstream?

TomAugspurger · 2018-04-02T18:50:05Z

If we're OK with people installing from git or anaconda.org/intake then this should be fine to merge. Otherwise, we should wait.

Cyberpandas will be released on PyPI and conda-forge once pandas is released (hopefully 2-4 weeks).

martindurant · 2018-04-02T18:52:02Z

That's probably OK for our purposes. I would expect that if you specify the dev (cyber)pandas in the requirements, but don't provide the channel to conda, then it should fall back to installing the previous version without the requirement. In any case, the situation is temporary.

TomAugspurger added 2 commits March 30, 2018 14:29

[WIP]: Added cyberpandas support

f39e24c

* Updated example for change in import * Added dependency * refactored dataframe column check to assert dtypes Waiting on pandas-dev/pandas#20556

Trigger CI

0c7b8d6

Trigger CI

721c7e3

martindurant reviewed Apr 2, 2018

View reviewed changes

TomAugspurger changed the title ~~[WIP]: Added cyberpandas support~~ Added cyberpandas support Apr 2, 2018

seibert merged commit f0c4093 into intake:master Apr 25, 2018

TomAugspurger deleted the cyberpandas branch June 13, 2018 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added cyberpandas support #12

Added cyberpandas support #12

TomAugspurger commented Mar 30, 2018

TomAugspurger commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

martindurant Apr 2, 2018

TomAugspurger Apr 2, 2018

TomAugspurger Apr 2, 2018

martindurant Apr 2, 2018

TomAugspurger commented Apr 2, 2018

martindurant commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

martindurant commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

martindurant commented Apr 2, 2018

Added cyberpandas support #12

Added cyberpandas support #12

Conversation

TomAugspurger commented Mar 30, 2018

TomAugspurger commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

martindurant Apr 2, 2018

Choose a reason for hiding this comment

TomAugspurger Apr 2, 2018

Choose a reason for hiding this comment

TomAugspurger Apr 2, 2018

Choose a reason for hiding this comment

martindurant Apr 2, 2018

Choose a reason for hiding this comment

TomAugspurger commented Apr 2, 2018

martindurant commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

martindurant commented Apr 2, 2018

TomAugspurger commented Apr 2, 2018

martindurant commented Apr 2, 2018