Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added cyberpandas support #12

Merged
merged 3 commits into from
Apr 25, 2018
Merged

Conversation

TomAugspurger
Copy link
Contributor

  • Updated example for change in import
  • Added dependency
  • refactored dataframe column check to assert dtypes

Waiting on pandas-dev/pandas#20556 to be merged and included in the intake pandas conda package.

* Updated example for change in import
* Added dependency
* refactored dataframe column check to assert dtypes

Waiting on pandas-dev/pandas#20556
@TomAugspurger
Copy link
Contributor Author

Building new pandas & cyberpandas packages for MacOS now. This will hopefully pass when those are done.

@TomAugspurger
Copy link
Contributor Author

Do we have any stress tests for this? cyperpandas.to_ipaddress may be somewhat slow. I haven't written a parser for IP addresses yet, and am just relying on the standard library's.

known_types = {k: v for k, v in self.dtype.items()
if k not in ('src_host', 'dst_host')}
df = df.astype(known_types)
df['src_host'] = to_ipaddress(df['src_host'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the input df['src_host'] here? Are we parsing text, bytes or just copying binary data? Seems like there ought to be a way of making this not-slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, at this point src_host and dst_host are strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue is that the fastest way to build an IPArray is from a columar 64-bit aligned bytestring, whereas these seem to be record based. I'll look a bit deeper (after profiling things).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input is not already an array of the right structure (e.g., with .view()), it may be useful here to build the new columns using empty arrays and fill them, perhaps in a tight numba loop.

That will be generally true for the rest of the dataframe too.

@TomAugspurger
Copy link
Contributor Author

Some rough timings.

For python examples/read-pcap.py examples/local.pcap on a file with 523 lines, the to_ipaddress lines take ~20% of the time, which isn't great.

For the streaming example like python examples/read-live.py en0 tcp, the lines only take 0-1% of the time (though not sure how heavy of a load I was putting it under).

@martindurant
Copy link
Member

Given that we don't have throughput requirements, I wouldn't worry too much about it right now, but clearly we have some ideas for acceleration if it becomes needed.

@TomAugspurger TomAugspurger changed the title [WIP]: Added cyberpandas support Added cyberpandas support Apr 2, 2018
@TomAugspurger
Copy link
Contributor Author

Yeah, on a larger local PCAP file (55 M), things seem less drastic. It takes 1.78 / 2.82 seconds (0.63%).

Which I think is good enough for me at the moment.

@TomAugspurger
Copy link
Contributor Author

https://cdn.rawgit.com/TomAugspurger/f87d38a4872621994a0b7d720d6d2c7d/raw/66219a41107bb6780edac964b96eadb83b3c180c/program.html may work. The relevant section is in the top-right

According to that, the to_dataframe spends about 1/3 of it's time decoding, and 2/3 of it's time in to_ipaddress (with the remainder in .astype). Which doesn't look so good, but I think is fine for now.

@martindurant
Copy link
Member

Yes, I think the performance is fine for now. Are there any other concerns before merging this? Does it need to wait until pandas extensions/cyberpandas become mainstream?

@TomAugspurger
Copy link
Contributor Author

If we're OK with people installing from git or anaconda.org/intake then this should be fine to merge. Otherwise, we should wait.

Cyberpandas will be released on PyPI and conda-forge once pandas is released (hopefully 2-4 weeks).

@martindurant
Copy link
Member

That's probably OK for our purposes. I would expect that if you specify the dev (cyber)pandas in the requirements, but don't provide the channel to conda, then it should fall back to installing the previous version without the requirement. In any case, the situation is temporary.

@seibert seibert merged commit f0c4093 into intake:master Apr 25, 2018
@TomAugspurger TomAugspurger deleted the cyberpandas branch June 13, 2018 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants