-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added cyberpandas support #12
Conversation
* Updated example for change in import * Added dependency * refactored dataframe column check to assert dtypes Waiting on pandas-dev/pandas#20556
Building new pandas & cyberpandas packages for MacOS now. This will hopefully pass when those are done. |
Do we have any stress tests for this? |
known_types = {k: v for k, v in self.dtype.items() | ||
if k not in ('src_host', 'dst_host')} | ||
df = df.astype(known_types) | ||
df['src_host'] = to_ipaddress(df['src_host']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the input df['src_host']
here? Are we parsing text, bytes or just copying binary data? Seems like there ought to be a way of making this not-slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, at this point src_host
and dst_host
are strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue is that the fastest way to build an IPArray is from a columar 64-bit aligned bytestring, whereas these seem to be record based. I'll look a bit deeper (after profiling things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input is not already an array of the right structure (e.g., with .view()
), it may be useful here to build the new columns using empty arrays and fill them, perhaps in a tight numba loop.
That will be generally true for the rest of the dataframe too.
Some rough timings. For For the streaming example like |
Given that we don't have throughput requirements, I wouldn't worry too much about it right now, but clearly we have some ideas for acceleration if it becomes needed. |
Yeah, on a larger local PCAP file (55 M), things seem less drastic. It takes 1.78 / 2.82 seconds (0.63%). Which I think is good enough for me at the moment. |
https://cdn.rawgit.com/TomAugspurger/f87d38a4872621994a0b7d720d6d2c7d/raw/66219a41107bb6780edac964b96eadb83b3c180c/program.html may work. The relevant section is in the top-right According to that, the |
Yes, I think the performance is fine for now. Are there any other concerns before merging this? Does it need to wait until pandas extensions/cyberpandas become mainstream? |
If we're OK with people installing from git or Cyberpandas will be released on PyPI and conda-forge once pandas is released (hopefully 2-4 weeks). |
That's probably OK for our purposes. I would expect that if you specify the dev (cyber)pandas in the requirements, but don't provide the channel to conda, then it should fall back to installing the previous version without the requirement. In any case, the situation is temporary. |
Waiting on pandas-dev/pandas#20556 to be merged and included in the intake pandas conda package.