Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow processing of large datasets #36

Closed
callumrollo opened this issue Nov 5, 2021 · 2 comments · Fixed by #120
Closed

Slow processing of large datasets #36

callumrollo opened this issue Nov 5, 2021 · 2 comments · Fixed by #120
Labels
enhancement New feature or request

Comments

@callumrollo
Copy link
Collaborator

There appears to be a performance bottleneck in raw_to_rawnc. In the docstring it is noted that this is slow for large datasets. As this function converts small files individually, it looks ideal for multiprocessing.

@callumrollo
Copy link
Collaborator Author

Update: I've recently been introduced to polars which looks like it would be an ideal solution for this. It's parallel by default and works effectively as a replacement for pandas. I'll work up a PR to use polars in place of pandas for the seaexplorer raw_to_rawnc step

@jklymak
Copy link
Member

jklymak commented Oct 6, 2022

We also introduced a method to subsample the data to remove all the redundant data the Alseamars put out by default. Not sure if that helps fix the problem. Not to say we shouldn't also consider polars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants