Slow processing of large datasets #36

callumrollo · 2021-11-05T11:12:23Z

There appears to be a performance bottleneck in raw_to_rawnc. In the docstring it is noted that this is slow for large datasets. As this function converts small files individually, it looks ideal for multiprocessing.

callumrollo · 2022-10-06T07:49:07Z

Update: I've recently been introduced to polars which looks like it would be an ideal solution for this. It's parallel by default and works effectively as a replacement for pandas. I'll work up a PR to use polars in place of pandas for the seaexplorer raw_to_rawnc step

jklymak · 2022-10-06T17:42:05Z

We also introduced a method to subsample the data to remove all the redundant data the Alseamars put out by default. Not sure if that helps fix the problem. Not to say we shouldn't also consider polars.

callumrollo added the enhancement New feature or request label Nov 5, 2021

callumrollo mentioned this issue Nov 5, 2021

multiprocessing seaexplorer raw to rawnc #37

Closed

callumrollo mentioned this issue Oct 6, 2022

use polars for seaexplorer data file load #120

Merged

callumrollo closed this as completed in #120 Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow processing of large datasets #36

Slow processing of large datasets #36

callumrollo commented Nov 5, 2021

callumrollo commented Oct 6, 2022

jklymak commented Oct 6, 2022

Slow processing of large datasets #36

Slow processing of large datasets #36

Comments

callumrollo commented Nov 5, 2021

callumrollo commented Oct 6, 2022

jklymak commented Oct 6, 2022