Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use polars for seaexplorer data file load #120

Merged
merged 15 commits into from
Nov 4, 2022

Conversation

callumrollo
Copy link
Collaborator

@callumrollo callumrollo commented Oct 6, 2022

This is a big PR. More work needed to check that it's not changing data as the methods are not the same as pandas.

Testing on our datasets has shown a ~ 10 X speedup of processing large delayed mode datasets, and a ~5 X speedup with nrt data. I'll write some example code comparing the two methods along with more tests. This is almost certainly a sub-optimal implementation of polars, as I'm directly aping the existing flow for pandas.

Using polars also decreases disk usage of intermediate products by using parquet rather than .nc. It also has substantially lower memory usage than pandas, so should decrease overheads when processing large datasets.

This is designed to resolve #36

@callumrollo
Copy link
Collaborator Author

This is failing some tests at the moment. Some of those are expected (they use intermediate ncs which are now parquet files) others look like some errors with timestamp parsing and some other issues. I'll work on these

@callumrollo
Copy link
Collaborator Author

Just got to resolve the optional sensor-sepcific coarsening now

@jklymak
Copy link
Member

jklymak commented Oct 6, 2022

I wonder if this needs to be its own method? I'd need to be convinced that parquet files are a better intermediate format than netcdf. We need to be able to look at the intermediate files to see what is in the raw data - what is the workflow for parquet files? Just polars? I don't think xarray handles them. Does the efficiency go away if you use netcdf and polars?

We recently had this issue with the raw Alseamar files and solved it by subsampling the redundant information and reduced the raw files down from 10Mb each to 40kb

@callumrollo
Copy link
Collaborator Author

These are good points, thanks Jody. We've tried subsampling the raw alseamar files, but with the 16 Hz legato running for 3 week missions, we still end up with huge datasets and the load/merge step is taking several hours per dataset.

I'm starting to profile the code now. polars makes time savings over pandas when loading and merging the data. parquet files are quicker and more storage efficient than ncs to write and read, but they do need to be readable as intermediate products. This is achieved in a very similar way to pandas:

>>> df = pl.read_parquet("sea63_35_nrt_rawnc/sea063.0035.gli.sub.0142.parquet")
>>> df
shape: (107, 23)
┌─────────────────────┬──────────┬───────────────┬─────────┬─────┬────────┬─────────┬──────────┬──────┐
│ time                ┆ NavState ┆ SecurityLevel ┆ Heading ┆ ... ┆ AngPos ┆ Voltage ┆ Altitude ┆ fnum │
│ ---                 ┆ ---      ┆ ---           ┆ ---     ┆     ┆ ---    ┆ ---     ┆ ---      ┆ ---  │
│ datetime[μs]        ┆ i64      ┆ i64           ┆ f64     ┆     ┆ f64    ┆ f64     ┆ f64      ┆ i64  │
╞═════════════════════╪══════════╪═══════════════╪═════════╪═════╪════════╪═════════╪══════════╪══════╡
│ 2022-03-01 23:50:06 ┆ 117      ┆ 0             ┆ 152.82  ┆ ... ┆ -5.0   ┆ 28.5    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:11 ┆ 110      ┆ 0             ┆ 154.58  ┆ ... ┆ 0.5    ┆ 28.5    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:21 ┆ 110      ┆ 0             ┆ 151.92  ┆ ... ┆ 0.5    ┆ 28.4    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-03-01 23:50:31 ┆ 110      ┆ 0             ┆ 156.13  ┆ ... ┆ 0.5    ┆ 28.4    ┆ -1.0     ┆ 142  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...                 ┆ ...      ┆ ...           ┆ ...     ┆ ... ┆ ...    ┆ ...     ┆ ...      ┆ ...  │

to me this makes more sense, as at the load_merge stage there is no metadata to make the bulkier netcdfs necessary for these intermediate table-like files.

I'm working on a full demo at the moment. I'd like to make use of this speedup, but if it's incompatible with existing workflows I'll make it a separate function or just keep it for internal use at VOTO

@callumrollo
Copy link
Collaborator Author

I've put together a rough notebook profiling the polars implementation against the current main branch. It shows the performance differences using sub and raw data as well as showing what the intermediate parquet products look like when loaded in polars

https://github.com/callumrollo/pyglider_profile/blob/main/profile_pyglider.ipynb

It's pretty ugly, but shows the main points. I can work to make it portable//repeatable if desired

@jklymak
Copy link
Member

jklymak commented Oct 7, 2022

OK 5-10x is a big deal if you have 24 Hz data. If you do this, can you update the docs to explain what a parquet file is and a quick how-to on opening them? I think we can assume pandas and xarray knowledge (we are writing to netcdf), but parquet probably needs a paragraph of explanation.

Final question would be if parquet and pandas have any packaging issues? Does parquet work with a conda install pandas on the big three OS's? (Can't be worse than netcdf ;-)

@callumrollo
Copy link
Collaborator Author

I've had a look and it turns out pandas is able to read and write parquet files, so user shouldn't ever need to interact with polars. I'll add a paragraph to the docs explaining it.

I've not encountered problems adding polars to the build so far. I'll test it out on a windows machine today to be sure though. Polars has mature builds on PyPI and conda-forge

@callumrollo
Copy link
Collaborator Author

No issues making conda environments in pandas. I'm testing it on a range of our SeaExplorer datasets now. I'll mark ready for review once I'm satisfied it's performing well in production

@callumrollo callumrollo marked this pull request as ready for review October 12, 2022 13:19
@callumrollo
Copy link
Collaborator Author

I've tested this pretty extensively on our datasets now. I think it's good to go. @jklymak do you have any other requested changes?

@jklymak
Copy link
Member

jklymak commented Oct 12, 2022

Give me a day or two to look it over. Probably is all good.

Copy link
Member

@jklymak jklymak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks good, just a couple of style things, and we probably need to change a top-level name.

I think we need to decide if we are used enough yet to need a deprecation cycle for removing things. I think we are probably fine to do this now, but at some point we can't change a bunch of underlying code that changes the output file type(!) without a deprecation cycle. In which case, code like this would be better as a new API and the old API kept around for users who need it, with a deprecation warning.

try:
out = pl.read_csv(f, sep=';')
except:
_log.warning(f'Could not read {f}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add the badfiles here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

badfiles added

except:
_log.warning(f'Could not read {f}')
continue
if "Timestamp" in out.columns:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this get a comment? Why do we need this check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is for corrupted files. I encountered this once in ~ 50 missions I've processed so far. I've added a comment

if ftype == 'gli':
outx.to_netcdf(fnout[:-3] + '.nc', 'w')
if rawsub == 'raw' and dropna_subset is not None:
out = out.with_column(out.select(pl.col(dropna_subset).is_null().cast(pl.Int64))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a comment, and if it doesn't cause a slowdown or extra writes maybe would benefit from unpacking into separate calls. Looks like you are dropping repeats, but its too many steps in one line for me to follow without spending too much time ;-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment on this. It's a bit convoluted looking, but it's just the polars equivalent of pandas.dropna. I didn't want to change the functionality of the dropna option in this PR. We can factor it out as a separate PR though?

post_1971 = df.filter(pl.col("time") > dt_1971)
if len(post_1971) == len(df):
return post_1971
return df.filter(pl.col("time") > dt_1971)


def merge_rawnc(indir, outdir, deploymentyaml, incremental=False, kind='raw'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably rename this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed this to the more descriptive drop_pre_1971_samples

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant rename merege_rawnc.... Though that screws with older scripts, but I'd rename this merge_parquet or whatever, and then alias merge_rawnc so old scripts work

@callumrollo
Copy link
Collaborator Author

I've added some more comments and used badfiles if a file fails the load. I have no idea how these changes have caused a couple of the slocum tests to start failing

@callumrollo
Copy link
Collaborator Author

OK I think we're good to go now. @jklymak has this met your requested changes? In future I'll make these kind of changes as part of a deprecation cycle as you suggest

Copy link
Member

@jklymak jklymak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks close, just a couple more places to clarify the code.

post_1971 = df.filter(pl.col("time") > dt_1971)
if len(post_1971) == len(df):
return post_1971
return df.filter(pl.col("time") > dt_1971)


def merge_rawnc(indir, outdir, deploymentyaml, incremental=False, kind='raw'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant rename merege_rawnc.... Though that screws with older scripts, but I'd rename this merge_parquet or whatever, and then alias merge_rawnc so old scripts work

val = _interp_gli_to_pld(sensor_sub, sensor, val2, indctd)
coarse_ints = np.arange(0, len(sensor)/coarsen_time, 1/coarsen_time).astype(int)
sensor_sub = sensor.with_columns(pl.lit(coarse_ints).alias("coarse_ints"))
sensor_sub_grouped = sensor_sub.with_column(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block needs a comment as well... It says "smooth" oxygen above, but I'm not following this danse with coarse_ints etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes more sense! I've renamed to merge_parquet and added merge_rawnc = merge_parquet so hopefully older scripts will still work. I've also added more comments to the variables coarsening

@jklymak
Copy link
Member

jklymak commented Nov 4, 2022

@callumrollo feel free to self merge, perhaps squash commits that you don't need, or just squash on merge. Thanks! we are about to try this out on some data sets that have been giving us problems. ping @hvdosser

@richardsc
Copy link
Collaborator

We have at least one dataset that has 16Hz legato data (plus 1Hz GPCTD), which would probably benefit from this change. Pinging @clayton to try this out on that mission.

@callumrollo callumrollo merged commit 0869839 into c-proof:main Nov 4, 2022
@jklymak
Copy link
Member

jklymak commented Nov 4, 2022

@richardsc @clayton This is now merged, and on master. It's not yet released, so you will need to install from the development branch. Let us know if that is problematic. If it manages to get through a few more glider setups we should release as well as the changes to the slocum processing.

@hvdosser hvdosser linked an issue Nov 16, 2022 that may be closed by this pull request
@callumrollo callumrollo deleted the callum-patch-33 branch November 29, 2024 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SeaExplorer delayed mode time series data loss Slow processing of large datasets
4 participants