Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace xarray median and mean with numpy #419

Conversation

Kasra-Shirvanian
Copy link
Contributor

@Kasra-Shirvanian Kasra-Shirvanian commented Feb 16, 2025

This PR improves the performance of an example script by replacing xarray's mean() and median() functions with numpy's nanmean() and nanmedian(). The xarray functions apply reductions iteratively over subbranches, making them inefficient for simple statistical calculations. Additionally, the script was redundantly computing mean and median values twice (once for the plot and again for the legend).

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?

This change does not affect the main functionality of the toolbox but significantly improves the performance of the example script. It provides a better user experience for those trying out the examples or adapting them to their own datasets, leading to a better impression of the project.

What does this PR do?

Replaces xarray's mean() and median() with numpy.nanmean() and numpy.nanmedian().
Removed redundant calculations in the legend.

How has this PR been tested?

It has been tested on my local machine

Performance Improvement:

Before: Median computation took 13.41 seconds
After: Median computation now takes 0.00025 seconds

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

@Kasra-Shirvanian Kasra-Shirvanian marked this pull request as draft February 16, 2025 23:52
@Kasra-Shirvanian Kasra-Shirvanian marked this pull request as ready for review February 17, 2025 01:38
@Kasra-Shirvanian Kasra-Shirvanian marked this pull request as draft February 17, 2025 01:38
Copy link

codecov bot commented Feb 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.83%. Comparing base (4269319) to head (4cedd64).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #419   +/-   ##
=======================================
  Coverage   99.83%   99.83%           
=======================================
  Files          22       22           
  Lines        1219     1219           
=======================================
  Hits         1217     1217           
  Misses          2        2           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@niksirbi
Copy link
Member

The xarray functions apply reductions iteratively over subbranches, making them inefficient for simple statistical calculations.

Thanks for drawing attention to this @Kasra-Shirvanian.

This PR prompted me to time the execution of numpy vs xarray ways of computing the mean and median on my machine. I found that you are right, numpy is faster, by a factor ranging from ~1.4 to ~5 (at least on the small dataset we use in this example).

Some outputs from my tests (run with IPython on an M2 silicon Mac, reported time is average of 1000 executions).

xarray mean: 0.094241 s
numpy mean: 0.066624 s
numpy is ~1.41 times faster

xarray median: 0.235287 s
numpy median: 0.094634 s
numpy is ~2.49 times faster

For some context, the purpose of the examples is mainly educational, and part of their aim is to get our users familiar with xarray syntax and methods, that's why we often reach for them despite the performance penalty. Another reason for preferring xarray is that it makes the code in the examples more readable, but this is more relevant for cases where statistics (like mean and median) are computed over a particular dimension. For example, data.mean(dim = "keypoints") is more explicit than np.nanmean(data, axis=2), at least in my opinion.

That said, the case you've highlighted only concerns 1D data, and perhaps there is value in also using numpy methods in some of our examples, because it highlights the fact that you can actually call numpy methods of xarray objects.

Additionally, the script was redundantly computing mean and median values twice (once for the plot and again for the legend).

You are right about that.

All in all I'm happy to merge this PR. feel free to mark it as "Ready for review" despite the failing action. That failure is unrelated to this PR.

@niksirbi
Copy link
Member

That failure is unrelated to this PR.

Actually, just re-running the failing github action fixed the problem. There was a problematic URL that now seems to be happy again.

@niksirbi niksirbi marked this pull request as ready for review February 17, 2025 15:18
Copy link
Member

@niksirbi niksirbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready to merge from my point of view @Kasra-Shirvanian. Congrats on your first contribution to movement 🎉

@niksirbi niksirbi added this pull request to the merge queue Feb 17, 2025
Merged via the queue into neuroinformatics-unit:main with commit c51ed75 Feb 17, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants