Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance median method regridder worse than other methods #240

Closed
JoerivanEngelen opened this issue May 1, 2024 · 5 comments · Fixed by #270
Closed

Performance median method regridder worse than other methods #240

JoerivanEngelen opened this issue May 1, 2024 · 5 comments · Fixed by #270

Comments

@JoerivanEngelen
Copy link
Contributor

JoerivanEngelen commented May 1, 2024

In the iMOD Python test bench, I noticed that changing the method "mode" to "median" of one specific variable (idomain), caused the tests to take more than 1 hour instead of 20 minutes. This could be due to some faulty agent on TeamCity, but I thought some simple performance tests are interesting to conduct here. I noticed a TODO in the median method notifying that this specific method might be not so performant, because np.nanpercentile is used here instead of a custom implementation.

I modified the OverlapRegridder example, to conduct some simple performance tests:

import xugrid as xu
import time

def create_grid(bounds, nx, ny):
    """Create a simple grid of triangles covering a rectangle."""
    import numpy as np
    from matplotlib.tri import Triangulation
    xmin, ymin, xmax, ymax = bounds
    dx = (xmax - xmin) / nx
    dy = (ymax - ymin) / ny
    x = np.arange(xmin, xmax + dx, dx)
    y = np.arange(ymin, ymax + dy, dy)
    y, x = [a.ravel() for a in np.meshgrid(y, x, indexing="ij")]
    faces = Triangulation(x, y).triangles
    return xu.Ugrid2d(x, y, -1, faces)

uda = xu.data.elevation_nl().ugrid.sel(
    x=slice(125_000, 225_000), y=slice(440_000, 500_000)
)
grid = create_grid(uda.ugrid.total_bounds, nx=1000, ny=1000)

functions = [
    "mean",
    "mean",  # Repeat first element to hack around JIT performance hits
    "harmonic_mean",
    "geometric_mean",
    "sum",
    "minimum",
    "maximum",
    "mode",
    "median",
    "max_overlap",
]

for f in functions:
    t0 = time.time()
    regridder = xu.OverlapRegridder(source=uda, target=grid, method=f)
    result = regridder.regrid(uda)
    t1 = time.time()
    tn = t1-t0
    print(f"{f}: {tn} seconds")

This printed:

mean: 4.471312522888184 seconds
mean: 0.17104887962341309 seconds
harmonic_mean: 0.717357873916626 seconds
geometric_mean: 0.8859498500823975 seconds
sum: 0.6982800960540771 seconds
minimum: 0.6448187828063965 seconds
maximum: 0.6634867191314697 seconds
mode: 1.0880956649780273 seconds
median: 9.672606706619263 seconds
max_overlap: 0.6971712112426758 seconds

Indeed the median method is slower than the rest. Next comes mode, but I think this is somewhat within the margin.

@JoerivanEngelen JoerivanEngelen changed the title Performance tests regridder Performance median method regridder slow May 1, 2024
@Huite
Copy link
Collaborator

Huite commented May 1, 2024

I would've expected as much, essentially. Median just wraps numpy nan percentile.

I even left a TODO there about better performance:

# TODO: more efficient implementation?

We're numba jitting, so it'll call the numba implementation. If you follow that link, you can see in the numba source code that it's allocating several numpy arrays (on the heap).
During regridding, this results in a huge number of tiny heap allocations (one allocation per target index), which obviously has bad performance characteristics.

I've thought about pre-allocating a buffer, and passing that to each function to use as a workspace instead. (In most cases, such a buffer is probably small enough to fit in a CPU cache.)
I think the challenge here is that ideally we do the reduction in parallel (see the numba prange in make_regrid); which requires one buffer per parallel process and makes the parallellization less straightforward.

@JoerivanEngelen JoerivanEngelen changed the title Performance median method regridder slow Performance median method regridder worse than other methods May 1, 2024
@Huite
Copy link
Collaborator

Huite commented May 1, 2024

Actually, it looks like I'm mistaken.

make_regrid uses a closure to insert the redunction function f here:

    def _regrid(source: FloatArray, A: MatrixCSR, size: int):
        n_extra = source.shape[0]
        out = np.full((n_extra, size), np.nan)
        for extra_index in numba.prange(n_extra):
            source_flat = source[extra_index]
            for target_index in range(A.n):
                slice = row_slice(A, target_index)
                indices = A.indices[slice]
                weights = A.data[slice]
                if len(indices) > 0:
                    out[extra_index, target_index] = f(source_flat, indices, weights)
        return out

Numba parallellisation is occuring over the n_extra dimensions (the flattened time, layer dimensions e.g.).

The next would work fine since it's allocated per parallel process:

            source_flat = source[extra_index]
            buffer = np.empty(buffersize)
            for target_index in range(A.n):
                slice = row_slice(A, target_index)
                indices = A.indices[slice]
                weights = A.data[slice]
                if len(indices) > 0:
                    out[extra_index, target_index] = f(source_flat, indices, weights, buffer)
        return out

Then we can provide some additional memory for each reduction function.
This would also allow us the write the mode implementation in a better way, since it's currently also allocating a copy.

The trickiest part is deciding the size of the buffer.

@Huite
Copy link
Collaborator

Huite commented May 8, 2024

Actually, deciding on the size of the buffer is probably also quite easy. In the sparse matrix, the largest number of non-zeros in any row suffices.

@Huite
Copy link
Collaborator

Huite commented Aug 5, 2024

Much better now:

mean: 4.8980393409729 seconds
mean: 0.19949793815612793 seconds
harmonic_mean: 0.20173168182373047 seconds
geometric_mean: 0.2833566665649414 seconds
sum: 0.19977402687072754 seconds
minimum: 0.1972644329071045 seconds
maximum: 0.18817400932312012 seconds
mode: 0.24338030815124512 seconds
median: 0.2827489376068115 seconds
max_overlap: 0.20089316368103027 seconds

@Huite
Copy link
Collaborator

Huite commented Aug 5, 2024

Making a local copy now, better memory access patterns might speed up some a little:

mean: 0.1990649700164795 seconds
mean: 0.1802206039428711 seconds
harmonic_mean: 0.1920933723449707 seconds
geometric_mean: 0.2039186954498291 seconds
sum: 0.17996907234191895 seconds
minimum: 0.17102718353271484 seconds
maximum: 0.17496109008789062 seconds
mode: 0.20006632804870605 seconds
median: 0.19997835159301758 seconds
max_overlap: 0.17799925804138184 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants