Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean opacity for variable kappa based on piecewise-power-law approximation #626

Merged
merged 50 commits into from
May 13, 2024

Conversation

chongchonghe
Copy link
Contributor

@chongchonghe chongchonghe commented May 2, 2024

Description

This PR implements piecewise-power-law approximation for multigroup radiation transfer. We calculate Planck-mean, energy-mean, and flux-mean opacity given an arbitrary function kappa(nu, T, rho) based on piecewise-power-law fitting to kappa(nu), E(nu), and F(nu). The user can enable this routine by setting OpacityModel to OpacityModel::piecewisePowerLaw. See examples in test_radhydro_shock_multigroup.cpp and test_radhydro_pulse_MG_int.hpp. The original approach that the opacities are defined via ComputePlanckOpacity, ComputeFluxMeanOpacity, etc are retained by setting OpacityModel to OpacityModel::user, which is the default.

We calculate the power-law slope of a radiation quantity via the following relations:

$$ \alpha_{i} = {\rm minmod}(s_{i-1/2}, s_{i+1/2}) $$

where

$$ s_{i+1/2} = \frac{\ln \frac{X_{i+1}}{\nu_{i+3/2} - \nu_{i+1/2}} - \ln \frac{X_i}{\nu_{i+1/2} - \nu_{i-1/2}}}{\ln \sqrt{\nu_{i+1/2} \nu_{i+3/2}} - \ln \sqrt{\nu_{i-1/2} \nu_{i+1/2}}} $$

I will try to construct a better estimate of the power-law slope in a separate PR. The user can choose to overwrite this by defining your own RadSystem<problem_t>::ComputeRadQuantityExponents; see example in test_radhydro_pulse_MG_int.cpp.

The power-law fitting of $\kappa(\nu)$ can be specified in two ways. The first method is to specify its power-law slope and lower-bound value for every photon bin, given $T$ and $\rho$. The second method is to specify a precise definition of $\kappa(\nu, T, \rho)$. Then, a power-law fitting to $\kappa(\nu)$ is done on the fly in the following way:

$$ \begin{gather} \kappa_{L}^i = \kappa(\nu_{i-1/2}, T, \rho) \\ \alpha^i_{\kappa} = \frac{\ln(\kappa(\nu_{i+1/2}, T, \rho) / \kappa(\nu_{i-1/2},T,\rho))}{\ln(\nu_{i+1/2} / \nu_{i-1/2})} \\ \kappa_{{\rm fit}}^i(\nu) = \kappa_{L}^i \left( \frac{\nu}{\nu_{i-1/2}} \right)^{\alpha_{\kappa}^i} \end{gather} $$

This is definitely not the most accurate way to do this as it is not volume conservative, but it's simple and this is important since it's done for every cell. An alternative is to construct $\kappa_{L,i}$ and $\alpha_{\kappa,i}$ in the beginning if $\kappa$ does not depend on $T$ or $\rho$.

Related issues

N/A

Checklist

Before this pull request can be reviewed, all of these tasks should be completed. Denote completed tasks with an x inside the square brackets [ ] in the Markdown source below:

  • I have added a description (see above).
  • I have added a link to any related issues see (see above).
  • I have read the Contributing Guide.
  • I have added tests for any new physics that this PR adds to the code.
  • I have tested this PR on my local computer and all tests pass.
  • I have manually triggered the GPU tests with the magic comment /azp run.
  • I have requested a reviewer for this PR.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/radiation_system.hpp Fixed Show resolved Hide resolved
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@chongchonghe
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@chongchonghe
Copy link
Contributor Author

Is there any way this can be done with an approximate logarithm function? The issue is that transcendental functions on GPUs have to use special hardware units that are not able to process as many threads simultaneously as the normal arithmetic units can, so there can be a severe performance cost.
If this is not possible (and I think it might not be), can you benchmark the radiation solver performance before/after this PR on either AMD GPUs or NVIDIA GPUs (or both)?

Yes. log⁡(x) can be approximated with
log⁡(x)=lima→0xa−1a
By setting a=10−7, the relative error is below 4e-6 for x between 1e-30 and 1e30, according to Mathematica. @BenWibking Do you think this is worth doing?
I was doing the opposite in ComputeGroupMeanOpacity. I need to calculate the above equation, and when a < 1e-8, I turned to using std::log(x). See here:

part1 = std::log(radBoundaryRatios[g]);

Unfortunately, I think std::pow has the same issue, so I don't think that will help.

Any approximate form should either be a rational polynomial interpolant, or else use std::ldexp and std::frexp, which can be used to approximately compute 2^x and log base 2: https://en.cppreference.com/w/cpp/numeric/math/ldexp.

In that case, I'll leave that for future work. If you can approximate log_2(x), then you can approximate log(x) with log(x) = log_2(x) * log(2).

For now, I'll do a quick benchmark as you suggested.

@chongchonghe
Copy link
Contributor Author

The 3D GPU test is taking very long. This PR is fine: pulse_MG_int takes 2.5 hours on gadi. However, the old pulse_MG test will take a projected time of 100 hours to finish. I suspect this is because of the low resid_tol value at 1e-13, while in this new PR I changed it to 1e-11. I reduced the resid_tol of the old pulse_MG test and I'll wait till it finishes and compare with the new test.

@chongchonghe
Copy link
Contributor Author

OK. I did a benchmark on gadi gpuvolta for the old pulse-MG test (bin-centered opacity) and the new pulse-MG-int (bin integrated opacity with piecewise power-law) test. The elapsed time is 45min vs 37min. The new test is faster, but I realized it is not a fair test. The old test runs two simulations, each with 4 photon groups. The new test runs two simulations, one with 4 photon groups and the other with 1 group (exact solution where kappa is integrated over the whole spectrum). The amount of computation is roughly proportional to the nGroups, so the elapsed time per group are: old: 45/8=5.625, new: 37/5=7.4. So, the difference is not large. I wouldn't worry too much about transcendental functions on GPU.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@chongchonghe
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@BenWibking
Copy link
Collaborator

BenWibking commented May 10, 2024

OK. I did a benchmark on gadi gpuvolta for the old pulse-MG test (bin-centered opacity) and the new pulse-MG-int (bin integrated opacity with piecewise power-law) test. The elapsed time is 45min vs 37min. The new test is faster, but I realized it is not a fair test. The old test runs two simulations, each with 4 photon groups. The new test runs two simulations, one with 4 photon groups and the other with 1 group (exact solution where kappa is integrated over the whole spectrum). The amount of computation is roughly proportional to the nGroups, so the elapsed time per group are: old: 45/8=5.625, new: 37/5=7.4. So, the difference is not large. I wouldn't worry too much about transcendental functions on GPU.

Can you run a test with a 128^3 grid (with blocking_factor=128 and max_grid=128)? For a performance benchmark, it is not necessary to run the whole problem, only $\sim 10$ timesteps.

@chongchonghe
Copy link
Contributor Author

@BenWibking
OK, here are the test results on gadi with 1 GPU. The parameters are 128^3 grids running for 10 time steps. Both the total run time and μs/zone-update are very similar, so the conclusion is there is no significant performance issue from this PR.

Old, bin-centered approach, opacity_model = OpacityModel::user:

186 Performance figure-of-merit: 0.7211298889 μs/zone-update [1.386712734 Mupdates/s]
187 Zone-updates on level 0: 20971520
188
189 Relative L1 error norm = 1.646387658e-05
190 elapsed time: 1459.157303 seconds.

New, opacity_model = OpacityModel::piecewisePowerLaw

186 Performance figure-of-merit: 0.8290508373 μs/zone-update [1.206198649 Mupdates/s]
187 Zone-updates on level 0: 20971520
188
189 Relative L1 error norm = 1.60272655e-05
190 elapsed time: 1394.70028 seconds.

With 4 GPUs, the performance is similar, too. These takes (about 4x) shorter time to run, although the reported μs/zone-update is bigger.

Old,

116 Performance figure-of-merit: 17.29543015 μs/zone-update [0.05781874122 Mupdates/s]
117 Zone-updates on level 0: 20971520

new:

116 Performance figure-of-merit: 16.46246597 μs/zone-update [0.06074424098 Mupdates/s]
117 Zone-updates on level 0: 20971520

@BenWibking
Copy link
Collaborator

BenWibking commented May 12, 2024

Ok, that seems reasonable to me.

As a final check, can you see how many iterations per zone there are on average for your benchmark runs?

When I tested the radiation code in the methods paper using 128^3 grids, I got 40 million updates/sec/GPU. However, that problem was optically thin, so there number of iterations should have been close to 1.

@chongchonghe
Copy link
Contributor Author

chongchonghe commented May 12, 2024

Ok, that seems reasonable to me.

As a final check, can you see how many iterations per zone there are on average for your benchmark runs?

When I tested the radiation code in the methods paper using 128^3 grids, I got 40 million updates/sec/GPU. However, that problem was optically thin, so there number of iterations should have been close to 1.

@BenWibking
Do you mean the number of Newton-Raphson iterations? How would you check that? I don't see a global counter for it.

@BenWibking
Copy link
Collaborator

Ok, that seems reasonable to me.
As a final check, can you see how many iterations per zone there are on average for your benchmark runs?
When I tested the radiation code in the methods paper using 128^3 grids, I got 40 million updates/sec/GPU. However, that problem was optically thin, so there number of iterations should have been close to 1.

@BenWibking Do you mean the number of Newton-Raphson iterations? How would you check that? I don't see a global counter for it.

Yes. There currently isn't one, but there is one for the number of cooling iterations, which you could very nearly copy and paste from:

amrex::Print() << fmt::format("\tcooling substeps (per cell): avg {}, max {}\n", navg, nmax);

@chongchonghe
Copy link
Contributor Author

Ok, that seems reasonable to me.
As a final check, can you see how many iterations per zone there are on average for your benchmark runs?
When I tested the radiation code in the methods paper using 128^3 grids, I got 40 million updates/sec/GPU. However, that problem was optically thin, so there number of iterations should have been close to 1.

@BenWibking Do you mean the number of Newton-Raphson iterations? How would you check that? I don't see a global counter for it.

Yes. There currently isn't one, but there is one for the number of cooling iterations, which you could very nearly copy and paste from:

amrex::Print() << fmt::format("\tcooling substeps (per cell): avg {}, max {}\n", navg, nmax);

OK, I'll try that later. Gadi is currently down.

@chongchonghe
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@chongchonghe
Copy link
Contributor Author

I did the same test with the RadForce problem instead. This is in the optically-thin limit and the performance is 8.3 Mupdates/s. Note that this is done with 2 photon groups, and the computational complexity is supposed to scale linearly with the number of photon groups (with a ~2.5 jump from 1 group to 2 groups due to additional setup for multigroup, e.g. interpolating the Planck integral). This brings this test to roughly 20 Mupdates/s/GPU per group. This also applies to the RadhydroPulseMGint problem presented above which uses 4 groups.

Old,

 86 Performance figure-of-merit: 0.119798939 μs/zone-update [8.347319334 Mupdates/s]
 87 Zone-updates on level 0: 20971520

New, PPL

 86 Performance figure-of-merit: 0.1201915345 μs/zone-update [8.320053521 Mupdates/s]
 87 Zone-updates on level 0: 20971520

@BenWibking
Copy link
Collaborator

I did the same test with the RadForce problem instead. This is in the optically-thin limit and the performance is 8.3 Mupdates/s. Note that this is done with 2 photon groups, and the computational complexity is supposed to scale linearly with the number of photon groups (with a ~2.5 jump from 1 group to 2 groups due to additional setup for multigroup, e.g. interpolating the Planck integral). This brings this test to roughly 20 Mupdates/s/GPU per group. This also applies to the RadhydroPulseMGint problem presented above which uses 4 groups.

Old,

 86 Performance figure-of-merit: 0.119798939 μs/zone-update [8.347319334 Mupdates/s]
 87 Zone-updates on level 0: 20971520

New, PPL

 86 Performance figure-of-merit: 0.1201915345 μs/zone-update [8.320053521 Mupdates/s]
 87 Zone-updates on level 0: 20971520

Ok, I am satisfied with this.

(Sometime in the future, we should figure out why it's a factor of ~2 slower than the version from the original paper.)

@chongchonghe
Copy link
Contributor Author

@markkrumholz Should we merge this PR? I think the only thing I haven't addressed is the Piecewise powerlaw reconstruction of the radiation quantities, which I haven't finished and I'll leave for the next PR.

@markkrumholz
Copy link
Collaborator

@markkrumholz Should we merge this PR? I think the only thing I haven't addressed is the Piecewise powerlaw reconstruction of the radiation quantities, which I haven't finished and I'll leave for the next PR.

Will look today and approve. It’s first thing in the morning here.

@markkrumholz markkrumholz added this pull request to the merge queue May 13, 2024
Merged via the queue into development with commit 66da980 May 13, 2024
21 checks passed
@chongchonghe chongchonghe deleted the chongchong/kappa-int-v8.2 branch September 27, 2024 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants