Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement precompute = POLYNOMIAL for blocking = false #94

Closed
JakobAsslaender opened this issue Mar 23, 2022 · 5 comments
Closed

implement precompute = POLYNOMIAL for blocking = false #94

JakobAsslaender opened this issue Mar 23, 2022 · 5 comments

Comments

@JakobAsslaender
Copy link
Contributor

Thanks, @tknopp for the comments in #93! Indeed, for calling calculateToeplitzKernel! repeatedly, the settings precompute = LINEAR, blocking = false are the fastest. Hence, it would be awesome to have an implementation for precompute = POLYNOMIAL, blocking = false, if you say this faster!

Off topic: Many thanks for your prompt merging / fixing the other issues! Would you mind releasing the patch to make it easier to incorporate the bugfixes?

@tknopp
Copy link
Member

tknopp commented Mar 24, 2022

The release ist triggered.

Regarding the other implementation: I will give this a go once I find some time. Want to have this for completeness anyway. If it will help in your case is not 100% clear. Depends on the fraction, the precomputation right now takes.

tknopp added a commit that referenced this issue Mar 26, 2022
@tknopp
Copy link
Member

tknopp commented Mar 26, 2022

I gave this a go. If you plan to benchmark it would be great to do this in a systematic way. I would be in particular interested in the impact on precomputation and runtime. I expect almost zero precomputation cost. The runtime should not change for small m and be better for large m. In your application I doubt that you need large m, however.

@JakobAsslaender
Copy link
Contributor Author

Many thanks! Not sure along which axes of the large parameter space you want me to run systematic benchmarks. In my real world example, with the default settings m=4 and simga=2, LINEAR and POLYNOMINAL seem to be roughly the same speed. In this case, I plan the NFFT once, copy it 40x (one for each thread), and then have a parallel for loop in which I call nodes! followed by one mul!(x, adjoint(p), d).

@tknopp
Copy link
Member

tknopp commented Mar 27, 2022

What would be interesting is:

  • planning time
  • copying time
  • runtime (multi- and single-threaded)
    My feeling is that in your case the last part takes more than 90%.

@tknopp
Copy link
Member

tknopp commented Jun 19, 2022

This feature is implemented

@tknopp tknopp closed this as completed Jun 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants