Various CUDA Optimizations #1

lilinitsy · 2025-02-16T01:00:12Z

I've added in a ton of CUDA optimizations for LS/CE/AOV.

Largely, I've done some lightcurve batching on lomb-scargle, and made use of CUDA asynchronous streams (including async memory transfers) to get some nice performance improvements.

Smaller improvements come from using restrict on pointers that are appropriate for it, and using some GPU intrinsic functions for slow math calls (ie,, using __sincosf).

All testings were done on a V100 on the SDSC Expanse GPU cluster.

TIMINGS

METHOD	VERSION	MACHINE	DATA	TIME	% GAIN
LS	Baseline	EXPANSE	all time, all mags, 100k periods	139.75323605537415
LS	OPTIMIZED	EXPANSE	all time, all mags, 100k periods	122.99888670444	+11.98%
CE	Baseline	EXPANSE	all time, all mags, all	237.9152114391327
CE	OPTIMIZED	EXPANSE	all time, all mags, all	224.75576400756836	+5.53%
AOV	Baseline	EXPANSE	all time, all mags, 100k periods	244.3776876926422
AOV	OPTIMIZED	EXPANSE	all time, all mags, 100k periods	214.41236901283264	+12.26%

METHOD	VERSION	MACHINE	DATA	TIME	% GAIN
LS	Baseline	EXPANSE	1000 lightcurves, all periods	17.119052171707153
LS	OPTIMIZED	EXPANSE	1000 lightcurves, all periods	16.61055612564087	+2.97%
CE	Baseline	EXPANSE	1000 lightcurves, all periods	28.453676223754883
CE	OPTIMIZED	EXPANSE	1000 lightcurves, all periods	27.26391577720642	+4.18%
AOV	Baseline	EXPANSE	1000 lightcurves, all periods	30.421194076538086
AOV	OPTIMIZED	EXPANSE	1000 lightcurves, all periods	26.356033086776733	+13.36%

The performance gains decrease as the GPU memory bandwidth increases -- on a GTX 1080, the lomb-scargle gains were in the mid 20%'s.

… increase

…es from 170s to 148s

ejaszewski · 2025-02-19T17:43:03Z

Happy to review the changes this weekend if you would like another pair of eyes on it!

ejaszewski · 2025-02-23T08:39:15Z

The changed .clang-format makes it very difficult to actually tell what has changed because the diff is picking up all of the whitespace changes. If possible, can you re-format this with the original clang-format so the diff is meaningful?

lilinitsy · 2025-02-25T02:01:53Z

@ejaszewski Sure thing. I'll try and do that this weekend.

lilinitsy and others added 30 commits June 17, 2024 12:28

Remove old cuda architectures incompatible with version 12

168217d

buggy (wrong results) of LS kernel batched

e8ee3e6

some updates???

0a649bb

index properly in the curve_bytes calculation...

b8b9e15

Multiple lightcurves batched, but not faster; need some improvements

ace28ad

backup

52e54b2

Batch updates

be405e3

Minor updates

a16e378

ffastmath on and some GPU intrinsics

7c5c7c8

Use two asynchronous cuda streams, which leads to a large performance…

2f36a81

… increase

Updates to improve accuracy; runtime for testing 100k periods decreas…

a681cd3

…es from 170s to 148s

Get multiple streams working, minor improvement

7fe18a0

clang-format

ac75555

Move cuda stream synchronization outside loop

310de6a

Remove extraneous chronos usage

04b2955

Synchronize streams before destroying

8419b2f

Increase the number of batched curves

0b60864

Modify inner loop condition

9f6eb21

Move synchronize

18500e0

restrict conditional entropy pointers

b6cc7e2

Pinned memory for CE

b03f020

3 cuda streams for CE

036159c

Use registers instead of shared memory in conditional entropy kernel

8521b23

Copy local hists to local memory

5aa4f3e

Cleanup on CE

1a19575

Cleanup on CE

45031c8

Change block size

21d122d

Pyx speedups using list comprehension

a37cef5

Tune threadcount and batch size

34b8769

More tuning for v100

b60e830

lilinitsy added 21 commits August 13, 2024 19:18

More tuning for running on 1 v100

a5e347b

List comprehension when returning stats in pyx

5083b32

Update pyx to reserve memory beforehand

6c19ff4

update pyx

39b5617

ls.pyx updates

5edbd7e

pyx update, set streams to 4

a16331a

Add OpenMP to Cmakelists

8feafdf

Pyx updates for CE

0593b7a

Adjust thread count

ba4a9e9

Change threads for ce

40854f9

Testingw ithout pinned on v100

f135693

Use shared memory (Faster on V100, slower on 1080)

dfeadbb

ce update

6adb53a

updates to threadcounts per kernel

49d69c3

Pinned memory for aov

4b416c0

Threadcount adjustments

13fed4c

streams for AOV

9593fbc

Move atomic adds in FoldBinKernel

c4c2bf5

Remove numphasebinoverlap loop from FoldBinKernel

54785b6

Finally ran clang-format

7e59a60

Remove extraneous files, clang-format, etc, ready for merge

dd4c9ea

mcoughlin requested review from DanielWarshofsky and ejaszewski February 19, 2025 15:53

DanielWarshofsky approved these changes Feb 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various CUDA Optimizations #1

Various CUDA Optimizations #1

lilinitsy commented Feb 16, 2025

ejaszewski commented Feb 19, 2025

ejaszewski commented Feb 23, 2025

lilinitsy commented Feb 25, 2025

Various CUDA Optimizations #1

Are you sure you want to change the base?

Various CUDA Optimizations #1

Conversation

lilinitsy commented Feb 16, 2025

TIMINGS

ejaszewski commented Feb 19, 2025

ejaszewski commented Feb 23, 2025

lilinitsy commented Feb 25, 2025