Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of high-level pollEvent #178

Closed
chrisdone opened this issue Apr 26, 2018 · 5 comments
Closed

Performance of high-level pollEvent #178

chrisdone opened this issue Apr 26, 2018 · 5 comments

Comments

@chrisdone
Copy link
Member

This is a general performance issue where I'll discuss results and issues.

I thought I would do a very simple measurement of the standard SDL event polling loop, with no window or anything:

import SDL
import Weigh

main :: IO ()
main =
  mainWith
    (do setColumns [Case, Allocated, GCs]
        sequence_
          [ action ("pollEvents " ++ show i) (sdl i)
          | i <- [10, 100, 1000, 10000, 100000]
          ])

sdl :: Int -> IO ()
sdl iters = do
  initializeAll
  let go :: Int -> IO ()
      go 0 = pure ()
      go i = do
        _events <- pollEvents
        go (i - 1)
  go iters

It turns out to allocate quite a bit: 😱

Case                Allocated  GCs
pollEvents 10           4,632    0
pollEvents 100         38,416    0
pollEvents 1000       369,112    0
pollEvents 10000    3,680,088    3
pollEvents 100000  36,802,144   35

If instead I do this, i.e. allocate once and repeatedly write to the same pointer and peek when there's something available:

sdl :: Int -> IO ()
sdl iters = do
  initializeAll
  alloca
    (\(e) -> do
       let go :: Int -> IO ()
           go 0 = pure ()
           go i = do
             do n <- Raw.pollEvent e
                if n == 0
                  then return ()
                  else void (peek e)
             go (i - 1)
       go iters)

Allocations are constant, no GCs: 💪

Case               Allocated  GCs
pollEvents 10          1,400    0
pollEvents 100         1,400    0
pollEvents 1000        1,400    0
pollEvents 10000       1,400    0
pollEvents 100000      1,400    0

So currently the example program in the docs which uses pollEvents in a loop is pretty inefficient. ☝️

I suspect waitEvent is probably better, but the SDL docs don't endors it like pollEvent:

SDL_PollEvent() is the favored way of receiving system events since it can be done from the main loop and does not suspend the main loop while waiting on an event to be posted.

But using pollEvent in a loop is guaranteed to allocate whether or not there are any events to be processed.

I realise there are worse things in life than a bit of allocation; I just have a hobby project as part of haskell-perf called "gcless" Haskell, in which I try out little programs and see what useful toys I can make that happen to not allocate beyond a constant initial factor.

Anyway, one alternative is to tweak pollEvent so that it checks whether there is an event before allocating:

{-# LANGUAGE LambdaCase #-}
import Foreign
import SDL
import qualified SDL.Raw as Raw
import Criterion.Main

main :: IO ()
main = do
  initializeAll
  defaultMain
    [ bgroup
        "pollEvents"
        [ bench ("orig") (whnfIO orig)
        , bench ("check") (whnfIO check)
        ]
    ]

check :: IO ()
check = do
  ret <- Raw.pollEvent nullPtr
  if ret == 0
    then pure ()
    else alloca
           (\e ->
              (do Raw.pollEvent e
                  peek e >>= \case
                    Raw.AudioDeviceEvent {} -> pure ()
                    _ -> pure ()))

orig :: IO ()
orig = pollEvent

No discernible speed difference:

benchmarking pollEvents/orig
time                 11.43 μs   (11.37 μs .. 11.49 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 11.41 μs   (11.37 μs .. 11.48 μs)
std dev              167.3 ns   (122.4 ns .. 274.7 ns)
variance introduced by outliers: 11% (moderately inflated)

benchmarking pollEvents/check
time                 11.05 μs   (11.00 μs .. 11.11 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 11.04 μs   (10.99 μs .. 11.13 μs)
std dev              216.5 ns   (154.8 ns .. 296.8 ns)
variance introduced by outliers: 19% (moderately inflated)

So we just avoid allocating where not necessary. 👍

@chrisdone
Copy link
Member Author

chrisdone commented Apr 26, 2018

Just to summarize those numbers for future reference:

> let timeToPoll = 11 * 1e-6 -- from criterion
> let bytesPerPoll = 36802144/100000 -- from weigh
> let pollsPerSec = 1/timeToPoll 
> let bytesPerSecond = bytesPerPoll * pollsPerSec
> pollsPerSec
90909.09090909091
> printf "%f" (pollsPerSec * bytesPerPoll)
33456494.545454547

So that's 90,909 polls and 33.5MB per second of allocations. That's
consistent with the other findings. The GC count for 100,000 polls (36.8MB)
is 35 GCs, so that's roughly 1 GC per 1MB which corresponds to GHC's
default settings
.

If your rendering function is in an SDL timer handled by pollEvent, then this might hinder your top performance.

@chrisdone
Copy link
Member Author

More findings: waitEvent is apparently just a loop that has an interval. It doesn't, it would seem, use OS timers or wait for user input to trigger.

chrisdone added a commit that referenced this issue Apr 26, 2018
Just to summarize the numbers below for future reference:

> let timeToPoll = 11 * 1e-6 -- from criterion
> let bytesPerPoll = 36802144/100000 -- from weigh
> let pollsPerSec = 1/timeToPoll
> let bytesPerSecond = bytesPerPoll * pollsPerSec
> pollsPerSec
90909.09090909091
> printf "%f" (pollsPerSec * bytesPerPoll)
33456494.545454547

So that's 90,909 polls and 33.5MB per second of allocations. That's
consistent with the below findings. The GC count for 100,000
polls (36.8MB) is 35 GCs, so that's roughly 1 GC per 1MB which
corresponds to GHC's default GC settings.

Before:

Case                Allocated  GCs
pollEvents 10           4,632    0
pollEvents 100         38,416    0
pollEvents 1000       369,112    0
pollEvents 10000    3,680,088    3
pollEvents 100000  36,802,144   35

benchmarking pollEvents/orig
time                 11.43 μs   (11.37 μs .. 11.49 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 11.41 μs   (11.37 μs .. 11.48 μs)
std dev              167.3 ns   (122.4 ns .. 274.7 ns)
variance introduced by outliers: 11% (moderately inflated)

After

Case               Allocated  GCs
pollEvents 10          1,400    0
pollEvents 100         1,400    0
pollEvents 1000        1,400    0
pollEvents 10000       1,400    0
pollEvents 100000      1,400    0

benchmarking pollEvents/check
time                 11.05 μs   (11.00 μs .. 11.11 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 11.04 μs   (10.99 μs .. 11.13 μs)
std dev              216.5 ns   (154.8 ns .. 296.8 ns)
variance introduced by outliers: 19% (moderately inflated)
chrisdone added a commit that referenced this issue Apr 26, 2018
Make pollEvent not allocate unnecessarily (#178)
@chrisdone chrisdone changed the title Performance of sdl2 high-level bindings Performance of high-level pollEvent Apr 26, 2018
@chrisdone
Copy link
Member Author

Closing, I'll open different PRs for other operations.

@chrisdone
Copy link
Member Author

Added regression tests for the allocations here: 2bd91b1

@ocharles
Copy link
Member

Hi Chris,

Just got back from holiday but wanted to express my thanks for this work. The follow up commits seem obvious, but thanks for explaining what you're doing at the same time.

Crack on, I say!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants