nvenc encoder improvements - support a common backing encoding session across windows #3286

winstona · 2021-09-26T22:03:07Z

The current implementation of the nvenc encoder instantiates a new encoding session for each new window or when a window's nvenc video encoder switches to being used. This incurs a (known) setup time each time the encoder is used and prioritized below other encoders. Due to the frequent setup and tear down, the encoder often gets into an unrecoverable state without killing xpra and restarting likely due to 2 reasons:

NVIDIA allows only 2 simultaneous encoders - if 3+ windows are attempting to use the nvenc encoder on a single consumer device, only 2 will ever succeed
resources not being freed consistently - the cuda/nvenc libraries are very sensitive to this and can hold on to an unusable encoder for the duration of xpra running, often completely locking up the nvenc encoder when both available encoders are orphaned

This PR solves the 2 issues above, and in the process adds a few additional benefits:

nvenc/hardware accelerated video encoding can easily be done on more than 2 windows simultaneously since only one encoding session is used
Due to using a singleton/global encoder, we can assume a near-zero setup time allowing more frequent use of the encoder in cases where it would have made more sense to use other encoders
With the flexibilty of the above, we can hint the window in focus to always prioritize using the nvenc encoder for all encoding to keep latency at a minimum

These changes do incur some minor drawbacks however:

More frequent IDR keyframes are sent. Encoding sessions retain state between frame compression except for keyframes (IDR frames). In order to not create image corruption, when the single encoder session is processing images for different windows simultaneously, it needs to send IDR frames more frequently to "reset" the encoding session. The positive side here is that the IDR frame is much faster to compress relative to subsequent frames but at the expense of a much larger size relative - incurring an overall higher bandwidth usage. This is minimized as much as possible though, and if only a single window is using the video encoder it will behave as expected (single IDR frame 0 at the beginning of the encode stream).
One large window will slow encoding for smaller windows. Encoding sessions need to specify dimensions. When a frame is encountered that is larger than the current encoder's dimensions, this will resize the encoder sessions dimensions to be larger. This takes a performance hit on encoding speed for windows that do not need the larger dimension as they still compress the frame with a larger input/output buffer and dimensions. In this implementation, we never resize the encoder back down to an optimum size after the larger window is no longer needed (until xpra is restarted and the encoder is reinitialized). This can easily be fixed in the future by resizing the encoder back down if it no longer needs to render a frame that large.
Realtime bandwidth/speed tuning is likely ignored after initialization. I didn't investigate this much yet, but optimization for changing link speeds/quality is likely ignored for now. This can probably be fixed fairly easily.

High Level Implementation:

To minimize code changes, the Encoder class is kept mostly intact and can act as a (previous behavior) standalone Encoder reinitialized for each window or as a common singleton instance (abstracted through the UniqueEncoder class). There's some python trickery here to transparently pass method calls/variable access from UniqueEncoder down to its backing singleton instance thus not requiring any complex rework into how a video encoder is initialized/called. This also allows selective override of any methods/variables in UniqueEncoder to override behavior and/or store local data to each UniqueEncoder.

Each UniqueEncoder has a unique_id - this id is used by the singleton Encoder's compress_image to determine if it needs to reset the backing encoder session back to frame 0 and send an IDR frame.

compress_image checks if the encoder is large enough to render the frame, and if not - triggers a resize of the encoding session (without fully reinitializing it) before continuing on to compress the image.

When clean() is called on a UniqueEncoder we need to ignore this, as we don't want to cleanup the singleton Encoder since it's likely to be used by a future UniqueEncoder.

Notes:

Some errors indicate a likely lockup with the encoding session - these when detected will trigger a full reinitialization of the nvenc library or singleton encoder instance, usually successfully recovering within a few seconds.
A global encode_lock is used to ensure parallel compress_image calls don't step over each other since they will be using the same backing buffers for IO to the encoder session.
New environment variable config flags are added:
- XPRA_WINDOW_FOCUS_VIDEO_ENCODER - if True try to use the video encoder for all frames of a window in focus, regardless of other hints that would exclude it from using the video encoder (ie text content-type)
- XPRA_NVENC_USE_SINGLETON_ENCODER - if True use this PR's implementation of a single backing encoder session for all windows, otherwise use the one encoder session per window implementation
This PR disables threaded init by default, as I had found (before these changes) that it was much more reliable when not initialized in a separate thread and this PR has only been tested with single-threaded init. In addition, threaded init is likely not even needed anymore for a singleton instance since init will happen very infrequently.

totaam · 2021-09-27T08:34:53Z

NVIDIA allows only 2 simultaneous encoders

Other ways around this problem are:

using a quadro card
https://github.com/keylase/nvidia-patch

These changes do incur some minor drawbacks however

That's going to penalize the quadro users.

I'll try to find the time to review this PR and see if we can at least merge some of it straight away.
ie: things like can_ignore_dim_changes can be generalized and renamed, but the removal of strget looks like a mistake.
The debug logging and extra asserts look good, the hard-coded 4K limit does not.
And things like if r == 10: will need a docstring.
The has_focus flag should be updated on every entry, not on every batch delay calculation. It makes no difference now, but it might in the future.
etc.

totaam

I have merged a bunch of mostly cosmetic changes.
Please rebase and address or answer the review comments.

totaam · 2021-09-29T04:25:28Z

xpra/codecs/nvenc/encoder.pyx

@@ -13,6 +13,8 @@ import ctypes
 from ctypes import cdll, POINTER
 from threading import Lock
 from pycuda import driver
+from pycuda._driver import LogicError


Should be from pycuda.driver import LogicError or just refer to it as driver.LogicError.

totaam · 2021-09-29T04:26:23Z

xpra/codecs/nvenc/encoder.pyx

+
+    def __init__(self, *args, **kwargs):
+        log("init uniqueencoder start")
+        self.unique_id = random.randint(0,999999999)


This is not a safe way to generate a unique id. Use AtomicInteger instead.

totaam · 2021-09-29T04:27:15Z

xpra/codecs/nvenc/encoder.pyx

+    setup_cost = 80
+    if USE_SINGLETON_ENCODER:
+        codec_class = UniqueEncoder
+        setup_cost = 0


Even with these changes, I doubt that the setup cost is as low as a pure software encoder.

This was chosen merely to give strong preference to nvenc over x264 software encoder - I think a small value like 5 or 10 might still keep it preferred over software encoding. Of course the actual setup cost is > 0, but the setup cost for an already initialized singleton encoder would be = 0 when initializing the UniqueEncoder

totaam · 2021-09-29T04:30:27Z

xpra/codecs/nvenc/encoder.pyx

@@ -1668,7 +1813,7 @@ cdef class Encoder:
        profile = os.environ.get("XPRA_NVENC_PROFILE", "")
        profile = os.environ.get("XPRA_NVENC_%s_PROFILE" % csc_mode, profile)
        #now see if the client has requested a different value:
-        profile = options.strget("h264.%s.profile" % csc_mode, profile)
+        profile = options.get("h264.%s.profile" % csc_mode, profile)


strget is needed because we get options from the client, through a packet encoder which may mangle bytes and strings. (#3229)

I'm not sure exactly what is going on here but months ago using strget caused the nvenc encoder to exception out on initialization where I was running it, so I had to add a manual hack changing it to get just to get it to work in the first place without crashing on me and disabling nvenc entirely. Currently I'm using the nvidia/cudagl:11.0-base-ubuntu20.04 docker container, however I believe this was the case with just a vanilla ubuntu 20.04

totaam · 2021-09-29T04:32:17Z

xpra/codecs/nvenc/encoder.pyx

+        log("called compress_image with common encoder: %s, context: %#x, frames: %i", self, <uintptr_t> self.context, self.frames)
+        log("compress_image%s", (quality, speed, options, retry))
+
+        #FIXME: do not lock encode_lock on USE_SINGLETON_ENCODER disabled


Actually, I think we do want a lock in all cases.
With multiple concurrent users each having their own encode thread, this could easily break nvenc.
(but the lock would need to protect more than just compress_image in that case)

I was assuming that multiple threads with their own cuda context and encoder would not need a global lock holding them (this is a global/module level var created) and could operate in parallel without stepping over eachother - the only reason why this is here is to ensure that parallel threads don't step on eachother as they're writing/reading from the same buffers/memory locations (in the singleton case).
Do you suspect that we'd need a global lock in the case where a singleton isn't used?

totaam · 2021-09-29T04:52:02Z