-
-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvenc encoder improvements - support a common backing encoding session across windows #3286
nvenc encoder improvements - support a common backing encoding session across windows #3286
Conversation
Other ways around this problem are:
That's going to penalize the quadro users. I'll try to find the time to review this PR and see if we can at least merge some of it straight away. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have merged a bunch of mostly cosmetic changes.
Please rebase and address or answer the review comments.
xpra/codecs/nvenc/encoder.pyx
Outdated
@@ -13,6 +13,8 @@ import ctypes | |||
from ctypes import cdll, POINTER | |||
from threading import Lock | |||
from pycuda import driver | |||
from pycuda._driver import LogicError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be from pycuda.driver import LogicError
or just refer to it as driver.LogicError
.
xpra/codecs/nvenc/encoder.pyx
Outdated
|
||
def __init__(self, *args, **kwargs): | ||
log("init uniqueencoder start") | ||
self.unique_id = random.randint(0,999999999) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a safe way to generate a unique id. Use AtomicInteger
instead.
setup_cost = 80 | ||
if USE_SINGLETON_ENCODER: | ||
codec_class = UniqueEncoder | ||
setup_cost = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with these changes, I doubt that the setup cost is as low as a pure software encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was chosen merely to give strong preference to nvenc over x264 software encoder - I think a small value like 5 or 10 might still keep it preferred over software encoding. Of course the actual setup cost is > 0, but the setup cost for an already initialized singleton encoder would be = 0 when initializing the UniqueEncoder
@@ -1668,7 +1813,7 @@ cdef class Encoder: | |||
profile = os.environ.get("XPRA_NVENC_PROFILE", "") | |||
profile = os.environ.get("XPRA_NVENC_%s_PROFILE" % csc_mode, profile) | |||
#now see if the client has requested a different value: | |||
profile = options.strget("h264.%s.profile" % csc_mode, profile) | |||
profile = options.get("h264.%s.profile" % csc_mode, profile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strget
is needed because we get options
from the client, through a packet encoder which may mangle bytes and strings. (#3229)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure exactly what is going on here but months ago using strget
caused the nvenc encoder to exception out on initialization where I was running it, so I had to add a manual hack changing it to get
just to get it to work in the first place without crashing on me and disabling nvenc entirely. Currently I'm using the nvidia/cudagl:11.0-base-ubuntu20.04
docker container, however I believe this was the case with just a vanilla ubuntu 20.04
log("called compress_image with common encoder: %s, context: %#x, frames: %i", self, <uintptr_t> self.context, self.frames) | ||
log("compress_image%s", (quality, speed, options, retry)) | ||
|
||
#FIXME: do not lock encode_lock on USE_SINGLETON_ENCODER disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we do want a lock in all cases.
With multiple concurrent users each having their own encode thread, this could easily break nvenc.
(but the lock would need to protect more than just compress_image
in that case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming that multiple threads with their own cuda context and encoder would not need a global lock holding them (this is a global/module level var created) and could operate in parallel without stepping over eachother - the only reason why this is here is to ensure that parallel threads don't step on eachother as they're writing/reading from the same buffers/memory locations (in the singleton case).
Do you suspect that we'd need a global lock in the case where a singleton isn't used?
@@ -723,6 +727,8 @@ def matches_video_subregion(self, width, height): | |||
return vr | |||
|
|||
def subregion_is_video(self): | |||
if WINDOW_FOCUS_VIDEO_ENCODER and self.has_focus: | |||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like the wrong place for it.
What is this meant to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably not ideal, yes - here again the intent was to short-circuit any video region detection for a window and always use nvenc for the active/focused window
@@ -778,7 +784,7 @@ def send_nonvideo(regions=regions, encoding=coding, exclude_region=None, get_bes | |||
assert not self.full_frames_only | |||
|
|||
actual_vr = None | |||
if vr in regions: | |||
if vr in regions or (WINDOW_FOCUS_VIDEO_ENCODER and self.has_focus): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, this looks wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also attempting to short-circuit the logic to use nvenc where possible on the active window
log("update_encoding_options check setregion override") | ||
if WINDOW_FOCUS_VIDEO_ENCODER and self.has_focus: | ||
vs.set_detection(False) | ||
vs.set_region(0, 0, ww, wh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this also bypasses video region detection.
Why was this needed?
Now the whole window is using nvenc, no matter what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - the intent was to let the nvenc encoding/compression optimize sending the whole window. I noticed in a lot of cases since partial window refreshes were being sent, the window refreshed in blocks and not all at the same time - since we already are going to try to use nvenc to encode the frame (since the window is in focus) might as well let it handle the optimization and get consistent repaints on the client side.
If there's a better way to do/handle this let me know and I can update
else: | ||
scorelog("check_pipeline_score(%s) change of video input dimensions from %ix%i to %ix%i", | ||
force_reload, ve.get_width(), ve.get_height(), enc_width, enc_height) | ||
clean = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The can_ignore_dim_changes
should be added to all the encoders interface after being renamed to a more generic name, ie: can_compress_size(w, h)
.
else: | ||
videolog("do_check_pipeline video: window dimensions have changed from %sx%s to %sx%s", | ||
ve.get_width(), ve.get_height(), encoder_src_width, encoder_src_height) | ||
return False | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above.
I missed mentioning allowing only 2 encoders on consumer cards specifically - the goal here was to have a solution that would work on consumer cards (and working within their limitations). I didn't come across keylase/nvidia-patch in my investigation into this, but that certainly opens up a lot of extra possibilities on its own making this PR less beneficial. I likely would have structured this a bit differently from the beginning had I been aware. My goal was to eventually add in a pool of encoders with preference/stickiness for the last encoder it used to keep IDR frames to a minimum if possible with the option to specify the number of encoders you'd want to use. This would possibly allow for best of all worlds as this wouldn't limit quadro cards, would be able to use a higher value for patched consumer drivers, and would be able to work within the 2 encoder limit on non-patched consumer cards. I'd imagine running 1 encoder per window even in a quadro/patched case does take up extra VRAM that can't be used for other work or other windows. I've been trying to use xpra with things like a browser, where its not just straight video encode that initializes once, and instead will be frequent stops and starts - software encoding/compression in this case just isn't responsive enough to keep up, and the shift between software and nvenc encoding seems to cause extra delays and frame jitter (where it jumps back to an old frame for a second). In addition, having at least one encoder ready to go without taking the initialize time hit so far seems to help keep overall perceived latency down. Thanks for reviewing this PR and adding feedback - I'll run through those soon and make adjustments/comments where appropriate. |
c7dffab
to
b76106e
Compare
moving to draft as I have pending changes and identified a memory leak resulting from leftover nvenc resources |
Can you update the PR? |
99b6bf1
to
018e830
Compare
I've been working on this off and on, and have been trying to validate and ensure any potential long term memory leaks are addressed fully (some of these were a bit hidden and hard to track in the nvenc library). |
Any update? |
sorry about the delay and updating this one. I've had to step away from this for a bit, but I've been using this daily and working out the last bits if I notice them. I'll need to rebase again off master but will get updates as mentioned soon - I didn't realize that there was also some interest in possibly using this method (at least until nvenc pooling is implemented). |
add option to disable and use previous behavior
memory leak fixes cleanup rebase from master
…nup cdc to avoid memory leaks ensure ui_thread handles damage regions
018e830
to
31e9f05
Compare
rebased off of latest master - it appears to be functional at this point, however I have a few comments to fix/address before considering this ready to merge |
I would like to merge the resizing part first, and the encoder pool separately - ideally in a separate module. |
Not heard back 😞 |
The current implementation of the nvenc encoder instantiates a new encoding session for each new window or when a window's nvenc video encoder switches to being used. This incurs a (known) setup time each time the encoder is used and prioritized below other encoders. Due to the frequent setup and tear down, the encoder often gets into an unrecoverable state without killing xpra and restarting likely due to 2 reasons:
This PR solves the 2 issues above, and in the process adds a few additional benefits:
These changes do incur some minor drawbacks however:
High Level Implementation:
To minimize code changes, the
Encoder
class is kept mostly intact and can act as a (previous behavior) standalone Encoder reinitialized for each window or as a common singleton instance (abstracted through theUniqueEncoder
class). There's some python trickery here to transparently pass method calls/variable access fromUniqueEncoder
down to its backing singleton instance thus not requiring any complex rework into how a video encoder is initialized/called. This also allows selective override of any methods/variables inUniqueEncoder
to override behavior and/or store local data to eachUniqueEncoder
.Each
UniqueEncoder
has aunique_id
- this id is used by the singletonEncoder
'scompress_image
to determine if it needs to reset the backing encoder session back to frame 0 and send an IDR frame.compress_image
checks if the encoder is large enough to render the frame, and if not - triggers a resize of the encoding session (without fully reinitializing it) before continuing on to compress the image.When
clean()
is called on aUniqueEncoder
we need to ignore this, as we don't want to cleanup the singletonEncoder
since it's likely to be used by a futureUniqueEncoder
.Notes:
encode_lock
is used to ensure parallelcompress_image
calls don't step over each other since they will be using the same backing buffers for IO to the encoder session.XPRA_WINDOW_FOCUS_VIDEO_ENCODER
- ifTrue
try to use the video encoder for all frames of a window in focus, regardless of other hints that would exclude it from using the video encoder (ie text content-type)XPRA_NVENC_USE_SINGLETON_ENCODER
- ifTrue
use this PR's implementation of a single backing encoder session for all windows, otherwise use the one encoder session per window implementation