-
-
Notifications
You must be signed in to change notification settings - Fork 854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SSE41 version of UpSample #1836
Conversation
Profiling results look pretty good: masterPRTestimage used: |
Codecov Report
@@ Coverage Diff @@
## master #1836 +/- ##
=======================================
Coverage 87% 87%
=======================================
Files 937 937
Lines 48449 48654 +205
Branches 6057 6067 +10
=======================================
+ Hits 42329 42534 +205
Misses 5112 5112
Partials 1008 1008
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
When looking at the profiler measures, the improvements seem huge. I am not sure what im doing wrong here or if im reading the profiler output wrong. I am using dotTrace with |
UpSample32Pixels(topU.Slice(uvPos), curU.Slice(uvPos), ru); | ||
UpSample32Pixels(topV.Slice(uvPos), curV.Slice(uvPos), rv); | ||
ConvertYuvToBgrSse41(topY, bottomY, topDst, bottomDst, ru, rv, pos, xStep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid slicing and passing spans to these private methods here, and iterate instead using references or pointers. (Pinning is fine with such complex code IMO if it helps keeping readability.)
// Load the bytes into the *upper* part of 16b words. That's "<< 8", basically. | ||
[MethodImpl(InliningOptions.ShortMethod)] | ||
private static Vector128<byte> LoadHigh(Vector64<byte> src) | ||
{ | ||
Vector128<byte> tmp = Unsafe.As<Vector64<byte>, Vector128<byte>>(ref src); | ||
return Sse2.UnpackLow(Vector128<byte>.Zero, tmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you actually load to Vector128<byte>
, then bitshift? Could be much cheaper, not sure what is Unsafe.As<Vector64<byte>, Vector128<byte>>(ref src)
being compiled to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its just a mov
, SharpLab, this should be fine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant seem to figure out how to do this differently/better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is a vector level shuffle/shift, now I see what you are doing. What I mean is that if you only unpack from the lower half of tmp
, pushing this through Vector64<byte>
can be avoided by adding an 8 byte padding to y
, u
, v
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not aware of this, looks much better now, thanks.
Changed with cc5f7af
I'm happy for this to go in as-is but I'll leave it to @antonfirsov for a final review since he had questions/suggestions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still some space for improvement, but feel free to move on, if you'd better focus on something else.
// Load the bytes into the *upper* part of 16b words. That's "<< 8", basically. | ||
[MethodImpl(InliningOptions.ShortMethod)] | ||
private static Vector128<byte> LoadHigh(Vector64<byte> src) | ||
{ | ||
Vector128<byte> tmp = Unsafe.As<Vector64<byte>, Vector128<byte>>(ref src); | ||
return Sse2.UnpackLow(Vector128<byte>.Zero, tmp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is a vector level shuffle/shift, now I see what you are doing. What I mean is that if you only unpack from the lower half of tmp
, pushing this through Vector64<byte>
can be avoided by adding an 8 byte padding to y
, u
, v
:
{ | ||
YuvToBgrSse41(topY.Slice(curX), ru, rv, topDst.Slice(curX * step)); | ||
|
||
if (bottomY != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if can be moved outside of the loop for happier pipelines / branch predictor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the if outside of the loop with 65870b9
@@ -312,6 +586,178 @@ public static void YuvToBgr(int y, int u, int v, Span<byte> bgr) | |||
bgr[0] = (byte)YuvToB(y, u); | |||
} | |||
|
|||
#if SUPPORTS_RUNTIME_INTRINSICS | |||
|
|||
private static void ConvertYuvToBgrSse41(Span<byte> topY, Span<byte> bottomY, Span<byte> topDst, Span<byte> bottomDst, Span<byte> ru, Span<byte> rv, int curX, int step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These spans can be also converted to ref
-s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a bit more comfortable with using spans here, not sure if its really worth changing to ref's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main overhead seems to appear here:
ImageSharp/src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs
Lines 620 to 622 in 984a725
ConvertYuv444ToRgbSse41(y.Slice(8), u.Slice(8), v.Slice(8), out Vector128<short> r1, out Vector128<short> g1, out Vector128<short> b1); | |
ConvertYuv444ToRgbSse41(y.Slice(16), u.Slice(16), v.Slice(16), out Vector128<short> r2, out Vector128<short> g2, out Vector128<short> b2); | |
ConvertYuv444ToRgbSse41(y.Slice(24), u.Slice(24), v.Slice(24), out Vector128<short> r3, out Vector128<short> g3, out Vector128<short> b3); |
You can consider "de-spanning" only in this method and ConvertYuv444ToRgbSse41
replacing .Slice
with Unsafe.Add
to y/u/v base references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i am fine with that: 6293f72
// Convert 32 samples of YUV444 to R/G/B | ||
private static void ConvertYuv444ToRgbSse41(Span<byte> y, Span<byte> u, Span<byte> v, out Vector128<short> r, out Vector128<short> g, out Vector128<short> b) | ||
{ | ||
Vector128<byte> y0 = LoadHigh(ref MemoryMarshal.GetReference(y)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore me if you already double checked this, but what you really need to make sure now is that if y
,u
,v
have 8 bytes even at the very end, otherwise it's a memory corruption.
The code that allocates these buffers seems to add padding at first glance, but I don't really follow it:
ImageSharp/src/ImageSharp/Formats/Webp/Lossy/Vp8Decoder.cs
Lines 66 to 73 in 099a676
int extraRows = WebpConstants.FilterExtraRows[(int)LoopFilter.Complex]; // assuming worst case: complex filter | |
int extraY = extraRows * this.CacheYStride; | |
int extraUv = extraRows / 2 * this.CacheUvStride; | |
this.YuvBuffer = memoryAllocator.Allocate<byte>((WebpConstants.Bps * 17) + (WebpConstants.Bps * 9) + extraY); | |
this.CacheY = memoryAllocator.Allocate<byte>((16 * this.CacheYStride) + extraY); | |
int cacheUvSize = (16 * this.CacheUvStride) + extraUv; | |
this.CacheU = memoryAllocator.Allocate<byte>(cacheUvSize); | |
this.CacheV = memoryAllocator.Allocate<byte>(cacheUvSize); |
…clean, they will be overwritten anyway This reverts commit cded607.
Vector128<byte> y0 = LoadHigh(ref y); | ||
Vector128<byte> u0 = LoadHigh(ref u); | ||
Vector128<byte> v0 = LoadHigh(ref v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to stop already but can't resist 😄 Pipelined loads may perform better:
Vector128<byte> y0 = LoadHigh(ref y); | |
Vector128<byte> u0 = LoadHigh(ref u); | |
Vector128<byte> v0 = LoadHigh(ref v); | |
// Load the bytes into the *upper* part of 16b words. That's "<< 8", basically. | |
Vector128<byte> y0 = Unsafe.As<byte, Vector128<byte>>(ref y); | |
Vector128<byte> u0 = Unsafe.As<byte, Vector128<byte>>(ref u); | |
Vector128<byte> v0 = Unsafe.As<byte, Vector128<byte>>(ref v); | |
y0 = Sse2.UnpackLow(Vector128<byte>.Zero, y0); | |
u0 = Sse2.UnpackLow(Vector128<byte>.Zero, u0); | |
v0 = Sse2.UnpackLow(Vector128<byte>.Zero, v0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, makes sense. Changed with 7775c34
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patience!
Prerequisites
Description
This PR adds a SSE41 version of UpSample, which is used during decoding of lossy webp images to convert from YUV to RGB.
This is where a significant portion of the time is spend (ca. 30%).
Related to #1786
This is still work in progress. This is still left to do:
The last column of each block seems to be incorrect.Fixed with 3f43883Example output:
Output looks correct now: