Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Message binary decoding support from ContiguousBytes. #914

Merged
merged 1 commit into from
Nov 18, 2019

Conversation

thomasvl
Copy link
Collaborator

ContiguousBytes should allow other libraries that have their own "buffer"
types to directly decode things without having to copy the payloads into
a Data first.

  • Add a new init.
  • Add a new merge.
  • Redirect the Data interfaces thru the ContiguousBytes interfaces.

@thomasvl thomasvl requested review from allevato and tbkka November 18, 2019 15:32
Copy link
Collaborator

@allevato allevato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I thought this would be a much bigger change than it turned out to be.

ContiguousBytes should allow other libraries that have their own "buffer"
types to directly decode things without having to copy the payloads into
a Data first.

- Add a new init.
- Add a new merge.
- Redirect the Data interfaces thru the ContiguousBytes interfaces.
Copy link
Collaborator

@tbkka tbkka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad this turned out so simple.

@thomasvl thomasvl merged commit 7f8daf3 into apple:master Nov 18, 2019
@thomasvl thomasvl deleted the contiguousBytes branch November 18, 2019 17:58
@Lukasa
Copy link
Contributor

Lukasa commented Nov 19, 2019

Sorry, I was on holiday when this patch was posted.

Can I suggest that rather than taking a ContiguousBytes existential you make this function @inlinable and generic over ContiguousBytes instead? Taking an existential throws away type information unnecessarily, whereas having the function be generic preserves that type information and grants the compiler more opportunities to generate efficient code. This goes double if the generic function is also inlinable, and as it will be small (just a call to withUnsafeBytes that passes the input to another helper function) this actually produces a code size win in most cases.

This advantage particularly accrues to data types whose implementation of ContiguousBytes is itself inlinable, as then it is possible to reduce the code to a minimal case. As an example, let's consider a simplified version of your module containing both options, where doStuff is the equivalent of what your code would be:

import Foundation

@usableFromInline
func doStuff(_ pointer: UnsafeRawBufferPointer) throws { }

public func takesExistential(contiguousBytes bytes: ContiguousBytes) throws {
    try bytes.withUnsafeBytes(doStuff)
}

@inlinable
public func takesGeneric<Bytes: ContiguousBytes>(contiguousBytes bytes: Bytes) throws {
    try bytes.withUnsafeBytes(doStuff)
}

In a different module (necessary to simulate the effects of the cross-module optimisation properties of this code) we can import NIO and try to pass a ByteBufferView to this code (the @inline(never) shouldn't affect the correctness of the code, but it should prevent the linker killing our functions and making it hard to see them):

import lib
import NIO
import NIOFoundationCompat

@inline(never)
func testWithExistential(buffer: ByteBuffer) throws {
    try takesExistential(contiguousBytes: buffer.readableBytesView)
}

@inline(never)
func testWithGeneric(buffer: ByteBuffer) throws {
    try takesGeneric(contiguousBytes: buffer.readableBytesView)
}

var buffer = ByteBufferAllocator().buffer(capacity: 1024)
try! testWithExistential(buffer: buffer)
try! testWithGeneric(buffer: buffer)

The difference in output here is profound.

The existential-taking code generates two quite large functions. The first is testWithExistential itself:

_$s4test0A15WithExistential6buffery3NIO10ByteBufferV_tKF:        // test.testWithExistential(buffer: NIO.ByteBuffer) throws -> ()
push       rbp                          ; CODE XREF=_main+129
mov        rbp, rsp
push       r14
push       rbx
sub        rsp, 0x50
mov        rbx, r12
movabs     rax, 0xffffffffffffff
and        rdx, rax                     ; argument #3 for method _$s3NIO10ByteBufferV17readableBytesViewAA0bcF0Vvg
lea        rax, qword [rbp+var_38]
call       _$s3NIO10ByteBufferV17readableBytesViewAA0bcF0Vvg ; NIO.ByteBuffer.readableBytesView.getter : NIO.ByteBufferView
lea        rax, qword [_$s3NIO14ByteBufferViewVN] ; _$s3NIO14ByteBufferViewVN
mov        qword [rbp+var_48], rax
mov        rax, qword [_$s3NIO14ByteBufferViewVAC10Foundation15ContiguousBytes19NIOFoundationCompatWL] ; _$s3NIO14ByteBufferViewVAC10Foundation15ContiguousBytes19NIOFoundationCompatWL
test       rax, rax
jne        loc_1000e1a25

lea        rdi, qword [_$s3NIO14ByteBufferViewV10Foundation15ContiguousBytes19NIOFoundationCompatMc] ; _$s3NIO14ByteBufferViewV10Foundation15ContiguousBytes19NIOFoundationCompatMc
lea        rsi, qword [_$s3NIO14ByteBufferViewVN] ; _$s3NIO14ByteBufferViewVN
call       imp___stubs__swift_getWitnessTable ; swift_getWitnessTable
mov        qword [_$s3NIO14ByteBufferViewVAC10Foundation15ContiguousBytes19NIOFoundationCompatWL], rax ; _$s3NIO14ByteBufferViewVAC10Foundation15ContiguousBytes19NIOFoundationCompatWL

loc_1000e1a25:
mov        qword [rbp+var_40], rax      ; CODE XREF=_$s4test0A15WithExistential6buffery3NIO10ByteBufferV_tKF+57
lea        rdi, qword [__swift_FORCE_LOAD_$_swiftCompatibilityDynamicReplacements_$_test+24] ; 0x100111c90
mov        esi, 0x38
mov        edx, 0x7
call       imp___stubs__swift_allocObject ; swift_allocObject
mov        qword [rbp+var_60], rax
mov        rcx, qword [rbp+var_18]
mov        qword [rax+0x30], rcx
mov        rcx, qword [rbp+var_20]
mov        qword [rax+0x28], rcx
mov        rcx, qword [rbp+var_28]
mov        qword [rax+0x20], rcx
mov        rcx, qword [rbp+var_38]
mov        rdx, qword [rbp+var_30]
mov        qword [rax+0x18], rdx
mov        qword [rax+0x10], rcx
lea        r14, qword [rbp+var_60]
mov        rdi, r14                     ; argument #1 for method _$s3lib16takesExistential15contiguousBytesy10Foundation010ContiguousE0_p_tKF
mov        r12, rbx
call       _$s3lib16takesExistential15contiguousBytesy10Foundation010ContiguousE0_p_tKF ; lib.takesExistential(contiguousBytes: Foundation.ContiguousBytes) throws -> ()
mov        rbx, r12
mov        rdi, r14                     ; argument #1 for method ___swift_destroy_boxed_opaque_existential_1
call       ___swift_destroy_boxed_opaque_existential_1 ; ___swift_destroy_boxed_opaque_existential_1
test       rbx, rbx
mov        r12, rbx
add        rsp, 0x50
pop        rbx
pop        r14
pop        rbp
ret

This code is in 3 parts. The first block creates the ByteBufferView from a ByteBuffer, and is not something we're very interested in. The second part is the thunk that builds and loads the value witness table for a ByteBufferView. This will be run only once in the program, and is required because we need the VWT for the call with the existential.

The third block is interesting, as it does two things. Firstly, it heap-allocates. It does this because the existential being passed to lib.takesExistential is not known not to escape, so a refcounted heap allocation must be made to manage the ByteBufferView. Second, note that we emit an actual call into lib.takesExistential. That means we need to jump into lib, to this code:

_$s3lib16takesExistential15contiguousBytesy10Foundation010ContiguousE0_p_tKF:        // lib.takesExistential(contiguousBytes: Foundation.ContiguousBytes) throws -> ()
push       rbp                          ; CODE XREF=_$s4test0A15WithExistential6buffery3NIO10ByteBufferV_tKF+165
mov        rbp, rsp
push       r15
push       r14
push       r13
push       rbx
mov        r14, r12
mov        rbx, qword [rdi+0x18]
mov        r15, qword [rdi+0x20]
mov        rsi, rbx                     ; argument #2 for method ___swift_project_boxed_opaque_existential_1
call       ___swift_project_boxed_opaque_existential_1 ; ___swift_project_boxed_opaque_existential_1
mov        rdx, qword [_$sytN_100102a88] ; _$sytN_100102a88
add        rdx, 0x8
lea        rdi, qword [_$sSWs5Error_pIgyzo_SWytsAA_pIegyrzo_TR20$s3lib7doStuffyySWKFTf3nnpf_n] ; _$sSWs5Error_pIgyzo_SWytsAA_pIegyrzo_TR20$s3lib7doStuffyySWKFTf3nnpf_n
xor        esi, esi
mov        r13, rax
mov        r12, r14
mov        rcx, rbx
mov        r8, r15
call       imp___stubs__$s10Foundation15ContiguousBytesP010withUnsafeC0yqd__qd__SWKXEKlFTj ; dispatch thunk of Foundation.ContiguousBytes.withUnsafeBytes<A>((Swift.UnsafeRawBufferPointer) throws -> A1) throws -> A1
pop        rbx
pop        r13
pop        r14
pop        r15
pop        rbp
ret

Here we do a partial unwrap of the existential and then jump through the ContiguousBytes protocol witness table for the provided value, passing the address of doStuff as an argument.

The effect here is that we have the following costs: one allocation, an unwrap of an existential, an indirect call via the PWT into NIO, followed by another indirect call in the non-specialised NIO implementation of withUnsafeBytes back into lib.doStuff. These two indirect calls are tricky to predict, and when combined with the allocation make things somewhat expensive, especially given that all we want to do is to ask NIO to vend us a pointer to interior storage.

By comparison, here is the code generated by testWithGeneric:

 _$s4test0A11WithGeneric6buffery3NIO10ByteBufferV_tKF:        // test.testWithGeneric(buffer: NIO.ByteBuffer) throws -> ()
push       rbp                          ; CODE XREF=_main+212
mov        rbp, rsp
push       r15
push       r14
push       r13
push       rbx
sub        rsp, 0x50
mov        qword [rbp+var_50], r12
movabs     rax, 0xffffffffffffff
and        rdx, rax                     ; argument #3 for method _$s3NIO10ByteBufferV17readableBytesViewAA0bcF0Vvg
lea        rax, qword [rbp+var_48]
call       _$s3NIO10ByteBufferV17readableBytesViewAA0bcF0Vvg ; NIO.ByteBuffer.readableBytesView.getter : NIO.ByteBufferView
mov        r15, qword [rbp+var_48]
mov        r14d, dword [rbp+var_38]
movzx      ebx, word [rbp+var_34]
movzx      r13d, byte [rbp+var_32]
lea        rdi, qword [r15+0x18]
lea        rsi, qword [rbp+var_68]
xor        edx, edx
xor        ecx, ecx
call       imp___stubs__swift_beginAccess ; swift_beginAccess
mov        r15, qword [r15+0x18]
lea        rdi, qword [rbp+var_48]      ; argument #1 for method _$s3NIO14ByteBufferViewVWOr
call       _$s3NIO14ByteBufferViewVWOr  ; outlined retain of NIO.ByteBufferView
shl        r13, 0x30
shl        rbx, 0x20
or         rbx, r14
or         rbx, r13
mov        rdi, rbx                     ; argument #1 for method _$s3NIO16_ByteBufferSliceV10lowerBounds6UInt32Vvg
call       _$s3NIO16_ByteBufferSliceV10lowerBounds6UInt32Vvg ; NIO._ByteBufferSlice.lowerBound.getter : Swift.UInt32
cmp        r14d, eax
jb         loc_1000e1bd0

mov        eax, eax
add        r15, rax
je         loc_1000e1bd2

mov        rax, qword [rbp+var_30]
mov        rbx, qword [rbp+var_28]
sub        rbx, rax
jo         loc_1000e1bd4

test       rbx, rbx
js         loc_1000e1bd6

add        r15, rax
add        rbx, r15
lea        rdi, qword [rbp+var_48]      ; argument #1 for method _$s3NIO14ByteBufferViewVWOr
call       _$s3NIO14ByteBufferViewVWOr  ; outlined retain of NIO.ByteBufferView
mov        rdi, r15
mov        rsi, rbx
mov        r12, qword [rbp+var_50]
call       _$s3lib7doStuffyySWKF        ; lib.doStuff(Swift.UnsafeRawBufferPointer) throws -> ()
mov        rbx, r12
test       r12, r12
je         loc_1000e1ba4

mov        r12, rbx
call       imp___stubs__swift_willThrow ; swift_willThrow
lea        r14, qword [rbp+var_48]
mov        rdi, r14                     ; argument #1 for method _$s3NIO14ByteBufferViewVWOs
call       _$s3NIO14ByteBufferViewVWOs  ; outlined release of NIO.ByteBufferView
mov        rdi, r14                     ; argument #1 for method _$s3NIO14ByteBufferViewVWOs
call       _$s3NIO14ByteBufferViewVWOs  ; outlined release of NIO.ByteBufferView
mov        r12, rbx
call       imp___stubs__swift_willThrow ; swift_willThrow
jmp        loc_1000e1bb8

loc_1000e1ba4:
lea        r14, qword [rbp+var_48]      ; CODE XREF=_$s4test0A11WithGeneric6buffery3NIO10ByteBufferV_tKF+188
mov        rdi, r14                     ; argument #1 for method _$s3NIO14ByteBufferViewVWOs
call       _$s3NIO14ByteBufferViewVWOs  ; outlined release of NIO.ByteBufferView
mov        rdi, r14                     ; argument #1 for method _$s3NIO14ByteBufferViewVWOs
call       _$s3NIO14ByteBufferViewVWOs  ; outlined release of NIO.ByteBufferView

loc_1000e1bb8:
mov        rdi, r14                     ; argument #1 for method _$s3NIO14ByteBufferViewVWOs, CODE XREF=_$s4test0A11WithGeneric6buffery3NIO10ByteBufferV_tKF+226
call       _$s3NIO14ByteBufferViewVWOs  ; outlined release of NIO.ByteBufferView
mov        r12, rbx
add        rsp, 0x50
pop        rbx
pop        r13
pop        r14
pop        r15
pop        rbp
ret

This code is a bit larger but has a number of better properties. In particular, it does not heap allocate, and never makes an indirect call: all calls are directly to the relevant function, including directly into lib.doStuff. This code path is much friendlier to the branch predictor and is inclined to run substantially faster.

In fact, the only problem with this code is that the codegen is kinda lousy due to implementation mistakes in NIO: we missed a bunch of inlinables that I'll be adding sometime soon.

In general, having this inlinable generic thunk greatly reduces the costs of obtaining pointers to the backing storage of structures, and I'd strongly recommend doing it.

@thomasvl
Copy link
Collaborator Author

Do we want to just worry about inline on this or should we also start to go after #763 (and deal with what it might bean for breaking changes).

@thomasvl
Copy link
Collaborator Author

thomasvl commented Nov 19, 2019

The generic part is likely easily doable, but @inlinable runs into problems.

BinaryDecoder is internal to the module, and it is directly used from merge(...), so you can't make merge @inlinable without going back and making BinaryDecoder public.

So the question seems to quickly slide into expanding the library interface with more things public, or wedging in more non inline functions to keep some types hidden, but that is likely going to prevent some of the wins one is trying to get by allowing things to be inline.

@Lukasa
Copy link
Contributor

Lukasa commented Nov 19, 2019

BinaryDecoder is internal to the module, and it is directly used from merge(...), so you can't make merge @inlinable without going back and making BinaryDecoder public

You can if you Mark it @usableFromInline.

@thomasvl
Copy link
Collaborator Author

You can if you Mark it @usableFromInline.

Yea, that doesn't really work out, everything to support the Decoder protocol then also has to get tagged (as well as the init and decodeFullMessage(message:) that is called from the method we're trying to support inlinable on).

@Lukasa
Copy link
Contributor

Lukasa commented Nov 19, 2019

No they don't, you just need to refactor slightly. You have this:

public mutating func merge(
  contiguousBytes bytes: ContiguousBytes,
  extensions: ExtensionMap? = nil,
  partial: Bool = false,
  options: BinaryDecodingOptions = BinaryDecodingOptions()
) throws {
  try bytes.withUnsafeBytes { (body: UnsafeRawBufferPointer) in
    if let baseAddress = body.baseAddress, body.count > 0 {
      let pointer = baseAddress.assumingMemoryBound(to: UInt8.self)
      var decoder = BinaryDecoder(forReadingFrom: pointer,
                                  count: body.count,
                                  options: options,
                                  extensions: extensions)
      try decoder.decodeFullMessage(message: &self)
    }
  }
  if !partial && !isInitialized {
    throw BinaryDecodingError.missingRequiredFields
  }
}

You can write this instead:

@inlinable
public mutating func merge<Bytes: ContiguousBytes>(
  contiguousBytes bytes: Bytes,
  extensions: ExtensionMap? = nil,
  partial: Bool = false,
  options: BinaryDecodingOptions = BinaryDecodingOptions()
) throws {
  try bytes.withUnsafeBytes { (body: UnsafeRawBufferPointer) in
    try _merge(rawBytes: body, extensions: extensions, partial: partial, options: options)
  }
}


@usableFromInline
internal mutating func _merge(
  rawBytes body: UnsafeRawBufferPointer,
  extensions: ExtensionMap?,
  partial: Bool,
  options: BinaryDecodingOptions 
) throws {
  if let baseAddress = body.baseAddress, body.count > 0 {
    let pointer = baseAddress.assumingMemoryBound(to: UInt8.self)
    var decoder = BinaryDecoder(forReadingFrom: pointer,
                                count: body.count,
                                options: options,
                                extensions: extensions)
    try decoder.decodeFullMessage(message: &self)
  }
  if !partial && !isInitialized {
    throw BinaryDecodingError.missingRequiredFields
  }
}

This requires only one point of @usableFromInline, where everything else can remain appropriately access controlled.

@thomasvl
Copy link
Collaborator Author

That's the other option I was suggesting, I just wasn't sure if that meant we were getting the same wins since we're only moving a small amount to be reachable now.

@Lukasa
Copy link
Contributor

Lukasa commented Nov 19, 2019

Yeah, this gives nearly 100% of the win. As a rough heuristic, if the function is not generic, the win from making it @inlinable is much lower than when it is. This pattern of having a small inlinable thunk that performs the generic operation and then delegates to an inner function is a very effective strategy for giving the compiler the best opportunity to remove the overhead of generics while keeping your code as compartmentalised as possible.

@allevato
Copy link
Collaborator

@Lukasa, do you have any links to good resources where I can read more about the finer performance differences between things like existentials vs. generics? I get bits and pieces of info from discussions like this and from some of the compiler documentation, but it's not an area I've ever needed to personally focus too much on, so I have a lot of gaps in my knowledge and I'd love to know if someone has done a comprehensive write-up anywhere.

@Lukasa
Copy link
Contributor

Lukasa commented Nov 19, 2019

I am not aware of a good writeup. It's definitely something worth having written down. The biggest problem is actually that the story changes a lot: for example, Swift recently (i.e. in the last two or so Swift versions) started stack allocating existentials where possible which somewhat changes their performance profile.

Part of the issue as well is that constructing small benchmarks is quite hard because you need to place a module boundary in the way. Within a module, the compiler will usually be able to break down the abstraction and perform the appropriate specialisation work. Only across boundaries do we see the difference.

In general, though, my first order theory is that wherever possible it is better to use generics than existentials, because generics can be specialised but existentials cannot. This rule is not 100% factually accurate, but it's a good starting point: preserving type information gives the compiler more options than throwing it away. Naturally this doesn't work in all cases: generics are tricky for heterogeneous containers, for example. But it's a useful starting point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants