Bulk receive on HostLink #35

m8pple · 2018-04-04T11:18:35Z

For good performance we need to blast messages into and out of PCI fairly
fast, and probably want to send/recv multiple messages in one go, probably
on different threads.

At the moment one the recv side we have:

void recv(void* flit);
void recvMsg(void* msg, uint32_t numBytes);

So for every message we have at least one call to get the header, then another
to get the body (unless the message is flit-sized). Ultimately these boil down
to read calls on the PCI stream, so there is a fair amount of per call overhead.

Given the receiver has to deal with some kind of parsing, it would make sense
to have a "firehose" type call, whereby they can get a whole bunch of flits, then
deal with them later (possibly on another thread). That way we have the best
chance of saturating PCI expression bandwidth.

My suggesting is a function:

/*! Attempts to read up to maxFlits
\retval Number of flits actually read. 0 <= retval <= maxFlits
*/
uint32_t HostLink::tryRecv(void* buffer, uint32_t maxFlits);

On the backend this would hopefully result in just one read getting
a whole bunch of flits.

I'm a bit unsure on the interaction with sub-flit sized reads from read though,
as I see there is the standard looping logic within HostLink::recv. Possibly
it requires a partial buffer within HostLink, so that partial flits will be stored
there, then completed and returned when the rest of the bytes turn up.
However, there is no such thing as a partial flit being sent (I think?), so it
could also make sense to loop until any partial flit is completed.

Also, this might be seen as premature optimisation - I'm only looking at
0.3 in detail at the moment, so possibly this isn't an issue in practise and
we are still bottlenecked on raw PCI performance, rather than API performance.

The text was updated successfully, but these errors were encountered:

mn416 · 2018-04-04T12:20:28Z

However, there is no such thing as a partial flit being sent (I think?), so it
could also make sense to loop until any partial flit is completed.

Yes, I think this simple approach would work fine. Sounds like it should be straightforward, I'll take a look.

Thanks for the suggestion. At some point, I'll need to validate the HostLink/PCIe performance to make sure it's as expected, so this is a good thing for me to keep in mind. As I recently discovered, DRAM performance was much worse than expected (and some simple changes made a big difference) so it's really important to measure everything!

mn416 · 2019-10-13T13:33:04Z

Finally supported in commit 20c7f0b.

Performance of graph download in POLite improves by 10x.

mn416 · 2019-11-11T14:09:29Z

Both bulk send and receive now supported. There is scope to move to 8Gbps PCIe lanes instead of 5Gbps lanes in the bridge board. It's just clicking a checkbox in QSys. Will leave this issue open as a reminder to try this at some point...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk receive on HostLink #35

Bulk receive on HostLink #35

m8pple commented Apr 4, 2018 •

edited by mn416

Loading

mn416 commented Apr 4, 2018

mn416 commented Oct 13, 2019 •

edited

Loading

mn416 commented Nov 11, 2019

Bulk receive on HostLink #35

Bulk receive on HostLink #35

Comments

m8pple commented Apr 4, 2018 • edited by mn416 Loading

mn416 commented Apr 4, 2018

mn416 commented Oct 13, 2019 • edited Loading

mn416 commented Nov 11, 2019

m8pple commented Apr 4, 2018 •

edited by mn416

Loading

mn416 commented Oct 13, 2019 •

edited

Loading