Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk receive on HostLink #35

Open
m8pple opened this issue Apr 4, 2018 · 3 comments
Open

Bulk receive on HostLink #35

m8pple opened this issue Apr 4, 2018 · 3 comments

Comments

@m8pple
Copy link
Contributor

m8pple commented Apr 4, 2018

For good performance we need to blast messages into and out of PCI fairly
fast, and probably want to send/recv multiple messages in one go, probably
on different threads.

At the moment one the recv side we have:

  • void recv(void* flit);
  • void recvMsg(void* msg, uint32_t numBytes);

So for every message we have at least one call to get the header, then another
to get the body (unless the message is flit-sized). Ultimately these boil down
to read calls on the PCI stream, so there is a fair amount of per call overhead.

Given the receiver has to deal with some kind of parsing, it would make sense
to have a "firehose" type call, whereby they can get a whole bunch of flits, then
deal with them later (possibly on another thread). That way we have the best
chance of saturating PCI expression bandwidth.

My suggesting is a function:

/*! Attempts to read up to maxFlits
\retval Number of flits actually read. 0 <= retval <= maxFlits
*/
uint32_t HostLink::tryRecv(void* buffer, uint32_t maxFlits);

On the backend this would hopefully result in just one read getting
a whole bunch of flits.

I'm a bit unsure on the interaction with sub-flit sized reads from read though,
as I see there is the standard looping logic within HostLink::recv. Possibly
it requires a partial buffer within HostLink, so that partial flits will be stored
there, then completed and returned when the rest of the bytes turn up.
However, there is no such thing as a partial flit being sent (I think?), so it
could also make sense to loop until any partial flit is completed.

Also, this might be seen as premature optimisation - I'm only looking at
0.3 in detail at the moment, so possibly this isn't an issue in practise and
we are still bottlenecked on raw PCI performance, rather than API performance.

@mn416
Copy link
Collaborator

mn416 commented Apr 4, 2018

However, there is no such thing as a partial flit being sent (I think?), so it
could also make sense to loop until any partial flit is completed.

Yes, I think this simple approach would work fine. Sounds like it should be straightforward, I'll take a look.

Thanks for the suggestion. At some point, I'll need to validate the HostLink/PCIe performance to make sure it's as expected, so this is a good thing for me to keep in mind. As I recently discovered, DRAM performance was much worse than expected (and some simple changes made a big difference) so it's really important to measure everything!

@mn416
Copy link
Collaborator

mn416 commented Oct 13, 2019

Finally supported in commit 20c7f0b.

Performance of graph download in POLite improves by 10x.

@mn416
Copy link
Collaborator

mn416 commented Nov 11, 2019

Both bulk send and receive now supported. There is scope to move to 8Gbps PCIe lanes instead of 5Gbps lanes in the bridge board. It's just clicking a checkbox in QSys. Will leave this issue open as a reminder to try this at some point...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants