Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-core concurrent packet processing #2234

Merged
merged 17 commits into from
Sep 11, 2024

Conversation

joseph-henry
Copy link
Contributor

@joseph-henry joseph-henry commented Feb 23, 2024

ZeroTier on multiple threads

This patch enables concurrent processing of packets in the RX and TX directions and appears to improve performance significantly in low-powered hardware such as arm chips in routers, raspberry pis, etc.

This has only been implemented for Linux and FreeBSD.

Example usage (local.conf):

{
   "settings":
   {
       "multicoreEnabled": true,
       "concurrency": 4,
       "cpuPinningEnabled": false
   }
}

@laduke
Copy link
Contributor

laduke commented Feb 29, 2024

Awesome!
Can you make it not compile on mac, windows, etc? I know you know it doesn't work there, but it's worth testing ifdefs.
I made myself a branch with all the current PRs, and this one makes that branch not work on my mac (obviously).
I'm not sure if it's feasible to make it local.conf setting, so we can get the code in, but not enabled by default, but that would be cool IMO.

@laduke laduke added this to the 1.14.0 milestone Mar 5, 2024
@joseph-henry joseph-henry removed this from the 1.14.0 milestone Mar 14, 2024
@sandros94
Copy link

I wonder if this could improve performance on smaller cpus like the ones in commercial NASs

@joseph-henry
Copy link
Contributor Author

Update: Packet re-ordering seemed to be an issue in situations where a single TCP stream was being received by a large number of high-performance cores so the following changes were made which I believe are a good compromise for the time being:

This latest commit will not have multicore enabled by default, it can be enabled with ZT_ENABLE_MULTICORE=1

When enabled it will only use 2 cores if at least 4 logical cores are available. No matter how many cores beyond that are present it will only use 2. To override this you can set ZT_CONCURRENCY=N.

To experiment with core pinning you can use ZT_CORE_PINNING=1 but this is most likely a bad idea so do this last.

Suggested default usage:

sudo ZT_ENABLE_MULTICORE=1 ./zerotier-one

I am interested in hearing how this performs for people.

Thanks.

@joseph-henry
Copy link
Contributor Author

I wonder if this could improve performance on smaller cpus like the ones in commercial NASs

Yes, exactly. This is where I'm seeing the best gains in my testing.

@sandros94

This comment was marked as resolved.

@sandros94
Copy link

I am interested in hearing how this performs for people.

I'm not entirely sure if I'm building it right but I just did some tests, in particular related to video workflows using Blackmagic Disk Speed Test.
Source was a Synology DS1522+ (zerotier built from source, no other containers nor connections active). Destination is a Win11+Ryzen 5800X (zerotier 1.14 stable) over a public network.

The connection should reach 100mbit/s from Win to Synology, and 900mbit/s from Synology to Win. All tests are from the Win machine perspective.

  1. With ZT_ENABLE_MULTICORE=0: upload is 100mbit/s; download is ~410mbit/s
  2. With ZT_ENABLE_MULTICORE=1 and ZT_CONCURRENCY=2: upload is ~80mbit/s; download is ~290mbit/s
  3. With ZT_ENABLE_MULTICORE=1 and ZT_CONCURRENCY=4: upload is ~50mbit/s; download is ~260mbit/s

ZT_CORE_PINNING=1 didn't make a difference, but I've also noticed during uploads that the speed is quite inconsistent
image

P.S: container running from this Docker Hub image (tag multicore-64634c9) built with this dockerfile.

@TommyKing
Copy link

I am interested in hearing how this performs for people.

I'm not entirely sure if I'm building it right but I just did some tests, in particular related to video workflows using Blackmagic Disk Speed Test. Source was a Synology DS1522+ (zerotier built from source, no other containers nor connections active). Destination is a Win11+Ryzen 5800X (zerotier 1.14 stable) over a public network.

The connection should reach 100mbit/s from Win to Synology, and 900mbit/s from Synology to Win. All tests are from the Win machine perspective.

1. With `ZT_ENABLE_MULTICORE=0`: upload is 100mbit/s; download is ~410mbit/s

2. With `ZT_ENABLE_MULTICORE=1` and `ZT_CONCURRENCY=2`: upload is ~80mbit/s; download is ~290mbit/s

3. With `ZT_ENABLE_MULTICORE=1` and `ZT_CONCURRENCY=4`: upload is ~50mbit/s; download is ~260mbit/s

ZT_CORE_PINNING=1 didn't make a difference, but I've also noticed during uploads that the speed is quite inconsistent image

P.S: container running from this Docker Hub image (tag multicore-64634c9) built with this dockerfile.

i get the same result with 50Mb/s Upload Connection

  1. with ZT_ENABLE_MULTICORE=0 = Upload 5MB/s
  2. with ZT_ENABLE_MULTICORE=1 and ZT_CONCURRENCY=2 = 4,4MB/s
  3. With ZT_ENABLE_MULTICORE=1 and ZT_CONCURRENCY= 4 i get around 4.0 MB/s

@joseph-henry
Copy link
Contributor Author

Thanks for your results everybody. It's still a work in progress.

Some updates:

  • Packets are sorted by flow to prevent re-ordering (though this doesn't seem to be a full solution)
  • Configuration is now done via local.conf, not environment variables

Example config:

{
   "settings":
   {
       "multicoreEnabled": true,
       "concurrency": 4,
       "cpuPinningEnabled": false
   }
}

More updates to come.

@joseph-henry
Copy link
Contributor Author

joseph-henry commented Sep 6, 2024

Update: At this point we are seeing decent gains in most cases and no worsening of performance in any case. However, this is claim is only true for Linux at the moment. Once things are fully ironed out we plan on porting these changes to other platforms.

If anyone is willing to give this another go it would be very much appreciated. Thank you

@adamierymenko adamierymenko marked this pull request as ready for review September 9, 2024 20:17
@adamierymenko adamierymenko changed the base branch from main to dev September 10, 2024 14:33
Copy link
Contributor

@adamierymenko adamierymenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine to ship disabled by default. Will continue to test including release builds, but want to merge and make build branch and start release process.

@adamierymenko adamierymenko merged commit 4a485df into dev Sep 11, 2024
4 checks passed
@sandros94
Copy link

sandros94 commented Sep 11, 2024

Thanks @adamierymenko, just a quick question: where could I follow its development and to quickly test it in a Synology nas should I build my own docker image based on dev branch or is there a CI build for that?

Also: currently it is only configurable via configs, ENV variables are disabled still?

@adamierymenko adamierymenko deleted the jh-zerotier-multithreaded branch September 11, 2024 18:57
@joseph-henry
Copy link
Contributor Author

should I build my own docker image based on dev branch or is there a CI build for that?

Also: currently it is only configurable via configs, ENV variables are disabled still?

Yes building a docker image from latest dev would be best for testing this. And also yes we made the decision to configure via local.conf instead of ENV vars.

@sandros94
Copy link

And also yes we made the decision to configure via local.conf instead of ENV vars.

Hopefully it's just a temporary decision, handling ENV vars in a docker container is such a breaze compared to binding mounts, in particular when A/B testing.

Thanks for the work

@raryanpur
Copy link

And also yes we made the decision to configure via local.conf instead of ENV vars.

Hopefully it's just a temporary decision, handling ENV vars in a docker container is such a breaze compared to binding mounts, in particular when A/B testing.

Thanks for the work

Maybe both? Having runtime configuration is super helpful.

@joseph-henry
Copy link
Contributor Author

That's actually a good point. We'll consider doing that.

@sandros94
Copy link

sandros94 commented Oct 3, 2024

For now I'm going to leave this here, since there isn't an official issue to track this, nor I do have useful information/reproducibility to open up one, but I've quickly tested 1.14.1 with the below local.conf and got half the speed compared without multicore. But still didn't experience the ups and downs I was getting last time I've reported.

Same setup as last time:

  • Synology DS1522+ with DSM7 (multicore enabled)
  • Win11+Ryzen 5800X

/var/lib/zerotier-one/local.conf:

{
  "settings":
  {
    "multicoreEnabled": true,
    "concurrency": 2,
    "cpuPinningEnabled": false
  }
}

@jklop123
Copy link

jklop123 commented Oct 8, 2024

For now I'm going to leave this here, since there isn't an official issue to track this, nor I do have useful information/reproducibility to open up one, but I've quickly tested 1.14.1 with the below local.conf and got half the speed compared without multicore. But still didn't experience the ups and downs I was getting last time I've reported.

Same setup as last time:

* Synology DS1522+ with DSM7 (multicore enabled)

* Win11+Ryzen 5800X

/var/lib/zerotier-one/local.conf:

{
  "settings":
  {
    "multicoreEnabled": true,
    "concurrency": 2,
    "cpuPinningEnabled": false
  }
}

I tested some situations. To enjoy the performance improvement of multithreading, both ends must be Linux clients and multithreading must be enabled.

pve virtual machine, two zerotier clients, debian12, x64.
Client server does not enable multithreading: 1.5G
Client server only enables one multithreading support: 1.5G
Client server both enable multithreading: 2.9-3G, reaching the 2x performance improvement claimed by the official blog.

My confusion is that if the server opens another iperf3 instance and adds the client2 -> server test, the bandwidth of the two sets of tests is still only 3G.
I am not sure whether this result meets expectations. Maybe the bottleneck now is that the zerotier virtual network card still has only one rx tx queue? In any case, no further research was done.

ps: This test environment is a server environment, the CPU frequency is not high, so the absolute performance is not very good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants