Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace numpy dependency in websockify #1926

Closed
totaam opened this issue Aug 1, 2018 · 12 comments
Closed

replace numpy dependency in websockify #1926

totaam opened this issue Aug 1, 2018 · 12 comments

Comments

@totaam
Copy link
Collaborator

totaam commented Aug 1, 2018

Issue migrated from trac ticket # 1926

component: external | priority: major | resolution: fixed

2018-08-01 18:57:50: antoine created the issue


As per #1913#comment:2, importing numpy wastes a lot of memory and the only thing that websockify uses it for is to do a string XOR!

Let's submit a patch to make this more pluggable, we have our own code which can provide it. (our cyxor cython extension)
The import would need to be deferred since the websockify module imports the sub-modules. (ie: WebSocket constructor could get the xor wrapper function from websockify.xor which we would then be able to patch up)

Alternatively, we could use the modules context hack to temporarily hide "numpy" from the module loader, then patch the _unmask method.
This approach would work with older version of websockify - and they have not made any new releases in years now...

See also #1134.

@totaam
Copy link
Collaborator Author

totaam commented Aug 1, 2018

2018-08-01 21:39:26: antoine changed owner from antoine to maxmylyn

@totaam
Copy link
Collaborator Author

totaam commented Aug 1, 2018

2018-08-01 21:39:26: antoine commented


r19989 does this by pretending numpy is not installed and then monkeypatching a cython function to do the work.

This should be faster than the numpy code it replaces:

  • one fewer copy of the buffer data when doing the XOR
  • no longer copying the buffers when reading from the websocket (pass memoryview directly to caller, queue the rest)

The impact should be negligible since the server doesn't do a lot of reading.

importing websockify will now show this warning:

WARNING: no 'numpy' module, HyBi protocol will be slower

This can safely be ignored.

@maxmylyn: please check that this new code does not cause any performance regressions.

@totaam
Copy link
Collaborator Author

totaam commented Aug 30, 2018

2018-08-30 05:31:53: antoine commented


r19989 caused a regression reported in #1941#comment:7 :

disconnect: zlib packet decompression failed must be string or read-only buffer, not memoryview
Fixed in r20229.

@totaam
Copy link
Collaborator Author

totaam commented Oct 23, 2018

2018-10-23 17:00:38: antoine commented


Another regression: #2001

@totaam
Copy link
Collaborator Author

totaam commented Jan 16, 2019

2019-01-16 04:52:37: antoine commented


See also: #2104, with websockify > 0.8 we can silence the numpy warning.

@totaam
Copy link
Collaborator Author

totaam commented Jan 18, 2019

2019-01-18 17:30:22: antoine changed status from new to closed

@totaam
Copy link
Collaborator Author

totaam commented Jan 18, 2019

2019-01-18 17:30:22: antoine set resolution to fixed

@totaam
Copy link
Collaborator Author

totaam commented Jan 18, 2019

2019-01-18 17:30:22: antoine commented


Not heard back, so:

  • r21389 adds a simple command line tool to compare the performance of the default websockify unmask function (which uses numpy) and our cython implementation
  • r21392 adds a unit test to verify that we get the same result as numpy, no matter what the buffer alignment is

Sample output:

$ python ./tests/xpra/net/websockify_numpy_vs_cython.py 
* slice size: 1KB
 - websockify via numpy:      119MB/s
 - xpra cython         :      635MB/s
* slice size: 8KB
 - websockify via numpy:     1189MB/s
 - xpra cython         :     4016MB/s
* slice size: 64KB
 - websockify via numpy:     3148MB/s
 - xpra cython         :     9698MB/s
* slice size: 1024KB
 - websockify via numpy:     1207MB/s
 - xpra cython         :     6416MB/s
* slice size: 16384KB
 - websockify via numpy:     1096MB/s
 - xpra cython         :     1170MB/s
* slice size: 65536KB
 - websockify via numpy:     1152MB/s
 - xpra cython         :     1254MB/s

It's much faster on smaller packets (1KB: ~5 times faster!) than on really big ones (64MB: only 10%).
Typical packets are in the 1-64KB range and this is also where we benefit the most.


But I wasn't happy with those values as lz4 can compress data at that speed.
So r21393 optimizes the code for better performance.
Here are some results after setting cpu governor to performance:

$ python ./tests/xpra/net/websockify_numpy_vs_cython.py 
* slice size: 1KB
 - xpra cython         :      899MB/s
 - websockify via numpy:      141MB/s
* slice size: 8KB
 - xpra cython         :     5330MB/s
 - websockify via numpy:     1458MB/s
* slice size: 64KB
 - xpra cython         :    14233MB/s
 - websockify via numpy:     4138MB/s
* slice size: 1024KB
 - xpra cython         :    11877MB/s
 - websockify via numpy:     4809MB/s
* slice size: 16384KB
 - xpra cython         :     3974MB/s
 - websockify via numpy:     1173MB/s
* slice size: 65536KB
 - xpra cython         :     2833MB/s
 - websockify via numpy:     1166MB/s

Note: when numpy gets vectorized optimizations ([https://lwn.net/Articles/691932/]), we can revisit this.

@totaam totaam closed this as completed Jan 18, 2019
@totaam
Copy link
Collaborator Author

totaam commented Jan 19, 2019

2019-01-19 07:45:33: antoine uploaded file cyxor-64-bit.patch (2.8 KiB)

64-bit is not faster!

@totaam
Copy link
Collaborator Author

totaam commented Jan 19, 2019

2019-01-19 07:45:48: antoine commented


Minor enhancements in r21394 + r21395 + r21396. (unit tests, etc)

Not sure why, but the 64-bit version attached is not faster..

@totaam
Copy link
Collaborator Author

totaam commented Jan 23, 2019

2019-01-23 07:07:56: antoine commented


Fix for building on 32-bit windows with python3: r21454.

@totaam
Copy link
Collaborator Author

totaam commented Jan 23, 2019

2019-01-23 17:16:25: antoine commented


r21465 adds support for websockify > 0.8 in the performance test (see #2104).
The results are the same or even lower than with websockify <= 0.8

See also #2121.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant