Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request user data taking >10000ms when using websockets #439

Closed
atkinsj opened this issue Feb 10, 2017 · 18 comments
Closed

request user data taking >10000ms when using websockets #439

atkinsj opened this issue Feb 10, 2017 · 18 comments
Assignees
Milestone

Comments

@atkinsj
Copy link

atkinsj commented Feb 10, 2017

I'm not sure how to begin troubleshooting this but that seems inordinately high. Update users and update map all occur in <5ms but update user data ranges from 8,000 to 10,000ms. How do I go about troubleshooting this?

@hilocz
Copy link

hilocz commented Feb 11, 2017

If you have single CPU system, then WebSocket server is eating all your CPU cycles. It´s known problem.

@atkinsj
Copy link
Author

atkinsj commented Feb 11, 2017

Heh, I'm running it on a quad core and it is indeed using 100% CPU. How is that possibly a thing?

@atkinsj
Copy link
Author

atkinsj commented Feb 11, 2017

So after some research it appears that by default Ratchet falls back to a loop (i.e., while(true)) if libevent is unavailable. Libevent does not yet exist for PHP7.0, thus Ratchet is always going to use 100% CPU until that changes.

There is a fork of php-libevent at https://github.com/expressif/pecl-event-libevent but it generates a coredump for me on the pathfinder_websocket cmd.php.

I'm not sure of a good solution for this going forward.

There's another option that also appears not to work: https://bitbucket.org/osmanov/pecl-event/src/41be9821b69ce996d59b750ecdbd9c07dffe192b/INSTALL.md?fileviewer=file-view-default

@WildStrawberryEVE
Copy link

root@ap0:/var/log/pathfinder# sudo -u www-data /usr/bin/php /var/www/pathfinder_websocket/cmd.php
2017-02-11 11:54:12 Server START ------------------------------------------
Segmentation fault (core dumped)
Feb 11 11:54:12 ap0 kernel: [1888276.414589] traps: php[24316] general protection ip:7ff4c9cde5d5 sp:7fff40eb01d0 error:0 in libevent.so[7ff4c9cda000+7000]

Got the same results :<
Osmanov's implementation fails a lot of tests on my system.

@atkinsj
Copy link
Author

atkinsj commented Feb 11, 2017

Yup. I'm not sure if this is a problem on Ratchet, react/event, pathfinder_websocket or the guys libevent fork either. I'll poke around with strace and see what I can hunt down tomorrow.

@exodus4d
Copy link
Owner

@atkinsj "*update client map *" and "update client user data" are client side timings. Its just the time your web browser takes to update/sync your map with updated data.

  • UPDATE_CLIENT_MAP -> time (ms) after your browser gets new map data by Ajax or WebSocket until your browser finished updated the DOM
  • UPDATE_CLIENT_USER_DATA -> time (ms) after your browser gets new userData (current location/system for all active characters on your current map) until your browser finished updating the DOM

The other 2 charts are AJAX timings. They show you how long your server takes to response:

  • UPDATE_SERVER_MAP -> time (ms) between sending your updated map data (Ajax) (e.g. when you move a system around) to the server, until your server finished all the DB stuff and send the data back to you (and other characters, in case of WebSockets usw). -> Your browser takes the data up updates your map (see UPDATE_CLIENT_MAP )
  • UPDATE_SERVER_USER_DATA -> time (ms) between sending a "ping" (Ajax) to your server. The server will receive that ping, and requests CCPs API for location updates for your character. (This is not effected by WebSocket configuration and is send every 5s). The response of that Ajax call includes userData (current location) for all active characters on your map. -> Your browser takes the data up updates your map (see UPDATE_CLIENT_USER_DATA )

@atkinsj
Copy link
Author

atkinsj commented Feb 12, 2017

@exodus4d Thanks for the explanation on the variables.

The latest version with the recommended configuration is pretty much unusable. I'm running this on a 4 core 2.4GHz box with 4GB of ram. The Websocket server is using 100% CPU from the bug we mentioned above (lack of libevent support in php7.0) and redis-server is using 33-35%. The php-fpm process is also using a lot of CPU which makes me think that route calculations are not being cached but I haven't had a chance to confirm this.

This is using all latest development builds of things on a Ubuntu 16.04 box. Setting CACHE to folder= and disabling the websocket server returns back to a much more response Pathfinder overall and stops smashing the CPU.

@InternetPseudonym
Copy link

InternetPseudonym commented Apr 13, 2017

just throwing a useless opinion around : never use any library or application which has while(true) in it - thats (very) bad programming, a beginners' mistake and a high-frequency problem source. Use atomics, volatile or something similar in your loop header and you're good to go - at least your app or library will not deadlock anymore as long as your variable is being updated from somewhere else. Just saiyan.

@hilocz
Copy link

hilocz commented Apr 13, 2017

@danielmack we all know that, it would be fine if you can fix it ;)

@atkinsj
Copy link
Author

atkinsj commented Apr 22, 2017

The libevent library only has a while(true) as a fallback on systems where event polling isn't available.

Unfortunately PHP7 has no event polling framework available to it, putting us in this situation. That's one of the issues with using the bleeding edge.

@atkinsj
Copy link
Author

atkinsj commented Apr 24, 2017

@danielmack It does make sense and this isn't an issue with Pathfinder. Pathfinder uses Ratchet, a PHP library to provide WebSockets. That library uses pecl-event-libevent which implements a variety of different polling and loop mechanisms (including inotify, libevent and a fallback to a while(true) when no system libraries are available). Your busy waiting Wikipedia link is completely irrelevant FWIW.

There is a fix to make pecl-event-libevent compatible with PHP7.0 that has not yet been merged at https://github.com/expressif/pecl-event-libevent. I am currently using that fork with Ratchet + Pathfinder_Websocket and it's behaving /kind of okay/.

If you have issues with the implementation, take it up with the Ratchet and React PHP projects. Relevant issues are ratchetphp/Ratchet#357 and reactphp/event-loop#40.

@InternetPseudonym
Copy link

InternetPseudonym commented Apr 24, 2017

thats like saying it makes sense to attack a wild bear with a dull stick because your favourite supermarket currently offers no lemonade and informational hyperlinks to assault rifles are completely irrelevant

@atkinsj
Copy link
Author

atkinsj commented Apr 24, 2017

You're welcome to submit a pull request to ratchetphp and reactphp with your no-system-library-required perfect event loop. Maybe leave this issue alone for now, though.

@InternetPseudonym
Copy link

I'll probably do that sometime ... and it will get rejected because hobbyists dont like software dev reality, they make up their own.

@hilocz
Copy link

hilocz commented Apr 24, 2017

@danielmack there is no problem to create special fork like Daniel´s Ratchet for PF community.

@atkinsj
Copy link
Author

atkinsj commented May 1, 2017

Ignore the comment I just deleted; I thought this had been fixed in upstream but the second a client connects it spikes to 100% CPU. It's unusable right now -- I have hundreds of users and even though I don't mind sacrificing a core to get WebSockets, it's not able to keep up with the users.

This would be a really cool feature if we could figure it out.

@InternetPseudonym
Copy link

Well it'd be more of a bugfix, really ... i can confirm this, my instance is permanently spamming one core with a running websocket script.

@Tupsi
Copy link
Contributor

Tupsi commented Jun 1, 2017

I ended up removing websocket from our pf instance as well last week, because user where complaining about bad behavior. What is strange about it though is the fact, that they only started after I moved to the master 1.2 release. It was running just fine with the 100% one core hogging with an earlier 1.2.x dev version, but it might as well be that we just reached some critical userlevel to make it so unusable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants