-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP SYN backlog change likely has concurrent global var access issues #729
Comments
Bisecting... |
@pfl: FYI ^^ |
D'oh. So, by default there's 1 entry in backlog table. When we send SYN/ACK, we take that entry and set up worker timer there. When we receive ACK, we've done with that entry, but can't remove it, as there sits counting down worker timer. At which point a new connection arrives, and being dropped, many times, until long ACK delay expires and worker handler clear the entry. That's how the patch developed to allow multiple concurrent connections regressed multiple sequential connections. Using any other arbitrary sizes for backlog queue aren't going to help of course, the culprit is in the usage of worker API and its deficiency (#682) in Zephyr. |
Actually, I guess there's confusion with how k_delayed_work_cancel() works. Let's consider this specific example, which uses ACK timeout of 1000ms. That's big timeout, and all this 1000ms delayed work just counts down this timeout and can be perfectly cancelled. Only at the end of these 1000ms it gets added to scheduler queue, where it can't be cancelled. It spends there maybe 100us (well, depends on well-behavedness of other delay work handlers and perhaps other components of system, e.g. #672 calls send_reset(), dunno how fast that may finish). So, following patch works as expected:
As can be seen I was skeptical for the need of "cancelled" flag at all, given then window of 1000ms when it can be cancelled vs 0.1ms when it can't. But here's another issue - I'm not sure how this (tcp_backlog_ack()) function is called, so conservative assumption is that it may be preempted by work handler. But then tcp_backlog_ack() both reads and writes Given these concerns on how to synchronize access to |
I would appreciate any comments to: |
@pfl : Ping about concerns with concurrent access to |
The TCP backlog code will not have any issues with IRQs. It is part of the TCP stack, and thus shielded from others because it is surrounded by FIFOs, both from applications and driver interrupts. So there should not be any synchronization issues - if that was the question. Now with #777 merged, there is no longer any need for the cancelled flag either. But I'm not sure I'm able to follow the discussion to 100% in this issue anymore, as the Zephyr code has evolved since the issue was opened. |
@pfl : Thanks for the reponse
Well, the question is not just about IRQs, but about thread-safety in general.
Yeah, but your code, running inside TCP stack, accesses the same global variable without any obvious attempts to synchronize this access. So, this is a very generic concern, which applies to any multithreaded (in the sense "multiple execution contexts") system, let's have a look at https://en.wikipedia.org/wiki/Thread_safety#Implementation_approaches , second subheading there:
This is exactly our case - there's a global array, and we can't easily drop it.
So, wikipedia (and common knowledge of anyone who worked with multithreading) says that above measures should be used. Your code doesn't have them. That's the concern. If you say "there should not be any synchronization issues", then maybe it's like that, and myself should learn why the worst traits of preemptive multithreading, which apply to any such system out there, suddenly don't apply to Zephyr. Alternatively, it's still a chance to think whether we have a problem here. Thanks. |
AFAIK the IP and TCP stacks were not preemtible last time I checked - a few years ago. The initial design was to have access to the TCP/IP stack through FIFOs, and I think that still holds. I don't see any reason to have more than one thread handling TCP/IP packets, it will just use up more memory for minimal speed gains, unless numbers prove me wrong. I must assume that @jukkar knew what he was doing in applying the initial implementation. If the pre-emption assumption turns out to be wrong now, please open up a new issue with @jukkar instead of beating on an already closed issue that fixed the usage of k_delayed_work_cancel(). And I do know what nice fireworks are to be seen by accessing a global array if the thread is pre-empted... |
@nashif : Currently there're known issues with TCP support in Zephyr, and this one in my list for extra-detailed checking. (@pfl reviewed it and responded that everything should be ok, nor I have any specific issues to point at, except for lack of explicit synchronized access.) This is low priority, assigned directly to me to hopefully avoid confusion. |
If the IP stack is made re-entrant, this issue needs looking into. |
This is quite old and we have recently added locking to net_context access -> closing it. |
ab -n1000 http://192.0.2.1:8080/
run against http_server.py (see #728) of MicroPython built against tagv1.8.0
:ab -n10 http://192.0.2.1:8080/
(note - 10 requests!) run against uPy build against Z master:The text was updated successfully, but these errors were encountered: