-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dead loop of the kernel during Bluetooth Mesh pressure communication #12726
Comments
My version is:
|
I can't quite decipher your gdb logs. Seems like they're pointing at the scheduler? |
When I pause with gdb, it stops on SYS_DLIST_FOR_EACH_CONTAINER, and the next of the adv_thread list node points to itself, causing SYS_DLIST_FOR_EACH_CONTAINER to become a dead loop. This seems to happen when adv_thread calls k_sleep. |
There shouldn't be anything wrong with a thread calling k_sleep, so seems like a possible scheduler bug (or something like that)? |
@andyross fyi |
That's a corrupt list for sure. There actually was a race that could cause this, partially fixed in commit e664c78 (which I think you have, if I found that HEAD correctly). But just Monday @dcpleung found another spot where we had introduced a recursive spinlock (which on UP systems results in the lock being released early) and could plausibly cause this. I should have a fix for that one (and some robustness changes to detect the case when CONFIG_ASSERT is active, though I don't know if that would have saved you or not) up in the next day or two. We can hope. |
Thanks, Expect it to be fixed. |
I might also be running into this problem (under heavy load of Mesh) but am not so familar with gdb unfortunately. I have also noticed next elements pointing to the referring element itself, but in my case of wait_q -> waitq. The symptoms are that my application freezes (i.e. it does not send / process advertisement messages) but it does not print a kernel panic or such in the serial console. Is it possible that I am hit by the same bug? @xiaoliang314 does this sound familar to you? I'm on ff88b7f. Happens on PCA10040 and also on PCA10059 (but can't debug latter due to missing J-Link)
|
@pirast Yes, they may be caused by the same bug. |
Hi @xiaoliang314, thanks! I have seen some commits in master that may tackle the issue. However, I still seem to encounter it but am not able to reproduce it right now with a debugger attached (trying over the weekend). In case you have time: Could you cross-check? You seem to have a more appropriate test setup at hand. |
@pirast I modified the sample code so that it could be controlled by the host computer. I sent the next packet as soon as it was sent successfully. My uart baud rate is 1M. |
I was able to reproduce it on latest master (c2d5e7b). This is the backtrace I could obtain with CONFIG_DEBUG=y. After increasing the log buffer, I also get an exception message: @andyross Is there any chance that you can come up with a solution? Is there any more information I can provide s.th. you can tackle this?
|
I don't see how the scheduling corruption is Bluetooth related, so removing myself from the reviewers - at the same time I don't know who the right person would be so this will be unassigned for awhile. |
@andyross , assigning to you based on #12726 (comment): Please request reporter to re-verify once the fixes you work on are merged |
Is this still reproducible? Note that there have been several fixes to scheduler and locking code since the HEAD referenced above (dff6b71 seems most likely) which could produce symptoms exactly like that. Would be good to check again. Or maybe it was a single spurious failure? Has it happened since? |
@andyross I will update to the latest version for testing, I have prepared at least 12 hours of testing time and see if the problem exists. |
I have run into this same infinite loop recently. After doing some investigation, I believe in my case it is caused by k_spin_lock not masking all interrupts (Arch=STM32L1). It appears both higher priority (31 is what I tested with) and IRQ_ZERO_LATENCY do not get masked. That can cause issues if the ISR interrupts _add_timeout or _remove_timeout, and the ISR itself calls k_poll_signal_raise or other kernel functions which call _add_timeout or _remove_timeout. In my case, _add_timeout was being interrupted and other calls were being made to _add_timeout/_remove_timeout and it ultimately ended up breaking the linked list. I'm not sure if this is a bug or not but I think it might be worthwhile to have k_spin_lock assert if it catches itself being re-entrant. Or at least have the kernel timer locally record whether or not the list is locked and having it bugcheck if it finds that it somehow has become re-entrant - this was a difficult to track down given how difficult it is to reproduce. |
One other option may be to have a number of the kernel functions assert if they're ever called from an NMI. |
@bdrlamb Are you using the latest version? |
@bdrlamb: The rule is that _arch_irq_lock() (the underlying primitive used by the spinlocks) must mask any interrupt used to inspect or modify Zephyr kernel state. It is allows that there be some levels higher than that, but they have to be for driver-specific purposes and only touch things that Zephyr never looks at. Some of the radio code on nRF5x works this way that I know of. The ZERO_LATENCY feature is deliberately placed at that priority. And FWIW: when CONFIG_ASSERT is true, k_spin_lock() does indeed include a validation layer that will catch reentrant locking, and releases of unlocked locks. |
I'm going to close this one. I'm really quite certain this is the issue resolved with commit dff6b71. Please reopen if the symptom reappears. |
@xiaoliang314 I wasn't using the absolute latest version, but it definitely included commit dff6b71. The issue was my fault (calling kernel functions in NMIs) and I have not encountered it since. |
Describe the bug
I created two Bluetooth Mesh nodes to put them in the same network. I use the serial to control one of the nodes to send packets to another node. Send the next packet immediately after receiving a reply from another node or a 5 second timeout. When I tested about 800 times, the node could not continue to respond to my serial commands. I used gdb to check the current running status. The information is as follows.
To Reproduce
Steps to reproduce the behavior:
The text was updated successfully, but these errors were encountered: