-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Falco with eBPF probe locks 4.19 kernel mainline when running lsiutil #896
Comments
I tested this on the latest kernels today. It appears that 4.19 is the only branch that hangs. This was tested with falco version 0.18.0 stable 5.3.8: works |
To help with reproduction, I ported the Vagrantfile to use virtualbox and have more detailed steps: # host commands
cd
mkdir reproduce896
cd reproduce896
vi Vagrantfile #copy contents from below
vagrant up
# wait for ~20 minutes for kernel to compile
vagrant ssh
#guest commands
sudo su -
# start a tmux session with two windows, or have 2 ssh sessions up
/start.sh
# wait for falco to download the image, and to boot
/hang.sh
# see lsiutil fail at wait4, and eventually (~10s), the kernel will begin to softlock Vagrantfile
|
Thanks for the update @a0145 - I will try this |
I had some spare time to try this out on the latest 4.19 kernel (4.19.98), and it still hangs with the bpf sysdig probe. I don't think this is related to the kernel lockdown feature - the bpf program is able to load and function, however it hangs at the wait4 syscall |
I'm looking at this again, here are my repro steps using the Vagrant file from @a0145 but using driverkit to build the BPF probe: Prepare the driverkit file cd /root
cat <<EOT >> vanilla.yaml
kernelrelease: 4.19.82
kernelversion: 1
target: vanilla
driverversion: be1ea2d9482d0e6e2cb14a0fd7e08cbecf517f94
output:
probe: /build/falco-probe.o
EOT Add the kernel config data to it cat /boot/config-4.19.82 | base64 -w0 | awk '{print "kernelconfigdata: " $1;}' >> vanilla.yaml Build the probe docker run -v /root/vanilla.yaml:/vanilla.yaml -v /build:/build -it -v /var/run/docker.sock:/var/run/docker.sock falcosecurity/driverkit:latest driverkit -c /vanilla.yaml docker Start Falco setenforce 0
docker run --rm -i -t -v /build:/build --net=host --privileged -e FALCO_BPF_PROBE="/build/falco-probe.o" -v /root:/root -v /var/run/docker.sock:/host/var/run/docker.sock -v /dev:/host/dev -v /proc:/host/proc:ro -v /boot:/host/boot:ro -v /lib/modules:/host/lib/modules:ro -v /usr:/host/usr:ro -v /etc:/host/etc falcosecurity/falco |
@a0145 I created a small C program to do a raw wait4 syscall and with your vagrantfile (it uses 4.19.82) and I'm not able to reproduce the soft lockup. #define _GNU_SOURCE
#include <stdio.h>
#include <sys/syscall.h>
#include <unistd.h>
int main() {
int pid = fork();
if (pid == 0) {
for (;;) {
}
}
if (pid != 0) {
syscall(SYS_wait4, pid, 0, 0, NULL);
}
} Here is my output of
At this point I think we could have two roads here:
|
Ok, suddenly it happens to me too after leaving it there for a while: Look at CPU 5:
|
While it was happening I was compiling the newer kernel version. |
This seems to be related to the events frequency. Higher frequency means higher probability for this to crash a kernel. |
It seems to happen regardless the syscall, here I'm getting it with a
|
Hey @fntlnz - thanks so much for looking into this. I was able to reproduce in multiple 4.19 LTS patch releases, and on VMs and bare metal with the same results, and my digging into the issue led me to believe it’s a problem with the BPF facility in 4.19. In 5.4 LTS I’m unable to reproduce the lockup however so that led me to suspect the JIT in 4.19 due to the changes between 5.4 and 4.19. |
I will try to debug more to discover where exactly this is happening but everything 100% points to a kernel bug. Thanks for making us aware @a0145 |
Here's a full stack trace I've been able to extract with @gnosek and @nathan-b
|
Just did another experiment with @leodido and this also happen in kernels that are compiled without eBPF JIT support. |
Stack trace without JIT
|
Another try in this direction (to exclude that is JIT the cause) would be to enable JIT in debugging mode ( And using Wait, it only seems to work for BPF not eBPF. Nevermind |
I am also getting a soft lockup: Running kernel |
Bump version of the driver to (commit: cd3d10123eef161d9f4e237581c1056fca29c130) that fixes #896 Summary of the needed fix can be found at patch [0] [0] https://patch-diff.githubusercontent.com/raw/draios/sysdig/pull/1612.patch Co-Authored-By: Leonardo Di Donato <[email protected]> Signed-off-by: Lorenzo Fontana <[email protected]>
Bump version of the driver to (commit: cd3d10123eef161d9f4e237581c1056fca29c130) that fixes #896 Summary of the needed fix can be found at patch [0] [0] https://patch-diff.githubusercontent.com/raw/draios/sysdig/pull/1612.patch Co-Authored-By: Leonardo Di Donato <[email protected]> Signed-off-by: Lorenzo Fontana <[email protected]>
Bump version of the driver to (commit: cd3d10123eef161d9f4e237581c1056fca29c130) that fixes #896 Summary of the needed fix can be found at patch [0] [0] https://patch-diff.githubusercontent.com/raw/draios/sysdig/pull/1612.patch Co-Authored-By: Leonardo Di Donato <[email protected]> Signed-off-by: Lorenzo Fontana <[email protected]>
This bug also present in CentOS 8.1 (4.18.0-147.8.1). This fix appears to work for that as well...running overnight on k8s 1.18.0 node to verify the driver is working properly. |
@smijolovic thabks for reporting it. Could you provide a log for that kernel version? I’m curious :) |
Interesting! @smijolovic I didn’t expect this to be on a 4.18 - probably some backport on the kernel. For The log @leodido is asking, please do a |
Fix ran overnight and no issues so far. Seeing a few "Falco internal: syscall event drop. 1 system calls dropped in last second" and looking to tweak the syscall_event_drops rate. I also built a systemd service file for falco vs an init script for service control. As requested: [42328.030993] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [sh:6839] a bit further down the stack there's the bad RIP values [42507.970565] watchdog: BUG: soft lockup - CPU#16 stuck for 22s! [migration/16:108] |
This looks like a second issue identified in the 4.18.0-147 kernel. These backports can be tricky because they pull in major bug and security fixes from mainline branches...and can be a real pain to pinpoint the inclusions. While backports are great for uniformity...this is where they hurt. This is the one you pinged me on yesterday: |
No longer an issue on 0.22.1 |
Thanks for confirming :)
/close
On Fri, 24 Apr 2020 at 22:14, smijolovic ***@***.***> wrote:
No longer an issue on 0.22.1
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#896 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA5J44TLNAY67CBIIOXPIDROHXLZANCNFSM4JDV7DSA>
.
--
L.
|
@smijolovic Same problem here. I am using libbpf library to write eBPF kernel program,which using bpf_probe_read_user_str() liking:
I don't understand how to resolve it using your path,could you please give some details? |
Hi @opsnull! This issue is focused specifically on getting the open-source project Falco working with eBPF. If you are having problems with your own eBPF program then I suggest the Linux kernel's eBPF mailing list. It's possible that the pointers in this ticket will help you figure out if you are being hit by the same kernel backport issue, but we can't really help you debug your own eBPF program. Best of luck! |
What happened:
While running the stock config of falco on kernel 4.19.80 (longterm), falco operates normally, and alerts as expected. Then, after running the binary
lsiutil
, the system hangs. This behavior only happens after engaging falco + bpf probe.Falco running properly:
Invocation of lsituil after bpf instrumentation:
What you expected to happen:
Falco to run with the bpf probe without soft-locking the system after a copy of
lsiutil
is launched. See output from above withkernel:watchdog: BUG
Clean output of lsitutil without ebpf sysdig falco running:
How to reproduce it (as minimally and precisely as possible):
Manual Steps:
lsiutil
binarystrace ./lsiutil
, and see a hang. The system will begin soft-locking and panicking shortly after:Example Vagrant File:
Anything else we need to know?:
When using mainline kernel 5.3.7, using the bpf probe, falco and the sysdig probe function correctly.
The hang was tested in 4.19.61, 4.19.80.
Environment:
Falco version (use
falco --version
):Falco version: 0.17.1
System info
Cloud provider or hardware configuration:
centos7/vmware
OS (e.g:
cat /etc/os-release
):uname -a
):Not a Contribution
The text was updated successfully, but these errors were encountered: