Bug #623
closedust crash, bus error
100%
Description
Commit id:
UST: 0ff43c44ea52d0d833ea17f15777622ec64d1b1a
tools: 95f3c88c5b889bb3d16bb6857f137c84e06b84c2
urcu: b2f20dff3e0ce8db5f4f0067e2d8810420f16346
Problem:
A program linked with lttng-ust crashes at startup. The crash is hard to reproduce, happens very seldom.
Sessiond is started and configured, tracing started.
See backtrace below,
Program terminated with signal 7, Bus error. #0 0x0fd1f300 in wait_for_sessiond (sock_info=<optimized out>) at lttng-ust-comm.c:875 875 if (uatomic_read((int32_t *) sock_info->wait_shm_mmap) == 0) { (gdb) bt full #0 0x0fd1f300 in wait_for_sessiond (sock_info=<optimized out>) at lttng-ust-comm.c:875 ret = <optimized out> #1 ust_listener_thread (arg=0xfd50550 <local_apps>) at lttng-ust-comm.c:923 sock_info = 0xfd50550 <local_apps> sock = <optimized out> ret = <optimized out> prev_connect_failed = <optimized out> has_waited = 0 i = <optimized out> timeout = 822235928 __func__ = "ust_listener_thread" __PRETTY_FUNCTION__ = "ust_listener_thread" #2 0x0ffac84c in ?? () from /lib/libpthread.so.0 No symbol table info available. #3 0x0fe4e6f4 in clone () from /lib/libc.so.6
Reproducable:
1%
Files
Updated by Mathieu Desnoyers over 11 years ago
Hi Stanislav,
Can you compile lttng-ust with:
make CFLAGS=-g
To disable optimisations and add debug info, and give us the backtrace ?
Thanks,
Mathieu
Updated by Mathieu Desnoyers over 11 years ago
Also, can you try starting your first lttng-ust application before lttng-sessiond is started on your machine, and then also test starting the first lttng-ust app after lttng-sessiond and see how this impacts the issue.
It looks like there is some problem accessing the shared memory map used for synchronization at application start time between lttng-ust and lttng-sessiond.
Thanks,
Mathieu
Updated by Mathieu Desnoyers over 11 years ago
Also, can you tell us what is your Linux kernel version exactly ?
Thanks,
Mathieu
Updated by Mathieu Desnoyers over 11 years ago
Oh, another thing to check:
Where does your /dev/shm point to ? Here it is:
lrwxrwxrwx 1 root root 8 Aug 29 09:56 shm -> /run/shm
and a quick look at mount:
tmpfs on /run/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=1578140k)
So basically, my system config limits posix shared memory to 1.5GB. So with per-uid buffers (the default since 2.3), you could reach this situation if you have very, very large buffers. What tracing configuration (buffer size, number of channels, per-pid or per-uid channels, number of users doing tracing, number of CPUs) do you have prior to the error ?
As far as I know, the operating system will trigger a sigbus if you try to access a posix shared memory for which it cannot allocate the backing pages in the page fault handler.
Thoughts ?
Thanks,
Mathieu
Updated by Mathieu Desnoyers about 11 years ago
- Status changed from New to Feedback
Updated by Stanislav Vovk about 11 years ago
The crash was not seen since we switched to debug libs.
Adding info about other questions.
Linux kernel version,
root@du1:~# uname -a Linux du1 3.4.34-rt40+ #e8430f3 SMP PREEMPT RT Tue, 24 Sep 2013 14:05:53 +0200 ppc GNU/Linux
/dev/shm is not a symlink
root@du1:~# ls -la /dev/shm total 8 drwxrwxrwt 2 root root 80 Sep 27 08:25 . drwxr-xr-x 12 root root 3540 Sep 27 08:25 .. -rw-rw-r-- 1 root root 4096 Sep 27 08:25 lttng-ust-wait-5 -rw-r----- 1 root root 4096 Sep 27 08:25 lttng-ust-wait-5-0
mount gives
proc on /proc type proc (rw,relatime) sysfs on /sys type sysfs (rw,relatime) debugfs on /sys/kernel/debug type debugfs (rw,relatime) none on /dev type tmpfs (rw,relatime,mode=755) devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620) tmpfs on /var/volatile type tmpfs (rw,relatime)
df -h
tmpfs_for_tmp 694M 76K 694M 1% /var/volatile/tmp none 694M 5.3M 689M 1% /dev tmpfs 694M 76K 694M 1% /var/volatile
Prior to the error there were just two programs linked with ust and configuration was the default one, one per-uid channel, default buffer size, 4 cores MP.
kernels shm config,
root@du1:~# ipcs -lm ------ Shared Memory Limits -------- max number of segments = 4096 max seg size (kbytes) = 32768 max total shared memory (kbytes) = 8388608 min seg size (bytes) = 1
Updated by Stanislav Vovk about 11 years ago
Got the crash back during the weekeend. See backtrace below with debug libs.
Commits used for the debug libs, build non-optimized:
ust: 07d2f704d61649a36bbe0dd71d48ea811d301f9d
tools: 3e618c7aca65a13d290c0eace8136ef5c468cace
urcu: b2f20dff3e0ce8db5f4f0067e2d8810420f16346
FYI, at system shutdown the lttng sessions were not stopped and destroyed.
This time different programs(2) crashed then earlier in the same system boot.
Program terminated with signal 7, Bus error. #0 0x0f962ff8 in wait_for_sessiond (sock_info=0xf9c667c <local_apps>) at lttng-ust-comm.c:875 875 lttng-ust-comm.c: No such file or directory. (gdb) bt full #0 0x0f962ff8 in wait_for_sessiond (sock_info=0xf9c667c <local_apps>) at lttng-ust-comm.c:875 ret = 0 __func__ = "wait_for_sessiond" #1 0x0f963308 in ust_listener_thread (arg=0xf9c667c <local_apps>) at lttng-ust-comm.c:923 sock_info = 0xf9c667c <local_apps> sock = 0 ret = 0 prev_connect_failed = 1 has_waited = 0 open_sock = {0, 0} i = 0 timeout = 0 __func__ = "ust_listener_thread" __PRETTY_FUNCTION__ = "ust_listener_thread" #2 0x0ffa484c in ?? () from /lib/libpthread.so.0 No symbol table info available. #3 0x0fea26f4 in clone () from /lib/libc.so.6 No symbol table info available. (gdb) q
Correcting my previous statement about configuration. There is two snapshot sessions created at system boot and configured with default channels, one channel in each session.
The programs that crashed this time were not instrumented with lttng tracepoints but loaded libraries which were instrumented.
Updated by Mathieu Desnoyers about 11 years ago
- Status changed from Feedback to Confirmed
- Assignee set to Mathieu Desnoyers
- Priority changed from Normal to High
OK, I'll come up with a fix shortly. Thanks for narrowing this one down.
Updated by Stanislav Vovk about 11 years ago
Hi,
Any updates/news on this issue?
Thanks,
Stanislav
Updated by Mathieu Desnoyers about 11 years ago
Hi Stan,
I've prepared two patches to fix this, one for lttng-ust, one for lttng-tools. I'm mostly done with the testing. I'll push them here so you can let me know if it fixes your problem before pushing them upstream.
Thanks,
Mathieu
Updated by Mathieu Desnoyers about 11 years ago
- File fix-sigbus-race.patch fix-sigbus-race.patch added
I've been able to reproduce the SIGBUS by removing the shm ftruncate() (artificially) in lttng-ust, which confirms the cause I identified. It's a race between application startup and sessiond startup, where there is an intermediate state where applications can SIGBUS if they see a zero-sized shm (it's been created, but not ftruncated yet).
I've pushed the following UST commit in stable-2.3:
commit afd1b711a8e494087f108550675698479fc457f4 Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Date: Wed Nov 6 07:03:10 2013 -0500 Fix: application SIGBUS when starting in parallel with sessiond There is a race between application startup and sessiond startup, where there is an intermediate state where applications can SIGBUS if they see a zero-sized shm, if the shm has been created, but not ftruncated yet. On the UST side, fix this by ensuring that UST can read the shared memory file descriptor with a read() system call before they try accessing it through a memory map (which triggers the SIGBUS if the access goes beyond the file size). On the sessiond side, another commit needs to ensure that the shared memory is writeable by applications as long as its size is 0, which allow applications to perform ftruncate and extend its size. Fixes #623 Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
I also have a patch for lttng-tools, which I am attaching here. Both patches are needed. I'll post the lttng-tools fix for inclusion.
Thanks,
Mathieu
Updated by Mathieu Desnoyers about 11 years ago
- Status changed from Confirmed to Resolved
- % Done changed from 0 to 100
Applied in changeset afd1b711a8e494087f108550675698479fc457f4.
Updated by Mathieu Desnoyers about 11 years ago
lttng-tools fix:
commit 7972d61991f4c7e25772502a3a7ebfe57a57645d Author: David Goulet <dgoulet@efficios.com> Date: Wed Nov 6 10:49:43 2013 -0500 Fix: application SIGBUS when starting in parallel with sessiond There is a race between application startup and sessiond startup, where there is an intermediate state where applications can SIGBUS if they see a zero-sized shm, if the shm has been created, but not ftruncated yet. On the sessiond side, we need to ensure that the shared memory is writeable by applications as long as its size is 0, which allow applications to perform ftruncate and extend its size. On the UST side, another commit needs to ensure that UST can read the shared memory file descriptor with a read() system call before they try accessing it through a memory map (which triggers the SIGBUS if the access goes beyond the file size) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>