Project

General

Profile

Actions

Bug #623

closed

ust crash, bus error

Added by Stanislav Vovk over 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
High
Target version:
-
Start date:
08/29/2013
Due date:
% Done:

100%

Estimated time:

Description

Commit id:
UST: 0ff43c44ea52d0d833ea17f15777622ec64d1b1a
tools: 95f3c88c5b889bb3d16bb6857f137c84e06b84c2
urcu: b2f20dff3e0ce8db5f4f0067e2d8810420f16346

Problem:
A program linked with lttng-ust crashes at startup. The crash is hard to reproduce, happens very seldom.
Sessiond is started and configured, tracing started.

See backtrace below,

Program terminated with signal 7, Bus error.
#0  0x0fd1f300 in wait_for_sessiond (sock_info=<optimized out>) at lttng-ust-comm.c:875
875        if (uatomic_read((int32_t *) sock_info->wait_shm_mmap) == 0) {
(gdb) bt full
#0  0x0fd1f300 in wait_for_sessiond (sock_info=<optimized out>) at lttng-ust-comm.c:875
        ret = <optimized out>
#1  ust_listener_thread (arg=0xfd50550 <local_apps>) at lttng-ust-comm.c:923
        sock_info = 0xfd50550 <local_apps>
        sock = <optimized out>
        ret = <optimized out>
        prev_connect_failed = <optimized out>
        has_waited = 0
        i = <optimized out>
        timeout = 822235928
        __func__ = "ust_listener_thread" 
        __PRETTY_FUNCTION__ = "ust_listener_thread" 
#2  0x0ffac84c in ?? () from /lib/libpthread.so.0
No symbol table info available.
#3  0x0fe4e6f4 in clone () from /lib/libc.so.6

Reproducable:
1%


Files

fix-sigbus-race.patch (3.28 KB) fix-sigbus-race.patch Mathieu Desnoyers, 11/06/2013 07:16 AM
Actions #1

Updated by Mathieu Desnoyers over 11 years ago

Hi Stanislav,

Can you compile lttng-ust with:

make CFLAGS=-g

To disable optimisations and add debug info, and give us the backtrace ?

Thanks,

Mathieu

Actions #2

Updated by Mathieu Desnoyers over 11 years ago

Also, can you try starting your first lttng-ust application before lttng-sessiond is started on your machine, and then also test starting the first lttng-ust app after lttng-sessiond and see how this impacts the issue.

It looks like there is some problem accessing the shared memory map used for synchronization at application start time between lttng-ust and lttng-sessiond.

Thanks,

Mathieu

Actions #3

Updated by Mathieu Desnoyers over 11 years ago

Also, can you tell us what is your Linux kernel version exactly ?

Thanks,

Mathieu

Actions #4

Updated by Mathieu Desnoyers over 11 years ago

Oh, another thing to check:

Where does your /dev/shm point to ? Here it is:

lrwxrwxrwx 1 root root 8 Aug 29 09:56 shm -> /run/shm

and a quick look at mount:

tmpfs on /run/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=1578140k)

So basically, my system config limits posix shared memory to 1.5GB. So with per-uid buffers (the default since 2.3), you could reach this situation if you have very, very large buffers. What tracing configuration (buffer size, number of channels, per-pid or per-uid channels, number of users doing tracing, number of CPUs) do you have prior to the error ?

As far as I know, the operating system will trigger a sigbus if you try to access a posix shared memory for which it cannot allocate the backing pages in the page fault handler.

Thoughts ?

Thanks,

Mathieu

Actions #5

Updated by Mathieu Desnoyers about 11 years ago

  • Status changed from New to Feedback
Actions #6

Updated by Stanislav Vovk about 11 years ago

The crash was not seen since we switched to debug libs.
Adding info about other questions.

Linux kernel version,

root@du1:~# uname -a          
Linux du1 3.4.34-rt40+ #e8430f3 SMP PREEMPT RT Tue, 24 Sep 2013 14:05:53 +0200 ppc GNU/Linux

/dev/shm is not a symlink

root@du1:~# ls -la /dev/shm
total 8
drwxrwxrwt  2 root root   80 Sep 27 08:25 .
drwxr-xr-x 12 root root 3540 Sep 27 08:25 ..
-rw-rw-r--  1 root root 4096 Sep 27 08:25 lttng-ust-wait-5
-rw-r-----  1 root root 4096 Sep 27 08:25 lttng-ust-wait-5-0

mount gives

proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
none on /dev type tmpfs (rw,relatime,mode=755)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620)
tmpfs on /var/volatile type tmpfs (rw,relatime)

df -h

tmpfs_for_tmp                              694M   76K  694M   1% /var/volatile/tmp
none                                       694M  5.3M  689M   1% /dev
tmpfs                                      694M   76K  694M   1% /var/volatile

Prior to the error there were just two programs linked with ust and configuration was the default one, one per-uid channel, default buffer size, 4 cores MP.

kernels shm config,

root@du1:~# ipcs -lm

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 32768
max total shared memory (kbytes) = 8388608
min seg size (bytes) = 1

Actions #7

Updated by Stanislav Vovk about 11 years ago

Got the crash back during the weekeend. See backtrace below with debug libs.

Commits used for the debug libs, build non-optimized:
ust: 07d2f704d61649a36bbe0dd71d48ea811d301f9d
tools: 3e618c7aca65a13d290c0eace8136ef5c468cace
urcu: b2f20dff3e0ce8db5f4f0067e2d8810420f16346

FYI, at system shutdown the lttng sessions were not stopped and destroyed.
This time different programs(2) crashed then earlier in the same system boot.

Program terminated with signal 7, Bus error.
#0  0x0f962ff8 in wait_for_sessiond (sock_info=0xf9c667c <local_apps>) at lttng-ust-comm.c:875
875    lttng-ust-comm.c: No such file or directory.
(gdb) bt full
#0  0x0f962ff8 in wait_for_sessiond (sock_info=0xf9c667c <local_apps>) at lttng-ust-comm.c:875
        ret = 0
        __func__ = "wait_for_sessiond" 
#1  0x0f963308 in ust_listener_thread (arg=0xf9c667c <local_apps>) at lttng-ust-comm.c:923
        sock_info = 0xf9c667c <local_apps>
        sock = 0
        ret = 0
        prev_connect_failed = 1
        has_waited = 0
        open_sock = {0, 0}
        i = 0
        timeout = 0
        __func__ = "ust_listener_thread" 
        __PRETTY_FUNCTION__ = "ust_listener_thread" 
#2  0x0ffa484c in ?? () from /lib/libpthread.so.0
No symbol table info available.
#3  0x0fea26f4 in clone () from /lib/libc.so.6
No symbol table info available.
(gdb) q

Correcting my previous statement about configuration. There is two snapshot sessions created at system boot and configured with default channels, one channel in each session.
The programs that crashed this time were not instrumented with lttng tracepoints but loaded libraries which were instrumented.

Actions #8

Updated by Mathieu Desnoyers about 11 years ago

  • Status changed from Feedback to Confirmed
  • Assignee set to Mathieu Desnoyers
  • Priority changed from Normal to High

OK, I'll come up with a fix shortly. Thanks for narrowing this one down.

Actions #9

Updated by Stanislav Vovk about 11 years ago

Hi,
Any updates/news on this issue?
Thanks,
Stanislav

Actions #10

Updated by Mathieu Desnoyers about 11 years ago

Hi Stan,

I've prepared two patches to fix this, one for lttng-ust, one for lttng-tools. I'm mostly done with the testing. I'll push them here so you can let me know if it fixes your problem before pushing them upstream.

Thanks,

Mathieu

Actions #11

Updated by Mathieu Desnoyers about 11 years ago

I've been able to reproduce the SIGBUS by removing the shm ftruncate() (artificially) in lttng-ust, which confirms the cause I identified. It's a race between application startup and sessiond startup, where there is an intermediate state where applications can SIGBUS if they see a zero-sized shm (it's been created, but not ftruncated yet).

I've pushed the following UST commit in stable-2.3:

commit afd1b711a8e494087f108550675698479fc457f4
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Wed Nov 6 07:03:10 2013 -0500

    Fix: application SIGBUS when starting in parallel with sessiond

    There is a race between application startup and sessiond startup, where
    there is an intermediate state where applications can SIGBUS if they see
    a zero-sized shm, if the shm has been created, but not ftruncated yet.

    On the UST side, fix this by ensuring that UST can read the shared
    memory file descriptor with a read() system call before they try
    accessing it through a memory map (which triggers the SIGBUS if the
    access goes beyond the file size).

    On the sessiond side, another commit needs to ensure that the shared
    memory is writeable by applications as long as its size is 0, which
    allow applications to perform ftruncate and extend its size.

    Fixes #623

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

I also have a patch for lttng-tools, which I am attaching here. Both patches are needed. I'll post the lttng-tools fix for inclusion.

Thanks,

Mathieu

Actions #12

Updated by Mathieu Desnoyers about 11 years ago

  • Status changed from Confirmed to Resolved
  • % Done changed from 0 to 100
Actions #13

Updated by Mathieu Desnoyers about 11 years ago

lttng-tools fix:

commit 7972d61991f4c7e25772502a3a7ebfe57a57645d
Author: David Goulet <dgoulet@efficios.com>
Date:   Wed Nov 6 10:49:43 2013 -0500

    Fix: application SIGBUS when starting in parallel with sessiond

    There is a race between application startup and sessiond startup, where there
    is an intermediate state where applications can SIGBUS if they see a zero-sized
    shm, if the shm has been created, but not ftruncated yet.

    On the sessiond side, we need to ensure that the shared memory is writeable by
    applications as long as its size is 0, which allow applications to perform
    ftruncate and extend its size.

    On the UST side, another commit needs to ensure that UST can read the shared
    memory file descriptor with a read() system call before they try accessing it
    through a memory map (which triggers the SIGBUS if the access goes beyond the
    file size)

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: David Goulet <dgoulet@efficios.com>
Actions

Also available in: Atom PDF