Project

General

Profile

Bug #865

UST deadlocks on probe registration

Added by Francis Giraldeau about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Target version:
-
Start date:
11/28/2014
Due date:
% Done:

0%


Description

I did a prototype of UST instrumentation to record all function
entry/return of Python programs (works with CPython >= 3).

Here is an example of the trace:

$ ./go.sh
...
[16:50:17.443042690] (+0.000053969) berta python:call: { cpu_id = 6 }, {
co_name = "<module>" }
[16:50:17.443056837] (+0.000014147) berta python:call: { cpu_id = 6 }, {
co_name = "cafe" }

It works when only the event python:call is enabled. When all events are
enabled (lttng enable-event -a -u), then the python program hangs on a call
to futex wait. It does not respond to CTRL-C signal, and must be killed.
Here is the full stack trace in GDB, showing that it seems related to the
registration of tracef() probe. Below that, there is the lttng-sessiond
log. I reproduced the problem with lttng 2.5x stable and the git master.

Below is the debug info from GDB showing the two threads involved in the deadlock. Basically, it seems that tracing nests with probe registration:

Thread 1:
pthread_mutex_lock
tls_get_addr_tail
_rcu_read_lock_bp
tracepoint_cb_ust_baddr_statedump_soinfo
trace_baddr

Thread 2:
pthread_mutex_lock
ust_lock_nocheck
lttng_probe_register
_lttng_events_init_lttng_ust_tracef

Could it be possible that rcu_read_lock() is not lock-free because of the TLS?

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff58be700 (LWP 314)]
[New Thread 0x7ffff50bd700 (LWP 315)]
Program received signal SIGSTOP, Stopped (signal).
0x00007ffff7bcaf2c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007ffff7bcaf2c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ffff7bc6657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007ffff7bc6480 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007ffff6102aec in ust_lock_nocheck () at lttng-ust-comm.c:150
#4  0x00007ffff61083a5 in lttng_probe_register (
    desc=0x7ffff633e9c0 <__probe_desc___lttng_ust_tracef>) at lttng-probes.c:204
#5  0x00007ffff60fc301 in __lttng_events_init__lttng_ust_tracef ()
    at ../include/lttng/ust-tracepoint-event.h:797
#6  0x00007ffff7dea13a in call_init (l=<optimized out>, argc=argc@entry=2, 
    argv=argv@entry=0x7fffffffdcf8, env=env@entry=0x7fffffffdd10) at dl-init.c:78
#7  0x00007ffff7dea223 in call_init (env=<optimized out>, argv=<optimized out>, 
    argc=<optimized out>, l=<optimized out>) at dl-init.c:36
#8  _dl_init (main_map=main_map@entry=0xafa210, argc=2, argv=0x7fffffffdcf8, env=0x7fffffffdd10)
    at dl-init.c:126
#9  0x00007ffff7deec70 in dl_open_worker (a=a@entry=0x7fffffffc1f8) at dl-open.c:577
#10 0x00007ffff7de9ff4 in _dl_catch_error (objname=objname@entry=0x7fffffffc1e8, 
    errstring=errstring@entry=0x7fffffffc1f0, mallocedp=mallocedp@entry=0x7fffffffc1e0, 
    operate=operate@entry=0x7ffff7dee9a0 <dl_open_worker>, args=args@entry=0x7fffffffc1f8)
    at dl-error.c:187
#11 0x00007ffff7dee3bb in _dl_open (
    file=0x7ffff6593f50 "./build/lib.linux-x86_64-3.4/lttngProfile.cpython-34m.so", 
    mode=-2147483646, caller_dlopen=<optimized out>, nsid=-2, argc=2, argv=0x7fffffffdcf8, 
    env=0x7fffffffdd10) at dl-open.c:661
#12 0x00007ffff75f302b in dlopen_doit (a=a@entry=0x7fffffffc410) at dlopen.c:66
#13 0x00007ffff7de9ff4 in _dl_catch_error (objname=0xa7bcb0, errstring=0xa7bcb8, mallocedp=0xa7bca8, 
    operate=0x7ffff75f2fd0 <dlopen_doit>, args=0x7fffffffc410) at dl-error.c:187
#14 0x00007ffff75f362d in _dlerror_run (operate=operate@entry=0x7ffff75f2fd0 <dlopen_doit>, 
    args=args@entry=0x7fffffffc410) at dlerror.c:163
#15 0x00007ffff75f30c1 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#16 0x000000000050fa6e in _PyImport_GetDynLoadFunc ()

(gdb) print &ust_mutex
$1 = (pthread_mutex_t *) 0x7ffff633eee0 <ust_mutex>

(gdb) print *(&ust_mutex)
$2 = {__data = {__lock = 2, __count = 0, __owner = 315, __nusers = 1, __kind = 0, __spins = 0, 
    __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
  __size = "\002\000\000\000\000\000\000\000;\001\000\000\001", '\000' <repeats 26 times>, 
  __align = 2}

(gdb) info threads
  Id   Target Id         Frame 
  3    Thread 0x7ffff50bd700 (LWP 315) "python3" 0x00007ffff7bcaf2c in __lll_lock_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
  2    Thread 0x7ffff58be700 (LWP 314) "python3" syscall ()
    at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
* 1    Thread 0x7ffff7fd1740 (LWP 310) "python3" 0x00007ffff7bcaf2c in __lll_lock_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0

(gdb) thread 3
[Switching to thread 3 (Thread 0x7ffff50bd700 (LWP 315))]
#0  0x00007ffff7bcaf2c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007ffff7bcaf2c in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ffff7bc6672 in _L_lock_953 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007ffff7bc64da in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007ffff7ddb150 in tls_get_addr_tail (ti=0x7ffff6339f58, dtv=0xafcde0, the_map=0xafb510)
    at dl-tls.c:722
#4  0x00007ffff6110a35 in _rcu_read_lock_bp ()
    at /usr/include/x86_64-linux-gnu/urcu/static/urcu-bp.h:158
#5  __tracepoint_cb_ust_baddr_statedump___soinfo (mtime=-1, size=0, sopath=0x7ffff50bb3c0 "[vdso]", 
    baddr=0x7ffff7ffa000, session=0x7fffe8000930) at ust_baddr_statedump.h:40
#6  trace_baddr (base_addr_ptr=base_addr_ptr@entry=0x7ffff7ffa000, 
    resolved_path=resolved_path@entry=0x7ffff50bb3c0 "[vdso]", vdso=<optimized out>, 
    owner=owner@entry=0x7ffff633a880 <local_apps>) at lttng-ust-baddr.c:82
#7  0x00007ffff6110efe in extract_soinfo_events (info=info@entry=0x7ffff50bc420, size=size@entry=64, 
    _data=_data@entry=0x7ffff50bc4a0) at lttng-ust-baddr.c:141
#8  0x00007ffff792c75c in __GI___dl_iterate_phdr (
    callback=callback@entry=0x7ffff6110e60 <extract_soinfo_events>, data=data@entry=0x7ffff50bc4a0)
    at dl-iteratephdr.c:76
#9  0x00007ffff6110fc4 in lttng_ust_baddr_statedump (owner=owner@entry=0x7ffff633a880 <local_apps>)
    at lttng-ust-baddr.c:189
#10 0x00007ffff610b209 in lttng_handle_pending_statedump (
    owner=owner@entry=0x7ffff633a880 <local_apps>) at lttng-events.c:686
#11 0x00007ffff610318e in handle_pending_statedump (sock_info=0x7ffff633a880 <local_apps>)
    at lttng-ust-comm.c:523
#12 handle_message (lum=0x7ffff50bc940, sock=3, sock_info=0x7ffff633a880 <local_apps>)
    at lttng-ust-comm.c:850
#13 ust_listener_thread (arg=0x7ffff633a880 <local_apps>) at lttng-ust-comm.c:1357
#14 0x00007ffff7bc4182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007ffff78f0fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

python-profile-ust.tar.gz (1.78 KB) Francis Giraldeau, 11/28/2014 12:33 PM

fix-baddr-python.patch View (789 Bytes) Mathieu Desnoyers, 01/12/2015 12:00 PM

History

#1 Updated by Francis Giraldeau about 2 years ago

Here are all the files to reproduce the bug. Just execute ./go.sh. Requires python3-dev and lttng-ust, of course.

#2 Updated by Francis Giraldeau about 2 years ago

I confirm that disabling baddr statedump does not triggers the bug.

export LTTNG_UST_WITHOUT_BADDR_STATEDUMP=true

Thanks!

#3 Updated by Mathieu Desnoyers about 2 years ago

Can you try this out ?

#4 Updated by Mathieu Desnoyers about 2 years ago

  • Status changed from Needs feedback to Resolved

Fixed by master commit:

commit 1556a5495c92bdd99e57faa87e21eb0b6f95c3a2
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Mon Jan 12 13:27:53 2015 -0500

    Fix: add urcu-bp TLS fixup

    When performing a library base address statedump, the UST threads can
    grab the RCU read-side lock, which touch a TLS. We should "pre-fault" 
    this TLS in the constructor to eliminate a deadlock involving libc
    dynamic loader lock.

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

#5 Updated by Francis Giraldeau about 2 years ago

I confirm that the patch fixes the issue. Here is a trace example:

[14:21:48.675076983] (+0.000024696) berta python:call: { cpu_id = 4 }, { co_name = "cafe" }
[14:21:48.675093285] (+0.000016302) berta ust_baddr_statedump:soinfo: { cpu_id = 0 }, { baddr = 0x7F1F77C13000, sopath = "/home/francis/tmp/python-profile-ust/build/lib.linux-x86_64-3.4/lttngProfile.cpython-34m.so", size = 50706, mtime = 1421090508 }

Thanks!

Also available in: Atom PDF