Project

General

Profile

Bug #1064

Deadlock while dlopen-ing a provider

Added by Jérémie Galarneau almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Target version:
Start date:
09/30/2016
Due date:
% Done:

0%

Estimated time:

Description

I have been observing spurious deadlocks in the ust-dl test on the LTTng CI.

I finally managed to capture a backtrace which helps understand the problem:

(gdb) thread apply all bt

Thread 3 (Thread 0xf6c4fb40 (LWP 712439)):
#0  0xf76eebe9 in __kernel_vsyscall ()
#1  0xf76bfd32 in __lll_lock_wait () at ../sysdeps/unix/sysv/linux/i386/lowlevellock.S:144
#2  0xf76b9939 in __GI___pthread_mutex_lock (mutex=0xf77134e4 <_rtld_global+1220>) at ../nptl/pthread_mutex_lock.c:115
#3  0xf7700828 in tls_get_addr_tail (ti=0xf74eff24, dtv=0x9c858b8, the_map=0xf74fa198) at dl-tls.c:765
#4  0xf74addfc in _rcu_read_lock_bp () at /home/jenkins/workspace/lttng-ust_master_build/arch/x86-32/build/std/conf/std/liburcu_version/master/deps/liburcu/build/include/urcu/static/urcu-bp.h:168
#5  __tracepoint_cb_lttng_ust_statedump___start (session=0xf6300498) at lttng-ust-statedump-provider.h:41
#6  trace_start_cb (session=0xf6300498, priv=0x0) at lttng-ust-statedump.c:251
#7  0xf74ac836 in trace_statedump_event (tp_cb=tp_cb@entry=0xf74addd0 <trace_start_cb>, owner=owner@entry=0xf74f04e0 <local_apps>, priv=priv@entry=0x0) at lttng-ust-statedump.c:212
#8  0xf74ae7ea in trace_statedump_start (owner=0xf74f04e0 <local_apps>) at lttng-ust-statedump.c:347
#9  do_lttng_ust_statedump (owner=0xf74f04e0 <local_apps>) at lttng-ust-statedump.c:601
#10 0xf74a383a in lttng_handle_pending_statedump (owner=0xf74f04e0 <local_apps>) at lttng-events.c:869
#11 0xf7498ece in handle_pending_statedump (sock_info=<optimized out>) at lttng-ust-comm.c:589
#12 handle_message (lum=0xf6c4ea80, sock=<optimized out>, sock_info=<optimized out>) at lttng-ust-comm.c:974
#13 ust_listener_thread (arg=0xf74f04e0 <local_apps>) at lttng-ust-comm.c:1507
#14 0xf76b72b5 in start_thread (arg=0xf6c4fb40) at pthread_create.c:333
#15 0xf75e216e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:114

Thread 2 (Thread 0xf7450b40 (LWP 712438)):
#0  0xf76eebe9 in __kernel_vsyscall ()
#1  0xf75ddbb7 in syscall () at ../sysdeps/unix/sysv/linux/i386/syscall.S:29
#2  0xf74990b2 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=0, op=0, uaddr=<optimized out>) at /home/jenkins/workspace/lttng-ust_master_build/arch/x86-32/build/std/conf/std/liburcu_version/master/deps/liburcu/build/include/urcu/futex.h:65
#3  futex_async (op=0, val=0, timeout=0x0, uaddr2=0x0, val3=0, uaddr=<optimized out>) at /home/jenkins/workspace/lttng-ust_master_build/arch/x86-32/build/std/conf/std/liburcu_version/master/deps/liburcu/build/include/urcu/futex.h:97
#4  wait_for_sessiond (sock_info=<optimized out>) at lttng-ust-comm.c:1258
#5  ust_listener_thread (arg=0xf74f2520 <global_apps>) at lttng-ust-comm.c:1319
#6  0xf76b72b5 in start_thread (arg=0xf7450b40) at pthread_create.c:333
#7  0xf75e216e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:114

Thread 1 (Thread 0xf7451740 (LWP 712437)):
#0  0xf76eebe9 in __kernel_vsyscall ()
#1  0xf76bfd32 in __lll_lock_wait () at ../sysdeps/unix/sysv/linux/i386/lowlevellock.S:144
#2  0xf76b98be in __GI___pthread_mutex_lock (mutex=0xf74f5038 <ust_mutex>) at ../nptl/pthread_mutex_lock.c:80
#3  0xf7497e27 in ust_lock () at lttng-ust-comm.c:140
#4  0xf74ad547 in iter_begin (data=0xff88cb54) at lttng-ust-statedump.c:368
#5  0xf74adce8 in extract_bin_info_events (info=0xff88cab8, size=40, _data=0xff88cb54) at lttng-ust-statedump.c:453
#6  0xf7617afe in __GI___dl_iterate_phdr (callback=0xf74adca0 <extract_bin_info_events>, data=0xff88cb54) at dl-iteratephdr.c:76
#7  0xf74ae1da in lttng_ust_dl_update (ip=0xf76d4a03 <__tracepoints__init+114>) at lttng-ust-statedump.c:565
#8  0xf76e3597 in dlopen (filename=0xf76d4e08 "liblttng-ust-tracepoint.so.0", flags=258) at lttng-ust-dl.c:222
#9  0xf76d4a03 in __tracepoints__init () at /home/jenkins/workspace/dev_jgalar_lttng-tools_master-staging_build/arch/x86-32/babeltrace_version/master/build/oot/conf/python-bindings/liburcu_version/master/deps/lttng-ust/build/include/lttng/tracepoint.h:321
#10 0xf76fe365 in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0xff88d0d4, env=0xff88d0dc) at dl-init.c:72
#11 0xf76fe48e in call_init (env=0xff88d0dc, argv=0xff88d0d4, argc=1, l=<optimized out>) at dl-init.c:30
#12 _dl_init (main_map=main_map@entry=0x9c86d68, argc=1, argv=0xff88d0d4, env=0xff88d0dc) at dl-init.c:120
#13 0xf770275a in dl_open_worker (a=0xff88cdec) at dl-open.c:575
#14 0xf76fe204 in _dl_catch_error (objname=objname@entry=0xff88cde4, errstring=errstring@entry=0xff88cde8, mallocedp=mallocedp@entry=0xff88cde3, operate=0xf7702370 <dl_open_worker>, args=0xff88cdec) at dl-error.c:187
#15 0xf7701f02 in _dl_open (file=0x8048860 "libbar.so", mode=-2147483647, caller_dlopen=0xf76e354a <dlopen+58>, nsid=<optimized out>, argc=1, argv=0xff88d0d4, env=0xff88d0dc) at dl-open.c:660
#16 0xf76cf8bd in dlopen_doit (a=0xff88cfa0) at dlopenold.c:54
#17 0xf76fe204 in _dl_catch_error (objname=0x9c79014, errstring=0x9c79018, mallocedp=0x9c79010, operate=0xf76cf850 <dlopen_doit>, args=0xff88cfa0) at dl-error.c:187
#18 0xf76cf30d in _dlerror_run (operate=operate@entry=0xf76cf850 <dlopen_doit>, args=args@entry=0xff88cfa0) at dlerror.c:163
#19 0xf76cf939 in __dlopen_nocheck (file=0x8048860 "libbar.so", mode=1) at dlopenold.c:76
#20 0xf76e354a in _lttng_ust_dl_libc_dlopen (flags=1, filename=0x8048860 "libbar.so") at lttng-ust-dl.c:55
#21 dlopen (filename=0x8048860 "libbar.so", flags=1) at lttng-ust-dl.c:211
#22 0x0804866b in main (argc=1, argv=0xff88d0d4) at /home/jenkins/workspace/dev_jgalar_lttng-tools_master-staging_build/arch/x86-32/babeltrace_version/master/build/oot/conf/python-bindings/liburcu_version/master/tests/regression/ust/ust-dl/prog.c:20

We can see that:

Thread 1 has acquired the glibc dynamic loader lock (by dlopening a provider, libbar.so).
Thread 1 is trying to acquire ust_lock in extract_bin_info_events().

Thread 3 has acquired the ust_lock (taken in handle_message()).
Thread 3 then triggers a tracepoint in trace_start_cb(), which calls into _rcu_read_lock_bp(), accessing/faulting a TLS which causes a call into tls_get_addr_tail(), causing the thread to try to acquire the glibc dynamic loader lock.

Therefore, it is easy to make the problem reproducible by making the following modification to lttng-ust:

diff --git a/liblttng-ust/lttng-ust-statedump.c b/liblttng-ust/lttng-ust-statedump.c
index 644af30..e372bc2 100644
--- a/liblttng-ust/lttng-ust-statedump.c
+++ b/liblttng-ust/lttng-ust-statedump.c
@@ -365,6 +365,7 @@ void iter_begin(struct dl_iterate_data *data)
         * protect memory allocation at early process start, due to
         * interactions with libc-wrapper lttng malloc instrumentation.
         */
+       sleep(1);
        if (ust_lock()) {
                data->cancel = true;
                return;

#1

Updated by Jérémie Galarneau almost 4 years ago

  • Status changed from New to In Progress

Disregard the note about the sleep-based reproducer, it is not reliable.

#2

Updated by Jérémie Galarneau almost 4 years ago

  • Status changed from In Progress to Resolved

Fixed by:

commit c362addf1c50e1d3202a26ff96738b71b4f9a531
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Sat Oct 1 18:59:17 2016 -0400

    Fix: perform TLS fixup in all UST entry points from each thread

    Each entry point into lttng-ust that end up taking the ust lock need to
    perform a TLS fixup for each thread. Add a TLS fixup in both listener
    threads, in fork and base address dump helper libs, and in app context
    and tracepoint probe registration/unregistration functions, which can be
    called from application threads.

    Those ensure we don't take the libc dl lock within the ust lock when
    performing the TLS lazy fixup.

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Also available in: Atom PDF