Bug #1064
closedDeadlock while dlopen-ing a provider
0%
Description
I have been observing spurious deadlocks in the ust-dl test on the LTTng CI.
I finally managed to capture a backtrace which helps understand the problem:
(gdb) thread apply all bt Thread 3 (Thread 0xf6c4fb40 (LWP 712439)): #0 0xf76eebe9 in __kernel_vsyscall () #1 0xf76bfd32 in __lll_lock_wait () at ../sysdeps/unix/sysv/linux/i386/lowlevellock.S:144 #2 0xf76b9939 in __GI___pthread_mutex_lock (mutex=0xf77134e4 <_rtld_global+1220>) at ../nptl/pthread_mutex_lock.c:115 #3 0xf7700828 in tls_get_addr_tail (ti=0xf74eff24, dtv=0x9c858b8, the_map=0xf74fa198) at dl-tls.c:765 #4 0xf74addfc in _rcu_read_lock_bp () at /home/jenkins/workspace/lttng-ust_master_build/arch/x86-32/build/std/conf/std/liburcu_version/master/deps/liburcu/build/include/urcu/static/urcu-bp.h:168 #5 __tracepoint_cb_lttng_ust_statedump___start (session=0xf6300498) at lttng-ust-statedump-provider.h:41 #6 trace_start_cb (session=0xf6300498, priv=0x0) at lttng-ust-statedump.c:251 #7 0xf74ac836 in trace_statedump_event (tp_cb=tp_cb@entry=0xf74addd0 <trace_start_cb>, owner=owner@entry=0xf74f04e0 <local_apps>, priv=priv@entry=0x0) at lttng-ust-statedump.c:212 #8 0xf74ae7ea in trace_statedump_start (owner=0xf74f04e0 <local_apps>) at lttng-ust-statedump.c:347 #9 do_lttng_ust_statedump (owner=0xf74f04e0 <local_apps>) at lttng-ust-statedump.c:601 #10 0xf74a383a in lttng_handle_pending_statedump (owner=0xf74f04e0 <local_apps>) at lttng-events.c:869 #11 0xf7498ece in handle_pending_statedump (sock_info=<optimized out>) at lttng-ust-comm.c:589 #12 handle_message (lum=0xf6c4ea80, sock=<optimized out>, sock_info=<optimized out>) at lttng-ust-comm.c:974 #13 ust_listener_thread (arg=0xf74f04e0 <local_apps>) at lttng-ust-comm.c:1507 #14 0xf76b72b5 in start_thread (arg=0xf6c4fb40) at pthread_create.c:333 #15 0xf75e216e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:114 Thread 2 (Thread 0xf7450b40 (LWP 712438)): #0 0xf76eebe9 in __kernel_vsyscall () #1 0xf75ddbb7 in syscall () at ../sysdeps/unix/sysv/linux/i386/syscall.S:29 #2 0xf74990b2 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=0, op=0, uaddr=<optimized out>) at /home/jenkins/workspace/lttng-ust_master_build/arch/x86-32/build/std/conf/std/liburcu_version/master/deps/liburcu/build/include/urcu/futex.h:65 #3 futex_async (op=0, val=0, timeout=0x0, uaddr2=0x0, val3=0, uaddr=<optimized out>) at /home/jenkins/workspace/lttng-ust_master_build/arch/x86-32/build/std/conf/std/liburcu_version/master/deps/liburcu/build/include/urcu/futex.h:97 #4 wait_for_sessiond (sock_info=<optimized out>) at lttng-ust-comm.c:1258 #5 ust_listener_thread (arg=0xf74f2520 <global_apps>) at lttng-ust-comm.c:1319 #6 0xf76b72b5 in start_thread (arg=0xf7450b40) at pthread_create.c:333 #7 0xf75e216e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:114 Thread 1 (Thread 0xf7451740 (LWP 712437)): #0 0xf76eebe9 in __kernel_vsyscall () #1 0xf76bfd32 in __lll_lock_wait () at ../sysdeps/unix/sysv/linux/i386/lowlevellock.S:144 #2 0xf76b98be in __GI___pthread_mutex_lock (mutex=0xf74f5038 <ust_mutex>) at ../nptl/pthread_mutex_lock.c:80 #3 0xf7497e27 in ust_lock () at lttng-ust-comm.c:140 #4 0xf74ad547 in iter_begin (data=0xff88cb54) at lttng-ust-statedump.c:368 #5 0xf74adce8 in extract_bin_info_events (info=0xff88cab8, size=40, _data=0xff88cb54) at lttng-ust-statedump.c:453 #6 0xf7617afe in __GI___dl_iterate_phdr (callback=0xf74adca0 <extract_bin_info_events>, data=0xff88cb54) at dl-iteratephdr.c:76 #7 0xf74ae1da in lttng_ust_dl_update (ip=0xf76d4a03 <__tracepoints__init+114>) at lttng-ust-statedump.c:565 #8 0xf76e3597 in dlopen (filename=0xf76d4e08 "liblttng-ust-tracepoint.so.0", flags=258) at lttng-ust-dl.c:222 #9 0xf76d4a03 in __tracepoints__init () at /home/jenkins/workspace/dev_jgalar_lttng-tools_master-staging_build/arch/x86-32/babeltrace_version/master/build/oot/conf/python-bindings/liburcu_version/master/deps/lttng-ust/build/include/lttng/tracepoint.h:321 #10 0xf76fe365 in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0xff88d0d4, env=0xff88d0dc) at dl-init.c:72 #11 0xf76fe48e in call_init (env=0xff88d0dc, argv=0xff88d0d4, argc=1, l=<optimized out>) at dl-init.c:30 #12 _dl_init (main_map=main_map@entry=0x9c86d68, argc=1, argv=0xff88d0d4, env=0xff88d0dc) at dl-init.c:120 #13 0xf770275a in dl_open_worker (a=0xff88cdec) at dl-open.c:575 #14 0xf76fe204 in _dl_catch_error (objname=objname@entry=0xff88cde4, errstring=errstring@entry=0xff88cde8, mallocedp=mallocedp@entry=0xff88cde3, operate=0xf7702370 <dl_open_worker>, args=0xff88cdec) at dl-error.c:187 #15 0xf7701f02 in _dl_open (file=0x8048860 "libbar.so", mode=-2147483647, caller_dlopen=0xf76e354a <dlopen+58>, nsid=<optimized out>, argc=1, argv=0xff88d0d4, env=0xff88d0dc) at dl-open.c:660 #16 0xf76cf8bd in dlopen_doit (a=0xff88cfa0) at dlopenold.c:54 #17 0xf76fe204 in _dl_catch_error (objname=0x9c79014, errstring=0x9c79018, mallocedp=0x9c79010, operate=0xf76cf850 <dlopen_doit>, args=0xff88cfa0) at dl-error.c:187 #18 0xf76cf30d in _dlerror_run (operate=operate@entry=0xf76cf850 <dlopen_doit>, args=args@entry=0xff88cfa0) at dlerror.c:163 #19 0xf76cf939 in __dlopen_nocheck (file=0x8048860 "libbar.so", mode=1) at dlopenold.c:76 #20 0xf76e354a in _lttng_ust_dl_libc_dlopen (flags=1, filename=0x8048860 "libbar.so") at lttng-ust-dl.c:55 #21 dlopen (filename=0x8048860 "libbar.so", flags=1) at lttng-ust-dl.c:211 #22 0x0804866b in main (argc=1, argv=0xff88d0d4) at /home/jenkins/workspace/dev_jgalar_lttng-tools_master-staging_build/arch/x86-32/babeltrace_version/master/build/oot/conf/python-bindings/liburcu_version/master/tests/regression/ust/ust-dl/prog.c:20
We can see that:
Thread 1 has acquired the glibc dynamic loader lock (by dlopening a provider, libbar.so).
Thread 1 is trying to acquire ust_lock in extract_bin_info_events().
Thread 3 has acquired the ust_lock (taken in handle_message()).
Thread 3 then triggers a tracepoint in trace_start_cb(), which calls into _rcu_read_lock_bp(), accessing/faulting a TLS which causes a call into tls_get_addr_tail(), causing the thread to try to acquire the glibc dynamic loader lock.
Therefore, it is easy to make the problem reproducible by making the following modification to lttng-ust:
diff --git a/liblttng-ust/lttng-ust-statedump.c b/liblttng-ust/lttng-ust-statedump.c index 644af30..e372bc2 100644 --- a/liblttng-ust/lttng-ust-statedump.c +++ b/liblttng-ust/lttng-ust-statedump.c @@ -365,6 +365,7 @@ void iter_begin(struct dl_iterate_data *data) * protect memory allocation at early process start, due to * interactions with libc-wrapper lttng malloc instrumentation. */ + sleep(1); if (ust_lock()) { data->cancel = true; return;
Updated by Jérémie Galarneau about 8 years ago
- Status changed from New to In Progress
Disregard the note about the sleep-based reproducer, it is not reliable.
Updated by Jérémie Galarneau about 8 years ago
- Status changed from In Progress to Resolved
Fixed by:
commit c362addf1c50e1d3202a26ff96738b71b4f9a531 Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Date: Sat Oct 1 18:59:17 2016 -0400 Fix: perform TLS fixup in all UST entry points from each thread Each entry point into lttng-ust that end up taking the ust lock need to perform a TLS fixup for each thread. Add a TLS fixup in both listener threads, in fork and base address dump helper libs, and in app context and tracepoint probe registration/unregistration functions, which can be called from application threads. Those ensure we don't take the libc dl lock within the ust lock when performing the TLS lazy fixup. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>