Project

General

Profile

Actions

Bug #1202

closed

Deadlock in probe record code when perf counters enabled

Added by Anton Smyk over 5 years ago. Updated almost 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
10/07/2019
Due date:
% Done:

100%

Estimated time:

Description

We experience deadlock in UST probe recording code when perf counters were enabled.

Managed to reproduce the problem in clean minimal install of Oracle Linux 7.7 with latest vanilla RPMs from Efficios site:

[root@tb97647 ~]# rpm -qa 'userspace-rcu*' 'lttng-*' | sort
lttng-tools-2.10.7-1.el7.x86_64
lttng-tools-debuginfo-2.10.7-1.el7.x86_64
lttng-tools-devel-2.10.7-1.el7.x86_64
lttng-ust-2.10.5-1.el7.x86_64
lttng-ust-debuginfo-2.10.5-1.el7.x86_64
lttng-ust-devel-2.10.5-1.el7.x86_64
lttng-ust-java-2.10.5-1.el7.x86_64
lttng-ust-java-agent-2.10.5-1.el7.x86_64
userspace-rcu-0.10.2-1.el7.x86_64
userspace-rcu-debuginfo-0.10.2-1.el7.x86_64
userspace-rcu-devel-0.10.2-1.el7.x86_64

Here are steps to reproduce:

1. In a terminal session start lttng spam program

[antons@tb97647 spam-lttng-ust]$ export LTTNG_HOME=$HOME/TB-97647/test03/lttng-home
[antons@tb97647 spam-lttng-ust]$ ~/spam-lttng-ust/spam 100 300 4
Started spamming with 100 trace events per second for 300 seconds using 4 concurrent threads...

Apparently, for the deadlock to reproduce, it is crucial to start the spam program BEFORE starting the LTTng session.

2. In another terminal window create, configure and start LTTng session:

[antons@tb97647 test03]$ export LTTNG_HOME=$HOME/TB-97647/test03/lttng-home
[antons@tb97647 test03]$ mkdir -p $LTTNG_HOME
[antons@tb97647 test03]$ lttng-sessiond -vvv --verbose-consumer > lttng-sessiond.log 2>&1 &
[antons@tb97647 test03]$ lttng create testtrace
Session testtrace created.
Traces will be written in /home/antons/TB-97647/test03/lttng-home/lttng-traces/testtrace-20191007-134454
[antons@tb97647 test03]$ lttng enable-event -u 'spam:*'
UST event spam:* created in channel channel0
[antons@tb97647 test03]$ for i in perf:thread:cycles perf:thread:stalled-cycles-frontend perf:thread:instructions perf:thread:cache-references perf:thread:cache-misses perf:thread:bus-cycles; do lttng add-context --userspace --type=$i; done
UST context perf:thread:cycles added to all channels
UST context perf:thread:stalled-cycles-frontend added to all channels
UST context perf:thread:instructions added to all channels
UST context perf:thread:cache-references added to all channels
UST context perf:thread:cache-misses added to all channels
UST context perf:thread:bus-cycles added to all channels
[antons@tb97647 test03]$ lttng start testtrace
Tracing started for session testtrace

3. In no time the spam program exhibits deadlock:

[root@tb97647 ~]# pstack `pgrep spam`
Thread 7 (Thread 0x7f6ac0915700 (LWP 22774)):
#0  0x00007f6ac1449be9 in syscall () from /lib64/libc.so.6
#1  0x00007f6ac1b5b60b in futex (val3=0, uaddr2=0x0, timeout=0x0, val=0, op=0, uaddr=0x7f6ac1fd4000) at /usr/include/urcu/futex.h:65
#2  futex_async (op=0, val=0, timeout=0x0, uaddr2=0x0, val3=0, uaddr=0x7f6ac1fd4000) at /usr/include/urcu/futex.h:97
#3  wait_for_sessiond (sock_info=0x7f6ac1da9a20 <global_apps>, sock_info=0x7f6ac1da9a20 <global_apps>) at lttng-ust-comm.c:1405
#4  ust_listener_thread (arg=0x7f6ac1da9a20 <global_apps>) at lttng-ust-comm.c:1467
#5  0x00007f6ac1726ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f6ac144f8cd in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f6ac0114700 (LWP 22775)):
#0  0x00007f6ac1444c2d in poll () from /lib64/libc.so.6
#1  0x00007f6ac0d27646 in poll (__timeout=10, __nfds=0, __fds=0x0) at /usr/include/bits/poll2.h:46
#2  wait_for_readers (input_readers=input_readers@entry=0x7f6ac0113250, cur_snap_readers=cur_snap_readers@entry=0x0, qsreaders=qsreaders@entry=0x7f6ac0113260) at urcu-bp.c:250
#3  0x00007f6ac0d2772c in synchronize_rcu_bp () at urcu-bp.c:319
#4  0x00007f6ac1b64155 in synchronize_trace () at lttng-events.c:135
#5  0x00007f6ac1b5edb5 in lttng_cmd (objd=<optimized out>, cmd=<optimized out>, arg=<optimized out>, uargs=<optimized out>, owner=<optimized out>) at lttng-ust-abi.c:385
#6  0x00007f6ac1b5b3c2 in handle_message (lum=0x7f6ac01135b0, sock=3, sock_info=0x7f6ac1da79e0 <local_apps>) at lttng-ust-comm.c:1000
#7  ust_listener_thread (arg=0x7f6ac1da79e0 <local_apps>) at lttng-ust-comm.c:1709
#8  0x00007f6ac1726ea5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f6ac144f8cd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f6abf913700 (LWP 22776)):
#0  0x00007f6ac172d54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f6ac1728e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f6ac1728d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f6ac1b5a4cc in ust_lock_nocheck () at lttng-ust-comm.c:178
#4  0x00007f6ac1b732ef in add_thread_field (perf_thread=0x7f6aac0008c0, perf_field=0x7f6ab8012350) at lttng-context-perf-counters.c:311
#5  get_thread_field (field=0x7f6ab8012350) at lttng-context-perf-counters.c:340
#6  0x00007f6ac1b73478 in wrapper_perf_counter_read (field=<optimized out>) at lttng-context-perf-counters.c:350
#7  perf_counter_record (field=<optimized out>, ctx=0x7f6abf912d30, chan=0x7f6ab8010dc0) at lttng-context-perf-counters.c:361
#8  0x00007f6ac1b4ea40 in ctx_record (bufctx=bufctx@entry=0x7f6abf912d30, chan=chan@entry=0x7f6ab8010dc0, ctx=0x7f6ab80114a0, mode=mode@entry=APP_CTX_ENABLED) at lttng-ring-buffer-client.h:140
#9  0x00007f6ac1b76052 in ctx_record (mode=APP_CTX_ENABLED, ctx=<optimized out>, chan=0x7f6ab8010dc0, bufctx=0x7f6abf912d30) at ../libringbuffer/backend.h:91
#10 lttng_write_event_header_slow (config=<optimized out>, event_id=<optimized out>, ctx=<optimized out>) at lttng-ring-buffer-client.h:361
#11 lttng_write_event_header (config=0x7f6ac1da6640 <client_config>, event_id=<optimized out>, ctx=0x7f6abf912d30) at lttng-ring-buffer-client.h:292
#12 lttng_event_reserve (ctx=0x7f6abf912d30, event_id=<optimized out>) at lttng-ring-buffer-client.h:728
#13 0x00000000004019fc in __event_probe__spam___dummy_event ()
#14 0x0000000000400f39 in spammer_thread ()
#15 0x00007f6ac1726ea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f6ac144f8cd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f6abf112700 (LWP 22777)):
#0  0x00007f6ac172d54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f6ac1728e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f6ac1728d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f6ac1b5a4cc in ust_lock_nocheck () at lttng-ust-comm.c:178
#4  0x00007f6ac1b732ef in add_thread_field (perf_thread=0x7f6ab00008c0, perf_field=0x7f6ab8012350) at lttng-context-perf-counters.c:311
#5  get_thread_field (field=0x7f6ab8012350) at lttng-context-perf-counters.c:340
#6  0x00007f6ac1b73478 in wrapper_perf_counter_read (field=<optimized out>) at lttng-context-perf-counters.c:350
#7  perf_counter_record (field=<optimized out>, ctx=0x7f6abf111d30, chan=0x7f6ab8010dc0) at lttng-context-perf-counters.c:361
#8  0x00007f6ac1b4ea40 in ctx_record (bufctx=bufctx@entry=0x7f6abf111d30, chan=chan@entry=0x7f6ab8010dc0, ctx=0x7f6ab80114a0, mode=mode@entry=APP_CTX_ENABLED) at lttng-ring-buffer-client.h:140
#9  0x00007f6ac1b76052 in ctx_record (mode=APP_CTX_ENABLED, ctx=<optimized out>, chan=0x7f6ab8010dc0, bufctx=0x7f6abf111d30) at ../libringbuffer/backend.h:91
#10 lttng_write_event_header_slow (config=<optimized out>, event_id=<optimized out>, ctx=<optimized out>) at lttng-ring-buffer-client.h:361
#11 lttng_write_event_header (config=0x7f6ac1da6640 <client_config>, event_id=<optimized out>, ctx=0x7f6abf111d30) at lttng-ring-buffer-client.h:292
#12 lttng_event_reserve (ctx=0x7f6abf111d30, event_id=<optimized out>) at lttng-ring-buffer-client.h:728
#13 0x00000000004019fc in __event_probe__spam___dummy_event ()
#14 0x0000000000400f39 in spammer_thread ()
#15 0x00007f6ac1726ea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f6ac144f8cd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f6abe911700 (LWP 22778)):
#0  0x00007f6ac172d54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f6ac1728e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f6ac1728d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f6ac1b5a4cc in ust_lock_nocheck () at lttng-ust-comm.c:178
#4  0x00007f6ac1b732ef in add_thread_field (perf_thread=0x7f6aa80008c0, perf_field=0x7f6ab8012350) at lttng-context-perf-counters.c:311
#5  get_thread_field (field=0x7f6ab8012350) at lttng-context-perf-counters.c:340
#6  0x00007f6ac1b73478 in wrapper_perf_counter_read (field=<optimized out>) at lttng-context-perf-counters.c:350
#7  perf_counter_record (field=<optimized out>, ctx=0x7f6abe910d30, chan=0x7f6ab8010dc0) at lttng-context-perf-counters.c:361
#8  0x00007f6ac1b4ea40 in ctx_record (bufctx=bufctx@entry=0x7f6abe910d30, chan=chan@entry=0x7f6ab8010dc0, ctx=0x7f6ab80114a0, mode=mode@entry=APP_CTX_ENABLED) at lttng-ring-buffer-client.h:140
#9  0x00007f6ac1b76052 in ctx_record (mode=APP_CTX_ENABLED, ctx=<optimized out>, chan=0x7f6ab8010dc0, bufctx=0x7f6abe910d30) at ../libringbuffer/backend.h:91
#10 lttng_write_event_header_slow (config=<optimized out>, event_id=<optimized out>, ctx=<optimized out>) at lttng-ring-buffer-client.h:361
#11 lttng_write_event_header (config=0x7f6ac1da6640 <client_config>, event_id=<optimized out>, ctx=0x7f6abe910d30) at lttng-ring-buffer-client.h:292
#12 lttng_event_reserve (ctx=0x7f6abe910d30, event_id=<optimized out>) at lttng-ring-buffer-client.h:728
#13 0x00000000004019fc in __event_probe__spam___dummy_event ()
#14 0x0000000000400f39 in spammer_thread ()
#15 0x00007f6ac1726ea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f6ac144f8cd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f6abe110700 (LWP 22779)):
#0  0x00007f6ac172d54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f6ac1728e9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f6ac1728d68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f6ac1b5a4cc in ust_lock_nocheck () at lttng-ust-comm.c:178
#4  0x00007f6ac1b732ef in add_thread_field (perf_thread=0x7f6ab40008c0, perf_field=0x7f6ab8012350) at lttng-context-perf-counters.c:311
#5  get_thread_field (field=0x7f6ab8012350) at lttng-context-perf-counters.c:340
#6  0x00007f6ac1b73478 in wrapper_perf_counter_read (field=<optimized out>) at lttng-context-perf-counters.c:350
#7  perf_counter_record (field=<optimized out>, ctx=0x7f6abe10fd30, chan=0x7f6ab8010dc0) at lttng-context-perf-counters.c:361
#8  0x00007f6ac1b4ea40 in ctx_record (bufctx=bufctx@entry=0x7f6abe10fd30, chan=chan@entry=0x7f6ab8010dc0, ctx=0x7f6ab80114a0, mode=mode@entry=APP_CTX_ENABLED) at lttng-ring-buffer-client.h:140
#9  0x00007f6ac1b76052 in ctx_record (mode=APP_CTX_ENABLED, ctx=<optimized out>, chan=0x7f6ab8010dc0, bufctx=0x7f6abe10fd30) at ../libringbuffer/backend.h:91
#10 lttng_write_event_header_slow (config=<optimized out>, event_id=<optimized out>, ctx=<optimized out>) at lttng-ring-buffer-client.h:361
#11 lttng_write_event_header (config=0x7f6ac1da6640 <client_config>, event_id=<optimized out>, ctx=0x7f6abe10fd30) at lttng-ring-buffer-client.h:292
#12 lttng_event_reserve (ctx=0x7f6abe10fd30, event_id=<optimized out>) at lttng-ring-buffer-client.h:728
#13 0x00000000004019fc in __event_probe__spam___dummy_event ()
#14 0x0000000000400f39 in spammer_thread ()
#15 0x00007f6ac1726ea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f6ac144f8cd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f6ac1fc77c0 (LWP 22773)):
#0  0x00007f6ac1728017 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000401127 in main ()


Files

lttng-sessiond.log.gz (39.5 KB) lttng-sessiond.log.gz Anton Smyk, 10/07/2019 08:16 AM
0001-Fix-lttng-perf-counter-deadlock.patch (7.81 KB) 0001-Fix-lttng-perf-counter-deadlock.patch Mathieu Desnoyers, 10/09/2019 12:44 PM
Actions #1

Updated by Mathieu Desnoyers over 5 years ago

(for the records: I just saw the link to the spam program in the bug description, so removed my previous question)

Actions #2

Updated by Mathieu Desnoyers over 5 years ago

Hi Anton,

Thanks for reporting this issue. Can you test with the attached patch ?

I tried reproducing the issue following your instructions on my side, but I suspect my machine is unlucky enough not to reproduce the specific timings to trigger the deadlock.

Thanks,

Mathieu

Actions #3

Updated by Mathieu Desnoyers over 5 years ago

  • Status changed from New to Feedback
Actions #4

Updated by Anton Smyk over 5 years ago

Thank you for such quick response!

It seems that you have made this patch from latest commit (48a143c09cc97bf7a2ace811277e7d60b294b5f6?) from master branch, since the patch couldn't be cleanly applied to stable-2.11 nor stable-2.10.

I have built lttng-ust from that commit together with latest urcu 0.11.1 and lttng-tools 2.11.0-rc (from stable-2.11) and the plot thickens here! Indeed, the deadlock could be reproduced there. Nor any message appeared in the trace if perf counters were enabled. I guess something related might have been accidentally broken recently in 2.11 development.

I was able to backport your patch to 2.10.5 and it appears to help! No more deadlock! I see perf counters in the trace!

So, two points in the summary:
  • The problem seems fixed for us (judging by this simple example program) in 2.10.5 with backported patch.
  • You may want to doublecheck perf counters in 2.11 in scenario described in this issue (start app before the LTTng session.)
Actions #5

Updated by Anton Smyk over 5 years ago

*Indeed, the deadlock could NOT be reproduced there

Actions #6

Updated by Mathieu Desnoyers over 5 years ago

Thanks for confirming that the proposed fix actually fixes the issue.

I succeeded in reproducing the deadlock with the master branch as well, but using fewer performance counters (3).

Regarding the difference in behavior you observe between your 2.10 and 2.11/master scenarios, I suspect it is caused by using more performance counters than are available in your system.

This means that previously, you might have experienced "working" tracing, but missing some performance counters fields (without really knowing it). I suspect that error handling has been changed in tools 2.11, so when adding a performance counter fails, it puts the channel into a failed state so nothing is traced into it. Can you confirm whether this is what you are observing ?

If this is indeed what is happening, we should open a separate ticket against lttng-tools to discuss this.

Actions #7

Updated by Mathieu Desnoyers over 5 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100
Actions #8

Updated by Jérémie Galarneau over 5 years ago

The pre-2.11 lttng-tools behaviour should be restored by

commit d767977450c6bf1fb18ce8bb653d6b6dfdb6bb68 (perf-counters-fix)
Author: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Date:   Tue Oct 15 16:56:22 2019 -0400

    Fix: sessiond: application channel creation failure stops start cmd

    The creation of an application's channel can fail when, for instance,
    a context can't be created. This causes applications that would have
    been started _after_ it to never be started.

    This keeps the iteration going on error and starts all applications
    that could be started. This is more in line with the behaviour of
    2.10 (and earlier) since those channel creations would occur as
    applications registered and not on tracing "start".

    Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Actions #9

Updated by Anton Smyk almost 5 years ago

Hi again, sorry for the late update, I just realized I should have answered on this point:

you might have experienced "working" tracing, but missing some performance counters fields

Yes, this is indeed what happens for us in 2.10 lttng-ust/lttng-tools. In the example in this case 6 perf counters are added to context, but only 5 of them are logged with each event: the perf:thread:stalled-cycles-frontend is not available on my machine, nor is it present in the saved trace. IMO it is pretty good behavior, although the error message in lttng-sessiond output could be less cryptic.

I've also tried recent lttng-ust 2.11.1 + lttng-tools 2.11.2 against the same example test. And that was different story: no events at all were saved in the trace. When I removed perf:thread:stalled-cycles-frontend counter from the add-context command, all went well. IMO this is less desirable behavior. At least add-context should have printed some warning.

We are using 2.10 at the moment, and we don't have immediate plans to upgrade to 2.11. But this issue in 2.11 would become problem when we do.

Do you agree that this behavior in 2.11 is a problem? Shall I raise separate big report for this?

Actions

Also available in: Atom PDF