Project

General

Profile

Actions

Bug #1078

closed

lttng enable-event throws the error "UST create channel failed"

Added by Sébastien Boisvert almost 8 years ago. Updated almost 8 years ago.

Status:
Invalid
Priority:
Normal
Target version:
Start date:
11/28/2016
Due date:
12/20/2016
% Done:

0%

Estimated time:
10.00 h

Description

Hi all,

I am tracing a C++ application on a 32-core machine (2 x AMD Opteron 6376 16 cores, 256 GB RAM).

I found a bug in lttng-tools.

After 5 tracing sessions, the session number 6 (the sixth session, starting in a state where no lttng* daemons
are running, always crash (n = 10) crashes at "lttng enable-event" ("UST create channel failed").
Because of this, lttng throws an error and does not seem to do a health check or to be able to recover from this.
I know that the problem is in one of the daemons because if I kill them all the problem goes away.

I searched "UST create channel failed" on https://bugs.lttng.org/, and got 1 search result (no
open bugs):

- https://bugs.lttng.org/issues/339 ("Subbuffer size of < 4096 are accepted but unable to use the channel afterwards")

Here is my bug report:

1/4 - Software component versions
--------------------------------

lttng-tools 2.8.2
lttng-ust 2.8.1
popt 1.10.4
userspace-rcu 0.9.2

babeltrace 1.5.0-rc1
libuuid 1.0.3
libxml2 2.9.4
asciidoc 8.6.9

linux 3.13.0-101-generic
g++ 4.8.4-2ubuntu1~14.04.3
cmake 2.8.12.2
make 3.81
binutils 2.24
coreutils 8.21
Ubuntu 14.04.5

2/4: Steps to reproduce the problem
-----------------------------------

1. Make sure that no lttng daemon processes are running, to have a clean state:

bolt$ killall -s SIGTERM lttng-sessiond lttng-runas lttng-consumerd lttng-runas lttng-relayd &> /dev/null
bolt$ ps aux|grep lttng|grep -v grep|wc -l
0

2. Trace the app 10 times. Each time, it creates a new session, which is kept for later.

Repeat this command 10 times:

bolt$ lttng create alcatraz-trace-`date -Ins`; lttng track -u --pid `pidof alcatraz-backbone-builder`; lttng enable-event -u gydle_om:DeNovo*;lttng start;sleep 1;lttng stop

3/4 - Expected result
---------------------

It is expected that 10 session will be created and stopped, one after the other.

There should not be any error. lttng processes are only to be killed before (at the very beginning of the 10 sessions, not between
sessions.

The log should be similar to this (repeated 10 times, with a different session name):

bolt$ lttng create alcatraz-trace-`date -Ins`; lttng track -u --pid `pidof alcatraz-backbone-builder`; lttng enable-event -u gydle_om:DeNovo*;lttng start;sleep 1;lttng stop
Session alcatraz-trace-2016-11-28T10:12:38,791131449-0500 created.
Traces will be written in /home/sboisvert/lttng-traces/alcatraz-trace-2016-11-28T10:12:38,791131449-0500-20161128-101238
PID 20337 tracked in session alcatraz-trace-2016-11-28T10:12:38,791131449-0500
UST event gydle_om:DeNovo* created in channel channel0
Tracing started for session alcatraz-trace-2016-11-28T10:12:38,791131449-0500
Waiting for data availability.
Tracing stopped for session alcatraz-trace-2016-11-28T10:12:38,791131449-0500

4/4 - Actual result
-------------

The session number 6 crashed (5 sessions did not crash before that). The experiment was ran 5 times, and it is always the session 6 that crashes.

Tracing during 2 seconds should, in theory, generate twice the number of events, so it may impact this bug.
The experiment was ran with a "sleep 2" 5 times too, and it always crashes at the session 6 too, like with "sleep 1".

After 5 successful sessions, the session 6 (the sixth) session crashes:

bolt$ lttng create alcatraz-trace-`date -Ins`; lttng track -u --pid `pidof alcatraz-backbone-builder`; lttng enable-event -u gydle_om:DeNovo*;lttng start;sleep 1;lttng stop
Session alcatraz-trace-2016-11-28T09:49:45,434535118-0500 created.
Traces will be written in /home/sboisvert/lttng-traces/alcatraz-trace-2016-11-28T09:49:45,434535118-0500-20161128-094945
PID 20337 tracked in session alcatraz-trace-2016-11-28T09:49:45,434535118-0500
UST event gydle_om:DeNovo* created in channel channel0
Error: No session daemon is available
Error: No session daemon is available

A backtrace can be generated.

Enable the debug mode:

bolt$ export LTTNG_ABORT_ON_ERROR=1
bolt$ ulimit -c
unlimited
bolt$ rm -f core

Run this 5 times:

bolt$ lttng create alcatraz-trace-`date -Ins`; lttng track -u --pid `pidof alcatraz-backbone-builder`; lttng enable-event -u gydle_om:DeNovo*;lttng start;sleep 1;lttng stop

Then, for the session 6, run the command one by one, until it crashes.

bolt$ lttng create alcatraz-trace-`date -Ins`
Session alcatraz-trace-2016-11-28T10:49:23,480808644-0500 created.
Traces will be written in /home/sboisvert/lttng-traces/alcatraz-trace-2016-11-28T10:49:23,480808644-0500-20161128-104923
bolt$ lttng track -u --pid `pidof alcatraz-backbone-builder`
PID 20337 tracked in session alcatraz-trace-2016-11-28T10:49:23,480808644-0500
bolt$ lttng enable-event -u gydle_om:DeNovo*
Error: Event gydle_om:DeNovo*: UST create channel failed (channel channel0, session alcatraz-trace-2016-11-28T10:49:23,480808644-0500)
Aborted (core dumped)

The backtrace:

bolt$ gdb -batch -ex bt lttng core
[New LWP 28837]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `lttng enable-event -u gydle_om:DeNovo*'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00002b9eac6c6c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
#0 0x00002b9eac6c6c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00002b9eac6ca028 in __GI_abort () at abort.c:89
#2 0x000000000041ba2b in lttng_abort_on_error () at ../../../lttng-tools-2.8.2/src/common/error.c:224
#3 0x00000000004127e4 in __lttng_print_check_abort (type=PRINT_ERR) at ../../../../lttng-tools-2.8.2/src/common/error.h:131
#4 enable_events (session_name=session_name@entry=0x1d23440 "alcatraz-trace-2016-11-28T10:49:23,480808644-0500") at ../../../../lttng-tools-2.8.2/src/bin/lttng/commands/enable_events.c:1071
#5 0x0000000000413322 in cmd_enable_events (argc=<optimized out>, argv=<optimized out>) at ../../../../lttng-tools-2.8.2/src/bin/lttng/commands/enable_events.c:1362
#6 0x00000000004096bc in handle_command (argv=0x7ffff39d4df0, argc=3) at ../../../../lttng-tools-2.8.2/src/bin/lttng/lttng.c:231
#7 parse_args (argv=<optimized out>, argc=<optimized out>) at ../../../../lttng-tools-2.8.2/src/bin/lttng/lttng.c:403
#8 main (argc=<optimized out>, argv=<optimized out>) at ../../../../lttng-tools-2.8.2/src/bin/lttng/lttng.c:452

The error LTTNG_ERR_UST_CHAN_FAIL ("UST create channel failed") occurs in two places in

https://github.com/lttng/lttng-tools/blob/v2.8.2/src/bin/lttng-sessiond/channel.c#L393

https://github.com/lttng/lttng-tools/blob/v2.8.2/src/bin/lttng-sessiond/channel.c#L414

Actions #1

Updated by Mathieu Desnoyers almost 8 years ago

Are you running your lttng-sessiond as a simple user or as root ?

If running as a simple user, you are probably running out of file descriptors.

Actions #2

Updated by Sébastien Boisvert almost 8 years ago

Mathieu Desnoyers wrote:

Are you running your lttng-sessiond as a simple user or as root ?

If running as a simple user, you are probably running out of file descriptors.

Simple user.

bolt$ lsof &> lsof.txt
bolt$ grep lttng lsof.txt|grep ust-shm-consumer|wc -l
4400
bolt$ grep lttng lsof.txt|grep ust-shm-consumer|tail
lttng-con 28776 28786 sboisvert 725u REG 0,20 660864 129840 /run/shm/ust-shm-consumer-28776 (deleted)
lttng-con 28776 28786 sboisvert 726u REG 0,20 660864 129841 /run/shm/ust-shm-consumer-28776 (deleted)
lttng-con 28776 28786 sboisvert 727u REG 0,20 660864 129842 /run/shm/ust-shm-consumer-28776 (deleted)
lttng-con 28776 28786 sboisvert 728u REG 0,20 660864 129843 /run/shm/ust-shm-consumer-28776 (deleted)
lttng-con 28776 28786 sboisvert 729u REG 0,20 660864 129844 /run/shm/ust-shm-consumer-28776 (deleted)
lttng-con 28776 28786 sboisvert 730u REG 0,20 660864 129845 /run/shm/ust-shm-consumer-28776 (deleted)
lttng-con 28776 28786 sboisvert 731u REG 0,20 660864 129846 /run/shm/ust-shm-consumer-28776 (deleted)
lttng-con 28776 28786 sboisvert 732u REG 0,20 660864 129847 /run/shm/ust-shm-consumer-28776 (deleted)
lttng-con 28776 28786 sboisvert 733u REG 0,20 660864 129848 /run/shm/ust-shm-consumer-28776 (deleted)
lttng-con 28776 28786 sboisvert 735u REG 0,20 17152 128858 /run/shm/ust-shm-consumer-28776 (deleted)

Actions #3

Updated by Mathieu Desnoyers almost 8 years ago

You should try running lttng-sessiond as root, or increase the max open files ulimit manually.

Thanks,

Mathieu

Actions #4

Updated by Sébastien Boisvert almost 8 years ago

Mathieu Desnoyers wrote:

You should try running lttng-sessiond as root, or increase the max open files ulimit manually.

Thanks,

Mathieu

That solves the problem ("ulimit -n 4096").

Could this be managed automatically, since userspace tracing is done without root access ?

Why are all these open files needed ?

Actions #5

Updated by Mathieu Desnoyers almost 8 years ago

We already bump the ulimit when running as root because we always have to right to do so.

In your situation, what are the limits imposed by the system on your user ?

We need to keep one open file per posix shared memory (one per cpu per channel per session) in the consumer daemon (lttng-consumerd).

Actions #6

Updated by Sébastien Boisvert almost 8 years ago

Mathieu Desnoyers wrote:

We already bump the ulimit when running as root because we always have to right to do so.

In your situation, what are the limits imposed by the system on your user ?

We need to keep one open file per posix shared memory (one per cpu per channel per session) in the consumer daemon (lttng-consumerd).

I don't/need have root access, and I don't have limits that I can't change.

So are these open files kept because a previous session could be restarted (not destroyed yet) ?

I found my problem in the manual of lttng-destroy:

"Destroying a tracing session does not destroy the recorded trace data, if any; it frees resources acquired by the session daemon and tracer side, making sure to flush all trace data."

The problem I see is that there is no continuity: destroy does not destroy the trace data. Sure.
But "lttng view" does not work on a destroyed session. One needs to use babeltrace with the full path to the CTF data:

bolt$ lttng destroy alcatraz-trace-2016-11-28T11:36:10,299050169-0500
Session alcatraz-trace-2016-11-28T11:36:10,299050169-0500 destroyed

bolt$ lttng view alcatraz-trace-2016-11-28T11:36:10,299050169-0500
Session name alcatraz-trace-2016-11-28T11:36:10,299050169-0500 not found
Error: Command error

bolt$ lttng view /home/sboisvert/lttng-traces/alcatraz-trace-2016-11-28T11:36:10,299050169-0500-20161128-113610
Session name /home/sboisvert/lttng-traces/alcatraz-trace-2016-11-28T11:36:10,299050169-0500-20161128-113610 not found
Error: Command error

bolt$ babeltrace /home/sboisvert/lttng-traces/alcatraz-trace-2016-11-28T11:36:10,299050169-0500-20161128-113610|wc -l
4200877

That's fine too. I thought that destroy would wipe my trace data.

Actions #7

Updated by Sébastien Boisvert almost 8 years ago

This can be closed as a WORKSFORME or READTHEMANUAL.

Actions #8

Updated by Mathieu Desnoyers almost 8 years ago

  • Status changed from New to Invalid
Actions

Also available in: Atom PDF