Project

General

Profile

Actions

Bug #1319

open

Assert in emit_inactivity_message

Added by Sergei Dyshel 6 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
06/17/2021
Due date:
% Done:

0%

Estimated time:

Description

While using babeltrace 2.0.3 with lttng-live (latest lttng version) we encountered the following assert:

(gdb) bt
#0  0x00007f9023fc3495 in raise () from /lib64/libc.so.6
#1  0x00007f9023fc4c75 in abort () from /lib64/libc.so.6
#2  0x00007f902391d8d3 in bt_common_abort () at common.c:2111
#3  0x00007f902391b9ff in bt_common_assert_failed (file=0x7f9023921495 "lttng-live.c", line=613, func=0x7f902393f500 <__func__.17246> "emit_inactivity_message", assertion=0x7f9023931c70 "stream_iter->trace->clock_class")
    at assert.c:40
#4  0x00007f902390be67 in emit_inactivity_message (stream_iter=0x472100, timestamp=<optimized out>, message=0x7ffe76aa8d98, lttng_live_msg_iter=0x471810) at lttng-live.c:613
#5  lttng_live_iterator_next_handle_one_quiescent_stream (message=0x7ffe76aa8d98, lttng_live_stream=0x472100, lttng_live_msg_iter=0x471810) at lttng-live.c:653
#6  lttng_live_iterator_next_msg_on_stream (curr_msg=0x7ffe76aa8d98, stream_iter=0x472100, lttng_live_msg_iter=0x471810) at lttng-live.c:961
#7  next_stream_iterator_for_trace (live_trace=0x4721a0, live_trace=0x4721a0, youngest_trace_stream_iter=<synthetic pointer>, lttng_live_msg_iter=0x471810) at lttng-live.c:1093
#8  next_stream_iterator_for_session (youngest_session_stream_iter=<synthetic pointer>, session=0x471d70, lttng_live_msg_iter=0x471810) at lttng-live.c:1325
#9  lttng_live_msg_iter_next (self_msg_it=<optimized out>, msgs=0x471780, capacity=15, count=0x7ffe76aa8ee8) at lttng-live.c:1518
#10 0x00007f90247031e2 in call_iterator_next_method (capacity=15, user_count=0x7ffe76aa8ee8, msgs=0x471780, iterator=0x4716a0) at iterator.c:803
#11 bt_message_iterator_next (iterator=0x4716a0, msgs=msgs@entry=0x7ffe76aa8ee0, user_count=user_count@entry=0x7ffe76aa8ee8) at iterator.c:859
#12 0x00007f90238841e9 in muxer_upstream_msg_iter_next (is_ended=<synthetic pointer>, muxer_upstream_msg_iter=0x471fc0) at muxer.c:399
#13 validate_muxer_upstream_msg_iter (is_ended=<synthetic pointer>, muxer_upstream_msg_iter=0x471fc0) at muxer.c:958
#14 validate_muxer_upstream_msg_iters (muxer_msg_iter=0x471650) at muxer.c:985
#15 muxer_msg_iter_do_next_one (msg=0x449850, muxer_msg_iter=0x471650, muxer_comp=0x4524f0) at muxer.c:1048
#16 muxer_msg_iter_do_next (count=0x7ffe76aa8f70, capacity=15, msgs=0x449850, muxer_msg_iter=0x471650, muxer_comp=0x4524f0) at muxer.c:1132
#17 muxer_msg_iter_next (self_msg_iter=0x471570, msgs=0x449850, capacity=15, count=0x7ffe76aa8f70) at muxer.c:1414
#18 0x00007f90247031e2 in call_iterator_next_method (capacity=15, user_count=0x7ffe76aa8f70, msgs=0x449850, iterator=0x471570) at iterator.c:803
#19 bt_message_iterator_next (iterator=0x471570, msgs=msgs@entry=0x7ffe76aa8f78, user_count=user_count@entry=0x7ffe76aa8f70) at iterator.c:859
#20 0x00007f9023821514 in bx_view_consume (comp=0x43a7d0) at traces//babeltrace-plugin/component.cc:272
#21 0x00007f90246ff3d8 in consume_graph_sink (comp=0x43a7d0) at graph.c:447
#22 consume_sink_node (graph=0x43b7d0, node=0x442e00) at graph.c:491
#23 consume_no_check (api_func=<synthetic pointer>, graph=0x43b7d0) at graph.c:566
#24 bt_graph_run (graph=0x43b7d0) at graph.c:710
#25 0x0000000000408a81 in cmd_run (cfg=0x451eb0) at babeltrace2.c:2485
#26 main (argc=<optimized out>, argv=<optimized out>) at babeltrace2.c:2673

I did initial prints in gdb to see that this is not some kind of obvious memory corruption:

(gdb) p stream_iter->trace
$1 = (struct lttng_live_trace *) 0x4721a0
(gdb) p stream_iter->trace->clock_class
$2 = (const bt_clock_class *) 0x0
(gdb) p stream_iter->trace[0]
$3 = {log_level = BT_LOGGING_LEVEL_WARNING, self_comp = 0x43c050, session = 0x471d70, id = 1, trace = 0x66fc30,
  trace_class = 0x66bef0, metadata = 0x66bb00, clock_class = 0x0, stream_iterators = 0x442d20,
  metadata_stream_state = LTTNG_LIVE_METADATA_STREAM_STATE_NOT_NEEDED}
(gdb) p stream_iter->trace->trace_class
$4 = (bt_trace_class *) 0x66bef0
(gdb) p stream_iter->trace->trace_class->stream_classes
$5 = (GPtrArray *) 0x43a4a0
(gdb) p stream_iter->trace->trace_class->stream_classes->len
$6 = 0

Unfortunately I can't attach the coredump here but I can provide any relevant information from it.

Actions #1

Updated by Jonathan Rajotte Julien 6 months ago

  • Description updated (diff)

Hi,

Is there any error messages on lttng-relayd or lttng-sessiond side?

Could you provide us with a bit more context on how the tracing session is configured?

Is "lttng clear" used?

Cheers

Actions #2

Updated by Sergei Dyshel 6 months ago

No other error messages.

babeltrace2 crashed several times until our monitoring script detected it and restarted the whole chain of lttng-sessiond/relayd/babeltrace2.

We're using LTTNG framework for continuous low-latency tracing, i.e. our service constantly emits traces which are then processed live
by our babeltrace2 plugin.

The session is created with these commands:

lttng create XXX --live
lttng enable-channel channel0 -u -s XXX --subbuf-size 1M --num-subbuf 8 --overwrite --tracefile-size 104857600 --tracefile-count 3
lttng enable-event 'XXX_*' -u -s XXX -c channel0
lttng add-context -u -t vpid -t vtid -s XXX -c channel0
lttng start XXX

relayd runs with this command:

lttng-relayd --output /var/log/lttng-XXX/live

babeltrace2 runs with:

babeltrace2 lttng-live net://localhost/host/DIRECTORY/XXX --component XXX:sink.XXX.view --log-level WARN

Actions #3

Updated by Jonathan Rajotte Julien 6 months ago

Do you see any pattern for when the error occurs? (i.e tracing is ok for over 2 hours etc, only happens on hooking babeltrace2 and happens 1 time in a 100 etc.)

That could help us pinpoint the problem.

Actions #4

Updated by Sergei Dyshel 5 months ago

No pattern, the error happened only once.

Actions #5

Updated by Jérémie Galarneau 4 months ago

In the interest of reproducing your overall setup, can you give us some rough indication of the throughput of the application(s) being traced?

Are the network links between the consumer and relay daemons, or between the relay daemon and viewer severely congested/throttled?

Can you reproduce these crashes with the (default) text sink?

Actions #6

Updated by Sergei Dyshel 3 months ago

I couldn't reproduce the crash in many runs. Seems one-off issue.

What kind of throughput would you want me to measure? The application produces (usually) no more than 25MB of trace messages per hour.

Actions

Also available in: Atom PDF