Project

General

Profile

Actions

Bug #1421

open

lttng session daemon hanging

Added by Mikael Beckius about 1 month ago. Updated 3 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
03/05/2025
Due date:
% Done:

0%

Estimated time:

Description

This problem was discovered while doing live streaming in a network with high latency and it appears that resources aren't released properly when a session is destroyed. It should be possible to reproduce using the following steps:

/usr/lib64/lttng-tools/ptest/tests/utils/testapp/gen-ust-events/gen-ust-events -i 1000000 -w 1000000 &
groupadd tracing
useradd -d /home/lttnguser -m -s /bin/sh -g tracing lttnguser
su lttnguser
/usr/lib64/lttng-tools/ptest/tests/utils/testapp/gen-ust-events/gen-ust-events -i 1000000 -w 1000000 &

lttng create micke --live=2000000
lttng enable-channel s micke -u --num-subbuf=4 --subbuf-size=1048576 ch1
lttng start micke
tc qdisc add dev lo root netem delay 300ms
-
> Monitor signal blocked by live signal
lttng destroy micke
Starting the session daemon in verbose mode will make it clear that the session is still alive. Otherwise, it will be noticed when trying to create a new session and the lttng commands appear to hang.

Removing the delay will make the monitor signal unblocked. To work around this the attached patch was created where the priorities are switched between live and monitor.

#######################################

While investigating this issue another issue that is somewhat similar was also encountered. This is currently not a problem, but I thought might be worth mentioning anyway. It involves two live sessions and should be possible to reproduce using the following steps:
lttng create micke --live=2000000
lttng enable-channel -s micke -u --num-subbuf=4 --subbuf-size=1048576 ch1
lttng start micke
lttng create micke2--live=2000000
lttng enable-channel -s micke2-u --num-subbuf=4 --subbuf-size=1048576 ch1
lttng start micke2

tc qdisc add dev lo root netem delay 300ms
lttng destroy micke2
--> live timer destruction blocked, pending, second live

Removing delay will make destruction unblocked
Removing booth sessions will also make destruction unblocked

It appears that in this case timer destruction can't proceed while there are outstanding signals connected to the timer in question.

#######################################

In both cases a 150ms delay is usually sufficient.

Problems were discovered on version 2.13.13.


Files

0001-Fix-consumerd-swap-live-and-monitor-signal-prioritie.patch (2.56 KB) 0001-Fix-consumerd-swap-live-and-monitor-signal-prioritie.patch Mikael Beckius, 03/05/2025 03:46 PM
repro.sh (2.64 KB) repro.sh Kienan Stewart, 04/11/2025 12:01 PM
matrix.sh (241 Bytes) matrix.sh Kienan Stewart, 04/11/2025 12:01 PM
results-2.13-with-prio-swap (1.58 KB) results-2.13-with-prio-swap Kienan Stewart, 04/11/2025 12:01 PM
results-2.13 (1.58 KB) results-2.13 Kienan Stewart, 04/11/2025 12:01 PM
stable-2.13.png (45.2 KB) stable-2.13.png Kienan Stewart, 04/11/2025 12:02 PM
stable-2.13-with-prio-patch.png (49.8 KB) stable-2.13-with-prio-patch.png Kienan Stewart, 04/11/2025 12:02 PM
Actions #1

Updated by Kienan Stewart about 1 month ago

  • Project changed from LTTng to LTTng-tools
Actions #2

Updated by Kienan Stewart about 1 month ago

Hi Micke,

thanks for the bug report. In the first case you demonstrate: when you say hang, do you mean an indefinite hang? I haven't been able to reproduce that with 2.13.13, stable-2.13, or master.

thanks,
kienan

Actions #3

Updated by Kienan Stewart 18 days ago

I think in the original report the sequence was repeated twice in order to actually reach a visible hang and a remote lttng-relayd was used, but I was able to deduce from the logs that the hang actually occurs during the destroy of the first session. The call sequence I received was embedded with the verbose logs from the session daemon, so it wasn't available for copy paste and it was like 2 months or something since I worked on this but right before opening the ticket I tested the sequence I reported.

If you run the lttng-session with logs enabled like "-vvv --verbose-consumer" you should see that the logging never settles and that the live timer keeps on running for ever. In the normal case logging will eventually stop even though it may take 1-2 minutes due to the latency. This is how I reproduced the issue.

I don't think I was ever able to reproduce it without a second user running an app with tracing enabled. Not sure why that was needed but this was something I picked up from the original logs. Before that, things just worked whatever I tried.

I guess you could also test using a remote relay daemon, but I am pretty sure I reproduced the issue using a local relay daemon. Possibly I should have emphasized the fact that you either need to check the verbose logging or run the sequence twice.

Anyway, as far as I could tell the live timer rt signal would always arrive before the monitor signal was processed, always putting itself first in the queue. After creating the patch, I even wrote a C test program to prove this. I think I even created a variant using separate threads for the signals which would not hang.

Updated by Kienan Stewart 8 days ago ยท Edited

Hi Mikael,

I spent some more time with this and while I've never had an indefinite hang, I see the time taken to destroy the session increase. I tested on stable 2.13 at caa7f0f590fa64ad2585631740c082e7e370c40f with an without the patch you proposed and didn't see a meaningful difference.

I've attached the scripts and results.


Actions #5

Updated by Mikael Beckius 3 days ago

Hello Kienan!

Let me have a look at your reproducer and results but I vaguely remember that I couldn't reproduce the issue on a dual core system but on systems with more cores it was 100%.

Micke

Actions #6

Updated by Mikael Beckius 3 days ago

I think reproduction is fine using your script but you can't shut down the session daemon or delete the netem delay as you will not see the logs then.

Maybe if you do:
...
echo -e "$TRACED_APPS_COUNT\t$QDISC_DELAY\t$START\t$END\t$(bc <<< "$END - $START")" >> results

  1. >>> ADDED THIS LINE <<<
    tail -f "${SESSIOND_LOG_FILE}"
  1. Cleanup
    stop_sessiond
    ...

And then from another shell after some time you can do:
tc qdisc del dev lo root

Or whatever way you want to do it but you need to observe the logs from the session daemon after the destroy call has finished.

It should look something like this:
DBG1 - 15:02:30.638588687 [4510/4522]: Live timer for channel 1 (in live_timer() at ../../../../git/src/common/consumer/consumer-timer.c:291)
DBG1 - 15:02:30.638617535 [4510/4522]: Stream 58 empty, sending beacon (in consumer_flush_ust_index() at ../../../../git/src/common/consumer/consumer-timer.c:257)
DBG1 - 15:02:30.638633460 [4510/4522]: Relayd sending index for stream ID 70 (in relayd_send_index() at ../../../../git/src/common/relayd/relayd.c:1101)
DBG3 - 15:02:30.638647933 [4510/4522]: Relayd sending command 13 of size 104 (in send_command() at ../../../../git/src/common/relayd/relayd.c:90)
...
over and over again.

Actions

Also available in: Atom PDF