Project

General

Profile

Actions

Bug #1407

open

Hang when stopping a live session with a low live value

Added by Kienan Stewart about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Low
Assignee:
-
Target version:
-
Start date:
12/19/2023
Due date:
% Done:

0%

Estimated time:

Description

While investigating #1403, I noticed some test runs would intermittently hang when the live delay value was set very low (eg. 1us).

For example, in tests/regression/tools/clear/test_ust the test test_ust_streaming_live would hang executing lttng stop sessionX. After investigating the issue I noted the following:

  • the lttng client is hanging, pending a response from the sessiond
  • the consumer's consumer_thread_data_poll thread appear to be looping in consumer_timer_signal_thread_qs while trying to remove the monitor timer during channel deletion:
    Thread 7 (Thread 0x7f999bfff680 (LWP 3060295) "lttng-consumerd"):                                                  
    #0  sigpending (set=0x7f999bffdea0) at ../sysdeps/unix/sysv/linux/sigpending.c:27                                  
    #1  0x00005559299f07ec in consumer_timer_signal_thread_qs (signr=47) at consumer/consumer-timer.cpp:325            
    #2  consumer_channel_timer_stop (timer_id=timer_id@entry=0x7f998c0021c0, signal=47) at consumer/consumer-timer.cpp:422
    #3  0x00005559299f122f in consumer_timer_monitor_stop (channel=channel@entry=0x7f998c000f40) at consumer/consumer-timer.cpp:531             
    #4  0x00005559299dabbf in consumer_del_channel (channel=channel@entry=0x7f998c000f40) at consumer/consumer.cpp:379                                                                                                                     
    #5  0x00005559299ee83a in consumer_stream_destroy (stream=stream@entry=0x7f998c01a830, ht=<optimized out>) at consumer/consumer-stream.cpp:1070
    #6  0x00005559299e3c61 in consumer_del_stream (ht=<optimized out>, stream=0x7f998c01a830) at consumer/consumer.cpp:547
    #7  consumer_thread_data_poll (data=0x55592b498ba0) at consumer/consumer.cpp:2747                                  
    #8  0x00007f99aaaa63ec in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444                        
    #9  0x00007f99aab26a5c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81   
    
  • the consumerd's consumer_timer_thread was often handling the live timer:
    Thread 4 (Thread 0x7f99a991b680 (LWP 3060292) "lttng-consumerd"):
    #0  __GI___libc_write (nbytes=124, buf=0x7f99a9919e90, fd=2) at ../sysdeps/unix/sysv/linux/write.c:26
    #1  __GI___libc_write (fd=2, buf=0x7f99a9919e90, nbytes=124) at ../sysdeps/unix/sysv/linux/write.c:24
    #2  0x00007f99aaa9dbb5 in _IO_new_file_write (f=0x7f99aabf26a0 <_IO_2_1_stderr_>, data=0x7f99a9919e90, n=124) at ./libio/fileops.c:1180
    #3  0x00007f99aaa9cf70 in new_do_write (fp=fp@entry=0x7f99aabf26a0 <_IO_2_1_stderr_>, data=data@entry=0x7f99a9919e9
    0 "DBG1 - 10:31:33.730698608 [3060287/3060292]: Live timer for channel 17 (in live_timer() at consumer/consumer-timer.cpp:284)\n", to_do=to_do@entry=124) at ./libio/libioP.h:946
    #4  0x00007f99aaa9e2a1 in _IO_new_file_xsputn (n=124, data=<optimized out>, f=0x7f99aabf26a0 <_IO_2_1_stderr_>) at ./libio/fileops.c:1254
    #5  _IO_new_file_xsputn (f=0x7f99aabf26a0 <_IO_2_1_stderr_>, data=<optimized out>, n=124) at ./libio/fileops.c:1196
    #6  0x00007f99aaa71409 in __printf_buffer_flush_to_file (buf=buf@entry=0x7f99a9919e60) at ../libio/libioP.h:946
    #7  0x00007f99aaa714c0 in __printf_buffer_to_file_done (buf=buf@entry=0x7f99a9919e60) at ./stdio-common/printf_buffer_to_file.c:120
    #8  0x00007f99aaa7ac0d in __vfprintf_internal (s=0x7f99aabf26a0 <_IO_2_1_stderr_>, format=0x555929a57cd8 "DBG1 - %s [%s]: Live timer for channel %lu (in %s() at consumer/consumer-timer.cpp:284)\n", ap=ap@entry=0x7f99a9919f60, mode_
    flags=mode_flags@entry=0) at ./stdio-common/vfprintf-internal.c:1475                                               
    #9  0x00007f99aaa70296 in __fprintf (stream=<optimized out>, format=<optimized out>) at ./stdio-common/fprintf.c:32
    #10 0x00005559299f2280 in live_timer (si=0x7f99a991a160, ctx=0x55592b498ba0) at consumer/consumer-timer.cpp:284
    #11 consumer_timer_thread (data=0x55592b498ba0) at consumer/consumer-timer.cpp:789
    #12 0x00007f99aaaa63ec in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
    #13 0x00007f99aab26a5c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
    

consumer_timer_signal_thread_qs is busy loop until the the the signal in question is no longer pending. In consumer-timer.cpp::consumer_timer_thread, the signals are waited on by sigwaitinfo, and there are some ordering rules (see man 7 signal):

  • for RT signals, by signal priority
  • for standard signals, an unspecified order
  • in Linux standard signals are delivered before RT signals

The signal priorities are their signal number as defined [[https://github.com/lttng/lttng-tools/blob/dff46b736afb7945aaf635de661b920dd0d01d7f/src/common/consumer/consumer-timer.hpp#L17|here]]. Lower values have a higher priority.

When the live session is configured with a low delay, eg --live=1, the live timer is firing relatively rapidly and can prevent the signal fired by the monitor timer from being handled. This in turn means that the session destruction may hang until (hopefully) the monitor signal is handled and no longer pending.

A few options were discussed:

  • changing signal ordering (eg. reducing the priority of live signal): while it would address this particular case, it then means a session with a low value for the monitor timer could equally starve out the handling of the live timer's signal
  • using sigpending in combination with sigwaitinfo: while it's possible to get all pending signals as a set, it seems to not be possible to retrieve the details (siginfo_t) which are required some some data (eg. pointers) to properly handle the signal
  • signalfd: very linux specific
  • timer_fd: very linux specific
  • custom scheduler: eg., https://github.com/nsec/badge-conf-2023/blob/nsec2023/lib/scheduling/scheduler.hpp
  • switch from using channel pointers in the sigval_ptr to passing IDs that can be used for lookups which may fail gracefully instead; that way the pending signals wouldn't have to all be cleared in consumer_timer_signal_thread_qs

It seems the most promising path is to implement a small deadline scheduler.


Related issues 1 (0 open1 closed)

Related to LTTng-tools - Bug #1403: Investigate why events are no longer recorded by the live view when `DELAYUS` in `tests/regression/tools/clear/test_ust` is reduced too lowResolvedKienan Stewart12/05/2023

Actions
Actions #1

Updated by Kienan Stewart about 1 year ago

  • Related to Bug #1403: Investigate why events are no longer recorded by the live view when `DELAYUS` in `tests/regression/tools/clear/test_ust` is reduced too low added
Actions #2

Updated by Kienan Stewart about 1 year ago

  • Description updated (diff)
Actions

Also available in: Atom PDF