Project

General

Profile

Actions

Bug #1405

closed

lttng (session and consumer daemons) freezes on channel creation after quickly destorying and creating a sessiond

Added by Erica Bugden 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
12/12/2023
Due date:
% Done:

100%

Estimated time:

Description

When using lttng via a script, the session and consumer daemons appear to completely lock up when we request that a channel be created. The conditions for this lockup seem to be created by destroying a sessiond and then creating a sessiond in quick sequence.

Setup information

  • lttng-tools version: lttng (LTTng Trace Control) 2.14.0-pre - O-Beer - v2.12.0-rc1-1550-g6ca1df2f0-dirty (Note: I believe it's dirty because of added comments. No functional changes were made.)

Reproducer

Run the script below twice:

sudo killall lttng-sessiond
sudo lttng-sessiond --daemonize
lttng create my_session --snapshot --output /tmp/demo-output
lttng enable-channel --kernel my_channel

Normally, the channel creation should be successful, but when the lockup is triggered, the channel creation command will never complete. The race seems to occur when destroying a sessiond that has at least one channel and then quickly creating a sessiond. The conditions for the lockup occur in the first two commands (kill + create), but the lockup happens when the channel is created.

Reset

To reset after triggering the lockup, you need to use the big guns (SIGKILL) on both the sessiond and consumerd (they won't respond to SIGTERM), for example:

sudo killall -9 lttng-sessiond
sudo killall -9 lttng-consumerd

Hypothesis

We suspect that the bind-related error is the cause of the lockup (see annex for log and bug hunt narrative). We believe this error is caused by the incoming session daemon incorrectly detecting that the outgoing session daemon still exists. Possible scenario that could lead to the crash:

  • sessiondA is tearing down - The socket-related folder is not yet destroyed.
  • sessiondB incorrectly detects that there is no other sessiond and starts building up - It sees that the socket-related folder exists, so it does not create one.
  • sessiondA finishes tearing down - It destroys the socket-related folder. (The conditions for the crash are created.)
  • sessiondA attempts to create a channel - The socket-related folder doesn't exist --> boom

Desired behaviour

Assuming the hypothesis is correct, the expected behaviour is that if a 2nd session daemon attempts startup when an existing sessiond is still running, the new sessiond daemon would correctly detect this and abort startup. A new session daemon should not start up using the structures of an old sessiond.

Annex

Example output when reproducing

# Round 1 of script: No lockup
erica@luna:~/lttng-tools$ ./run-demo.sh 
Snapshot session my_session created.
Default snapshot output set to /tmp/demo-output
Every channel enabled for this session will be set to mmap output and default to overwrite mode.
kernel channel my_channel enabled for session my_session

# Round 2 of script: Lockup 
erica@luna:~/lttng-tools$ ./run-demo.sh 
Snapshot session my_session created.
Default snapshot output set to /tmp/demo-output
Every channel enabled for this session will be set to mmap output and default to overwrite mode.
# Channel enabling command never completes

Sessiond log after reproducing bug

Error: Unable to set group on /var/run/lttng/kconsumerd/error
PERROR - 17:03:06.117881285 [Main]: chown: No such file or directory (in set_permissions() at main.cpp:1010)
Warning: Cannot connect to error socket (is lttng-sessiond started?)
PERROR - 17:03:06.153464497 [9415/9420]: bind: No such file or directory (in lttcomm_create_unix_sock() at unix.cpp:136)
Error: Unable to create health check Unix socket
Error: Health error occurred in thread_manage_health_consumerd

Sessiond thread backtrace after freeze

Thread 8 (Thread 0x7f2ef5bf9600 (LWP 6199) "lttng-consumerd"):
#0  0x00007f2ef8d18dbf in __GI___poll (fds=0x7f2ef5bf80e0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00005565e88bf29f in lttng_consumer_poll_socket (consumer_sockpoll=0x7f2ef5bf80e0) at consumer/consumer.cpp:1218
#2  0x00005565e88c7450 in consumer_thread_sessiond_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:3244
#3  0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 7 (Thread 0x7f2ef63fa600 (LWP 6198) "lttng-consumerd"):
#0  0x00007f2ef8d18dbf in __GI___poll (fds=0x7f2ee0000b90, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00005565e88c43b3 in consumer_thread_data_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:2586
#2  0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#3  0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 6 (Thread 0x7f2ef6bfb600 (LWP 6197) "lttng-consumerd"):
#0  0x00007f2ef8d2601e in epoll_wait (epfd=28, events=0x7f2eec000b70, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef6bfa0d0, timeout=-1, interruptible=false) at compat/poll.cpp:277
#2  0x00005565e88c3057 in consumer_thread_metadata_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:2321
#3  0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5 (Thread 0x7f2ef73fc600 (LWP 6196) "lttng-consumerd"):
#0  0x00007f2ef8d2601e in epoll_wait (epfd=29, events=0x7f2ee8001700, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef73fb1a0, timeout=-1, interruptible=false) at compat/poll.cpp:277
#2  0x00005565e88c5d7a in consumer_thread_channel_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:2927
#3  0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x7f2ef7bfd600 (LWP 6195) "lttng-consumerd"):
#0  0x00007f2ef8c4321a in __GI___sigtimedwait (set=0x7f2ef7bfbff0, info=0x7f2ef7bfc070, timeout=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:61
#1  0x00005565e88d5432 in consumer_timer_thread (data=0x5565ea19dd10) at consumer/consumer-timer.cpp:768
#2  0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#3  0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x7f2ef83fe600 (LWP 6180) "lttng-consumerd"):
#0  0x00007f2ef8d2601e in epoll_wait (epfd=27, events=0x7f2ef00012c0, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef83fd0d0, timeout=-1, interruptible=false) at compat/poll.cpp:277
#2  0x00005565e88ba825 in thread_manage_health_consumerd (data=0x0) at health-consumerd.cpp:232
#3  0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7f2ef8bff600 (LWP 6179) "lttng-consumerd"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007f2ef9291f24 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x5565ea197e60) at ../include/urcu/futex.h:67
#2  futex_async (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x5565ea197e60) at ../include/urcu/futex.h:99
#3  futex_wait (futex=futex@entry=0x5565ea197e60) at workqueue.c:121
#4  0x00007f2ef9292346 in workqueue_thread (arg=0x5565ea197e20) at workqueue.c:232
#5  0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#6  0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f2ef9232300 (LWP 6176) "lttng-consumerd"):
#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=6199, futex_word=0x7f2ef5bf98d0) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=6199, futex_word=0x7f2ef5bf98d0) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f2ef5bf98d0, expected=6199, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139
#3  0x00007f2ef8c96624 in __pthread_clockjoin_ex (threadid=139839668196864, thread_return=0x7ffca4bdfcd0, clockid=0, abstime=0x0, block=<optimized out>) at ./nptl/pthread_join_common.c:105
#4  0x00005565e88b937b in main (argc=9, argv=0x7ffca4bdff88) at lttng-consumerd.cpp:613

Bug hunt narrative

Main sessiond thread

After triggering the lockup as described in the reproducer section above, with gdb we can see in the backtrace that the health check thread wasn't created successfully, the thread supposed to create it is still waiting for a response.

Thread 3 (Thread 0x7f2ef83fe600 (LWP 6180) "lttng-consumerd"):
#0  0x00007f2ef8d2601e in epoll_wait (epfd=27, events=0x7f2ef00012c0, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef83fd0d0, timeout=-1, interruptible=false) at compat/poll.cpp:277
#2  0x00005565e88ba825 in thread_manage_health_consumerd (data=0x0) at health-consumerd.cpp:232
#3  0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

The above thread launches the health check thread creation:

    /* Create thread to manage the client socket */
    ret = pthread_create(&health_thread,
                 default_pthread_attr(),
                 thread_manage_health_consumerd,
                 (void *) nullptr);

and then is stuck waiting here:

    /*
     * Wait for health thread to be initialized before letting the
     * sessiond thread reply to the sessiond that we are ready.
     */
    while (uatomic_read(&lttng_consumer_ready)) {
        usleep(100000);
    }

Health check thread

So, we know the health check thread (thread_manage_health_consumerd) is dying before it's able to set uatomic_dec(&lttng_consumer_ready); at line 225. The sessiond logs give us some hints as to why. These are the lines that stood out the most to us:

PERROR - 17:03:06.153464497 [9415/9420]: bind: No such file or directory (in lttcomm_create_unix_sock() at unix.cpp:136)
Error: Unable to create health check Unix socket
Error: Health error occurred in thread_manage_health_consumerd

The socket creation lttcomm_create_unix_sock() 🧦 in the health thread fails:

    /* Create unix socket */
    sock = lttcomm_create_unix_sock(health_unix_sock_path);
    if (sock < 0) {
        ERR("Unable to create health check Unix socket");
        err = -1;
        goto error;
    }

And the most likely candidate for what failed while trying to create the socket is the bind():

    /* Unlink the old file if present */
    (void) unlink(pathname);
    ret = bind(fd, (struct sockaddr *) &s_un, sizeof(s_un));
    if (ret < 0) {
        PERROR("bind");
        goto error;
    }

This is also consistent with the file and line number in the bind error in the log: "bind: No such file or directory (in lttcomm_create_unix_sock() at unix.cpp:136". My understanding is that "No such file or directory" corresponds to ENOENT (lttng associates error numbers with text on its end). In bind's man page, this is what ENOENT means:

       ENOENT A component in the directory prefix of the socket pathname
              does not exist

So we think when it tries to create the socket, the parent folder or some other part of the path is missing. We hypothesize that this is the sequence of events that leads to the crash:

  • sessiondA is tearing down - The socket-related folder is not yet destroyed.
  • sessiondB incorrectly detects that there is no other sessiond and starts building up - It sees that the socket-related folder exists, so it does not create one.
  • sessiondA finishes tearing down - It destroys the socket-related folder. (The conditions for the crash are created.)
  • sessiondA attempts to create a channel - The socket-related folder doesn't exist --> boom
Actions #1

Updated by Erica Bugden 5 months ago

  • Description updated (diff)
Actions #2

Updated by Jérémie Galarneau 5 months ago

  • Status changed from New to Feedback

I have submitted a tentative fix:
https://review.lttng.org/c/lttng-tools/+/11559/

The whole series (3 changes) is related, but this is the crux of the changes.

Let me know if it fixes the problem on your end!

Actions #3

Updated by Jérémie Galarneau 5 months ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF