Bug #1405
closedlttng (session and consumer daemons) freezes on channel creation after quickly destorying and creating a sessiond
100%
Description
When using lttng via a script, the session and consumer daemons appear to completely lock up when we request that a channel be created. The conditions for this lockup seem to be created by destroying a sessiond and then creating a sessiond in quick sequence.
Setup information¶
- lttng-tools version: lttng (LTTng Trace Control) 2.14.0-pre - O-Beer - v2.12.0-rc1-1550-g6ca1df2f0-dirty (Note: I believe it's dirty because of added comments. No functional changes were made.)
Reproducer¶
Run the script below twice:
sudo killall lttng-sessiond sudo lttng-sessiond --daemonize lttng create my_session --snapshot --output /tmp/demo-output lttng enable-channel --kernel my_channel
Normally, the channel creation should be successful, but when the lockup is triggered, the channel creation command will never complete. The race seems to occur when destroying a sessiond that has at least one channel and then quickly creating a sessiond. The conditions for the lockup occur in the first two commands (kill + create), but the lockup happens when the channel is created.
Reset¶
To reset after triggering the lockup, you need to use the big guns (SIGKILL) on both the sessiond and consumerd (they won't respond to SIGTERM), for example:
sudo killall -9 lttng-sessiond sudo killall -9 lttng-consumerd
Hypothesis¶
We suspect that the bind-related error is the cause of the lockup (see annex for log and bug hunt narrative). We believe this error is caused by the incoming session daemon incorrectly detecting that the outgoing session daemon still exists. Possible scenario that could lead to the crash:
- sessiondA is tearing down - The socket-related folder is not yet destroyed.
- sessiondB incorrectly detects that there is no other sessiond and starts building up - It sees that the socket-related folder exists, so it does not create one.
- sessiondA finishes tearing down - It destroys the socket-related folder. (The conditions for the crash are created.)
- sessiondA attempts to create a channel - The socket-related folder doesn't exist --> boom
Desired behaviour¶
Assuming the hypothesis is correct, the expected behaviour is that if a 2nd session daemon attempts startup when an existing sessiond is still running, the new sessiond daemon would correctly detect this and abort startup. A new session daemon should not start up using the structures of an old sessiond.
Annex¶
Example output when reproducing¶
# Round 1 of script: No lockup erica@luna:~/lttng-tools$ ./run-demo.sh Snapshot session my_session created. Default snapshot output set to /tmp/demo-output Every channel enabled for this session will be set to mmap output and default to overwrite mode. kernel channel my_channel enabled for session my_session # Round 2 of script: Lockup erica@luna:~/lttng-tools$ ./run-demo.sh Snapshot session my_session created. Default snapshot output set to /tmp/demo-output Every channel enabled for this session will be set to mmap output and default to overwrite mode. # Channel enabling command never completes
Sessiond log after reproducing bug¶
Error: Unable to set group on /var/run/lttng/kconsumerd/error PERROR - 17:03:06.117881285 [Main]: chown: No such file or directory (in set_permissions() at main.cpp:1010) Warning: Cannot connect to error socket (is lttng-sessiond started?) PERROR - 17:03:06.153464497 [9415/9420]: bind: No such file or directory (in lttcomm_create_unix_sock() at unix.cpp:136) Error: Unable to create health check Unix socket Error: Health error occurred in thread_manage_health_consumerd
Sessiond thread backtrace after freeze¶
Thread 8 (Thread 0x7f2ef5bf9600 (LWP 6199) "lttng-consumerd"): #0 0x00007f2ef8d18dbf in __GI___poll (fds=0x7f2ef5bf80e0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x00005565e88bf29f in lttng_consumer_poll_socket (consumer_sockpoll=0x7f2ef5bf80e0) at consumer/consumer.cpp:1218 #2 0x00005565e88c7450 in consumer_thread_sessiond_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:3244 #3 0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #4 0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Thread 7 (Thread 0x7f2ef63fa600 (LWP 6198) "lttng-consumerd"): #0 0x00007f2ef8d18dbf in __GI___poll (fds=0x7f2ee0000b90, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 #1 0x00005565e88c43b3 in consumer_thread_data_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:2586 #2 0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #3 0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Thread 6 (Thread 0x7f2ef6bfb600 (LWP 6197) "lttng-consumerd"): #0 0x00007f2ef8d2601e in epoll_wait (epfd=28, events=0x7f2eec000b70, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef6bfa0d0, timeout=-1, interruptible=false) at compat/poll.cpp:277 #2 0x00005565e88c3057 in consumer_thread_metadata_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:2321 #3 0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #4 0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Thread 5 (Thread 0x7f2ef73fc600 (LWP 6196) "lttng-consumerd"): #0 0x00007f2ef8d2601e in epoll_wait (epfd=29, events=0x7f2ee8001700, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef73fb1a0, timeout=-1, interruptible=false) at compat/poll.cpp:277 #2 0x00005565e88c5d7a in consumer_thread_channel_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:2927 #3 0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #4 0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Thread 4 (Thread 0x7f2ef7bfd600 (LWP 6195) "lttng-consumerd"): #0 0x00007f2ef8c4321a in __GI___sigtimedwait (set=0x7f2ef7bfbff0, info=0x7f2ef7bfc070, timeout=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:61 #1 0x00005565e88d5432 in consumer_timer_thread (data=0x5565ea19dd10) at consumer/consumer-timer.cpp:768 #2 0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #3 0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Thread 3 (Thread 0x7f2ef83fe600 (LWP 6180) "lttng-consumerd"): #0 0x00007f2ef8d2601e in epoll_wait (epfd=27, events=0x7f2ef00012c0, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef83fd0d0, timeout=-1, interruptible=false) at compat/poll.cpp:277 #2 0x00005565e88ba825 in thread_manage_health_consumerd (data=0x0) at health-consumerd.cpp:232 #3 0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #4 0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Thread 2 (Thread 0x7f2ef8bff600 (LWP 6179) "lttng-consumerd"): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007f2ef9291f24 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x5565ea197e60) at ../include/urcu/futex.h:67 #2 futex_async (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x5565ea197e60) at ../include/urcu/futex.h:99 #3 futex_wait (futex=futex@entry=0x5565ea197e60) at workqueue.c:121 #4 0x00007f2ef9292346 in workqueue_thread (arg=0x5565ea197e20) at workqueue.c:232 #5 0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #6 0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Thread 1 (Thread 0x7f2ef9232300 (LWP 6176) "lttng-consumerd"): #0 __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=6199, futex_word=0x7f2ef5bf98d0) at ./nptl/futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=6199, futex_word=0x7f2ef5bf98d0) at ./nptl/futex-internal.c:87 #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f2ef5bf98d0, expected=6199, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139 #3 0x00007f2ef8c96624 in __pthread_clockjoin_ex (threadid=139839668196864, thread_return=0x7ffca4bdfcd0, clockid=0, abstime=0x0, block=<optimized out>) at ./nptl/pthread_join_common.c:105 #4 0x00005565e88b937b in main (argc=9, argv=0x7ffca4bdff88) at lttng-consumerd.cpp:613
Bug hunt narrative¶
Main sessiond thread¶
After triggering the lockup as described in the reproducer section above, with gdb we can see in the backtrace that the health check thread wasn't created successfully, the thread supposed to create it is still waiting for a response.
Thread 3 (Thread 0x7f2ef83fe600 (LWP 6180) "lttng-consumerd"): #0 0x00007f2ef8d2601e in epoll_wait (epfd=27, events=0x7f2ef00012c0, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef83fd0d0, timeout=-1, interruptible=false) at compat/poll.cpp:277 #2 0x00005565e88ba825 in thread_manage_health_consumerd (data=0x0) at health-consumerd.cpp:232 #3 0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #4 0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
The above thread launches the health check thread creation:
/* Create thread to manage the client socket */ ret = pthread_create(&health_thread, default_pthread_attr(), thread_manage_health_consumerd, (void *) nullptr);
and then is stuck waiting here:
/* * Wait for health thread to be initialized before letting the * sessiond thread reply to the sessiond that we are ready. */ while (uatomic_read(<tng_consumer_ready)) { usleep(100000); }
Health check thread¶
So, we know the health check thread (thread_manage_health_consumerd) is dying before it's able to set uatomic_dec(<tng_consumer_ready);
at line 225. The sessiond logs give us some hints as to why. These are the lines that stood out the most to us:
PERROR - 17:03:06.153464497 [9415/9420]: bind: No such file or directory (in lttcomm_create_unix_sock() at unix.cpp:136) Error: Unable to create health check Unix socket Error: Health error occurred in thread_manage_health_consumerd
The socket creation lttcomm_create_unix_sock()
🧦 in the health thread fails:
/* Create unix socket */ sock = lttcomm_create_unix_sock(health_unix_sock_path); if (sock < 0) { ERR("Unable to create health check Unix socket"); err = -1; goto error; }
And the most likely candidate for what failed while trying to create the socket is the bind()
:
/* Unlink the old file if present */ (void) unlink(pathname); ret = bind(fd, (struct sockaddr *) &s_un, sizeof(s_un)); if (ret < 0) { PERROR("bind"); goto error; }
This is also consistent with the file and line number in the bind error in the log: "bind: No such file or directory (in lttcomm_create_unix_sock() at unix.cpp:136". My understanding is that "No such file or directory" corresponds to ENOENT (lttng associates error numbers with text on its end). In bind's man page, this is what ENOENT means:
ENOENT A component in the directory prefix of the socket pathname does not exist
So we think when it tries to create the socket, the parent folder or some other part of the path is missing. We hypothesize that this is the sequence of events that leads to the crash:
- sessiondA is tearing down - The socket-related folder is not yet destroyed.
- sessiondB incorrectly detects that there is no other sessiond and starts building up - It sees that the socket-related folder exists, so it does not create one.
- sessiondA finishes tearing down - It destroys the socket-related folder. (The conditions for the crash are created.)
- sessiondA attempts to create a channel - The socket-related folder doesn't exist --> boom
Updated by Jérémie Galarneau 5 months ago
- Status changed from New to Feedback
I have submitted a tentative fix:
https://review.lttng.org/c/lttng-tools/+/11559/
The whole series (3 changes) is related, but this is the crux of the changes.
Let me know if it fixes the problem on your end!
Updated by Jérémie Galarneau 5 months ago
- Status changed from Feedback to Resolved
- % Done changed from 0 to 100
Applied in changeset lttng-tools|e560c525031d7aba481e0b8655c48a14bade24c2.