Project

General

Profile

Bug #1405

Updated by Erica Bugden 5 months ago

When using lttng via a script, the session and consumer daemons appear to completely lock up when we request that a channel be created. The conditions for this lockup seem to be created by destroying a sessiond and then creating a sessiond in quick sequence. 

 h1. Setup information 

 * lttng-tools version: lttng (LTTng Trace Control) 2.14.0-pre - O-Beer - v2.12.0-rc1-1550-g6ca1df2f0-dirty (Note: I believe it's dirty because of added comments. No functional changes were made.) 

 h1. Reproducer  

 Run the script below twice: 

 <pre> 
 sudo killall lttng-sessiond 
 sudo lttng-sessiond --daemonize 
 lttng create my_session --snapshot --output /tmp/demo-output 
 lttng enable-channel --kernel my_channel 
 </pre> 

 Normally, the channel creation should be successful, but when the lockup is triggered, the channel creation command will never complete. The race seems to occur when destroying a sessiond that has at least one channel and then quickly creating a sessiond. The conditions for the lockup occur in the first two commands (kill + create), but the lockup happens when the channel is created. 

 h3. Reset 

 To reset after triggering the lockup, you need to use the big guns (SIGKILL) on both the sessiond and consumerd (they won't respond to SIGTERM), for example: 

 <pre> 
 sudo killall -9 lttng-sessiond 
 sudo killall -9 lttng-consumerd 
 </pre> 

 h1. Hypothesis 

 We suspect that the bind-related error is the cause of the lockup (see annex for log and bug hunt narrative). We believe this error is caused by the incoming session daemon incorrectly detecting that the outgoing session daemon still exists. Possible scenario that could lead to the crash: 

 * sessiondA is tearing down - The socket-related folder is not yet destroyed. 
 * sessiondB incorrectly detects that there is no other sessiond and starts building up - It sees that the socket-related folder exists, so it does not create one. 
 * sessiondA finishes tearing down - It destroys the socket-related folder. (The conditions for the crash are created.) 
 * sessiondA attempts to create a channel - The socket-related folder doesn't exist --> boom 


 h3. Desired behaviour 

 Assuming the hypothesis is correct, the expected behaviour is that if a 2nd session daemon attempts startup when an existing sessiond is still running, the new sessiond daemon would correctly detect this and abort startup. A new session daemon should not start up using the structures of an old sessiond. 

 h1. Annex 
 
 h3. Example output when reproducing 

 <pre> 
 # Round 1 of script: No lockup 
 erica@luna:~/lttng-tools$ ./run-demo.sh  
 Snapshot session my_session created. 
 Default snapshot output set to /tmp/demo-output 
 Every channel enabled for this session will be set to mmap output and default to overwrite mode. 
 kernel channel my_channel enabled for session my_session 

 # Round 2 of script: Lockup  
 erica@luna:~/lttng-tools$ ./run-demo.sh  
 Snapshot session my_session created. 
 Default snapshot output set to /tmp/demo-output 
 Every channel enabled for this session will be set to mmap output and default to overwrite mode. 
 # Channel enabling command never completes 
 </pre> 

 h3. Sessiond log after reproducing bug 

 <pre> 
 Error: Unable to set group on /var/run/lttng/kconsumerd/error 
 PERROR - 17:03:06.117881285 [Main]: chown: No such file or directory (in set_permissions() at main.cpp:1010) 
 Warning: Cannot connect to error socket (is lttng-sessiond started?) 
 PERROR - 17:03:06.153464497 [9415/9420]: bind: No such file or directory (in lttcomm_create_unix_sock() at unix.cpp:136) 
 Error: Unable to create health check Unix socket 
 Error: Health error occurred in thread_manage_health_consumerd 
 </pre> 

 h3. Sessiond thread backtrace after freeze 

 <pre> 
 Thread 8 (Thread 0x7f2ef5bf9600 (LWP 6199) "lttng-consumerd"): 
 #0    0x00007f2ef8d18dbf in __GI___poll (fds=0x7f2ef5bf80e0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 
 #1    0x00005565e88bf29f in lttng_consumer_poll_socket (consumer_sockpoll=0x7f2ef5bf80e0) at consumer/consumer.cpp:1218 
 #2    0x00005565e88c7450 in consumer_thread_sessiond_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:3244 
 #3    0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 
 #4    0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 

 Thread 7 (Thread 0x7f2ef63fa600 (LWP 6198) "lttng-consumerd"): 
 #0    0x00007f2ef8d18dbf in __GI___poll (fds=0x7f2ee0000b90, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29 
 #1    0x00005565e88c43b3 in consumer_thread_data_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:2586 
 #2    0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 
 #3    0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 

 Thread 6 (Thread 0x7f2ef6bfb600 (LWP 6197) "lttng-consumerd"): 
 #0    0x00007f2ef8d2601e in epoll_wait (epfd=28, events=0x7f2eec000b70, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 
 #1    0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef6bfa0d0, timeout=-1, interruptible=false) at compat/poll.cpp:277 
 #2    0x00005565e88c3057 in consumer_thread_metadata_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:2321 
 #3    0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 
 #4    0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 

 Thread 5 (Thread 0x7f2ef73fc600 (LWP 6196) "lttng-consumerd"): 
 #0    0x00007f2ef8d2601e in epoll_wait (epfd=29, events=0x7f2ee8001700, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 
 #1    0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef73fb1a0, timeout=-1, interruptible=false) at compat/poll.cpp:277 
 #2    0x00005565e88c5d7a in consumer_thread_channel_poll (data=0x5565ea19dd10) at consumer/consumer.cpp:2927 
 #3    0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 
 #4    0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 

 Thread 4 (Thread 0x7f2ef7bfd600 (LWP 6195) "lttng-consumerd"): 
 #0    0x00007f2ef8c4321a in __GI___sigtimedwait (set=0x7f2ef7bfbff0, info=0x7f2ef7bfc070, timeout=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:61 
 #1    0x00005565e88d5432 in consumer_timer_thread (data=0x5565ea19dd10) at consumer/consumer-timer.cpp:768 
 #2    0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 
 #3    0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 

 Thread 3 (Thread 0x7f2ef83fe600 (LWP 6180) "lttng-consumerd"): 
 #0    0x00007f2ef8d2601e in epoll_wait (epfd=27, events=0x7f2ef00012c0, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 
 #1    0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef83fd0d0, timeout=-1, interruptible=false) at compat/poll.cpp:277 
 #2    0x00005565e88ba825 in thread_manage_health_consumerd (data=0x0) at health-consumerd.cpp:232 
 #3    0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 
 #4    0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 

 Thread 2 (Thread 0x7f2ef8bff600 (LWP 6179) "lttng-consumerd"): 
 #0    syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 
 #1    0x00007f2ef9291f24 in futex (val3=0, uaddr2=0x0, timeout=0x0, val=-1, op=0, uaddr=0x5565ea197e60) at ../include/urcu/futex.h:67 
 #2    futex_async (timeout=0x0, uaddr2=0x0, val3=0, val=-1, op=0, uaddr=0x5565ea197e60) at ../include/urcu/futex.h:99 
 #3    futex_wait (futex=futex@entry=0x5565ea197e60) at workqueue.c:121 
 #4    0x00007f2ef9292346 in workqueue_thread (arg=0x5565ea197e20) at workqueue.c:232 
 #5    0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 
 #6    0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 

 Thread 1 (Thread 0x7f2ef9232300 (LWP 6176) "lttng-consumerd"): 
 #0    __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=6199, futex_word=0x7f2ef5bf98d0) at ./nptl/futex-internal.c:57 
 #1    __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=6199, futex_word=0x7f2ef5bf98d0) at ./nptl/futex-internal.c:87 
 #2    __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f2ef5bf98d0, expected=6199, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139 
 #3    0x00007f2ef8c96624 in __pthread_clockjoin_ex (threadid=139839668196864, thread_return=0x7ffca4bdfcd0, clockid=0, abstime=0x0, block=<optimized out>) at ./nptl/pthread_join_common.c:105 
 #4    0x00005565e88b937b in main (argc=9, argv=0x7ffca4bdff88) at lttng-consumerd.cpp:613 
 </pre> 

 h3. Bug hunt narrative 

 h4. Main sessiond thread 

 After triggering the lockup as described in the reproducer section above, with gdb we can see in the backtrace that the health check thread wasn't created successfully, the thread supposed to create it is still waiting for a response. 

 <pre> 
 Thread 3 (Thread 0x7f2ef83fe600 (LWP 6180) "lttng-consumerd"): 
 #0    0x00007f2ef8d2601e in epoll_wait (epfd=27, events=0x7f2ef00012c0, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 
 #1    0x00005565e890a79a in compat_epoll_wait (events=0x7f2ef83fd0d0, timeout=-1, interruptible=false) at compat/poll.cpp:277 
 #2    0x00005565e88ba825 in thread_manage_health_consumerd (data=0x0) at health-consumerd.cpp:232 
 #3    0x00007f2ef8c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 
 #4    0x00007f2ef8d26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 
 </pre> 

 The above thread launches the health check thread creation: 

 <pre> 
	 /* Create thread to manage the client socket */ 
	 ret = pthread_create(&health_thread, 
			      default_pthread_attr(), 
			      thread_manage_health_consumerd, 
			      (void *) nullptr); 
 </pre> 

 and then is stuck waiting here: 

 <pre> 
	 /* 
	  * Wait for health thread to be initialized before letting the 
	  * sessiond thread reply to the sessiond that we are ready. 
	  */ 
	 while (uatomic_read(&lttng_consumer_ready)) { 
		 usleep(100000); 
	 } 
 </pre> 

 h4. h3. Health check thread 

 So, we know the health check thread (thread_manage_health_consumerd) is dying before it's able to set @uatomic_dec(&lttng_consumer_ready);@ at line 225. The sessiond logs give us some hints as to why. These are the lines that stood out the most to us: 

 <pre> 
 PERROR - 17:03:06.153464497 [9415/9420]: bind: No such file or directory (in lttcomm_create_unix_sock() at unix.cpp:136) 
 Error: Unable to create health check Unix socket 
 Error: Health error occurred in thread_manage_health_consumerd 
 </pre> 

 The socket creation @lttcomm_create_unix_sock()@ 🧦 in the health thread fails: 

 <pre> 
	 /* Create unix socket */ 
	 sock = lttcomm_create_unix_sock(health_unix_sock_path); 
	 if (sock < 0) { 
		 ERR("Unable to create health check Unix socket"); 
		 err = -1; 
		 goto error; 
	 } 
 </pre> 

 And the most likely candidate for what failed while trying to create the socket is the @bind()@: 

 <pre> 
	 /* Unlink the old file if present */ 
	 (void) unlink(pathname); 
	 ret = bind(fd, (struct sockaddr *) &s_un, sizeof(s_un)); 
	 if (ret < 0) { 
		 PERROR("bind"); 
		 goto error; 
	 } 
 </pre> 

 This is also consistent with the file and line number in the bind error in the log: "bind: No such file or directory (in lttcomm_create_unix_sock() at unix.cpp:136". My understanding is that "No such file or directory" corresponds to ENOENT (lttng associates error numbers with text on its end). In bind's man page, this is what ENOENT means: 

 <pre> 
        ENOENT A component in the directory prefix of the socket pathname 
               does not exist 
 </pre> 

 So we think when it tries to create the socket, the parent folder or some other part of the path is missing. We hypothesize that this is the sequence of events that leads to the crash: 

 * sessiondA is tearing down - The socket-related folder is not yet destroyed. 
 * sessiondB incorrectly detects that there is no other sessiond and starts building up - It sees that the socket-related folder exists, so it does not create one. 
 * sessiondA finishes tearing down - It destroys the socket-related folder. (The conditions for the crash are created.) 
 * sessiondA attempts to create a channel - The socket-related folder doesn't exist --> boom 

Back