Bug #546
closedlttng2.2.0rc2: SessionD hang after receiving lttng_enable_channel() API call
100%
Description
Commit used: ============ babeltrace : 9eaf254 Version 1.0.3 tools : 094d169 (HEAD, origin/master, origin/HEAD) Fix: dereference after NULL check ust : 996aead (HEAD, origin/master, origin/HEAD) Add parameter -f to rm in Makefile clean target userspace : 264716f (HEAD, origin/stable-0.7, stable-0.7) Fix: Use a filled signal mask to disable all signals Problem Description: ==================== * SessionD hang (does not response) during stability test activity. * The hanging start to happen after about 30 min of Stability run. * From our log, it shows that all hanging happens when lttng_enable_channel() API is invoked. : 86520 09:30:23 05/29/2013 IN TraceP_SC-1 "DBG: activateSession: invoke lttng_create_session for session STAB001_1 with url net://192.168.0.1:54176:53483/./. 86521 09:30:23 05/29/2013 IN TraceP_SC-1 "DBG: activateSession: Invoking lttng_create_handle for session STAB001_1 86522 09:30:23 05/29/2013 IN TraceP_SC-1 "DBG: activateSession: Invoking lttng_set_default_attr for session STAB001_1. 86523 09:30:23 05/29/2013 IN TraceP_SC-1 "DBG: activateSession: Invoking lttng_enable_channel for session STAB001_1 : #--- no reply receives for this....after about 60 sec (+- 5) timeout, our process start to #--- "kill -9" sessiond and consumerd . # We have modified the code to use "kill -SIGABRT" sessiond and consumerD so that # coredump can be generated. The corresponding gdb printouts are attached with # this report. Is problem reproducible ? ========================= * yes How to reproduce (if reproducible): =================================== * Our stability test consist of 4 users. Each user has a different set of trace commands (such as create session, activate session, stop session, etc). Each user then executes its set of commands through multiple iterations. All sessions are created using streamming and perUID buffer and tracing on userspace only. After about 30 min run (Each user has finished a few iterations), we start noticing that sessionD start to hang. Our heatlh check process will "kill -9" sessiond and consumerD and restart a new one. After about another while, sessionD hang again with the same kind of symptomp. During an overnight run, that happens a lot of time. For the purpose of gathering more data, we have modified our code to use "kill -SIGABRT" to kill sessiond and consumerd so that coredump can be obtained when hanging occur. The corresponding gdb printout are included in this report. Any other information: ====================== -
Files
Updated by David Goulet over 11 years ago
It seems that the sessiond is unable to reach the relayd for some reason. Two threads are stuck on connect() to relayd and recvmsg() from one.
Is the relayd kill or malfunctioning during (or before) this hang ?
Updated by Tan le tran over 11 years ago
- File gdb_printous.log gdb_printous.log added
We have noticed that the corresponding relayd port is in SYN_SENT state when sessionD is hanging. So, we have mmodified our code to do the following when the hanging has been detected: 1)_ run "netstat -etanp" 2)_ run "ps auxwwwf" 3)_ Search for "SYN_SENT" and get the corresponding port number and from there, the corresponding relayd PID. 4)_ Send "Kill -SIGABRT" to relayd PID do the same for sessionD and consumerD. The corresponding gdb printouts are attached in this report. The following shows the netstat printout when the hanging sessionD was detected: Note that we have 3 sessions ongoing. Each session has its own relayd. Fri May 31 01:59:25 EDT 2013 netstat -etanp |grep lttng tcp 0 0 192.168.0.2:50925 0.0.0.0:* LISTEN 0 6515877 2587/lttng-relayd tcp 0 0 192.168.0.2:55804 0.0.0.0:* LISTEN 0 9301056 10773/lttng-relayd tcp 0 0 192.168.0.2:58751 0.0.0.0:* LISTEN 0 9301057 10773/lttng-relayd <------------ tcp 0 0 192.168.0.2:50946 0.0.0.0:* LISTEN 0 9286713 10250/lttng-relayd tcp 0 0 192.168.0.2:57123 0.0.0.0:* LISTEN 0 9286714 10250/lttng-relayd tcp 0 0 192.168.0.2:56328 0.0.0.0:* LISTEN 0 6515876 2587/lttng-relayd tcp 0 0 192.168.0.2:57123 192.168.0.2:54737 ESTABLISHED 0 9267909 10250/lttng-relayd tcp 0 0 192.168.0.2:50925 192.168.0.3:35658 ESTABLISHED 0 6511587 2587/lttng-relayd tcp 0 0 192.168.0.2:55804 192.168.0.2:43508 ESTABLISHED 0 9301061 10773/lttng-relayd tcp 0 98304 192.168.0.2:54737 192.168.0.2:57123 ESTABLISHED 0 9297075 989/lttng-consumerd tcp 0 0 192.168.0.2:50925 192.168.0.4:34087 ESTABLISHED 0 6512384 2587/lttng-relayd tcp 113 0 192.168.0.2:50946 192.168.0.2:55902 ESTABLISHED 0 9267907 10250/lttng-relayd tcp 0 0 192.168.0.2:56328 192.168.0.3:37172 ESTABLISHED 0 6512379 2587/lttng-relayd tcp 0 0 192.168.0.2:50925 192.168.0.1:36151 ESTABLISHED 0 6512385 2587/lttng-relayd tcp 0 0 192.168.0.2:56328 192.168.0.4:54327 ESTABLISHED 0 6510256 2587/lttng-relayd tcp 0 0 192.168.0.2:56328 192.168.0.1:47063 ESTABLISHED 0 6512380 2587/lttng-relayd tcp 0 1 192.168.0.2:58527 192.168.0.2:58751 SYN_SENT 0 9302155 542/lttng-sessiond tcp 0 0 192.168.0.2:55902 192.168.0.2:50946 ESTABLISHED 0 9282303 989/lttng-consumerd tcp 0 0 192.168.0.2:43508 192.168.0.2:55804 ESTABLISHED 0 9298218 989/lttng-consumerd ====================================================== ps auxwwwf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND : root 27656 1.2 0.0 195228 4200 ? Ssl May30 6:05 /opt/trace/sbin/trace_c root 2587 0.0 0.0 117156 716 ? Sl May30 0:03 \_ lttng-relayd -C tcp://192.168.0.2:56328 -D tcp://192.168.0.2:50925 -o /storage/no-backup/tracecc-apr9010500/STAB004_1 root 10250 0.1 0.0 51620 716 ? Sl 01:57 0:00 \_ lttng-relayd -C tcp://192.168.0.2:50946 -D tcp://192.168.0.2:57123 -o /storage/no-backup/tracecc-apr9010500/STAB002_3 root 10773 0.0 0.0 51620 720 ? Sl 01:58 0:00 \_ lttng-relayd -C tcp://192.168.0.2:55804 -D tcp://192.168.0.2:58751 -o /storage/no-backup/tracecc-apr9010500/STAB003_a_1 root 1747 0.0 0.0 42280 2288 ? Sl May30 0:02 /opt/trace_ea/bin/TraceEa root 10954 0.0 0.0 11300 1320 ? S 01:59 0:00 \_ sh -c ps auxwwwf >> /cluster/temp/tdlt/bug546_extra_info.txt root 10955 0.0 0.0 4940 1096 ? R 01:59 0:00 \_ ps auxwwwf root 542 0.0 0.0 169140 3240 ? Ssl 00:21 0:02 lttng-sessiond -d root 989 0.0 0.0 209480 3536 ? Sl 00:21 0:03 \_ lttng-consumerd --quiet -u --consumerd-cmd-sock /var/run/lttng/ustconsumerd64/command --consumerd-err-sock /var/run/lttng/ustconsumerd64/error : ======================================================
Updated by David Goulet over 11 years ago
What could be very useful is to get the stdout/stderr of the session daemon meaning before the hang, was there any error log ?
This can help us pin point which code path triggered the issue.
Updated by Tan le tran over 11 years ago
- File sessiond_vvv.log.tar sessiond_vvv.log.tar added
- File netstat_ps_printout.txt netstat_ps_printout.txt added
- File gdb_printouts.log gdb_printouts.log added
Hi David, We got the hanging again during our stability run. Here are the updates: Commit used: ============ babeltrace : 9eaf254 Version 1.0.3 tools : fe81e5c (HEAD, origin/master, origin/HEAD) Fix: add more conditions and a lock to metadata cache flushed ust : 844a160 (HEAD, origin/stable-2.2) Fix: liblttng-ust process startup hang.... userspace : 264716f (HEAD, origin/stable-0.7, stable-0.7) Fix: Use a filled signal mask to disable all signals How log has been obtained: ========================== - The data has been collected as described in update #2 from this bug report. Except, this time, the corresponding "sessiond -vvv" output is also attached. Unfortunately, I have forgot to run the "TimeMe" script (which insert a time stamp into the "sessiond -vvv" log at every second); therefore, it is a bit hard to analyse. - We have restarted the stability run on the node. If the hanging occurs again, new log will be added.
Updated by David Goulet over 11 years ago
- Status changed from New to Confirmed
Is it possible to have the relayd logs during that hang.
Of what I can see, the session daemon connects to the relayd on the control port successfully then tries the data port which does not respond to the connect()... Both threads of the relayd are in epoll_wait() so the relayd might have outputted some errors.
Updated by Tan le tran over 11 years ago
Hi David,
I have tried to modified our code to use "relayd -vvv" when we launch it.
But it get a bit complicated. I got some hit last night but I just relalized that our log limitted the size of the error message causing the stderr output from relayd -vvv being chopped .
We are still looking into how to make it work. Once we got it, we will update this report with the ouput from relayd.
Regards,
Tan
Updated by Tan le tran over 11 years ago
- File sessiond_vvv.log sessiond_vvv.log added
- File relayd_vvv.log relayd_vvv.log added
We finally got the relayd -vvv printout for the session that causes sessiond to hang. From the log, we see the following: * At 12:17:55, SC-1 sends lttng_enable_channel() for session STAB003_a_1. * At 12:18:52, TraceEA detects that sessiond is not responsive in the last 60 second. - printout of netsat has been taken - SIGABRT is sent to relayd, sessiond and consumerd to cause coredump. Included are the corresponding "relayd -vvv" of session STAB003_a_1 . The printout of "sessiond -vvv" around that time frame is also included.
Updated by David Goulet over 11 years ago
- File bug546.diff bug546.diff added
- Status changed from Confirmed to Feedback
Let's try this patch. We set the socket in non block mode just before the connect() so we'll detect if a blocking state can occur and if so, an error is returned or else we remove the non block state of the socket and continue.
Updated by Tan le tran over 11 years ago
- File sessiond_vvv.log.tar sessiond_vvv.log.tar added
Hi David, I have applied the patch and no session can be activated because they all fail to connect to relayd. "sessiond -vvv" log are attached. Commit used for this time: ========================== rcu : 4b832c2 (origin/master, origin/HEAD) hash table test: don't redefine CACHE_LINE_SIZE ust : df56b23 (HEAD, tag: v2.2.0-rc3, origin/stable-2.2) Version 2.2.0-rc3 tools: 821fffb (HEAD, origin/master, origin/HEAD) Fix: consumer add channel return ... Then apply patch bug546.diff
Updated by David Goulet over 11 years ago
Hrm... Ok so the patch tried to put the connect in non block mode but it seems
that it has some other undesired effect for the whole system...
This is a bit out of our reach since a blocking connect() means that the TCP
handshake is taking too long or the kernel queue is full.
One thing that would be interesting when this happens is to have a look at the
number of socket opened by the sessiond/consumerd/relayd (lsof) and the current
backlog in the kernel.
- cat /proc/net/sockstat
- wc -l /proc/net/tcp
- cat /proc/sys/net/ipv4/tcp_max_syn_backlog
That would be useful to try to confirm if the kernel is stuck in a backlog.
There might be some TIME_WAIT socket on the relayd that prevents any new
connect() so the socket listing of the lttng toolchain will help us.
Updated by Tan le tran over 11 years ago
- File 1850_Netstat.log 1850_Netstat.log added
- File 1850_relayd_vvv.log 1850_relayd_vvv.log added
- File 1850_extra_printouts.log 1850_extra_printouts.log added
The scenario has been rerun and the requested extra info have been collected: 1)_ cat /proc/net/sockstat sockets: used 650 TCP: inuse 44 orphan 0 tw 1 alloc 51 mem 218 UDP: inuse 21 mem 11 UDPLITE: inuse 0 RAW: inuse 1 FRAG: inuse 0 memory 0 2)_ wc -l /proc/net/tcp 46 3)_ cat /proc/sys/net/ipv4/tcp_max_syn_backlog 1024 The attached files also contain all data from /proc/sys/net/ipv4/
Updated by David Goulet over 11 years ago
Can you set your health check to something like 250 seconds. Normally, the connect() should timeout after 120 or 180 sec (not sure) and unblock. This will tell us if this is a normal kernel TCP block or something different!
Thanks
Updated by Mathieu Desnoyers over 11 years ago
- File socket-timeout-1.patch socket-timeout-1.patch added
- File socket-timeout-2.patch socket-timeout-2.patch added
- File socket-timeout-3.patch socket-timeout-3.patch added
- Assignee set to Mathieu Desnoyers
Hi Tan,
I'm attaching 3 patches for you to try. They allow overriding the OS socket timeout for the sessiond, consumerd and relayd.
To use it, with e.g. a 20 seconds timeout, launch sessiond with:
LTTNG_NETWORK_SOCKET_TIMEOUT=20000 lttng-sessiond
See man lttng-sessiond and man lttng-relayd for details.
the patches are:
socket-timeout-1.patch
socket-timeout-2.patch
socket-timeout-3.patch
It it helps with your issue, we'll consider pulling them.
Feedback is welcome.
Thanks,
Mathieu
Updated by Tan le tran over 11 years ago
- File sessiond_vvv_combined_1.log.tar sessiond_vvv_combined_1.log.tar added
- File sessiond_vvv_combined_2.log.tar sessiond_vvv_combined_2.log.tar added
Hi Mathieu, We finally got the chance to tweak our timer so that we only timeout with sessionD after 400sec (instead of the original 60sec). We have re-ran our test and have seen a couple of PERROR from sessionD instead of we timing out and restart sessiond: PERROR [14953/14958]: connect: Connection timed out (in lttcomm_connect_inet_sock() at inet.c:105) Error: Unable to reach lttng-relayd (I have included the "sessiond -vvv" log in case you want to take a look) Now that the above is confirmed, we will look into your suggestion (as outlined in update#14 above). We will keep you posted. Thanks, Tan
Updated by Tan le tran over 11 years ago
- File sessiond_vvv.log.tar sessiond_vvv.log.tar added
Hi Mathieu, I have loaded the latest commits of lttng-tools as you have suggested and then apply the 3 patches (from update#14) on top of that commit. Then I re-ran our test. What I have noticed is that when the LTTNG_NETWORK_SOCKET_TIMEOUT is used (set to 30000), we are experience a lot of "Error: Unable to reach lttng-relayd" when we send lttng_enable_channel() API call to sessiond. Here is a small stat: The 3 patches are loaded. Lets call LNST be LTTNG_NETWORK_SOCKET_TIMEOUT. #of lttng_enable_channel #of "Error: Unable to reach lttng-relayd" ========================= ========================================== LNST not used ~90 0 LNST=30000 ~90 30 Note that the "Error: Unable to reach lttng-relayd" comes within one second after the lttng_enable_channel() is called. According to my understanding, LNST=30000 ie: 30sec; therefore, if there is any hickup, the timeout will be 30 sec later. I have tried to set LNST=300 000 (300 sec), but observe the same suddend increase of number of "Error: Unable to reach lttng-relayd" . This error also comes within 1 sec after lttng_enable_channel() is called. I have included a short clips of sessiond -vvv log, in which lttng_enable_channel() are sent at 14:49:12 (for session STAB003_a_1) and at 14:49:18 (for session STAB004_1). Both failed with "Error: Unable to reach lttng-relayd". In short, I have the 3 patches loaded. When I am not using LNST, I see no failure with "Error: Unable to reach lttng-relayd". When I launch sessiond with LNST=30000, I start to see a lot of "Error: Unable to reach lttng-relayd". Could you doublecheck to see whether the pacthes cause this behaviour. If so, are they supposed to be that way or sligthly different ? Commit used: ============ Tools: 3823e84 (origin/stable-2.2) test: test_periodical_metadata_flush shrink timer period, kill app with: Bug546 work around: add 3 pacthes from update #14 ust : 481d245 (HEAD, tag: v2.2.1) Version 2.2.1 rcu : b2f20df (HEAD, tag: v0.7.7) Version 0.7.7 babel: 87353f2 (HEAD, tag: v1.1.1) Version 1.1.1 Thanks, Tan
Updated by Mathieu Desnoyers over 11 years ago
Here is a replacement patch for socket-timeout-2.patch (see connect-timeout-2-replacement.patch).
Updated by Tan le tran over 11 years ago
After adding the patch from update#17 and execute the stability run, the problem is no longer observed. The traffic could run more than 24hrs without having healthcheck timeout with sessiond.
Thanks a bunch for fixing this.
Updated by Mathieu Desnoyers over 11 years ago
- Status changed from Feedback to Resolved
- % Done changed from 0 to 100
Applied in changeset 783a3b9aa609d6805e1463655d25c7ae176c4859.
Updated by Tan le tran over 11 years ago
Hi Mathieu,
Can you port it back to lttng-tools origin/stable-2.2 ?
Thanks,
Tan