Project

General

Profile

Actions

Bug #546

closed

lttng2.2.0rc2: SessionD hang after receiving lttng_enable_channel() API call

Added by Tan le tran over 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
High
Target version:
Start date:
05/29/2013
Due date:
% Done:

100%

Estimated time:

Description

Commit used:
============
babeltrace  : 9eaf254 Version 1.0.3
tools       : 094d169 (HEAD, origin/master, origin/HEAD) Fix: dereference after NULL check
ust         : 996aead (HEAD, origin/master, origin/HEAD) Add parameter -f to rm in Makefile clean target
userspace   : 264716f (HEAD, origin/stable-0.7, stable-0.7) Fix: Use a filled signal mask to disable all signals

Problem Description:
====================
 * SessionD hang (does not response) during stability test activity.
 * The hanging start to happen after about 30 min of Stability run.
 * From our log, it shows that all hanging happens when lttng_enable_channel() API is invoked.
     :
     86520 09:30:23 05/29/2013 IN TraceP_SC-1 "DBG: activateSession: invoke lttng_create_session for session STAB001_1 with url net://192.168.0.1:54176:53483/./.
     86521 09:30:23 05/29/2013 IN TraceP_SC-1 "DBG: activateSession: Invoking lttng_create_handle for session STAB001_1
     86522 09:30:23 05/29/2013 IN TraceP_SC-1 "DBG: activateSession: Invoking lttng_set_default_attr for session STAB001_1.
     86523 09:30:23 05/29/2013 IN TraceP_SC-1 "DBG: activateSession: Invoking lttng_enable_channel for session STAB001_1
     :  #--- no reply receives for this....after about 60 sec (+- 5) timeout, our process start to
        #--- "kill -9" sessiond and consumerd .
        #    We have modified the code to use "kill -SIGABRT" sessiond and consumerD so that
        #    coredump can be generated. The corresponding gdb printouts are attached with
        #    this report.

Is problem reproducible ?
=========================
  * yes 

How to reproduce (if reproducible):
===================================
  * Our stability test consist of 4 users. Each user has a different set of trace commands
    (such as create session, activate session, stop session, etc). Each user then executes
    its set of commands through multiple iterations.
    All sessions are created using streamming and perUID buffer and tracing on userspace only.

    After about 30 min run (Each user has finished a few iterations), we start noticing 
    that sessionD start to hang. Our heatlh check process will "kill -9" sessiond and consumerD
    and restart a new one. After about another while, sessionD hang again with the same kind
    of symptomp. During an overnight run, that happens a lot of time.

    For the purpose of gathering more data, we have modified our code to use "kill -SIGABRT" to
    kill sessiond and consumerd so that coredump can be obtained when hanging occur.
    The corresponding gdb printout are included in this report.

Any other information:
======================
-   

Files

may29_sessiond_consumerd_gdb_printout.log (22.8 KB) may29_sessiond_consumerd_gdb_printout.log gdb printout for sessionD and consumerD after sending "kill -SIGABRT" Tan le tran, 05/29/2013 01:04 PM
gdb_printous.log (27.6 KB) gdb_printous.log gdb printout of relayd, sessiond and consumerd when hanging occurs Tan le tran, 05/31/2013 10:58 AM
sessiond_vvv.log.tar (429 KB) sessiond_vvv.log.tar "sessiond -vvv" (from 09:36:04 to when sessiond got restarted) Tan le tran, 06/04/2013 12:14 PM
netstat_ps_printout.txt (3.63 KB) netstat_ps_printout.txt netstat printout when hanging occurs Tan le tran, 06/04/2013 12:14 PM
gdb_printouts.log (15.6 KB) gdb_printouts.log gdb printout of ralyd and sessiond after SIGABRT sent Tan le tran, 06/04/2013 12:14 PM
sessiond_vvv.log (459 KB) sessiond_vvv.log printout of "sessiond -vvv" Tan le tran, 06/07/2013 07:50 PM
relayd_vvv.log (129 KB) relayd_vvv.log print out of "relayd -vvv" Tan le tran, 06/07/2013 07:50 PM
bug546.diff (1.26 KB) bug546.diff David Goulet, 06/18/2013 03:51 PM
sessiond_vvv.log.tar (7.72 KB) sessiond_vvv.log.tar sessiond -vvv log when attempting to activate session Tan le tran, 06/20/2013 03:30 PM
1850_Netstat.log (2.09 KB) 1850_Netstat.log netstat -etanp printout Tan le tran, 06/25/2013 09:53 AM
1850_relayd_vvv.log (175 KB) 1850_relayd_vvv.log relayd -vvv printout Tan le tran, 06/25/2013 09:53 AM
1850_extra_printouts.log (3.82 KB) 1850_extra_printouts.log requested data printout Tan le tran, 06/25/2013 09:53 AM
socket-timeout-1.patch (5.13 KB) socket-timeout-1.patch socket-timeout-1.patch Mathieu Desnoyers, 07/11/2013 01:49 PM
socket-timeout-2.patch (6.62 KB) socket-timeout-2.patch socket-timeout-2.patch Mathieu Desnoyers, 07/11/2013 01:49 PM
socket-timeout-3.patch (6.05 KB) socket-timeout-3.patch socket-timeout-3.patch Mathieu Desnoyers, 07/11/2013 01:49 PM
sessiond_vvv_combined_1.log.tar (311 KB) sessiond_vvv_combined_1.log.tar sessiond -vvv log (part1) Tan le tran, 07/16/2013 09:13 AM
sessiond_vvv_combined_2.log.tar (425 KB) sessiond_vvv_combined_2.log.tar sessiond -vvv log (part2) Tan le tran, 07/16/2013 09:13 AM
sessiond_vvv.log.tar (21.6 KB) sessiond_vvv.log.tar sessiond -vvv log Tan le tran, 07/23/2013 09:59 AM
connect-timeout-2-replacement.patch (8 KB) connect-timeout-2-replacement.patch Mathieu Desnoyers, 07/23/2013 04:13 PM
Actions #1

Updated by David Goulet over 11 years ago

It seems that the sessiond is unable to reach the relayd for some reason. Two threads are stuck on connect() to relayd and recvmsg() from one.

Is the relayd kill or malfunctioning during (or before) this hang ?

Actions #2

Updated by Tan le tran over 11 years ago


We have noticed that the corresponding relayd port is in SYN_SENT state
when sessionD is hanging. So, we have mmodified our code to do the following
when the hanging has been detected:
  1)_ run "netstat -etanp" 
  2)_ run "ps auxwwwf" 
  3)_ Search for "SYN_SENT" and get the corresponding port number and
      from there, the corresponding relayd PID.
  4)_ Send "Kill -SIGABRT" to relayd PID
      do the same for sessionD and consumerD.  

The corresponding gdb printouts are attached in this report.

The following shows the netstat printout when the hanging sessionD was detected:
Note that we have 3 sessions ongoing. Each session has its own relayd.

Fri May 31 01:59:25 EDT 2013
netstat -etanp |grep lttng
tcp        0      0 192.168.0.2:50925       0.0.0.0:*               LISTEN      0          6515877    2587/lttng-relayd   
tcp        0      0 192.168.0.2:55804       0.0.0.0:*               LISTEN      0          9301056    10773/lttng-relayd  
tcp        0      0 192.168.0.2:58751       0.0.0.0:*               LISTEN      0          9301057    10773/lttng-relayd  <------------
tcp        0      0 192.168.0.2:50946       0.0.0.0:*               LISTEN      0          9286713    10250/lttng-relayd  
tcp        0      0 192.168.0.2:57123       0.0.0.0:*               LISTEN      0          9286714    10250/lttng-relayd  
tcp        0      0 192.168.0.2:56328       0.0.0.0:*               LISTEN      0          6515876    2587/lttng-relayd   
tcp        0      0 192.168.0.2:57123       192.168.0.2:54737       ESTABLISHED 0          9267909    10250/lttng-relayd  
tcp        0      0 192.168.0.2:50925       192.168.0.3:35658       ESTABLISHED 0          6511587    2587/lttng-relayd   
tcp        0      0 192.168.0.2:55804       192.168.0.2:43508       ESTABLISHED 0          9301061    10773/lttng-relayd  
tcp        0  98304 192.168.0.2:54737       192.168.0.2:57123       ESTABLISHED 0          9297075    989/lttng-consumerd 
tcp        0      0 192.168.0.2:50925       192.168.0.4:34087       ESTABLISHED 0          6512384    2587/lttng-relayd   
tcp      113      0 192.168.0.2:50946       192.168.0.2:55902       ESTABLISHED 0          9267907    10250/lttng-relayd  
tcp        0      0 192.168.0.2:56328       192.168.0.3:37172       ESTABLISHED 0          6512379    2587/lttng-relayd   
tcp        0      0 192.168.0.2:50925       192.168.0.1:36151       ESTABLISHED 0          6512385    2587/lttng-relayd   
tcp        0      0 192.168.0.2:56328       192.168.0.4:54327       ESTABLISHED 0          6510256    2587/lttng-relayd   
tcp        0      0 192.168.0.2:56328       192.168.0.1:47063       ESTABLISHED 0          6512380    2587/lttng-relayd   
tcp        0      1 192.168.0.2:58527       192.168.0.2:58751       SYN_SENT    0          9302155    542/lttng-sessiond  
tcp        0      0 192.168.0.2:55902       192.168.0.2:50946       ESTABLISHED 0          9282303    989/lttng-consumerd 
tcp        0      0 192.168.0.2:43508       192.168.0.2:55804       ESTABLISHED 0          9298218    989/lttng-consumerd 
======================================================
ps auxwwwf 
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
:
root     27656  1.2  0.0 195228  4200 ?        Ssl  May30   6:05 /opt/trace/sbin/trace_c
root      2587  0.0  0.0 117156   716 ?        Sl   May30   0:03  \_ lttng-relayd -C tcp://192.168.0.2:56328 -D tcp://192.168.0.2:50925 -o /storage/no-backup/tracecc-apr9010500/STAB004_1
root     10250  0.1  0.0  51620   716 ?        Sl   01:57   0:00  \_ lttng-relayd -C tcp://192.168.0.2:50946 -D tcp://192.168.0.2:57123 -o /storage/no-backup/tracecc-apr9010500/STAB002_3
root     10773  0.0  0.0  51620   720 ?        Sl   01:58   0:00  \_ lttng-relayd -C tcp://192.168.0.2:55804 -D tcp://192.168.0.2:58751 -o /storage/no-backup/tracecc-apr9010500/STAB003_a_1
root      1747  0.0  0.0  42280  2288 ?        Sl   May30   0:02 /opt/trace_ea/bin/TraceEa
root     10954  0.0  0.0  11300  1320 ?        S    01:59   0:00  \_ sh -c ps auxwwwf >> /cluster/temp/tdlt/bug546_extra_info.txt
root     10955  0.0  0.0   4940  1096 ?        R    01:59   0:00      \_ ps auxwwwf
root       542  0.0  0.0 169140  3240 ?        Ssl  00:21   0:02 lttng-sessiond -d
root       989  0.0  0.0 209480  3536 ?        Sl   00:21   0:03  \_ lttng-consumerd --quiet -u --consumerd-cmd-sock /var/run/lttng/ustconsumerd64/command --consumerd-err-sock /var/run/lttng/ustconsumerd64/error
:
======================================================
Actions #3

Updated by David Goulet over 11 years ago

  • Target version set to 2.2
Actions #4

Updated by David Goulet over 11 years ago

What could be very useful is to get the stdout/stderr of the session daemon meaning before the hang, was there any error log ?

This can help us pin point which code path triggered the issue.

Updated by Tan le tran over 11 years ago


Hi David,

We got the hanging again during our stability run.
Here are the updates:

Commit used:
============
babeltrace  : 9eaf254 Version 1.0.3
tools       : fe81e5c (HEAD, origin/master, origin/HEAD) Fix: add more conditions and a lock to metadata cache flushed
ust         : 844a160 (HEAD, origin/stable-2.2) Fix: liblttng-ust process startup hang....
userspace   : 264716f (HEAD, origin/stable-0.7, stable-0.7) Fix: Use a filled signal mask to disable all signals

How log has been obtained:
==========================
  - The data has been collected as described in update #2 from this bug report.
    Except, this time, the corresponding "sessiond -vvv" output is also attached.
    Unfortunately, I have forgot to run the "TimeMe" script (which insert a 
    time stamp into the "sessiond -vvv" log at every second); therefore, it is 
    a bit hard to analyse. 

  - We have restarted the stability run on the node. If the hanging occurs again,
    new log will be added.

Actions #6

Updated by David Goulet over 11 years ago

  • Status changed from New to Confirmed

Is it possible to have the relayd logs during that hang.

Of what I can see, the session daemon connects to the relayd on the control port successfully then tries the data port which does not respond to the connect()... Both threads of the relayd are in epoll_wait() so the relayd might have outputted some errors.

Actions #7

Updated by Tan le tran over 11 years ago

Hi David,

I have tried to modified our code to use "relayd -vvv" when we launch it.
But it get a bit complicated. I got some hit last night but I just relalized that our log limitted the size of the error message causing the stderr output from relayd -vvv being chopped .

We are still looking into how to make it work. Once we got it, we will update this report with the ouput from relayd.
Regards,
Tan

Updated by Tan le tran over 11 years ago

We finally got the relayd -vvv printout for the session that causes sessiond to hang.

From the log, we see the following:
* At 12:17:55, SC-1 sends lttng_enable_channel() for session STAB003_a_1.
* At 12:18:52, TraceEA detects that sessiond is not responsive in the last 60 second.
  - printout of netsat has been taken
  - SIGABRT is sent to relayd, sessiond and consumerd to cause coredump.

Included are the corresponding "relayd -vvv" of session STAB003_a_1 .
The printout of "sessiond -vvv" around that time frame is also included.

Actions #9

Updated by David Goulet about 11 years ago

Let's try this patch. We set the socket in non block mode just before the connect() so we'll detect if a blocking state can occur and if so, an error is returned or else we remove the non block state of the socket and continue.

Actions #10

Updated by Tan le tran about 11 years ago

Hi David,

I have applied the patch and no session can be activated because they all fail to connect to relayd.
"sessiond -vvv" log are attached.

Commit used for this time:
==========================
rcu  : 4b832c2 (origin/master, origin/HEAD) hash table test: don't redefine CACHE_LINE_SIZE
ust  : df56b23 (HEAD, tag: v2.2.0-rc3, origin/stable-2.2) Version 2.2.0-rc3
tools: 821fffb (HEAD, origin/master, origin/HEAD) Fix: consumer add channel return ...
       Then apply patch bug546.diff
Actions #11

Updated by David Goulet about 11 years ago

Hrm... Ok so the patch tried to put the connect in non block mode but it seems
that it has some other undesired effect for the whole system...

This is a bit out of our reach since a blocking connect() means that the TCP
handshake is taking too long or the kernel queue is full.

One thing that would be interesting when this happens is to have a look at the
number of socket opened by the sessiond/consumerd/relayd (lsof) and the current
backlog in the kernel.

  1. cat /proc/net/sockstat
  2. wc -l /proc/net/tcp
  3. cat /proc/sys/net/ipv4/tcp_max_syn_backlog

That would be useful to try to confirm if the kernel is stuck in a backlog.
There might be some TIME_WAIT socket on the relayd that prevents any new
connect() so the socket listing of the lttng toolchain will help us.

Updated by Tan le tran about 11 years ago

The scenario has been rerun and the requested extra info have been collected:
  1)_ cat /proc/net/sockstat
      sockets: used 650
      TCP: inuse 44 orphan 0 tw 1 alloc 51 mem 218
      UDP: inuse 21 mem 11
      UDPLITE: inuse 0
      RAW: inuse 1
      FRAG: inuse 0 memory 0

  2)_ wc -l /proc/net/tcp
      46

  3)_ cat /proc/sys/net/ipv4/tcp_max_syn_backlog
      1024

The attached files also contain all data from /proc/sys/net/ipv4/

Actions #13

Updated by David Goulet about 11 years ago

Can you set your health check to something like 250 seconds. Normally, the connect() should timeout after 120 or 180 sec (not sure) and unblock. This will tell us if this is a normal kernel TCP block or something different!

Thanks

Updated by Mathieu Desnoyers about 11 years ago

Hi Tan,

I'm attaching 3 patches for you to try. They allow overriding the OS socket timeout for the sessiond, consumerd and relayd.

To use it, with e.g. a 20 seconds timeout, launch sessiond with:

LTTNG_NETWORK_SOCKET_TIMEOUT=20000 lttng-sessiond

See man lttng-sessiond and man lttng-relayd for details.

the patches are:

socket-timeout-1.patch
socket-timeout-2.patch
socket-timeout-3.patch

It it helps with your issue, we'll consider pulling them.

Feedback is welcome.

Thanks,

Mathieu

Updated by Tan le tran about 11 years ago

Hi Mathieu,

We finally got the chance to tweak our timer so that we only timeout with sessionD after 400sec (instead of the original 60sec).
We have re-ran our test and have seen a couple of PERROR from sessionD instead of we timing out and restart sessiond:

  PERROR [14953/14958]: connect: Connection timed out (in lttcomm_connect_inet_sock() at inet.c:105)
  Error: Unable to reach lttng-relayd
  (I have included the "sessiond -vvv" log in case you want to take a look)

Now that the above is confirmed, we will look into your suggestion (as outlined in update#14 above).
We will keep you posted.

Thanks,
Tan

Actions #16

Updated by Tan le tran about 11 years ago

Hi Mathieu,

I have loaded the latest commits of lttng-tools as you have suggested and then apply the 
3 patches (from update#14) on top of that commit. Then I re-ran our test. 
What I have noticed is that when the LTTNG_NETWORK_SOCKET_TIMEOUT is 
used (set to 30000), we are experience a lot of "Error: Unable to reach lttng-relayd" 
when we send lttng_enable_channel() API call to sessiond.

Here is a small stat:
   The 3 patches are loaded. Lets call LNST be LTTNG_NETWORK_SOCKET_TIMEOUT.

                   #of lttng_enable_channel   #of "Error: Unable to reach lttng-relayd" 
                   =========================   ==========================================
   LNST not used        ~90                        0
   LNST=30000           ~90                        30

   Note that the "Error: Unable to reach lttng-relayd" comes within one second after the
   lttng_enable_channel() is called. According to my understanding, LNST=30000 ie: 30sec;
   therefore, if there is any hickup, the timeout will be 30 sec later.

I have tried to set LNST=300 000 (300 sec), but observe the same suddend increase of number of
"Error: Unable to reach lttng-relayd" . This error also comes within 1 sec after
lttng_enable_channel() is called.

I have included a short clips of sessiond -vvv log, in which lttng_enable_channel() are
sent at 14:49:12 (for session STAB003_a_1) and at 14:49:18 (for session STAB004_1). Both
failed with "Error: Unable to reach lttng-relayd".

In short, I have the 3 patches loaded. When I am not using LNST, I see no failure with
"Error: Unable to reach lttng-relayd". When I launch sessiond with LNST=30000, I start to
see a lot of "Error: Unable to reach lttng-relayd".
Could you doublecheck to see whether the pacthes cause this behaviour. If so, are
they supposed to be that way or sligthly different ?   

Commit used:
============
Tools: 3823e84 (origin/stable-2.2) test: test_periodical_metadata_flush shrink timer period, kill app
       with: Bug546 work around: add 3 pacthes from update #14
ust  : 481d245 (HEAD, tag: v2.2.1) Version 2.2.1
rcu  : b2f20df (HEAD, tag: v0.7.7) Version 0.7.7
babel: 87353f2 (HEAD, tag: v1.1.1) Version 1.1.1

Thanks,
Tan

Actions #17

Updated by Mathieu Desnoyers about 11 years ago

Here is a replacement patch for socket-timeout-2.patch (see connect-timeout-2-replacement.patch).

Actions #18

Updated by Tan le tran about 11 years ago

After adding the patch from update#17 and execute the stability run, the problem is no longer observed. The traffic could run more than 24hrs without having healthcheck timeout with sessiond.

Thanks a bunch for fixing this.

Actions #19

Updated by Mathieu Desnoyers about 11 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100
Actions #20

Updated by Tan le tran about 11 years ago

Hi Mathieu,

Can you port it back to lttng-tools origin/stable-2.2 ?
Thanks,
Tan

Actions

Also available in: Atom PDF