Project

General

Profile

Actions

Bug #545

closed

lttng2.2.0rc2: Instrumented apps hung (can not execute their main) when sessionD is not responsive (simulated via kill -STOP)

Added by Tan le tran over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Critical
Assignee:
-
Target version:
-
Start date:
05/27/2013
Due date:
% Done:

100%

Estimated time:

Description

Not quite sure whether this should go under lttng-tools or lttng-ust.
If it is registered under the wrong queue, please, forward it to the right one.

Commit used:
============
userspace   : 56e676d Document: rculfhash destroy and resize side-effect in 0.7
lttng-ust   : 996aead Add parameter -f to rm in Makefile clean target
lttng-tools : dae1096 Fix: use lttng pipe to send new stream to thread
babeltrace  : 9eaf254 Version 1.0.3

Problem Description:
====================
 * When sessionD is hanging (simulated via a "pkill -STOP sessiond"), launched instrumented applications
   got stuck and can not execute their main until sessiond is killed .
   According to our understanding, when an instrumented app is launched, if it fails to register
   with sessiond, it should be able to execute its main after the LTTNG_UST_REGISTER_TIMEOUT has
   elapsed. This seems to be valid only if the influx of launched instr app is low. When the influx
   of launched instr app is high, it seems that only the first 32 instr apps can execute their main,
   the remaining instr app hung forever until the sessonD process is killed.

Is problem reproducible ?
=========================
  * yes 

How to reproduce (if reproducible):
===================================
  1)_ Kill the current sessionD and relaunch a new one
  2)_ ps -e |grep lttng
  3)_ pkill -STOP sessiond
  4)_ Launch 1 instance of instr app (ex: ./TestApp_100perSecOnly 5 & )
      After 3 sec (default value of LTTNG_UST_REGISTER_TIMEOUT), receives:
         libust[9890/9890]: Error: Timed out waiting for lttng-sessiond (in lttng_ust_init() at lttng-ust-comm.c:1198)
      Then receives the print out from TestApp_100perSecOnly (at every sec for the next 5 sec).
      This works as intended.

  5)_ Launch about 3000 instances of instr apps in the next few seconds.
      ex:
          for a in $(seq 1 30); do (for n in $(seq 1 100); do (/home/test_apps/TestApp_100perSecOnly 3 np > /dev/null &); done; usleep 200000); done &

      Note: the above command launch the instr app with life span of only 3 sec for each instance.

  6)_ Will see plenty of "Error: Timed out waiting for lttng-sessiond" 
  7)_ After a few seconds, ps -e |grep -c TestApp
      See a magic number "2969" .
      Wait for a few sec, repeat the above ps command 
      Still see 2969.
      It looks like only the first 32 instances of instr app were able to execute despite the hanging of sessiond.
      The remaining instances of instr apps hang.

  8)_ pkill -CONT sessond
  9)_ ps -e |grep -c TestApp
      See "2928" remaining.

  10)_ Launch a new instance of instr app: (ex: ./TestApp_100perSecOnly 5 &)
       get: "Timed out waiting for lttng-sessiond" 
       and no periodic printout from this testapp ...that means it got hang too.
  11)_ pkill -9 sessiond
  12)_ Start getting printout of the instr app launched at step 10 above.
  13)_ ps -e |grep -c TestApp
       get "0" 

Any other information:
======================
-   

Files

terminal_log.log (334 KB) terminal_log.log log of terminal when the above steps were carried out. Tan le tran, 05/27/2013 02:53 PM
May29_gdb_printout.log (345 KB) May29_gdb_printout.log Terminal log and gdb printout Tan le tran, 05/29/2013 11:08 AM
patch-hang-startup.patch (2.51 KB) patch-hang-startup.patch Jérémie Galarneau, 05/30/2013 11:36 AM
Actions #1

Updated by Tan le tran over 11 years ago

The problem has been reproduced based on the steps outlined in this report (upto step10).
The instance of instr app from step-10 is then "kill -SIGABRT" to generate a coredump.
Included is the corresponding gdb printout of that coredump.

Actions #2

Updated by Jérémie Galarneau over 11 years ago

That's very helpful, thanks!

Actions #3

Updated by Jérémie Galarneau over 11 years ago

I think we have identified the cause of this deadlock. I'm testing a potential fix right now.

Actions #4

Updated by Jérémie Galarneau over 11 years ago

I can confirm that I have reproduced the issue here and that this patch fixes the problem.

Actions #5

Updated by Tan le tran over 11 years ago

Do I need to recompile the instrumented app in this case ?

Actions #6

Updated by Jérémie Galarneau over 11 years ago

I did. But it shouldn't be necessary.

Actions #7

Updated by Tan le tran over 11 years ago

Hi Jérémie,

I just verified with commit on "origin/stable-2.2" and the
problem is no longer reproducible.
Thanks for the fix,
Tan

Actions #8

Updated by Mathieu Desnoyers over 11 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF