Actions
Bug #545
closedlttng2.2.0rc2: Instrumented apps hung (can not execute their main) when sessionD is not responsive (simulated via kill -STOP)
Status:
Resolved
Priority:
Critical
Assignee:
-
Target version:
-
Start date:
05/27/2013
Due date:
% Done:
100%
Estimated time:
Description
Not quite sure whether this should go under lttng-tools or lttng-ust. If it is registered under the wrong queue, please, forward it to the right one. Commit used: ============ userspace : 56e676d Document: rculfhash destroy and resize side-effect in 0.7 lttng-ust : 996aead Add parameter -f to rm in Makefile clean target lttng-tools : dae1096 Fix: use lttng pipe to send new stream to thread babeltrace : 9eaf254 Version 1.0.3 Problem Description: ==================== * When sessionD is hanging (simulated via a "pkill -STOP sessiond"), launched instrumented applications got stuck and can not execute their main until sessiond is killed . According to our understanding, when an instrumented app is launched, if it fails to register with sessiond, it should be able to execute its main after the LTTNG_UST_REGISTER_TIMEOUT has elapsed. This seems to be valid only if the influx of launched instr app is low. When the influx of launched instr app is high, it seems that only the first 32 instr apps can execute their main, the remaining instr app hung forever until the sessonD process is killed. Is problem reproducible ? ========================= * yes How to reproduce (if reproducible): =================================== 1)_ Kill the current sessionD and relaunch a new one 2)_ ps -e |grep lttng 3)_ pkill -STOP sessiond 4)_ Launch 1 instance of instr app (ex: ./TestApp_100perSecOnly 5 & ) After 3 sec (default value of LTTNG_UST_REGISTER_TIMEOUT), receives: libust[9890/9890]: Error: Timed out waiting for lttng-sessiond (in lttng_ust_init() at lttng-ust-comm.c:1198) Then receives the print out from TestApp_100perSecOnly (at every sec for the next 5 sec). This works as intended. 5)_ Launch about 3000 instances of instr apps in the next few seconds. ex: for a in $(seq 1 30); do (for n in $(seq 1 100); do (/home/test_apps/TestApp_100perSecOnly 3 np > /dev/null &); done; usleep 200000); done & Note: the above command launch the instr app with life span of only 3 sec for each instance. 6)_ Will see plenty of "Error: Timed out waiting for lttng-sessiond" 7)_ After a few seconds, ps -e |grep -c TestApp See a magic number "2969" . Wait for a few sec, repeat the above ps command Still see 2969. It looks like only the first 32 instances of instr app were able to execute despite the hanging of sessiond. The remaining instances of instr apps hang. 8)_ pkill -CONT sessond 9)_ ps -e |grep -c TestApp See "2928" remaining. 10)_ Launch a new instance of instr app: (ex: ./TestApp_100perSecOnly 5 &) get: "Timed out waiting for lttng-sessiond" and no periodic printout from this testapp ...that means it got hang too. 11)_ pkill -9 sessiond 12)_ Start getting printout of the instr app launched at step 10 above. 13)_ ps -e |grep -c TestApp get "0" Any other information: ====================== -
Files
Updated by Tan le tran over 11 years ago
- File May29_gdb_printout.log May29_gdb_printout.log added
The problem has been reproduced based on the steps outlined in this report (upto step10). The instance of instr app from step-10 is then "kill -SIGABRT" to generate a coredump. Included is the corresponding gdb printout of that coredump.
Updated by Jérémie Galarneau over 11 years ago
I think we have identified the cause of this deadlock. I'm testing a potential fix right now.
Updated by Jérémie Galarneau over 11 years ago
- File patch-hang-startup.patch patch-hang-startup.patch added
I can confirm that I have reproduced the issue here and that this patch fixes the problem.
Updated by Tan le tran over 11 years ago
Do I need to recompile the instrumented app in this case ?
Updated by Jérémie Galarneau over 11 years ago
I did. But it shouldn't be necessary.
Updated by Tan le tran over 11 years ago
Hi Jérémie,
I just verified with commit on "origin/stable-2.2" and the
problem is no longer reproducible.
Thanks for the fix,
Tan
Updated by Mathieu Desnoyers over 11 years ago
- Status changed from New to Resolved
- % Done changed from 0 to 100
Applied in changeset 5b14aab38abe095cf5885aba2f0a1cdaed36729b.
Actions