Project

General

Profile

Actions

Bug #386

closed

A hanging instrumented application can cause consummerD to hang, which in turn will cause sessiond to hang

Added by Tan le tran over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Critical
Assignee:
Target version:
Start date:
10/26/2012
Due date:
% Done:

100%

Estimated time:

Description

Description: 
============ 
  After a session has been activated, if an instrumented app (that is currently
  being traced on by the session) hang, then it will cause consumerD to hang
  as well. Any lttng command towards sessiond that needs interaction with 
  consumerD will also cause sessionD to hang and that will lead to unavailability
  of tracing.

  Killing sessonD will not help (see bug-367) .

Commit used: 
============ 
  userspace:   e7e6ff7 rculfhash test: fix trivial memleak and return node leak and errors
  lttng-ust:   1c7b4a9 Fix: memcpy of string is larger than source
  lttng-tools: dda67f6  Fix: Error handling when sending relayd sockets to consumer
  babeltrace : 6ca30a4 Cleanup: fix cppcheck warning

Scenario: 
========= 
# lttng create ses1
# lttng enable-event com_ericsson_cba_trace_testapp_lowtraf:OnePerSecB -u
# lttng start
# lttng list
    Available tracing sessions:
      1) ses1 (/root/lttng-traces/ses1-20121026-153048) [active]

# lttng list -u
    UST events:
    -------------
    None

# /home/test_apps/TestApp_type1 5555 np &
# ps -e |egrep -i 'lttng|TestApp|trace'
      646 pts/2    00:00:00 TestApp_type1
     5469 ?        00:00:00 TraceEa
     9961 ?        00:00:20 TraceEa
    31730 ?        00:00:00 lttng-sessiond
    31823 ?        00:00:00 lttng-consumerd
# /home/test_apps/TestApp_Thread1 44444 np &
# ps -e |egrep -i 'lttng|TestApp|trace'
      646 pts/2    00:00:01 TestApp_type1
      690 pts/2    00:00:00 TestApp_Thread1
     5469 ?        00:00:00 TraceEa
     9961 ?        00:00:20 TraceEa
    31730 ?        00:00:00 lttng-sessiond
    31823 ?        00:00:00 lttng-consumerd
# pkill -STOP TestApp_type1
    [1]+  Stopped                 /home/test_apps/TestApp_type1 5555 np
# ps -e |egrep -i 'lttng|TestApp|trace'
      646 pts/2    00:00:02 TestApp_type1
      690 pts/2    00:00:00 TestApp_Thread1
     5469 ?        00:00:00 TraceEa
     9961 ?        00:00:20 TraceEa
    31730 ?        00:00:00 lttng-sessiond
    31823 ?        00:00:00 lttng-consumerd
# lttng create ses2
     Session ses2 created.
     Traces will be written in /root/lttng-traces/ses2-20121026-153239
# lttng enable-event com_ericsson_cba_trace_testapp_thread_main:OnePerSecA -u
     Command hang !!!
# ps -e |egrep -i 'lttng|TestApp|trace'
      646 pts/2    00:00:02 TestApp_type1
      690 pts/2    00:00:05 TestApp_Thread1
     1176 pts/2    00:00:00 lttng   <-------------------
     5469 ?        00:00:00 TraceEa
     9961 ?        00:00:20 TraceEa
    31730 ?        00:00:00 lttng-sessiond
    31823 ?        00:00:00 lttng-consumerd

# pkill -9 sessiond
# ps -e |egrep -i 'lttng|TestApp|trace'
      646 pts/2    00:00:02 TestApp_type1
      690 pts/2    00:00:08 TestApp_Thread1
     1233 ?        00:00:00 TraceEa
     9961 ?        00:00:20 TraceEa
    31823 ?        00:00:00 lttng-consumerd

# lttng-sessiond -vvv &
    DEBUG3: Creating LTTng run directory: /var/run/lttng [in create_lttng_rundir() at main.c:3525]
    DEBUG2: Kernel consumer err path: /var/run/lttng/kconsumerd/error [in main() at main.c:3773]
    DEBUG2: Kernel consumer cmd path: /var/run/lttng/kconsumerd/command [in main() at main.c:3775]
    DEBUG1: Client socket path /var/run/lttng/client-lttng-sessiond [in main() at main.c:3828]
    DEBUG1: Application socket path /var/run/lttng/apps-lttng-sessiond [in main() at main.c:3829]
    DEBUG1: LTTng run directory path: /var/run/lttng [in main() at main.c:3830]
    DEBUG2: UST consumer 32 bits err path: /var/run/lttng/ustconsumerd32/error [in main() at main.c:3839]
    DEBUG2: UST consumer 32 bits cmd path: /var/run/lttng/ustconsumerd32/command [in main() at main.c:3841]
    DEBUG2: UST consumer 64 bits err path: /var/run/lttng/ustconsumerd64/error [in main() at main.c:3850]
    DEBUG2: UST consumer 64 bits cmd path: /var/run/lttng/ustconsumerd64/command [in main() at main.c:3852]
    Error: Already running daemon.

    [3]-  Exit 1                  lttng-sessiond -vvv

# ps -e |egrep -i 'lttng|TestApp|trace'
      646 pts/2    00:00:02 TestApp_type1
      690 pts/2    00:00:08 TestApp_Thread1
     1233 ?        00:00:00 TraceEa
     9961 ?        00:00:20 TraceEa
    31823 ?        00:00:00 lttng-consumerd
# pkill -9 lttng-consumerd      (on the background, there is a process continously attempt to launch a new sessiond)
# ps -e |egrep -i 'lttng|TestApp|trace'
  646 pts/2    00:00:02 TestApp_type1
  690 pts/2    00:00:09 TestApp_Thread1
 1233 ?        00:00:00 TraceEa
 1775 ?        00:00:00 lttng-sessiond    (sessiond can finally be launched)
 9961 ?        00:00:20 TraceEa

Actions #1

Updated by David Goulet over 11 years ago

  • Status changed from New to Confirmed
  • Assignee set to David Goulet
  • Priority changed from High to Critical
  • Target version set to 2.1 stable

The session daemon is actually blocked on lttcomm_send_unix_sock() and can not be stopped using a SIGINT.

This is critical and should be fix quickly.

The only thing I can see here is to add a timeout to the send() call. I'll experiment a bit if we can detect this otherwise.

Thanks!

Actions #2

Updated by David Goulet over 11 years ago

Adding a timeout to recvmsg() and sendmsg() to the application socket can possibly have some "unknown" repercussions over time...

There is now way of detecting a SIGSTOP on the app socket so I wonder if we should either timeout or make the tracer sends a notification of a sigstop for which I'm not sure is that simple....?

Thoughts?

Actions #3

Updated by David Goulet over 11 years ago

So here is what I'm going to do.

For each application socket, I'll add a default timeout, that can be specified by a ENV var to the session daemon, and if the recvmsg() do timeout, the command will be aborted but the application will still be registered.

For now, it's the only possible solution we ended up with for 2.1-stable.

Actions #4

Updated by Tan le tran over 11 years ago

Just to clarify:
1)_ does it mean that after the timeout, the "hanging" app will still be registered but it will not participate in that session and therefore there should be no <pid> directory under this session directory for that hanging app, right ?

2)_ Also, let's say the above hanging app comes out from its hanging state, if a new session is created, will it be able to participate then ?
Actions #5

Updated by Tan le tran over 11 years ago

Hmmm, somehow not all my comments got submitted .

2)_ Also, let's say the above hanging app comes out of its hanging state, if another session got created, will that app be able to participate at this point ?

Actions #6

Updated by David Goulet over 11 years ago

Just to clarify:
1)_ does it mean that after the timeout, the "hanging" app will still be registered but it will not participate in that session and therefore there should be no <pid> directory under this session directory for that hanging app, right ?

I just spoke with Mathieu about this issue and there is actually more
problems.

When sending a command to a stopped application, the sendmsg() succeeds
but the recvmsg() times out so the command is aborted but the
application stays registered (on the session daemon side). However, once
the app. comes back to life, the command previously sent is handled and
a reply is sent back to the session daemon which is not waiting for
this (and for now is creating an infinite loop in the epoll thread).

And to be honest, we don't have a way to fix that without changing a
large part of the application and command subsystems.

So the conclusion was to unregister the application if detected so the
session daemon can at least continue to operate.

Now the question is can we do that on the trace side or do we have to
dot it on the session daemon.

We will sleep the night on this one and figure out the best way to
handle the issue here.

Thanks!
David

Actions #7

Updated by Tan le tran over 11 years ago

Hi David,

It looks like we have a couple of things to consider now:

1) How can we bring to the user attention that the app has been
inresponsive during event enabling ?
I am guessing that if it fail to reply during the event enabling, it will
not participate in the sesson and therefore it has no PID directory under
the session directory. But is there another better way ?

2) How to perform the "unregistration" of the app .
(Can we assume that "unregister" means the app is no longer traceable ?)

3) How to register the app again (without restarting the app of course).

Regards,
Tan

Actions #8

Updated by David Goulet over 11 years ago

So here are the two scenarios we can envision. Either keep the app registered or unregister it. Here is what we'll have to do in terms of dev.

Keep app:
  • We have to make significant changes to the command and application sub systems to handle this possible asynchronous command.
  • Detect the stopped state by a socket timeout so other subsequent commands will be queued (not trivial).
  • Flag the app "stopped" in the application hash table which will also tell the user on "lttng list" (new flag, new state).
Unregister app:
  • We have to notify the app thread of this possible error so the thread unregisters the app. It is crucial for synchronization.
  • Some minor changes to libustctl to send back the errno code on recv/send error so the sessiond is able to know what went wrong.
  • App is no longer traceable and needs to re-register again. (This should probably be done automatically just after the libust detects that the previous sockets died and that it should return in a wait mode).

Also, the SIGSTOP signal can't be caught by the application (see signal(7)) so this needs to be handled on the session daemon side.

Btw, this also affects 2.0.x-stable.

Thanks!

Actions #9

Updated by David Goulet over 11 years ago

UST and tools now got a fixed that shutdown the socket when a socket failure is detected making the session daemon automatically delete the application.

The branch 'bug386' of my personal repository adds the timeout so can you please test it and see if it fixes your use case(s)? On my side, we have the desired effect which is a timeout on the recvmsg() of the UST command on the session daemon side followed by the application unregistering.

http://git.dorsal.polymtl.ca/~dgoulet?p=lttng-tools-next.git;a=shortlog;h=refs/heads/bug386

Thanks!

Actions #10

Updated by David Goulet over 11 years ago

  • Status changed from Confirmed to Feedback
Actions #11

Updated by David Goulet over 11 years ago

I pushed this one upstream (commit: ae9e45b342636e7b42eafc15cd105bccfbbbe373).

Any issue you come across or show stopper, we'll revert it or try to fix it as quickly as possible.

Thanks!

Actions #12

Updated by David Goulet over 11 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF