Bug #367
closedAfter "pkill -9 lttng-sessiond", unable to launch sessiond again unless all instrumented app exit
100%
Description
This might be the same as bug-365
Commit used:
============
userspace : (oct08) f94061a wfcqueue documentation: hint at for_each iterators
lttng-ust : (oct09) 1c7b4a9 Fix: memcpy of string is larger than source
ttng-tools: (oct02) 4dbc372 ABI with support for compat 32/64 bits
babeltrace : (oct02) 052606b Document that list.h is LGPLv2.1, but entirely trivial
Scenario:
=========
1)_ ./demo-trace 500 &
2)_ lttng create s1
3)_ lttng enable-event event ust_tests_demo2:loop -u
4)_ lttng start
5)_ pkill -9 lttng-sessiond
6)_ ps -ef |grep lttng
Still see: lttng-consumerd --quiet -u --consumerd-cmd-sock .....
7)_ Try to launch a new lttng-sessiond
lttng-sessiond -vvv &
get:
Error: Already running daemon
8)_ No new sessiond can be launched.
9)_ Once the instrumented app got killed, or exit,
then a new sessiond can be launched.
Files
Updated by David Goulet about 12 years ago
- Status changed from New to Confirmed
- Assignee set to David Goulet
- Target version set to 2.1 stable
This is actually a VERY good catch! Took me a while to get it but here goes.
The client and application sockets are NOT set with CLOEXEC so once we fork/exec the consumer daemon, it keeps a copy of those sockets making them "valid" (meaning that we can connect). Until the consumerd is stopped/killed, the session daemon does not restart thinking that another session daemon is alive because of those copy.
I will have to test this thoroughly since I'm not sure a new session daemon will behave well with an old consumer daemon waiting on an application to stop.
Updated by David Goulet about 12 years ago
I've cooked a patch that fix this issue. So no more socket leaks between the two daemons.
However, I wanted to discuss a behavior of lttng-tools that this bugfix make me realize.
Considering a running tracing session, a kill -9 of the session daemon, the consumer data thread drops dead after a grace period (for now it's 2 seconds). Note that the metadata thread does not do that thus creating two possible scenarios that we should fix or not:
1) Without any modifications, as for now, if the session daemon dies badly, the tracing data is NOT recoverable for the running session since the consumer just stops to consumer data stream because of this grace period once the session daemon dies. Furthermore, the consumer daemon stays alive because the metadata thread does not stop until the stream hang up (Choosing this scenario, we'll have to quick fix this).
2) Removing the grace period makes the consumer stays alive on it's own and continue to extract the data until the application is stopped or dies which trigger a graceful quit of the consumer daemon.
So, what do we want here? Considering a session daemon fatal failure (kill -9), every tracing session is lost (hence the data) or do we want a rogue consumer daemon continuing its work for every stream available? Please note that this brings some issues especially with kernel tracing where if the session daemon goes away, unless unloading the module by hand, there is no possible way to stop the session. Same for the UST tracer, either the application is stopped or the tracing continues at vitam eternam.
Updated by David Goulet about 12 years ago
For the record. With the new patch in lttng-ust that cleans every stream associated with a lttng-sessiond socket, the consumer timeout grace period will be removed thus giving time to extract the remaining tracing data and finally quit due to the session daemon hang up.
Best of the two worlds.
Patch should be pushed in a jiffy.
Updated by David Goulet about 12 years ago
- Status changed from Confirmed to Resolved
- % Done changed from 0 to 100
Applied in changeset b662582bf448d2fad2f5990580771733a3b33d16.
Updated by Tan le tran about 12 years ago
Nice David,
I have re-run my test case and your patch proves to work nicely.
All the log collected for the session (prior to the crash of sessionD) were convertable.
Thanks for your help,
Tan