Bug #428
closedSessionD occasionally returns error for lttng_health_check (LTTNG_HEALTH_CMD)
100%
Description
During our stability test, we have encountered a bunch of health
check failure. Most of them deal with LTTNG_HEALTH_CMD and
LTTNG_HEALTH_APP_MANAGE . The healthcheck failure occurs averagely
once every 10 min from a 12hrs test duration.
We have used "sessionD -vvv" to collect more data for the investigation.
Here is a small sumary of what we have observed from the log attached:
17:06:10 invoking lttng_start_tracing (session STAB002_2)
17:06:11 Reply recieved for lttng_start_tracing for session STAB002_2
17:06:16 Around this time, we launch 10 new instances of the
instrumented app: TestApp_Mini1. Each instance has only
5 sec life time.
:
17:06:24 The TestCase then tries to deactivate the session STAB002_2.
But before deactivation, we check whether sessionD
has that session or not. So list session is called first.
Invoke lttng_list_session
17:06:24 The TestCase then prepare to cancel the session.
Where the relayD of that session will be deleted
first then a lttng_destroy will be sent to sessionD later.
However, after killing relayD, TraceP is still waiting for
the list session above and therefore, not able to handle the
next command in the queue (ie: the cancel command).
So the impact in this sittuation is:
There are 10 apps quickly launched.
5 sec later, that 10 apps terminate.
List session is being called.
relayd is being killed.
Then somehow, TraceP does not detect any reply
from sessionD (for the list session).
:
17:06:48 SessionD healthcheck failed: LTTNG_HEALTH_CMD !
SessionD then got killed and restarted
17:06:48 TraceP then get the reply from sessionD regarding to
list trace event above:
Reply recieved from lttng_list_session with count 0
(count=0 ie: no session is seen by sessionD)
(Note that at this point, neither lttng_stop nor
lttng_destroy has been sent to sessionD)
We would like to know why the sessionD fail the healthcheck
(in this case, it seems to not reponding to the list session).
We have disabled the activity where 10 apps are launch in a burst
during our stability test and also disabled the "cancel" command
(ie: no relayd killing in midway).
Under this condition, we see only 3 healtch check failures within 12hrs run .
The log attached is the result of "tail -f our_log" and "tail -f sessiond_log"
being redirected to. Every second, we also insert a time stamp to help
identifying the sequence.
Files
Added by David Goulet almost 12 years ago
Added by David Goulet almost 12 years ago
Fix: remove consumer health poll update on startup
With the TLS health state, the consumer thread has to register in order
to be validated during the health check so the poll update work around
is no longer needed andi replaced with a simple code update just after
the health registration of the thread.
This has been reported after the TLS feature ticket #411 has been
implemented.
Fixes #428
Signed-off-by: David Goulet <dgoulet@efficios.com>
Fix: remove consumer health poll update on startup
With the TLS health state, the consumer thread has to register in order
to be validated during the health check so the poll update work around
is no longer needed andi replaced with a simple code update just after
the health registration of the thread.
This has been reported after the TLS feature ticket #411 has been
implemented.
Fixes #428
Signed-off-by: David Goulet <dgoulet@efficios.com>