Project

General

Profile

Actions

Bug #428

closed

SessionD occasionally returns error for lttng_health_check (LTTNG_HEALTH_CMD)

Added by Tan le tran almost 12 years ago. Updated almost 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
01/23/2013
Due date:
% Done:

100%

Estimated time:

Description

During our stability test, we have encountered a bunch of health
check failure. Most of them deal with LTTNG_HEALTH_CMD and
LTTNG_HEALTH_APP_MANAGE . The healthcheck failure occurs averagely
once every 10 min from a 12hrs test duration.

We have used "sessionD -vvv" to collect more data for the investigation.

Here is a small sumary of what we have observed from the log attached:

17:06:10 invoking lttng_start_tracing (session STAB002_2)
17:06:11 Reply recieved for lttng_start_tracing for session STAB002_2
17:06:16 Around this time, we launch 10 new instances of the
instrumented app: TestApp_Mini1. Each instance has only
5 sec life time.
:
17:06:24 The TestCase then tries to deactivate the session STAB002_2.
But before deactivation, we check whether sessionD
has that session or not. So list session is called first.

Invoke lttng_list_session

17:06:24 The TestCase then prepare to cancel the session.
Where the relayD of that session will be deleted
first then a lttng_destroy will be sent to sessionD later.
However, after killing relayD, TraceP is still waiting for
the list session above and therefore, not able to handle the
next command in the queue (ie: the cancel command).

So the impact in this sittuation is:
There are 10 apps quickly launched.
5 sec later, that 10 apps terminate.
List session is being called.
relayd is being killed.
Then somehow, TraceP does not detect any reply
from sessionD (for the list session).
:
17:06:48 SessionD healthcheck failed: LTTNG_HEALTH_CMD !
SessionD then got killed and restarted
17:06:48 TraceP then get the reply from sessionD regarding to
list trace event above:
Reply recieved from lttng_list_session with count 0
(count=0 ie: no session is seen by sessionD)
(Note that at this point, neither lttng_stop nor
lttng_destroy has been sent to sessionD)

We would like to know why the sessionD fail the healthcheck
(in this case, it seems to not reponding to the list session).

We have disabled the activity where 10 apps are launch in a burst
during our stability test and also disabled the "cancel" command
(ie: no relayd killing in midway).
Under this condition, we see only 3 healtch check failures within 12hrs run .

The log attached is the result of "tail -f our_log" and "tail -f sessiond_log"
being redirected to. Every second, we also insert a time stamp to help
identifying the sequence.


Files

combined_log_shortened.log (1.32 MB) combined_log_shortened.log Combined log from sessionD and our internal logs Tan le tran, 01/23/2013 03:10 PM
0001-Fix-remove-consumer-health-poll-update-on-startup.patch (1.9 KB) 0001-Fix-remove-consumer-health-poll-update-on-startup.patch David Goulet, 01/29/2013 12:33 PM

Related issues 1 (0 open1 closed)

Related to LTTng-tools - Bug #411: health check reporting should use TLSResolved12/13/2012

Actions
Actions

Also available in: Atom PDF