Project

General

Profile

Actions

Bug #433

closed

SessionD occasionally returns error for lttng_health_check (LTTNG_HEALTH_CMD)

Added by Mathieu Bain over 11 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
01/23/2013
Due date:
% Done:

0%

Estimated time:

Description

Previous bug #428 was not fix. The submitted patch resolved the problem concerning the previous patch you submitted but the bug is still there.

Using lttng-tools commit a6eab824de8cf6d9c7ae0fc6aaef950d67bd756e
lttng-ust commit 05356aa2a4dca0bc9bfd716d2d6723e3941851dc
userspace-rcu commit da9bed2581279d3a333a7eb0ce3a395e0f59b3e1

Thanks

===================================================================================

During our stability test, we have encountered a bunch of health
check failure. Most of them deal with LTTNG_HEALTH_CMD and
LTTNG_HEALTH_APP_MANAGE . The healthcheck failure occurs averagely
once every 10 min from a 12hrs test duration.

We have used "sessionD -vvv" to collect more data for the investigation.

Here is a small sumary of what we have observed from the log attached:

17:06:10 invoking lttng_start_tracing (session STAB002_2)
17:06:11 Reply recieved for lttng_start_tracing for session STAB002_2
17:06:16 Around this time, we launch 10 new instances of the
instrumented app: TestApp_Mini1. Each instance has only
5 sec life time.
:
17:06:24 The TestCase then tries to deactivate the session STAB002_2.
But before deactivation, we check whether sessionD
has that session or not. So list session is called first.

Invoke lttng_list_session

17:06:24 The TestCase then prepare to cancel the session.
Where the relayD of that session will be deleted
first then a lttng_destroy will be sent to sessionD later.
However, after killing relayD, TraceP is still waiting for
the list session above and therefore, not able to handle the
next command in the queue (ie: the cancel command).

So the impact in this sittuation is:
There are 10 apps quickly launched.
5 sec later, that 10 apps terminate.
List session is being called.
relayd is being killed.
Then somehow, TraceP does not detect any reply
from sessionD (for the list session).
:
17:06:48 SessionD healthcheck failed: LTTNG_HEALTH_CMD !
SessionD then got killed and restarted
17:06:48 TraceP then get the reply from sessionD regarding to
list trace event above:
Reply recieved from lttng_list_session with count 0
(count=0 ie: no session is seen by sessionD)
(Note that at this point, neither lttng_stop nor
lttng_destroy has been sent to sessionD)

We would like to know why the sessionD fail the healthcheck
(in this case, it seems to not reponding to the list session).

We have disabled the activity where 10 apps are launch in a burst
during our stability test and also disabled the "cancel" command
(ie: no relayd killing in midway).
Under this condition, we see only 3 healtch check failures within 12hrs run .

The log attached is the result of "tail -f our_log" and "tail -f sessiond_log"
being redirected to. Every second, we also insert a time stamp to help
identifying the sequence.


Files

combined_log_shortened.log (1.32 MB) combined_log_shortened.log Combined log from sessionD and our internal logs Mathieu Bain, 01/29/2013 04:00 PM
log.txt (46.3 KB) log.txt new logs Mathieu Bain, 01/30/2013 02:04 PM
log_done.txt (3.6 MB) log_done.txt complete logs Mathieu Bain, 01/30/2013 02:17 PM
0001-Fix-add-health-poll-entry-exit-to-relayd-connect.patch (1.17 KB) 0001-Fix-add-health-poll-entry-exit-to-relayd-connect.patch David Goulet, 01/30/2013 04:39 PM
new_log.txt (855 KB) new_log.txt new logs from 2012-01-31 Mathieu Bain, 01/31/2013 03:17 PM
0001-Fix-add-health-state-update.patch (22.2 KB) 0001-Fix-add-health-state-update.patch David Goulet, 02/01/2013 10:40 AM
0001-Fix-health-check-now-validates-deadlock-and-lock-wai.patch (38.1 KB) 0001-Fix-health-check-now-validates-deadlock-and-lock-wai.patch David Goulet, 02/01/2013 03:55 PM
bug433_saflog_SC1_1736pm.log (124 KB) bug433_saflog_SC1_1736pm.log Log from our system Tan le tran, 02/26/2013 04:25 PM
combined_APP_MANAGE_1736pm.log (1.69 MB) combined_APP_MANAGE_1736pm.log printout of "lttng-sessiond -vvv" Tan le tran, 02/26/2013 04:25 PM
Actions

Also available in: Atom PDF