Project

General

Profile

Actions

Bug #433

closed

SessionD occasionally returns error for lttng_health_check (LTTNG_HEALTH_CMD)

Added by Mathieu Bain almost 12 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
01/23/2013
Due date:
% Done:

0%

Estimated time:

Description

Previous bug #428 was not fix. The submitted patch resolved the problem concerning the previous patch you submitted but the bug is still there.

Using lttng-tools commit a6eab824de8cf6d9c7ae0fc6aaef950d67bd756e
lttng-ust commit 05356aa2a4dca0bc9bfd716d2d6723e3941851dc
userspace-rcu commit da9bed2581279d3a333a7eb0ce3a395e0f59b3e1

Thanks

===================================================================================

During our stability test, we have encountered a bunch of health
check failure. Most of them deal with LTTNG_HEALTH_CMD and
LTTNG_HEALTH_APP_MANAGE . The healthcheck failure occurs averagely
once every 10 min from a 12hrs test duration.

We have used "sessionD -vvv" to collect more data for the investigation.

Here is a small sumary of what we have observed from the log attached:

17:06:10 invoking lttng_start_tracing (session STAB002_2)
17:06:11 Reply recieved for lttng_start_tracing for session STAB002_2
17:06:16 Around this time, we launch 10 new instances of the
instrumented app: TestApp_Mini1. Each instance has only
5 sec life time.
:
17:06:24 The TestCase then tries to deactivate the session STAB002_2.
But before deactivation, we check whether sessionD
has that session or not. So list session is called first.

Invoke lttng_list_session

17:06:24 The TestCase then prepare to cancel the session.
Where the relayD of that session will be deleted
first then a lttng_destroy will be sent to sessionD later.
However, after killing relayD, TraceP is still waiting for
the list session above and therefore, not able to handle the
next command in the queue (ie: the cancel command).

So the impact in this sittuation is:
There are 10 apps quickly launched.
5 sec later, that 10 apps terminate.
List session is being called.
relayd is being killed.
Then somehow, TraceP does not detect any reply
from sessionD (for the list session).
:
17:06:48 SessionD healthcheck failed: LTTNG_HEALTH_CMD !
SessionD then got killed and restarted
17:06:48 TraceP then get the reply from sessionD regarding to
list trace event above:
Reply recieved from lttng_list_session with count 0
(count=0 ie: no session is seen by sessionD)
(Note that at this point, neither lttng_stop nor
lttng_destroy has been sent to sessionD)

We would like to know why the sessionD fail the healthcheck
(in this case, it seems to not reponding to the list session).

We have disabled the activity where 10 apps are launch in a burst
during our stability test and also disabled the "cancel" command
(ie: no relayd killing in midway).
Under this condition, we see only 3 healtch check failures within 12hrs run .

The log attached is the result of "tail -f our_log" and "tail -f sessiond_log"
being redirected to. Every second, we also insert a time stamp to help
identifying the sequence.


Files

combined_log_shortened.log (1.32 MB) combined_log_shortened.log Combined log from sessionD and our internal logs Mathieu Bain, 01/29/2013 04:00 PM
log.txt (46.3 KB) log.txt new logs Mathieu Bain, 01/30/2013 02:04 PM
log_done.txt (3.6 MB) log_done.txt complete logs Mathieu Bain, 01/30/2013 02:17 PM
0001-Fix-add-health-poll-entry-exit-to-relayd-connect.patch (1.17 KB) 0001-Fix-add-health-poll-entry-exit-to-relayd-connect.patch David Goulet, 01/30/2013 04:39 PM
new_log.txt (855 KB) new_log.txt new logs from 2012-01-31 Mathieu Bain, 01/31/2013 03:17 PM
0001-Fix-add-health-state-update.patch (22.2 KB) 0001-Fix-add-health-state-update.patch David Goulet, 02/01/2013 10:40 AM
0001-Fix-health-check-now-validates-deadlock-and-lock-wai.patch (38.1 KB) 0001-Fix-health-check-now-validates-deadlock-and-lock-wai.patch David Goulet, 02/01/2013 03:55 PM
bug433_saflog_SC1_1736pm.log (124 KB) bug433_saflog_SC1_1736pm.log Log from our system Tan le tran, 02/26/2013 04:25 PM
combined_APP_MANAGE_1736pm.log (1.69 MB) combined_APP_MANAGE_1736pm.log printout of "lttng-sessiond -vvv" Tan le tran, 02/26/2013 04:25 PM
Actions #1

Updated by David Goulet almost 12 years ago

  • Status changed from New to Feedback

Are you sure you are running the right lttng-tools version?

The commit given is currently the HEAD of stable-2.1 but it does not fit the log you sent. Can you double check because this debug statement for instance is not on line 127 anymore. It is on 2.1.1 but not on the current HEAD of stable-2.1.

DEBUG1: Health state current 61644, last 61644, ret 0 [in health_check_state() at health.c:127]

Furthermore, we can see one of the health check checking for three different counters which was the consumer threads testing all three possible consumer (ust32, ust64 and kernel).

DEBUG1: Receiving data from client for health... [in thread_manage_health() at main.c:3041]
DEBUG1: Health state current 1, last 1, ret 0 [in health_check_state() at health.c:127]
DEBUG1: Health state current 1, last 1, ret 0 [in health_check_state() at health.c:127]
DEBUG1: Health state current 15, last 15, ret 0 [in health_check_state() at health.c:127]

This is definitely a log NOT from git head with the health TLS patch.

Thanks!

Actions #2

Updated by David Goulet almost 12 years ago

  • % Done changed from 100 to 0
Actions #3

Updated by Mathieu Bain almost 12 years ago

The logs that were attached to this bug are in fact the same than the one that were attached to the previous bug (#428). I just did a duplicate because the bug #428 was closed a bit too promptly. The second patch you provided was correcting the issue that i got from the first patch but it was not solving the bug.

I can get you new logs now with the patches for this commit and send them.

Actions #4

Updated by David Goulet almost 12 years ago

Yes new logs would be quite important here. The health system has
changed considerably using TLS health counter.

Thanks

Actions #5

Updated by Mathieu Bain almost 12 years ago

Hi David,

Here are new logs.

Actions #6

Updated by David Goulet almost 12 years ago

Why is this log so small ?

In order to understand the health problem, I have to know what happened before in the command subsystem especially if the thread managing that quit as some point and why. My bet is that it did not but I have to see the last command that triggered this health counter.

Actions #7

Updated by Mathieu Bain almost 12 years ago

I didn't put the entire log because it got around 35 000 lines, so i put cut it to get the part were the failure happened. If you need more, i attached the entire log file.

Actions #8

Updated by David Goulet almost 12 years ago

  • Status changed from Feedback to Confirmed

Thanks Mathieu! I'll flag this back to Confirmed since it is clearly happening to your use case.

Actions #9

Updated by David Goulet almost 12 years ago

It appears that the "lttng start trace" for session STAB003_a_1 never finished and looking at the logs, the relayd_connect() seems to be taking a lot of time.

Can you decrypt me your logs?:

4850 13:56:09 01/30/2013 IN TraceP_SC-2 "DBG: activateSession: Reply recieved for lttng_start_tracing for session STAB003_a_1.
4851 13:56:09 01/30/2013 IN TraceP_SC-2 "DBG: Return for invocation -208 value: 1, return code: 1 , SA_AIS_OK
4852 13:56:09 01/30/2013 IN TraceP_SC-2 "Activation completed successfully for session traceSessionRecordingId=STAB003_a_1,traceMId=1
4853 13:56:09 01/30/2013 IN TraceP_SC-2 "DBG: !!! Jobs took excessively long to complete: 3244706us !!!
4854 13:56:09 01/30/2013 IN TraceP_PL-3 "DBG: activateSession: Reply recieved for lttng_start_tracing for session STAB003_a_1.
4855 13:56:09 01/30/2013 IN TraceP_PL-3 "DBG: Return for invocation -205 value: 1, return code: 1 , SA_AIS_OK
4856 13:56:09 01/30/2013 IN TraceP_PL-3 "Activation completed successfully for session traceSessionRecordingId=STAB003_a_1,traceMId=1
4857 13:56:09 01/30/2013 IN TraceP_PL-3 "DBG: !!! Jobs took excessively long to complete: 3249539us !!!
4858 13:56:09 01/30/2013 IN TraceC "receiveTracePActivateResult result: 1 op_result 1
4859 13:56:09 01/30/2013 NO TraceC "Session STAB003_a_1 activated.
4860 13:56:09 01/30/2013 IN TraceC "receiveTracePActivateResult result: 1 op_result 1
4861 13:56:09 01/30/2013 IN TraceP_PL-4 "DBG: activateSession: Reply recieved for lttng_start_tracing for session STAB003_a_1.
4862 13:56:09 01/30/2013 IN TraceP_PL-4 "DBG: Return for invocation -206 value: 1, return code: 1 , SA_AIS_OK
4863 13:56:09 01/30/2013 IN TraceP_PL-4 "Activation completed successfully for session traceSessionRecordingId=STAB003_a_1,traceMId=1
4864 13:56:09 01/30/2013 IN TraceP_PL-4 "DBG: !!! Jobs took excessively long to complete: 3250196us !!!

Is this means that lttng_start_tracing() returned? Because de logs from the session daemon suggest otherwise that the session daemon is waiting to connect to the relayd and that would explain why there is no more logs of client commands after that.

Actions #10

Updated by David Goulet almost 12 years ago

Can you try this. It should help. This makes the connect() to the relayd in a poll execution state for the health.

Actions #11

Updated by Mathieu Bain almost 12 years ago

Hi David. about our logs, those are just debug statements from the other nodes. As the problem happens with sessionD on SC-1, those statements do not mean anything. The one that are important are the one concerning TraceP_SC-1. The activation seems to be completed (so lttng-start-tracing responding) but the seems to happen with lttng_list_sessions because of this line :

Line 34650 : 5163 13:56:46 01/30/2013 ER TraceP_SC-1 "Error: unable to list session from SessionD while checking for session. Error: -19

I am currently testing the patch. I still see the problem so i am running again to get new logs for you

Actions #12

Updated by Mathieu Bain almost 12 years ago

Here are some new logs

Actions #13

Updated by David Goulet almost 12 years ago

  • Priority changed from Normal to High

Ok, so in this case, everything is still fine but the command triggering the bad health state is due to the fact that the session list lock is taken by the application registration update process and it appears it takes more than 20 seconds until the lock is released.

I'll provide you a patch for that and will try to find more possible blocking calls in this thread since you've hit at least two as of now so they might be more.

Thanks!

Actions #14

Updated by David Goulet almost 12 years ago

This should help. I've added a bunch of health code update through out the client command susbsytem and some poll entry/exit at potential blocking callsite such as the recv() from the consumer, session and session list locking.

This should at least fix the issues observed in the attached logs. Apply this patch on top of stable-2.1 HEAD.

I hope to have covered every possible long blocking call that could trigger the BAD health state even though it's not.

Let me know! Thanks.

Actions #15

Updated by David Goulet almost 12 years ago

An improved version is coming up from Mathieu. I'll attach the patch to this ticket in a jiffy. Stay tune! ;)

Actions #16

Updated by David Goulet almost 12 years ago

Ok, Mathieu, can you try that on stable-2.1 HEAD and tell us if it helps/fixes the problem.

Thanks!

Updated by Tan le tran almost 12 years ago


Hi David,

Due to the overhead we had on our side, we could not continue working
on this bug until yesterday.

Unfortunately, the problem are still there (in lesser frequency than before, maybe 1 health check failure per 15min instead of 1 per 10min as before).
The failures that we have seen so far are: LTTNG_HEALTH_APP_MANAGE and LTTNG_HEALTH_CMD .

The scenario is still the same as described in this bug report.

I have not tried to combined all the logs together into a singer file like Mathieu did. But I hope the data collected are still analyzable. 

The commits used for this time was:
===================================
lttng-ust:    d480871 (HEAD, origin/stable-2.1) Fix package: don't distribute generated headers
lttng-tools:  ed22248 (HEAD, temp_bug429_bug433_patches) Apply patches for bug429 and 433
              b325dc7 (origin/stable-2.1) Fix: put session list lock around the app registration
userspace  :  da9bed2 (HEAD, tag: v0.7.6) Version 0.7.6

The general overview of the log:
================================
User2:   Mon Feb 25 17:35:58 EST 2013: Activate Profile again (2nd) 
User2:   Mon Feb 25 17:36:04 EST 2013: 2nd Activate completed, sleep for 10 sec. 
                   ~17:36:06           launch 20 TestApp_Mini (each has lifetime = 5 sec) 
User2:   Mon Feb 25 17:36:07 EST 2013: Deactivate Profile 
User2:   Mon Feb 25 17:36:07 EST 2013: Cancel during DEACTIVATE   (invoke lttng_list_session)
User1:   Mon Feb 25 17:36:08 EST 2013: Deactivate ...(invoke lttng_list_session)

         Feb 25 17:36:41 SC-1 TraceEa: SessionD healthcheck failed: LTTNG_HEALTH_APP_MANAGE !
         Feb 25 17:36:42 SC-1 TraceEa: sessiond faulted - Trying to Stop LTTNG
         Feb 25 17:36:42 SC-1 logger: lttng-sessiond is stopped

Enclosed are 2 logs:
====================
"bug433_saflog_SC1_1736pm.log"  containing the log from our system.
                                "TraceP_SC-1" is the interface towards sessionD.
                                All api call/reply to and from this process are logged.

"combined_APP_MANAGE_1736pm.log" contains the output from "lttng-sessiond -vvv" .
                                 A timestamp is inserted at every second.
                                 it shows that the health check was still good at 17:36:36.

Please, let us know if further info are needed.

Actions #18

Updated by Tan le tran almost 12 years ago

Oups,

I forgot to indicate the patchs I have used are:

Patch from Update #10 of bug-433
Patch from Update #3 of bug-429
Actions #19

Updated by Mathieu Desnoyers over 11 years ago

  • Status changed from Confirmed to Feedback

Having a backtrace of all sessiond threads when the health issue is detected would be helpful here: it would allow us to know exactly what the sessiond was doing that took so long (without having health state report).

Thanks,

Mathieu

Actions #20

Updated by David Goulet over 11 years ago

Not enough info, been 3 months in "Needs Feedback".

Please reopen if you can reproduce it.

Actions #21

Updated by David Goulet over 11 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF