Bug #428
closedSessionD occasionally returns error for lttng_health_check (LTTNG_HEALTH_CMD)
100%
Description
During our stability test, we have encountered a bunch of health
check failure. Most of them deal with LTTNG_HEALTH_CMD and
LTTNG_HEALTH_APP_MANAGE . The healthcheck failure occurs averagely
once every 10 min from a 12hrs test duration.
We have used "sessionD -vvv" to collect more data for the investigation.
Here is a small sumary of what we have observed from the log attached:
17:06:10 invoking lttng_start_tracing (session STAB002_2)
17:06:11 Reply recieved for lttng_start_tracing for session STAB002_2
17:06:16 Around this time, we launch 10 new instances of the
instrumented app: TestApp_Mini1. Each instance has only
5 sec life time.
:
17:06:24 The TestCase then tries to deactivate the session STAB002_2.
But before deactivation, we check whether sessionD
has that session or not. So list session is called first.
Invoke lttng_list_session
17:06:24 The TestCase then prepare to cancel the session.
Where the relayD of that session will be deleted
first then a lttng_destroy will be sent to sessionD later.
However, after killing relayD, TraceP is still waiting for
the list session above and therefore, not able to handle the
next command in the queue (ie: the cancel command).
So the impact in this sittuation is:
There are 10 apps quickly launched.
5 sec later, that 10 apps terminate.
List session is being called.
relayd is being killed.
Then somehow, TraceP does not detect any reply
from sessionD (for the list session).
:
17:06:48 SessionD healthcheck failed: LTTNG_HEALTH_CMD !
SessionD then got killed and restarted
17:06:48 TraceP then get the reply from sessionD regarding to
list trace event above:
Reply recieved from lttng_list_session with count 0
(count=0 ie: no session is seen by sessionD)
(Note that at this point, neither lttng_stop nor
lttng_destroy has been sent to sessionD)
We would like to know why the sessionD fail the healthcheck
(in this case, it seems to not reponding to the list session).
We have disabled the activity where 10 apps are launch in a burst
during our stability test and also disabled the "cancel" command
(ie: no relayd killing in midway).
Under this condition, we see only 3 healtch check failures within 12hrs run .
The log attached is the result of "tail -f our_log" and "tail -f sessiond_log"
being redirected to. Every second, we also insert a time stamp to help
identifying the sequence.
Files
Updated by David Goulet almost 12 years ago
- Status changed from New to Confirmed
- Target version set to 2.1 stable
Ok, so I found out that the health counter of LTTNG_HEALTH_CMD can be updated when an application registers which is an other thread using the health counter LTTNG_HEALTH_APP_MANAGE.....
I think the real long term solution would be to address #411 instead of trying to move around this issue.
My guess is that those two counters are out of sync of what is really going on so this would MOST probably fix your issue because I can't find anything in the logs that would explain a CMD bad health....
Updated by David Goulet almost 12 years ago
- Assignee set to David Goulet
I've just pushed the fix for bug411.
I'm pretty sure this issue will not happen again. Please let me know and I'll close the bug if so.
Thanks!
Updated by Tan le tran almost 12 years ago
Unfortunately, we are tight at the moment for our release activities.
I hope you won't mind if i try this patch by next monday (Jan28)night.
Updated by David Goulet almost 12 years ago
Not at all.
It's upstream right now so take the time you need.
Updated by Mathieu Bain almost 12 years ago
Hi David, I just tested your patch and when i create a session (with lttng create my_session) and then enable some event, consumerd health check (lttng_health_check(LTTNG_HEALTH_CONSUMER)) fails now all the time by returning a value different than 0.
I was testing your patch with:
lttng-tools commit 769b7d7ec9067a192a01a8d0c884256e9fa25165 : Fix: health subsystem issues with shared code
lttng-ust commit 235ebca232ef222c905550a8c787dd76c2e0f338 : Fix: missing test for lttng_ust_comm_should_quit in lttng-ust-comm.c
userspace-rcu commit da9bed2581279d3a333a7eb0ce3a395e0f59b3e1 : Version 0.7.6
Updated by David Goulet almost 12 years ago
- File 0001-Fix-remove-consumer-health-poll-update-on-startup.patch 0001-Fix-remove-consumer-health-poll-update-on-startup.patch added
This patch should fix the issue where the poll update work around was still being used for the consumer thread.
Can you tell me if this fixes your problem, if so I'll merge it upstream and in stable-2.1
Apply this patch on 14c18e9de7f95b4665280a338495488744832560 (currently HEAD of stable-2.1):
$ git am FILE
Thanks!
Updated by Mathieu Bain almost 12 years ago
It seems to solve the problem.
Thanks
Updated by David Goulet almost 12 years ago
- Status changed from Confirmed to Resolved
- % Done changed from 0 to 100
Applied in changeset a6eab824de8cf6d9c7ae0fc6aaef950d67bd756e.