Bug #324
closedLTTng does not trace after a cluster reboot
Added by Jesus Garcia over 12 years ago. Updated about 12 years ago.
100%
Description
After a cluster reboot we get no traces from lttng in some nodes. In order to isolate the fault and rule out our application and test apps, we reproduced the problem using the lttng CLI and the lttng demo application. This problem is not tied to any particular node. It can happen on any node after a cluster reboot.
I have attached two log files:
1. lttng_not_tracing_after_reboot_PL4.txt: This is for PL-4, the node where this time lttng fails to produce traces. We activated a session with all available events in the demo app. No files were produced.
2. lttng_tracing_ok_after_reboot_PL3.txt: This is for PL-3, where lttng was working normally and the same steps were followed to succesfully produce logs. This was done for comparison purposes.
In both cases the demo app was executed with LTTNG_UST_DEBUG=1.
lttng SW versions:
babeltrace 2ae35d4 (HEAD, tag: v1.0.0-rc4) Update version to 1.0.0-rc4
lttng-tools 6c12303 (HEAD, tag: v2.0.4, origin/stable-2.0) Update version to 2.0.4 stable
lttng-ust f8b9341 (HEAD, tag: v2.0.4) Update version to 2.0.4
userspace-rcu 3227f2c (HEAD, tag: v0.7.3) Update version to 0.7.3
NOTE: This build does not include the patch "Fix UST SIGPIPE handling".
Files
lttng_not_tracing_after_reboot_PL4.txt (22.2 KB) lttng_not_tracing_after_reboot_PL4.txt | Jesus Garcia, 08/15/2012 02:31 PM | ||
lttng_tracing_ok_after_reboot_PL3.txt (20.8 KB) lttng_tracing_ok_after_reboot_PL3.txt | Jesus Garcia, 08/15/2012 02:31 PM | ||
lttngoutput_pl3.txt (918 KB) lttngoutput_pl3.txt | Jesus Garcia, 09/09/2012 10:55 PM | ||
lttng_not_tracing_after_reboot_PL3_DemoApp_Sep7.txt (17.2 KB) lttng_not_tracing_after_reboot_PL3_DemoApp_Sep7.txt | Jesus Garcia, 09/09/2012 10:55 PM | ||
pthread_cond_patch.diff (5.15 KB) pthread_cond_patch.diff | David Goulet, 09/18/2012 04:52 PM |
Updated by David Goulet over 12 years ago
- Status changed from New to Feedback
- Assignee set to David Goulet
Hi,
We'll need the session daemon and consumer logs to pin point the problem.
Please use "lttng-sessiond -vvv --verbose-consumer".
Cheers!
David
Updated by Jesus Garcia over 12 years ago
- File lttngoutput_pl3.txt lttngoutput_pl3.txt added
- File lttng_not_tracing_after_reboot_PL3_DemoApp_Sep7.txt lttng_not_tracing_after_reboot_PL3_DemoApp_Sep7.txt added
Hi David,
Sorry for the late reply. We had some issues to reproduce the bug. Also, I run into problems when I use the "verbose-consumer" option (I get empty CTF files as opposed to no CTF files at all, which is what this bug is about). I also get errors from babeltrace. So, I generated a trace with verbose (-vvv), but without verbose-consumer. Please let me know if the provided info is sufficient to get to the bottom of the issue.
The 1st attachment is the actual output from sessiond and the 2nd attachment is the commands that were executed. At the top of the 1st file there may be output from when I was running our own test apps (not shown in the 2nd file), so to better correlate between the two files, start by looking at the end of the 1st file.
Regards,
Jesus
-----Original Message-----
From: bugs@lttng.org [mailto:bugs@lttng.org]
Sent: August-22-12 10:11 AM
Subject: [LTTng Tools - Bug #324] (Needs feedback) LTTng does not trace after a cluster reboot
Issue #324 has been updated by David Goulet.
Status changed from New to Needs feedback
Assignee set to David Goulet
Hi,
We'll need the session daemon and consumer logs to pin point the problem.
Please use "lttng-sessiond -vvv --verbose-consumer".
Cheers!
David
----------------------------------------
Bug #324: LTTng does not trace after a cluster reboot
https://bugs.lttng.org/issues/324#change-640
Author: Jesus Garcia
Priority: High
Category:
After a cluster reboot we get no traces from lttng in some nodes. In order to isolate the fault and rule out our application and test apps, we reproduced the problem using the lttng CLI and the lttng demo application. This problem is not tied to any particular node. It can happen on any node after a cluster reboot.
I have attached two log files:
1. lttng_not_tracing_after_reboot_PL4.txt: This is for PL-4, the node where this time lttng fails to produce traces. We activated a session with all available events in the demo app. No files were produced.
2. lttng_tracing_ok_after_reboot_PL3.txt: This is for PL-3, where lttng was working normally and the same steps were followed to succesfully produce logs. This was done for comparison purposes.
In both cases the demo app was executed with LTTNG_UST_DEBUG=1.
lttng SW versions:
babeltrace 2ae35d4 (HEAD, tag: v1.0.0-rc4) Update version to 1.0.0-rc4
lttng-tools 6c12303 (HEAD, tag: v2.0.4, origin/stable-2.0) Update version to 2.0.4 stable
lttng-ust f8b9341 (HEAD, tag: v2.0.4) Update version to 2.0.4
userspace-rcu 3227f2c (HEAD, tag: v0.7.3) Update version to 0.7.3
NOTE: This build does not include the patch "Fix UST SIGPIPE handling".
--
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://bugs.lttng.org/my/account
Updated by David Goulet over 12 years ago
- Status changed from Feedback to In Progress
It seems that this is caused by a timeout on the semaphore wait in src/bin/lttng-sessiond/main.c +1582 which kills the consumer thread making the consumer unable to connect to the error socket and dying. However, the big issue here is that the timeout happened really quickly and not using the 30 seconds default in lttng-tools. (The debug output are too close for a 30 seconds wait).
Looking at the sem_timedwait() man page, ETIMEDOUT can be returned in the following case:
If the timeout has already expired by the time of the call, and the
semaphore could not be locked immediately, then sem_timedwait() fails
with a timeout error (errno set to ETIMEDOUT).
I'm just puzzled on how this could happen since the semaphore is created a few lines of code before in the same thread...
One possible explanation is that the clock_gettime() call went backwards in time (which is possible...) and by the time the semaphore wait call occurred, the time was set back forward in time thus expiring....... I have to say that this is highly unlikely but possible...
We'll have to investigate more this possibility and see for a fix or look for other possible explanation(s).
In the meantime, are you able to tell us if you have some kind of non standard libc that could mess up sem_timedwait() or/and clock_gettime() or any kernel patches that can affect clocks?
Thanks!
Updated by David Goulet over 12 years ago
After some talks on IRC in channel #lttng, using a non-monotonic clock source can produce the race we are witnessing here. However, the semaphore API does not provide a way to use a monotonic source only accepting an absolute EPOCH time value....
Mathieu Desnoyers find out that we can change the pthread_cond_timedwait clock source to use a monotonic one.
Using pthread_cond with this clock source replacing the sem_wait/post scheme can be done without too much code hence relying on a clock that can NOT go backwards.
I'll work on a patch for that in the coming days.
Thanks
Updated by Jesus Garcia over 12 years ago
Hi David,
Thanks for the info. I hope this can solve the issue. Just in case, to answer your question, we are using a standard libc library:
GNU C Library stable release version 2.11.3 (20110527)
Regards,
Jesus
-----Original Message-----
From: bugs@lttng.org [mailto:bugs@lttng.org]
Sent: September-17-12 5:19 PM
Subject: [LTTng Tools - Bug #324] LTTng does not trace after a cluster reboot
Issue #324 has been updated by David Goulet.
After some talks on IRC in channel #lttng, using a non-monotonic clock source can produce the race we are witnessing here. However, the semaphore API does not provide a way to use a monotonic source only accepting an absolute EPOCH time value....
Mathieu Desnoyers find out that we can change the pthread_cond_timedwait clock source to use a monotonic one.
Using pthread_cond with this clock source replacing the sem_wait/post scheme can be done without too much code hence relying on a clock that can NOT go backwards.
I'll work on a patch for that in the coming days.
Thanks
----------------------------------------
Bug #324: LTTng does not trace after a cluster reboot
https://bugs.lttng.org/issues/324#change-679
Author: Jesus Garcia
Priority: High
Category:
After a cluster reboot we get no traces from lttng in some nodes. In order to isolate the fault and rule out our application and test apps, we reproduced the problem using the lttng CLI and the lttng demo application. This problem is not tied to any particular node. It can happen on any node after a cluster reboot.
I have attached two log files:
1. lttng_not_tracing_after_reboot_PL4.txt: This is for PL-4, the node where this time lttng fails to produce traces. We activated a session with all available events in the demo app. No files were produced.
2. lttng_tracing_ok_after_reboot_PL3.txt: This is for PL-3, where lttng was working normally and the same steps were followed to succesfully produce logs. This was done for comparison purposes.
In both cases the demo app was executed with LTTNG_UST_DEBUG=1.
lttng SW versions:
babeltrace 2ae35d4 (HEAD, tag: v1.0.0-rc4) Update version to 1.0.0-rc4
lttng-tools 6c12303 (HEAD, tag: v2.0.4, origin/stable-2.0) Update version to 2.0.4 stable
lttng-ust f8b9341 (HEAD, tag: v2.0.4) Update version to 2.0.4
userspace-rcu 3227f2c (HEAD, tag: v0.7.3) Update version to 0.7.3
NOTE: This build does not include the patch "Fix UST SIGPIPE handling".
--
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://bugs.lttng.org/my/account
Updated by David Goulet over 12 years ago
- File pthread_cond_patch.diff pthread_cond_patch.diff added
I've added a patch here that should be reviewed before merging it upstream.
It works well and I've tested it by jumping forward my clock by an hour and everything goes smooth with the monotonic clock.
Of course, I realize that it's a bit difficult to test in production since this is quite the race! so I'll wait for some Acked-by before pushing it upstream.
In theory, this should not affect at all the behavior of the session daemon.
Thanks
Updated by David Goulet about 12 years ago
Latest patch is under review by the community:
http://lists.lttng.org/pipermail/lttng-dev/2012-September/018705.html
This will both be applied to 2.0.x and 2.1-stable.
Updated by David Goulet about 12 years ago
- Status changed from In Progress to Resolved
- % Done changed from 0 to 100
Applied in changeset a23ec3a78becfaeca0041c4a6747d30b3ac44bfd.
Updated by David Goulet about 12 years ago
This should resolve the issue we saw according to the logs.
If this same issue comes back, I'll open back the bug.
Thanks!