Project

General

Profile

Bug #324

LTTng does not trace after a cluster reboot

Added by Jesus Garcia almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
08/15/2012
Due date:
% Done:

100%

Estimated time:

Description

After a cluster reboot we get no traces from lttng in some nodes. In order to isolate the fault and rule out our application and test apps, we reproduced the problem using the lttng CLI and the lttng demo application. This problem is not tied to any particular node. It can happen on any node after a cluster reboot.
I have attached two log files:
1. lttng_not_tracing_after_reboot_PL4.txt: This is for PL-4, the node where this time lttng fails to produce traces. We activated a session with all available events in the demo app. No files were produced.
2. lttng_tracing_ok_after_reboot_PL3.txt: This is for PL-3, where lttng was working normally and the same steps were followed to succesfully produce logs. This was done for comparison purposes.

In both cases the demo app was executed with LTTNG_UST_DEBUG=1.

lttng SW versions:
babeltrace 2ae35d4 (HEAD, tag: v1.0.0-rc4) Update version to 1.0.0-rc4
lttng-tools 6c12303 (HEAD, tag: v2.0.4, origin/stable-2.0) Update version to 2.0.4 stable
lttng-ust f8b9341 (HEAD, tag: v2.0.4) Update version to 2.0.4
userspace-rcu 3227f2c (HEAD, tag: v0.7.3) Update version to 0.7.3

NOTE: This build does not include the patch "Fix UST SIGPIPE handling".


Files

lttng_not_tracing_after_reboot_PL4.txt (22.2 KB) lttng_not_tracing_after_reboot_PL4.txt Jesus Garcia, 08/15/2012 02:31 PM
lttng_tracing_ok_after_reboot_PL3.txt (20.8 KB) lttng_tracing_ok_after_reboot_PL3.txt Jesus Garcia, 08/15/2012 02:31 PM
lttngoutput_pl3.txt (918 KB) lttngoutput_pl3.txt Jesus Garcia, 09/09/2012 10:55 PM
lttng_not_tracing_after_reboot_PL3_DemoApp_Sep7.txt (17.2 KB) lttng_not_tracing_after_reboot_PL3_DemoApp_Sep7.txt Jesus Garcia, 09/09/2012 10:55 PM
pthread_cond_patch.diff (5.15 KB) pthread_cond_patch.diff David Goulet, 09/18/2012 04:52 PM
#1

Updated by David Goulet almost 8 years ago

  • Status changed from New to Feedback
  • Assignee set to David Goulet

Hi,

We'll need the session daemon and consumer logs to pin point the problem.

Please use "lttng-sessiond -vvv --verbose-consumer".

Cheers!
David

#2

Updated by Jesus Garcia almost 8 years ago

Hi David,

Sorry for the late reply. We had some issues to reproduce the bug. Also, I run into problems when I use the "verbose-consumer" option (I get empty CTF files as opposed to no CTF files at all, which is what this bug is about). I also get errors from babeltrace. So, I generated a trace with verbose (-vvv), but without verbose-consumer. Please let me know if the provided info is sufficient to get to the bottom of the issue.

The 1st attachment is the actual output from sessiond and the 2nd attachment is the commands that were executed. At the top of the 1st file there may be output from when I was running our own test apps (not shown in the 2nd file), so to better correlate between the two files, start by looking at the end of the 1st file.

Regards,
Jesus

-----Original Message-----
From: [mailto:]
Sent: August-22-12 10:11 AM
Subject: [LTTng Tools - Bug #324] (Needs feedback) LTTng does not trace after a cluster reboot

Issue #324 has been updated by David Goulet.

Status changed from New to Needs feedback
Assignee set to David Goulet

Hi,

We'll need the session daemon and consumer logs to pin point the problem.

Please use "lttng-sessiond -vvv --verbose-consumer".

Cheers!
David
----------------------------------------
Bug #324: LTTng does not trace after a cluster reboot
https://bugs.lttng.org/issues/324#change-640

Author: Jesus Garcia

Priority: High

Category:

After a cluster reboot we get no traces from lttng in some nodes. In order to isolate the fault and rule out our application and test apps, we reproduced the problem using the lttng CLI and the lttng demo application. This problem is not tied to any particular node. It can happen on any node after a cluster reboot.
I have attached two log files:
1. lttng_not_tracing_after_reboot_PL4.txt: This is for PL-4, the node where this time lttng fails to produce traces. We activated a session with all available events in the demo app. No files were produced.
2. lttng_tracing_ok_after_reboot_PL3.txt: This is for PL-3, where lttng was working normally and the same steps were followed to succesfully produce logs. This was done for comparison purposes.

In both cases the demo app was executed with LTTNG_UST_DEBUG=1.

lttng SW versions:
babeltrace 2ae35d4 (HEAD, tag: v1.0.0-rc4) Update version to 1.0.0-rc4
lttng-tools 6c12303 (HEAD, tag: v2.0.4, origin/stable-2.0) Update version to 2.0.4 stable
lttng-ust f8b9341 (HEAD, tag: v2.0.4) Update version to 2.0.4
userspace-rcu 3227f2c (HEAD, tag: v0.7.3) Update version to 0.7.3

NOTE: This build does not include the patch "Fix UST SIGPIPE handling".

--
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://bugs.lttng.org/my/account

#3

Updated by David Goulet almost 8 years ago

  • Status changed from Feedback to In Progress

It seems that this is caused by a timeout on the semaphore wait in src/bin/lttng-sessiond/main.c +1582 which kills the consumer thread making the consumer unable to connect to the error socket and dying. However, the big issue here is that the timeout happened really quickly and not using the 30 seconds default in lttng-tools. (The debug output are too close for a 30 seconds wait).

Looking at the sem_timedwait() man page, ETIMEDOUT can be returned in the following case:

If  the  timeout  has  already expired by the time of the call, and the
semaphore could not be locked immediately, then sem_timedwait() fails
with a timeout error (errno set to ETIMEDOUT).

I'm just puzzled on how this could happen since the semaphore is created a few lines of code before in the same thread...

One possible explanation is that the clock_gettime() call went backwards in time (which is possible...) and by the time the semaphore wait call occurred, the time was set back forward in time thus expiring....... I have to say that this is highly unlikely but possible...

We'll have to investigate more this possibility and see for a fix or look for other possible explanation(s).

In the meantime, are you able to tell us if you have some kind of non standard libc that could mess up sem_timedwait() or/and clock_gettime() or any kernel patches that can affect clocks?

Thanks!

#4

Updated by David Goulet almost 8 years ago

After some talks on IRC in channel #lttng, using a non-monotonic clock source can produce the race we are witnessing here. However, the semaphore API does not provide a way to use a monotonic source only accepting an absolute EPOCH time value....

Mathieu Desnoyers find out that we can change the pthread_cond_timedwait clock source to use a monotonic one.

https://www.linuxquestions.org/questions/programming-9/clock_monotonic-and-sem_timedwait-pthread_mutex_timedlock-745769/

Using pthread_cond with this clock source replacing the sem_wait/post scheme can be done without too much code hence relying on a clock that can NOT go backwards.

I'll work on a patch for that in the coming days.

Thanks

#5

Updated by Jesus Garcia almost 8 years ago

Hi David,

Thanks for the info. I hope this can solve the issue. Just in case, to answer your question, we are using a standard libc library:

GNU C Library stable release version 2.11.3 (20110527)

Regards,
Jesus

-----Original Message-----
From: [mailto:]
Sent: September-17-12 5:19 PM
Subject: [LTTng Tools - Bug #324] LTTng does not trace after a cluster reboot

Issue #324 has been updated by David Goulet.

After some talks on IRC in channel #lttng, using a non-monotonic clock source can produce the race we are witnessing here. However, the semaphore API does not provide a way to use a monotonic source only accepting an absolute EPOCH time value....

Mathieu Desnoyers find out that we can change the pthread_cond_timedwait clock source to use a monotonic one.

https://www.linuxquestions.org/questions/programming-9/clock_monotonic-and-sem_timedwait-pthread_mutex_timedlock-745769/

Using pthread_cond with this clock source replacing the sem_wait/post scheme can be done without too much code hence relying on a clock that can NOT go backwards.

I'll work on a patch for that in the coming days.

Thanks
----------------------------------------
Bug #324: LTTng does not trace after a cluster reboot
https://bugs.lttng.org/issues/324#change-679

Author: Jesus Garcia

Priority: High

Category:

After a cluster reboot we get no traces from lttng in some nodes. In order to isolate the fault and rule out our application and test apps, we reproduced the problem using the lttng CLI and the lttng demo application. This problem is not tied to any particular node. It can happen on any node after a cluster reboot.
I have attached two log files:
1. lttng_not_tracing_after_reboot_PL4.txt: This is for PL-4, the node where this time lttng fails to produce traces. We activated a session with all available events in the demo app. No files were produced.
2. lttng_tracing_ok_after_reboot_PL3.txt: This is for PL-3, where lttng was working normally and the same steps were followed to succesfully produce logs. This was done for comparison purposes.

In both cases the demo app was executed with LTTNG_UST_DEBUG=1.

lttng SW versions:
babeltrace 2ae35d4 (HEAD, tag: v1.0.0-rc4) Update version to 1.0.0-rc4
lttng-tools 6c12303 (HEAD, tag: v2.0.4, origin/stable-2.0) Update version to 2.0.4 stable
lttng-ust f8b9341 (HEAD, tag: v2.0.4) Update version to 2.0.4
userspace-rcu 3227f2c (HEAD, tag: v0.7.3) Update version to 0.7.3

NOTE: This build does not include the patch "Fix UST SIGPIPE handling".

--
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here: http://bugs.lttng.org/my/account

#6

Updated by David Goulet almost 8 years ago

I've added a patch here that should be reviewed before merging it upstream.

It works well and I've tested it by jumping forward my clock by an hour and everything goes smooth with the monotonic clock.

Of course, I realize that it's a bit difficult to test in production since this is quite the race! so I'll wait for some Acked-by before pushing it upstream.

In theory, this should not affect at all the behavior of the session daemon.

Thanks

#7

Updated by David Goulet almost 8 years ago

Latest patch is under review by the community:

http://lists.lttng.org/pipermail/lttng-dev/2012-September/018705.html

This will both be applied to 2.0.x and 2.1-stable.

#8

Updated by David Goulet almost 8 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100
#9

Updated by David Goulet almost 8 years ago

This should resolve the issue we saw according to the logs.

If this same issue comes back, I'll open back the bug.

Thanks!

Also available in: Atom PDF