Project

General

Profile

Actions

Bug #536

closed

ConsumerD segfault when closing metadata

Added by David Goulet almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Critical
Assignee:
Target version:
Start date:
05/17/2013
Due date:
% Done:

100%

Estimated time:

Description

Hi David,

It now occurs in multiple nodes and the frequency of this occurences is
also about every 3-7min . We still run the same test suite described on the
top of this bug report.

New commits used: =================
lttng-tools: c5854b1 (HEAD, origin/master, origin/HEAD) Fix: use memset instead of poll ...
+ Apply bug530 patch (from update #2)

lttng-ust : 352fce3 (HEAD, origin/master, origin/HEAD) Remove 0.x TODO
rcu : 56e676d (HEAD, origin/stable-0.7) Document: rculfhash destroy and resize...
babeltrace : 5bfcad9 (HEAD, origin/master, origin/HEAD) Fix: handling of empty streams

New gdb_printout is attached.
I have quickly checked gdb for other coredumps and they all have very similar back trace.

Please, let us know if further info are needed.


Files

May17_Pacth1_applied_gdb_printout.log (29.7 KB) May17_Pacth1_applied_gdb_printout.log David Goulet, 05/17/2013 11:20 AM
bug536.patch (4.28 KB) bug536.patch David Goulet, 05/22/2013 01:02 PM
Actions #1

Updated by David Goulet almost 11 years ago

I've found why the segfault happens. I'll try to submit you a patch as soon as possible based on master. Here is a small analysis.

When the endpoint of the trace hangs up or dies (e.g lttng-relayd), the streams associated with it are flagged with an INACTIVE endpoint status. After that, every streams are validated and, in this case, the metadata stream gets deleted because the relayd is gone however the close metadata command from the session daemon has not yet been triggered.

So, the stream is deleted, the close metadata command arrives but the stream is no longer available at that point. Now, why this stream is found is because we keep a reference to the metadata stream directly in the channel thus accessing without any RCU protection.

Actions #2

Updated by Tan le tran almost 11 years ago


This coredump can also be seen with the following scenario:
  - Create a session using streaming (and use perUID buffer)
  - launch the instrumented app 
  - start the session
  - sleep a few sec
  - kill the corresponding relayd
  - coredump with very similar gdb printout.

Commits used:
userspace-rcu: 56e676d (HEAD, origin/stable-0.7) Document: rculfhash ...
lttng-ust    : 352fce3 (HEAD, origin/master, origin/HEAD) Remove 0.x TODO
lttng-tools  : b31398b (HEAD, origin/master, origin/HEAD) Fix: increment ...

 
Actions #3

Updated by David Goulet almost 11 years ago

Can you try this patch to see if this fixes the issue?

Thanks

Actions #4

Updated by Tan le tran almost 11 years ago

Hi David,

So far it looks very promising.
After 3hrs of our test run, no coredump yet :-)
We will let that traffic run overnight and see if something else happen.
But for now, we can safely assumed that this bug has been fixed with the patch above.

Thank You very much for your support,
Tan

Actions #5

Updated by David Goulet almost 11 years ago

  • % Done changed from 0 to 100
  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF