Bug #536
closedConsumerD segfault when closing metadata
100%
Description
Hi David,
It now occurs in multiple nodes and the frequency of this occurences is
also about every 3-7min . We still run the same test suite described on the
top of this bug report.
New commits used: =================
lttng-tools: c5854b1 (HEAD, origin/master, origin/HEAD) Fix: use memset instead of poll ...
+ Apply bug530 patch (from update #2)
lttng-ust : 352fce3 (HEAD, origin/master, origin/HEAD) Remove 0.x TODO
rcu : 56e676d (HEAD, origin/stable-0.7) Document: rculfhash destroy and resize...
babeltrace : 5bfcad9 (HEAD, origin/master, origin/HEAD) Fix: handling of empty streams
New gdb_printout is attached.
I have quickly checked gdb for other coredumps and they all have very similar back trace.
Please, let us know if further info are needed.
Files
Updated by David Goulet over 11 years ago
I've found why the segfault happens. I'll try to submit you a patch as soon as possible based on master. Here is a small analysis.
When the endpoint of the trace hangs up or dies (e.g lttng-relayd), the streams associated with it are flagged with an INACTIVE endpoint status. After that, every streams are validated and, in this case, the metadata stream gets deleted because the relayd is gone however the close metadata command from the session daemon has not yet been triggered.
So, the stream is deleted, the close metadata command arrives but the stream is no longer available at that point. Now, why this stream is found is because we keep a reference to the metadata stream directly in the channel thus accessing without any RCU protection.
Updated by Tan le tran over 11 years ago
This coredump can also be seen with the following scenario: - Create a session using streaming (and use perUID buffer) - launch the instrumented app - start the session - sleep a few sec - kill the corresponding relayd - coredump with very similar gdb printout. Commits used: userspace-rcu: 56e676d (HEAD, origin/stable-0.7) Document: rculfhash ... lttng-ust : 352fce3 (HEAD, origin/master, origin/HEAD) Remove 0.x TODO lttng-tools : b31398b (HEAD, origin/master, origin/HEAD) Fix: increment ...
Updated by David Goulet over 11 years ago
- File bug536.patch bug536.patch added
- Status changed from Confirmed to Feedback
Can you try this patch to see if this fixes the issue?
Thanks
Updated by Tan le tran over 11 years ago
Hi David,
So far it looks very promising.
After 3hrs of our test run, no coredump yet :-)
We will let that traffic run overnight and see if something else happen.
But for now, we can safely assumed that this bug has been fixed with the patch above.
Thank You very much for your support,
Tan
Updated by David Goulet over 11 years ago
- % Done changed from 0 to 100
- Status changed from Feedback to Resolved
Applied in changeset 73811eccc9599ebf62e5f5bee49039cecc25c3eb.