Bug #429
closed
lttng-consumerd do a segmentation fault
Added by Mathieu Bain almost 12 years ago.
Updated over 11 years ago.
Description
While running some test, we hit a segmentation fault on lttng-consumerd.
The core file indicates to line 338 in relayd.c, function relayd-close.
Up to now i was not able to get it again. But when it happened, as we are working on clusters, it hit all 4 nodes at the same time.
So we got 4 core dumps due to segmentation fault.
The problem with the file that the code was compile but i attached the backtrace. Maybe you will be able to find something
Thanks, Mathieu
Files
lttng-tools version is 2.1.1 stable
- Status changed from New to Confirmed
- Assignee set to David Goulet
- Target version set to 2.1 stable
So it appears that the socket operation functions pointer is invalid (NULL or bad address) in this case which causes the segfault.
I found one code path that could trigger that but it does not fit the one I see in the coredump being a relayd_close() in a call_rcu... If a relayd socket is in the relayd hash table, it means that the sockets were all successfully allocated and initialized. Seeing invalid stuff in the operation pointer is kind of "not possible" because to trigger a call rcu, a relayd object has to be in the hash table meaning initialized correctly.
Anyway, I'll push a fix to make sure we have a valid op. functions pointer for the socket when calling relayd_close() because I see at least when code path that needs that. Hopefully, it won't trigger again the problem you've been having avoiding to dereference the operations.
Btw, again, if you can provide coredumps with no optimization, that would help a lot because we can never be sure of the root cause with stuff optimized out.
Thanks!
I've attached a patch to the bug to improve relayd_close() and should fix having an invalid ops pointer.
This needs review because it adds a default fallback on close(3) if no ops is found which can be debated. Also, if the fd is not valid, it will still return a success.
- Status changed from Confirmed to Feedback
We have tried to re-run our stress test to see whether or not this is reproducible but our VM crashed in midway. So far, from whatever left over data, we have not seen the consumerD problem yet.
We are planning to run the stress test over the week end and we will give you the update by early next week.
Stress test result from last week end was inconclusive.
The test kind of getting stuck in "deactivating" since Saturday.
lttng_data_pending() still return 1 (as of today) for all sessions even if
"lttng list" shows all sessions are "inactive" .
We are investigating the new issue...more info will be updated.
Hi David,
We have re-run our test and we confirm that the above consumerD crash can
no longer be observed.
The lttng 2.1.1 commits we were using were:
rcu : da9bed2 (HEAD, tag: v0.7.6) Version 0.7.6
ust : 164931d (HEAD, origin/stable-2.1) Fix: refcount issue in ...
tools : b325dc7 (origin/stable-2.1) Fix: put session list lock...
+ bug429-fix (as from Update#3 of bug429)
+ bug433-patch (as from Update#10 of bug433)
Are you planning to implement the fix into lttng-tools stable-2.1 ?
- Status changed from Feedback to Confirmed
Yep, I'll push that in 2.1-stable and port the fix to 2.2 if necessary.
Thanks!
- Status changed from Confirmed to Resolved
- % Done changed from 0 to 100
As a final note, this fix is in 2.2-rc1 already.
Thanks for the info David .
Also available in: Atom
PDF