Project

General

Profile

Actions

Bug #429

closed

lttng-consumerd do a segmentation fault

Added by Mathieu Bain over 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Start date:
01/23/2013
Due date:
% Done:

100%

Estimated time:

Description

While running some test, we hit a segmentation fault on lttng-consumerd.
The core file indicates to line 338 in relayd.c, function relayd-close.

Up to now i was not able to get it again. But when it happened, as we are working on clusters, it hit all 4 nodes at the same time.
So we got 4 core dumps due to segmentation fault.

The problem with the file that the code was compile but i attached the backtrace. Maybe you will be able to find something

Thanks, Mathieu


Files

gdb.log (3.33 KB) gdb.log Mathieu Bain, 01/23/2013 03:22 PM
bug429.diff (3.79 KB) bug429.diff David Goulet, 01/23/2013 04:28 PM
Actions #1

Updated by Mathieu Bain over 11 years ago

lttng-tools version is 2.1.1 stable

Actions #2

Updated by David Goulet over 11 years ago

  • Status changed from New to Confirmed
  • Assignee set to David Goulet
  • Target version set to 2.1 stable

So it appears that the socket operation functions pointer is invalid (NULL or bad address) in this case which causes the segfault.

I found one code path that could trigger that but it does not fit the one I see in the coredump being a relayd_close() in a call_rcu... If a relayd socket is in the relayd hash table, it means that the sockets were all successfully allocated and initialized. Seeing invalid stuff in the operation pointer is kind of "not possible" because to trigger a call rcu, a relayd object has to be in the hash table meaning initialized correctly.

Anyway, I'll push a fix to make sure we have a valid op. functions pointer for the socket when calling relayd_close() because I see at least when code path that needs that. Hopefully, it won't trigger again the problem you've been having avoiding to dereference the operations.

Btw, again, if you can provide coredumps with no optimization, that would help a lot because we can never be sure of the root cause with stuff optimized out.

Thanks!

Actions #3

Updated by David Goulet over 11 years ago

I've attached a patch to the bug to improve relayd_close() and should fix having an invalid ops pointer.

This needs review because it adds a default fallback on close(3) if no ops is found which can be debated. Also, if the fd is not valid, it will still return a success.

Actions #4

Updated by David Goulet about 11 years ago

  • Status changed from Confirmed to Feedback
Actions #5

Updated by Tan le tran about 11 years ago

We have tried to re-run our stress test to see whether or not this is reproducible but our VM crashed in midway. So far, from whatever left over data, we have not seen the consumerD problem yet.

We are planning to run the stress test over the week end and we will give you the update by early next week.

Actions #6

Updated by Tan le tran about 11 years ago

Stress test result from last week end was inconclusive.
The test kind of getting stuck in "deactivating" since Saturday.
lttng_data_pending() still return 1 (as of today) for all sessions even if
"lttng list" shows all sessions are "inactive" .

We are investigating the new issue...more info will be updated.

Actions #7

Updated by Tan le tran about 11 years ago

Hi David,

We have re-run our test and we confirm that the above consumerD crash can 
no longer be observed.

The lttng 2.1.1 commits we were using were:

   rcu        : da9bed2 (HEAD, tag: v0.7.6) Version 0.7.6
   ust        : 164931d (HEAD, origin/stable-2.1) Fix: refcount issue in ...
   tools      : b325dc7 (origin/stable-2.1) Fix: put session list lock...
                 + bug429-fix (as from Update#3 of bug429)
                 + bug433-patch (as from Update#10 of bug433)

Are you planning to implement the fix into lttng-tools stable-2.1 ?

Actions #8

Updated by David Goulet about 11 years ago

  • Status changed from Feedback to Confirmed

Yep, I'll push that in 2.1-stable and port the fix to 2.2 if necessary.

Thanks!

Actions #9

Updated by David Goulet about 11 years ago

  • Status changed from Confirmed to Resolved
  • % Done changed from 0 to 100
Actions #10

Updated by David Goulet about 11 years ago

As a final note, this fix is in 2.2-rc1 already.

Actions #11

Updated by Tan le tran about 11 years ago

Thanks for the info David .

Actions

Also available in: Atom PDF