Project

General

Profile

Actions

Bug #426

closed

Instrumented application segfault ( in lttng_enabler_release (objd=<optimized out>) at lttng-ust-abi.c:1033)

Added by Tan le tran about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Normal
Target version:
-
Start date:
01/21/2013
Due date:
% Done:

0%

Estimated time:

Description

We received this coredump very often while running the so called "stability" testing.
This activity consist of 3 users. Each user has a set of trace commands to execute.
They keep executing one command (such as create, stop, destroy, list, etc) after the the other and keep repeating until we stop the activity. At all time, there are 4 instances of instrumented app running (without exiting).

Once in a while, we launch another 10 new instances of "TestApp_Mini1" (an insturmented app that simply wait for 5 sec , print a line and exit).

We have noticed that once in a while, we see a segfault from the TestApp_Mini1.

"LTTNG_UST_DEBUG=1" has been used to launch TestApp_Mini1 to get more data for
the segfault. "gdb bt" printout is also provided in the attachement.

Please, note that we are using a separate relayD for each session.

From our internal log, this is what we have observed (put into time order based on their time stamps) In the scenario below, user2 is using session STAB002_2 and user3 uses session STAB003_b_1:
    :
            21:27:26 Launch 10 TestApp_Mini by using:
                     for a in $(seq 1 10); do (LTTNG_UST_DEBUG=1 /home/test_apps/TestApp_Mini1 5 &); done
                    (each instance has 5 sec life time)
   (user-3) 21:27:26 invoke lttng_list_session (STAB003_b_1)
   (user-2) 21:27:27 RelayD is killed for session STAB002_2
   (user-3) 21:27:28 reply received from lttng_list_session with count 2
   (user-3) 21:27:28 Invoke lttng_stop_tracing_no_wait for session STAB003_b_1
   (user-3) 21:27:28 Reply received for lttng_stop_tracing_no_wait for session STAB003_b_1
   (user-2) 21:27:28 Invoke lttng_list_session (for session STAB002_2)
   (user-2) 21:27:28 Reply received from lttng_list_session (for session STAB002_2)
   (user-2) 21:27:28 Invoke lttng_stop_tracing_no_wait for session STAB002_2
   (user-2) 21:27:28 Reply received for lttng_stop_tracing_no_wait for session STAB002_2
   (user-2) 21:27:28 Invoke lttng_list_session (for session STAB002_2)
   (user-2) 21:27:28 Reply received from lttng_list_session (for session STAB002_2)
   (user-2) 21:27:28 lttng_destroy_session for session STAB002_2
   (user-2) 21:27:28 Reply recieved for lttng_destroy_session for session STAB002_2.
   (user-3) 21:27:28 Invoke lttng_data_pending for session STAB003_b_1
   (user-3) 21:27:29 Reply recieved for lttng_data_pending for session STAB003_b_1: 0
   (user-3) 21:27:29 Invoke lttng_destroy_session for session STAB003_b_1
       ******
       Jan 20 21:27:34 SC-1 kernel: [133553.807697] TestApp_Mini1[24519]: segfault 
          at 7fe883e5d2d8 ip 00007fe8835f39be sp 00007fff67488b40 error 4 in 
          liblttng-ust.so.0.0.0[7fe8835e3000+38000]
       ******
   (user-3) 21:28:03 Reply recieved for lttng_destroy_session for session STAB003_b_1.

   We are not sure what about what would have made the instrumented app
   (TestApp_Mini1) to seg fault. Regardless where the fault might be, 
   instrumented application should not be effected and should not give 
   a segmentation fault.

Lttng version used:
===================
  * userspace:   da9bed2 (HEAD, tag: v0.7.6) Version 0.7.6
  * lttng-ust:   05356aa (HEAD, tag: v2.1.1) Version 2.1.1
  * lttng-tools: 959f036 (HEAD, tag: v2.1.1) Update version to v2.1.1


Files

gdb_bt_jan20_2127.log (5.82 KB) gdb_bt_jan20_2127.log Backtrace from the coredump Tan le tran, 01/21/2013 05:12 PM
mini_jan20_212726.log (101 KB) mini_jan20_212726.log LTTNG_UST_DEBUG information when launching the inst. app Tan le tran, 01/21/2013 05:12 PM
TestApp_Mini.tar (720 KB) TestApp_Mini.tar TestApp_Mini source Tan le tran, 01/21/2013 05:12 PM
patch-ust-quit-check.patch (955 Bytes) patch-ust-quit-check.patch Mathieu Desnoyers, 01/21/2013 06:43 PM
gdb_TestAppMini_feb27.log (39.5 KB) gdb_TestAppMini_feb27.log GDB printout for 4 selected coredumps Tan le tran, 02/27/2013 11:12 AM
launchAppMini_shorten.tar (200 KB) launchAppMini_shorten.tar printout of LTTNG_UST_DEBUG=1 when launching the TestApp_Mini1 Tan le tran, 02/27/2013 11:12 AM
Actions #1

Updated by Mathieu Desnoyers about 11 years ago

Hi Tan,

I would need the following additional info:

- build the instrumented application without optimisations (O0 instead
of -O2).
output, with gdb on core file:

"thread apply all bt full"

So I get the backtrace from all threads (there appears to be 3 threads).

print lttng_ust_comm_should_quit

You might want to try the patch I am attaching to this bug report, I
think it should help.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Actions #3

Updated by Tan le tran about 11 years ago

Hi Mathieu,

We are pretty busy at the moment with our release activities.
I will give a try by early next week .
Will keep you posted.
/Tan

Actions #4

Updated by Mathieu Desnoyers about 11 years ago

No problem, it is pushed into master/stable-2.1 now.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Actions #5

Updated by Mathieu Desnoyers about 11 years ago

  • Status changed from New to Feedback

Updated by Tan le tran about 11 years ago


Hi Mathieu,

We finally were able to revisit this case . Sorry for all the delay.

I have recompiled lttng and our TestApps using "-O0 -g" option .

The stress test sceanrio has been re-run with the following commits:

lttng-ust:    d480871 (HEAD, origin/stable-2.1) Fix package: don't distribute generated headers
lttng-tools:  ed22248 (HEAD, temp_bug429_bug433_patches) Apply patches for bug429 and 433
                        using patch in <Update #10> of bug-429
                        and   patch in <Update #3>  of bug-433
              b325dc7 (origin/stable-2.1) Fix: put session list lock around the app registration
userspace  :  da9bed2 (HEAD, tag: v0.7.6) Version 0.7.6

The TestApp_Mini1 still occasionally segfault.

Enclosed files are:
  "gdb_TestAppMini_feb27.log"    is the requested gdb printouts. 
                                 (only 2 threads seen instead of 3)
                                 with:
                                       "thread apply all bt full" 
                                       "print lttng_ust_comm_should_quit" 

  "launchAppMini_shorten.tar"    This file is generated based on:
                                    echo $(date) >> /cluster/temp/tdlt/bug426/launchAppMini.log
                                    for a in $(seq 1 20); do (usleep 10000; \
                                      LTTNG_UST_DEBUG=1 /home/test_apps/TestApp_Mini1 5  >> \
                                      /cluster/temp/tdlt/bug426/launchAppMini.log 2>&1 &); \
                                    done
                                    echo $(date) >> /cluster/temp/tdlt/bug426/launchAppMini.log

Please, let us know if further information is required.
Thanks,
Tan
Actions #7

Updated by Mathieu Desnoyers about 11 years ago

Can you try with the following stable-2.1 commit ?

commit 164931d23d430b1770611ec0300720350261b346
Author: Mathieu Desnoyers <>
Date: Wed Mar 6 10:07:12 2013 -0500

Fix: refcount issue in lttng-ust-abi.c
The following scenario could lead to a segmentation fault in
applications when the sessiond disappears or when the application try to
exit. This is caused by incorrect handling of reference counts.
This can happen in very particular race scenario, described as follows:
1) The sessiond asks for "release" of one or more objects (e.g. a
session object), but without asking for release of all objects
referencing the session.
2) The application exits, thus calling objd_table_destroy(). It walks on
all objects, decrementing their reference count, freeing memory when
their reference count reaches "1".
However, here is the issue: since "release" has already been performed
by sessiond on the session object, this extra reference count unref
performed by objd_table_destroy() can make the session object disappear
while it is still needed by either the channel object or the enabler
object. Therefore, we can experience a segmentation fault when we try to
unref and free the channel or enabler objects within
objd_table_destroy().
Fix this issue by adding the concept of an "owner reference". Only
objd_table_destroy(), lttng_ust_objd_table_owner_cleanup(), "release"
commands, and failure paths of object creation are allowed to decrement
the owner reference. We restrict objd_table_destroy() and
lttng_ust_objd_table_owner_cleanup() to only decrement refcount of
objects if their owner reference is still held.
Signed-off-by: Mathieu Desnoyers &lt;&gt;
Actions #8

Updated by Tan le tran about 11 years ago

Hi Mathieu,

We have re-run our stress test scenario last weekend.
Unfortunately, the whole trace traffic got stuck in "deactivating" state on early morning of
saturday due to some issue where sessionD which keeps reporting to have more data pending (currently, it is still reporting more data pending even if "lttng list" shows that all sessions are in "inactive" state).
Therefore the run lasted only about 12hrs. So far, it looks very promissing. No segfault from testApp_Mini has been detected after that 12hrs run (usually we would have at least a couple).

We will try again with longer run (at least 48hrs) and let you know (hopefully, it will not got hit by the above deactivating issue). We will look into the log data to see what causes the "deactivating" issue.
Thanks,
Tan

Actions #9

Updated by Tan le tran about 11 years ago

Our stress test has been ran for more than 60hrs and the issue reported in this bug report is no longer seen. The fix works well.

Actions #10

Updated by Mathieu Desnoyers about 11 years ago

  • Status changed from Feedback to Resolved
  • Assignee set to Mathieu Desnoyers
Actions

Also available in: Atom PDF