Project

General

Profile

Actions

Bug #549

closed

lttng2.2.0rc2: Session can not complete deactivating process due to return value 1 of lttng_data_pending()

Added by Tan le tran over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Target version:
Start date:
05/30/2013
Due date:
% Done:

0%

Estimated time:

Description


Commit used:
============
babeltrace  : 9eaf254 Version 1.0.3
tools       : 094d169 (HEAD, origin/master, origin/HEAD) Fix: dereference after NULL check
ust         : 996aead (HEAD, origin/master, origin/HEAD) Add parameter -f to rm in Makefile clean target
userspace   : 264716f (HEAD, origin/stable-0.7, stable-0.7) Fix: Use a filled signal mask to disable all signals

Problem Description:
====================
 * During stability test, sometimes the deactivating process got hung when consumerD
   keeps reporting lttng_data_pending() with value 1 even if the session is already 
   inactive (seen via "lttng list").

   When our code deactivate a session, the following sequence of API are used:
      lttng_stop_tracing_no_wait()
      lttng_data_pending()
        If return value > 0, repeat every 100ms until return value == 0 (ie: no more data)
   From our log, the session has about 6MB of data. The deactivation has taken more than 
   5hrs and lttng_data_pending still returns 1.

   "lttng list" shows that the session is already inactive.

   Periodically check the size of the session dir, no further data being written into
   that dir for a long time.

   "kill -SIGABRT" is used to kill the consumerD. The corresponding gdb printout is
   attached with this report. Unfortunately, we did not manage to use "kill -SIGABRT" 
   on sessionD as it was automatically killed by our health check process once it 
   detected that consumerD was no longer healthy.

   This is the first time we observe this behaviour in lttng2.2.0rc2 .
   We have seen this a couple of time in lttng2.1 . Note in lttng2.1, we used to
   see the deactivation process last more than 2 days (never got completed).

Is problem reproducible ?
=========================
  * Maybe

How to reproduce (if reproducible):
===================================
  * Our stability test consist of 4 users. Each user has a different set of trace commands
    (such as create session, activate session, stop session, etc). Each user then executes
    its set of commands through multiple iterations.
    All sessions are created using streamming and perUID buffer and tracing on userspace only.

    After the overnight run, we start seing one session from one node could not complete the
    deactivation process. 

Any other information:
======================
-   

Files

terminal_log2.log (33.3 KB) terminal_log2.log terminal log Tan le tran, 05/30/2013 10:43 AM
gdb_printout.log (12 KB) gdb_printout.log gdb printout of consumerD when abort signal is sent Tan le tran, 05/30/2013 10:43 AM
Actions #1

Updated by David Goulet over 8 years ago

  • Status changed from New to Confirmed
  • Target version set to 2.2
Actions #2

Updated by Tan le tran over 8 years ago

After loading lttng-tools 2.2.3 and lttng-ust 2.2.1, so far, this bug has not been reproduced. However, since the scenario that is needed to reproduce this fault is still unknown; it does not neccessary mean that it has been fixed.

Therefore, before closing this bug report, it would be great if some information can be stated here regarding how data should be collected once the fault occurs again, so that we don't lose our chance to capture the info to find the root cause.
Many Thanks.

Actions #3

Updated by David Goulet over 8 years ago

  • Status changed from Confirmed to Resolved

Killing with SIGABRT is good as long as the ulimit -c is set to unlimited for coredump.

Else, attach with gdb and start printing all the back traces from all the available threads.
gdb> info threads
gdb> 1
gdb> bt full
[repeat process for all threads]

Actions

Also available in: Atom PDF