Project

General

Profile

Actions

Bug #419

closed

Consumerd gets stuck while stopping session

Added by Jesus Garcia almost 12 years ago. Updated almost 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
01/03/2013
Due date:
% Done:

100%

Estimated time:

Description

When two cluster-wide sessions are active and one of them is deactivated, consumerd seems to get stuck for a long time.

My build was made from all 2.1.0 tagged commits, but my colleagues have also seen this issue with the latest commits.

The scenario is as follows:
- 4-node cluster each with a sessiond/consumerd pair
- 4 test apps running on every node in the cluster before the sessions are created. They are never stopped
- 2 cluster-wide sessions are created with several events and allowed to run for 10 sec.
The specific input for both sessions is as follows:
com_ericsson_cba_trace_testapp_lowtraf:* (loglevel: TRACE_DEBUG_SYSTEM (7)) (type: tracepoint) [enabled]
com_ericsson_cba_trace_testapp_lowtraf_* (loglevel: TRACE_DEBUG_SYSTEM (7)) (type: tracepoint) [enabled]
This yields the following (low-output) events:
com_ericsson_cba_testapp_fork_parent_lowtraf:OnePerSecA,TRACE_DEBUG_PROGRAM(8) LOGLEVEL_ONLY
com_ericsson_cba_testapp_fork_parent_lowtraf:OnePerSecB,TRACE_DEBUG_PROCESS(9) LOGLEVEL_ONLY
com_ericsson_cba_testapp_fork_parent_lowtraf:TenPerSecA,TRACE_DEBUG_MODULE(10) LOGLEVEL_ONLY
com_ericsson_cba_testapp_fork_parent_lowtraf:TenPerSecB,TRACE_DEBUG_UNIT(11) LOGLEVEL_ONLY
- The 1st session is stopped

A log was collected with sessiond/consumerd verbose mode on one of the nodes (SC-1). The issue takes place in several of the nodes.

Notes about the log: ====================
The stop of the 1st session is invoked on the following lines:
(line 2473) 88439 16:55:10 01/03/2013 IN TraceP_PL-3 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
(line 2479) 88445 16:55:10 01/03/2013 IN TraceP_PL-4 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
(line 2488) 88454 16:55:10 01/03/2013 IN TraceP_SC-1 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
(line 2492) 88458 16:55:10 01/03/2013 IN TraceP_SC-2 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1

From that point on, the log shows that our code is checking for data pending:
88468 16:55:11 01/03/2013 IN TraceP_SC-1 "DBG: deactivateSession: Reply received for lttng_stop_tracing_no_wait for session TestProf1_1
88469 16:55:11 01/03/2013 IN TraceP_SC-1 "DBG: deactivatingSessionInProgress: Invoke lttng_data_pending for session TestProf1_1

And the result is that there is more data pending:
88476 16:55:11 01/03/2013 IN TraceP_SC-1 "DBG: deactivatingSessionInProgress: Reply recieved for lttng_data_pending for session TestProf1_1: 1.

This continues for several minutes:
115637 16:59:15 01/03/2013 IN TraceP_SC-1 "DBG: deactivatingSessionInProgress: Reply recieved for lttng_data_pending for session TestProf1_1: 1.

And it never seems to stop. At this point the saflog got full and it rotated to a new file, so I canceled the cluster-wide session (destroyed the lttng session, killed relayd).


Files

consumerd_stuck_during_session_stop.zip (271 KB) consumerd_stuck_during_session_stop.zip Jesus Garcia, 01/03/2013 05:58 PM
LTTng-bug419_relayd_log.txt (127 KB) LTTng-bug419_relayd_log.txt Jesus Garcia, 01/07/2013 01:56 PM
bug419.zip (807 KB) bug419.zip Jesus Garcia, 01/07/2013 04:19 PM
bug419_2.zip (148 KB) bug419_2.zip Jesus Garcia, 01/07/2013 06:01 PM
bug419_3.zip (127 KB) bug419_3.zip Jesus Garcia, 01/07/2013 10:28 PM
Actions

Also available in: Atom PDF