Bug #419
closedConsumerd gets stuck while stopping session
100%
Description
When two cluster-wide sessions are active and one of them is deactivated, consumerd seems to get stuck for a long time.
My build was made from all 2.1.0 tagged commits, but my colleagues have also seen this issue with the latest commits.
The scenario is as follows:
- 4-node cluster each with a sessiond/consumerd pair
- 4 test apps running on every node in the cluster before the sessions are created. They are never stopped
- 2 cluster-wide sessions are created with several events and allowed to run for 10 sec.
The specific input for both sessions is as follows:
com_ericsson_cba_trace_testapp_lowtraf:* (loglevel: TRACE_DEBUG_SYSTEM (7)) (type: tracepoint) [enabled]
com_ericsson_cba_trace_testapp_lowtraf_* (loglevel: TRACE_DEBUG_SYSTEM (7)) (type: tracepoint) [enabled]
This yields the following (low-output) events:
com_ericsson_cba_testapp_fork_parent_lowtraf:OnePerSecA,TRACE_DEBUG_PROGRAM(8) LOGLEVEL_ONLY
com_ericsson_cba_testapp_fork_parent_lowtraf:OnePerSecB,TRACE_DEBUG_PROCESS(9) LOGLEVEL_ONLY
com_ericsson_cba_testapp_fork_parent_lowtraf:TenPerSecA,TRACE_DEBUG_MODULE(10) LOGLEVEL_ONLY
com_ericsson_cba_testapp_fork_parent_lowtraf:TenPerSecB,TRACE_DEBUG_UNIT(11) LOGLEVEL_ONLY
- The 1st session is stopped
A log was collected with sessiond/consumerd verbose mode on one of the nodes (SC-1). The issue takes place in several of the nodes.
Notes about the log:
====================
The stop of the 1st session is invoked on the following lines:
(line 2473) 88439 16:55:10 01/03/2013 IN TraceP_PL-3 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
(line 2479) 88445 16:55:10 01/03/2013 IN TraceP_PL-4 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
(line 2488) 88454 16:55:10 01/03/2013 IN TraceP_SC-1 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
(line 2492) 88458 16:55:10 01/03/2013 IN TraceP_SC-2 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
From that point on, the log shows that our code is checking for data pending:
88468 16:55:11 01/03/2013 IN TraceP_SC-1 "DBG: deactivateSession: Reply received for lttng_stop_tracing_no_wait for session TestProf1_1
88469 16:55:11 01/03/2013 IN TraceP_SC-1 "DBG: deactivatingSessionInProgress: Invoke lttng_data_pending for session TestProf1_1
And the result is that there is more data pending:
88476 16:55:11 01/03/2013 IN TraceP_SC-1 "DBG: deactivatingSessionInProgress: Reply recieved for lttng_data_pending for session TestProf1_1: 1.
This continues for several minutes:
115637 16:59:15 01/03/2013 IN TraceP_SC-1 "DBG: deactivatingSessionInProgress: Reply recieved for lttng_data_pending for session TestProf1_1: 1.
And it never seems to stop. At this point the saflog got full and it rotated to a new file, so I canceled the cluster-wide session (destroyed the lttng session, killed relayd).
Files