Bug #419
closedConsumerd gets stuck while stopping session
100%
Description
When two cluster-wide sessions are active and one of them is deactivated, consumerd seems to get stuck for a long time.
My build was made from all 2.1.0 tagged commits, but my colleagues have also seen this issue with the latest commits.
The scenario is as follows:
- 4-node cluster each with a sessiond/consumerd pair
- 4 test apps running on every node in the cluster before the sessions are created. They are never stopped
- 2 cluster-wide sessions are created with several events and allowed to run for 10 sec.
The specific input for both sessions is as follows:
com_ericsson_cba_trace_testapp_lowtraf:* (loglevel: TRACE_DEBUG_SYSTEM (7)) (type: tracepoint) [enabled]
com_ericsson_cba_trace_testapp_lowtraf_* (loglevel: TRACE_DEBUG_SYSTEM (7)) (type: tracepoint) [enabled]
This yields the following (low-output) events:
com_ericsson_cba_testapp_fork_parent_lowtraf:OnePerSecA,TRACE_DEBUG_PROGRAM(8) LOGLEVEL_ONLY
com_ericsson_cba_testapp_fork_parent_lowtraf:OnePerSecB,TRACE_DEBUG_PROCESS(9) LOGLEVEL_ONLY
com_ericsson_cba_testapp_fork_parent_lowtraf:TenPerSecA,TRACE_DEBUG_MODULE(10) LOGLEVEL_ONLY
com_ericsson_cba_testapp_fork_parent_lowtraf:TenPerSecB,TRACE_DEBUG_UNIT(11) LOGLEVEL_ONLY
- The 1st session is stopped
A log was collected with sessiond/consumerd verbose mode on one of the nodes (SC-1). The issue takes place in several of the nodes.
Notes about the log:
====================
The stop of the 1st session is invoked on the following lines:
(line 2473) 88439 16:55:10 01/03/2013 IN TraceP_PL-3 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
(line 2479) 88445 16:55:10 01/03/2013 IN TraceP_PL-4 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
(line 2488) 88454 16:55:10 01/03/2013 IN TraceP_SC-1 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
(line 2492) 88458 16:55:10 01/03/2013 IN TraceP_SC-2 "DBG: deactivateSession: Invoke lttng_stop_tracing_no_wait for session TestProf1_1
From that point on, the log shows that our code is checking for data pending:
88468 16:55:11 01/03/2013 IN TraceP_SC-1 "DBG: deactivateSession: Reply received for lttng_stop_tracing_no_wait for session TestProf1_1
88469 16:55:11 01/03/2013 IN TraceP_SC-1 "DBG: deactivatingSessionInProgress: Invoke lttng_data_pending for session TestProf1_1
And the result is that there is more data pending:
88476 16:55:11 01/03/2013 IN TraceP_SC-1 "DBG: deactivatingSessionInProgress: Reply recieved for lttng_data_pending for session TestProf1_1: 1.
This continues for several minutes:
115637 16:59:15 01/03/2013 IN TraceP_SC-1 "DBG: deactivatingSessionInProgress: Reply recieved for lttng_data_pending for session TestProf1_1: 1.
And it never seems to stop. At this point the saflog got full and it rotated to a new file, so I canceled the cluster-wide session (destroyed the lttng session, killed relayd).
Files
Updated by David Goulet almost 12 years ago
- Status changed from New to Confirmed
- Assignee set to David Goulet
The relayd logs would greatly help here.
Everything seems fine on the session daemon and consumer side. We need to look at the stream's seq_num the consumer is sending to the relayd and this will tell us if the relayd is badly out of sync and why.
My guess, looking at the logs, is that there is a data_pending command sent to the relayd before any data is even streamed so their might be a "bootstrap" problem. I'll keep looking but the relayd logs are needed to confirm.
Updated by Jesus Garcia almost 12 years ago
Hi David,
Here attached is the relayd log for the exact same scenario.
Regards,
Jesus
Updated by David Goulet almost 12 years ago
Does this relayd log fits the above consumer/sessiond log file ?
The sessiond log above shows two different relay daemons... and this log shows NO data pending command from SC-1 node which makes it difficult to correlate both logs if they are from a different test case.
Updated by Jesus Garcia almost 12 years ago
Sorry, didn't realize you still needed both sides of the equation. Full set of logs coming up soon...
Updated by Jesus Garcia almost 12 years ago
- File bug419.zip bug419.zip added
I apologize for the delay, I has some trouble collecting the info.
Please let me know if the log shows the info you need.
Updated by David Goulet almost 12 years ago
Hmmm somethings weird is going on here. The session daemon is asking for data pending on the stream handle 4 that apparently exists but has never received data on the relayd side making the sequence number never initialized making the data pending test failing.
The problem I have is there is NO log at all of the stream handle 4 being created on either sides but it does exists. The stream handle created in this logs begins at 61 ... so I think there maybe a missing session daemon log here ?
Now, I wonder how it is possible for a stream to never have sent data to the relayd thus making the sequence number never initialized. Even though the stream does not produce data, there should be at least some padding sent making at least one packet sent for the stream.
We can also see this problem for "stream id 3" but a packet was clearly sent :
DEBUG3: Receiving data of size 68 for stream id 3 seqnum 0 [in relay_process_data() at mai
946745 16:10:37 01/07/2013 IN TraceC "Previous log message truncated by 2385 character(s)
And this gets repeated twice... so can we get the truncated log because if an error occurred that would make the sequence number not being updated, it's right there. I suspect that there is an error during the recvmsg() or when writing the data on disk which would not update the prev_seq counter.
I might send you a patch in a jiffy to try to fix a possible issue here. In the meantime, I really need the "truncated" part.
Thanks!
Updated by Jesus Garcia almost 12 years ago
- File bug419_2.zip bug419_2.zip added
Maybe the reason you are missing data is because I was only collecting sessiond/consumerd logs from the first node in the cluster.
In the newly attached log I collected logs from all 4 nodes (SC-1,SC-2,PL-3,PL-4) since there is one sessiond/consumerd pair in each node. There is one relayd per session (in this case we have 2 relayds since we have 2 cluster-wide sessions) and each of the relayds are receiving data from the 4 sessiond/consumerd pairs.
This one should have everything.
Updated by David Goulet almost 12 years ago
Hmmm I'm having problem with the log... there is corrupted data in it at several place. See line 1210 or 1251 ... lot of unknown bytes.... :S
I feel like a lot of information is missing since I don't see some commands that should be there.
Updated by Jesus Garcia almost 12 years ago
- File bug419_3.zip bug419_3.zip added
Not sure what to do about the corrupt data in the log. I have captured a new log after loading patch bug419-2.patch.
Updated by Jesus Garcia almost 12 years ago
relayd coredump data:
(gdb) bt full
#0 0x0000000000405b54 in relay_process_data (streams_ht=<optimized out>, cmd=<optimized out>) at main.c:1796
ret = <optimized out>
stream = 0x616150
data_hdr = {circuit_id = 0, stream_id = 216172782113783808, net_seq_num = 0, data_size = 1140850688, padding_size = 3155099648}
func = "relay_process_data"
PRETTY_FUNCTION = "relay_process_data"
#1 relay_thread_worker (data=<optimized out>) at main.c:2148
revents = <optimized out>
pollfd = <optimized out>
i = 0
seen_control = 0
last_notdel_data_fd = 13
ret = 1
err = <optimized out>
last_seen_data_fd = -1
relay_connection = 0x614fd0
events = {epfd = 10, nb_fd = 10, alloc_size = 16, init_size = 2, events = 0x616d80}
relay_connections_ht = 0x610010
node = 0x614fe8
iter = {iter = {node = 0x614ff0, next = 0x610030}}
streams_ht = 0x6149d0
recv_hdr = {circuit_id = 0, data_size = 581245826907504640, cmd = 16777216, cmd_version = 0}
func = "relay_thread_worker"
#2 0x00007fe262f6e7b6 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#3 0x00007fe262ccac6d in clone () from /lib64/libc.so.6
No symbol table info available.
#4 0x0000000000000000 in ?? ()
No symbol table info available.
Updated by David Goulet almost 12 years ago
- Status changed from Confirmed to Resolved
- % Done changed from 0 to 100
Applied in changeset 18261bd1f606ed31dd91cf82bbb91af7839573f5.