Bug #510
closedlttng2.2.0rc1: ConsumerD coredump once in a while after a few iteration of "TC dealing with killing relayd while session is active"
0%
Description
Commit used: ============ rcu : 8a97780 (HEAD) Fix: tests/api.h use cpuset.h ust : a16877a (HEAD, origin/master, origin/HEAD) Typo fix in README tools : 6bf5e7c (HEAD, origin/master, origin/HEAD) Fix: remove mention of trace ... babeltrace : d088404 (HEAD, origin/master, origin/HEAD) Use objstack for AST allocation Problem Description: ==================== * ConsumerD coredump with "in __assert_fail () from /lib64/libc.so.6" So far, 2 different flavours (in terms of "gdb> bt" printouts) have been seen: Case1: (gdb) bt #0 0x00007ff900a13b55 in raise () from /lib64/libc.so.6 #1 0x00007ff900a15131 in abort () from /lib64/libc.so.6 #2 0x00007ff900a0ca10 in __assert_fail () from /lib64/libc.so.6 #3 0x000000000041c28f in lttng_ustconsumer_request_metadata (ctx=0x632d10, channel=0x6359b0) at ust-consumer.c:1404 #4 0x000000000040c88f in metadata_switch_timer (ctx=0x632d10, sig=44, si=0x7ff8fe5cdb10, uc=0x0) at consumer-timer.c:68 #5 0x000000000040ce85 in consumer_timer_metadata_thread (data=0x632d10) at consumer-timer.c:215 #6 0x00007ff900d5c7b6 in start_thread () from /lib64/libpthread.so.0 #7 0x00007ff900ab8c6d in clone () from /lib64/libc.so.6 #8 0x0000000000000000 in ?? () Case2: (gdb) bt #0 0x00007feadea4bb55 in raise () from /lib64/libc.so.6 #1 0x00007feadea4d131 in abort () from /lib64/libc.so.6 #2 0x00007feadea44a10 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000413a29 in lttng_ht_add_unique_u64 (ht=0x633fa0, node=0x63c138) at hashtable.c:281 #4 0x000000000040ad1e in consumer_thread_channel_poll (data=0x632d10) at consumer.c:2697 #5 0x00007feaded947b6 in start_thread () from /lib64/libpthread.so.0 #6 0x00007feadeaf0c6d in clone () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () Is problem reproducible ? ========================= * yes How to reproduce (if reproducible): =================================== * When running the following scenario a couple of times, a consumerD coredump can be observed (NOTE: all interaction with lttng were done via API calls): 1)_ lttng-relayd -C tcp://0.0.0.0:53000 -D tcp://0.0.0.0:53001 -o <some out put path> & 2)_ create a session (using the above ports) a)_ lttng_create_session() b)_ lttng_create_handle() c)_ lttng_channel_set_default_attr() // one for ctf and one for metadata d)_ lttng_enable_channel() // subbuf_size = 16384 (for both ctf and metadata) // switch_timer_interval = 1000000 (for both ctf and metadata) e)_ lttng_add_context(..,LTTNG_EVENT_CONTEXT_VPID,..) f)_ lttng_enable_event_with_filter() // enable a tracepoint which has only 1 hit per second 3)_ start the instrumented apps to produce events 4)_ lttng_start_tracing() 5)_ sleep 5 6)_ "pkill relayd" 7)_ lttng_stop_tracing_no_wait() 8)_ loop every 1 sec until lttng_data_pending() returns 0 (timeout at after 60 sec) 9)_ lttng_destroy_session() 10)_ "pkill TestApp" (kill all running instrumented apps) 11)_ pause for a couple of sec then repeat from step-1 to 11 Sometimes the coredump happens during the 2nd or 3rd iteration. * Unfortunately, we could not reproduce the above issue using equivalent lttng cli commands ! Any other information: ====================== - "sessiond -vvv" have been collected for each of the above 2 cases.
Files
Updated by David Goulet over 11 years ago
- Assignee set to Julien Desfossez
- Target version set to 2.2
Updated by Mathieu Desnoyers over 11 years ago
- Status changed from New to Feedback
- Assignee changed from Julien Desfossez to Mathieu Desnoyers
Please check if the following commits fix your 2 issues (or any of the 2):
lttng-tools:
2b8f875 Cleanup: Use own mutex within timer setup/teardown
f2ad556 Fix: channel management thread should hold a refcount
fc64324 Fix: move metadata cache setup before ownership passing
4419b4f Fix: consumer metadata switch timer error handling
lttng-ust:
64bf51a Fix: add internal mutex for timer
Thanks,
Mathieu
Updated by Amer Alhalabi over 11 years ago
- File sessiond.log sessiond.log added
Hi Mathieu,
I tried the new commits, but still having the same problem.
I have collected some logs for you (attached).
And for some reason now I see 3 relayds.
SC-1:~/temp # lttng-sessiond -vvv --verbose-consumer &> /tmp/sessiond.log &
[1] 23690
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
24167 pts/0 00:00:00 lttng-consumerd
24329 ? 00:00:00 lttng-relayd
25197 ? 00:00:00 lttng-relayd
25916 ? 00:00:00 lttng-relayd
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
24167 pts/0 00:00:00 lttng-consumerd <defunct>
/Amer
Updated by Amer Alhalabi over 11 years ago
Please ignore my previous update as it was meant for a different bug.
I am sorry!
Updated by Tan le tran over 11 years ago
- File apr27_terminal.log apr27_terminal.log added
We have rebuilt our lttng using the following commits (as from Apr-27): userspace-rcu 171ff3a (HEAD, origin/master, origin/HEAD) Add "sparc" host cpu to configure.ac lttng-ust 64bf51a (HEAD, origin/master, origin/HEAD) Fix: add internal mutex for timer lttng-tools 2b8f875 (HEAD, origin/master, origin/HEAD) Cleanup: Use own mutex within timer ... babeltrace 888ec52 (HEAD, origin/master, origin/HEAD) Fix: add missing error return when .... It now gives a new flavour to the original problem. Instead of having to repeat the test case a couple of time to get the coredump, it now gives coredump at a single attempt. However, according to the gdb printout, the coredump seems to point towards a different place now (instead of abort, it now gives segfault): (gdb) bt #0 shm_close_wakeup_fd (ref=<optimized out>, handle=<optimized out>) at shm.h:175 #1 ring_buffer_stream_close_wakeup_fd (config=<optimized out>, chan=0x62f190, handle=0x633de0, cpu=<optimized out>) at ring_buffer_frontend.c:946 #2 0x0000000000415f70 in close_metadata (chan_key=<optimized out>) at ust-consumer.c:625 #3 lttng_ustconsumer_recv_cmd (ctx=<optimized out>, sock=21, consumer_sockpoll=0x7fccf1d90c10) at ust-consumer.c:1019 #4 0x000000000040548c in consumer_thread_sessiond_poll (data=<optimized out>) at consumer.c:2892 #5 0x00007fccf3d1e7b6 in start_thread () from /lib64/libpthread.so.0 #6 0x00007fccf3a7ac6d in clone () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () Enclosed is the termial log with "bt full". Please, let us know if further info is needed.
Updated by Mathieu Desnoyers over 11 years ago
I've been able to reproduce from scripts:
console 1: lttng-sessiond console 2 One 100ps test app started. console 3: #!/bin/sh for a in $(seq 1 1000); do lttng create -U net://localhost lttng enable-channel -u --buffers-uid --switch-timer 1000000 test lttng enable-channel -u --buffers-uid --switch-timer 1000000 metadata lttng add-context -u -t vpid -c test lttng enable-event -u -a --filter '1==1' -c test lttng start lttng destroy done console 4: for a in $(seq 1 20); do (lttng-relayd &); sleep 1; killall lttng-relayd; done
Updated by Mathieu Desnoyers over 11 years ago
Can you try again with the following upstream commits ?
ea88ca2 Fix: consumerd metadata channel/cache/timer races
a0cbdd2 Fix: consumerd channel destroy race
a500c25 Cleanup: document RCU read-side lock better
f96e454 consumer relayd interaction: fix segfaults
It fixes it for me here.
Thanks,
Mathieu
Updated by Tan le tran over 11 years ago
I have re-run my test case multiple times and the bug is no longer seen .
Looks like the fix is good.
Thanks for your help,
Tan
Updated by Mathieu Desnoyers over 11 years ago
- Status changed from Feedback to Resolved