Bug #510
closed
lttng2.2.0rc1: ConsumerD coredump once in a while after a few iteration of "TC dealing with killing relayd while session is active"
Added by Tan le tran over 11 years ago.
Updated over 11 years ago.
Description
Commit used:
============
rcu : 8a97780 (HEAD) Fix: tests/api.h use cpuset.h
ust : a16877a (HEAD, origin/master, origin/HEAD) Typo fix in README
tools : 6bf5e7c (HEAD, origin/master, origin/HEAD) Fix: remove mention of trace ...
babeltrace : d088404 (HEAD, origin/master, origin/HEAD) Use objstack for AST allocation
Problem Description:
====================
* ConsumerD coredump with "in __assert_fail () from /lib64/libc.so.6"
So far, 2 different flavours (in terms of "gdb> bt" printouts) have been seen:
Case1:
(gdb) bt
#0 0x00007ff900a13b55 in raise () from /lib64/libc.so.6
#1 0x00007ff900a15131 in abort () from /lib64/libc.so.6
#2 0x00007ff900a0ca10 in __assert_fail () from /lib64/libc.so.6
#3 0x000000000041c28f in lttng_ustconsumer_request_metadata (ctx=0x632d10, channel=0x6359b0) at ust-consumer.c:1404
#4 0x000000000040c88f in metadata_switch_timer (ctx=0x632d10, sig=44, si=0x7ff8fe5cdb10, uc=0x0) at consumer-timer.c:68
#5 0x000000000040ce85 in consumer_timer_metadata_thread (data=0x632d10) at consumer-timer.c:215
#6 0x00007ff900d5c7b6 in start_thread () from /lib64/libpthread.so.0
#7 0x00007ff900ab8c6d in clone () from /lib64/libc.so.6
#8 0x0000000000000000 in ?? ()
Case2:
(gdb) bt
#0 0x00007feadea4bb55 in raise () from /lib64/libc.so.6
#1 0x00007feadea4d131 in abort () from /lib64/libc.so.6
#2 0x00007feadea44a10 in __assert_fail () from /lib64/libc.so.6
#3 0x0000000000413a29 in lttng_ht_add_unique_u64 (ht=0x633fa0, node=0x63c138) at hashtable.c:281
#4 0x000000000040ad1e in consumer_thread_channel_poll (data=0x632d10) at consumer.c:2697
#5 0x00007feaded947b6 in start_thread () from /lib64/libpthread.so.0
#6 0x00007feadeaf0c6d in clone () from /lib64/libc.so.6
#7 0x0000000000000000 in ?? ()
Is problem reproducible ?
=========================
* yes
How to reproduce (if reproducible):
===================================
* When running the following scenario a couple of times, a consumerD coredump
can be observed (NOTE: all interaction with lttng were done via API calls):
1)_ lttng-relayd -C tcp://0.0.0.0:53000 -D tcp://0.0.0.0:53001 -o <some out put path> &
2)_ create a session (using the above ports)
a)_ lttng_create_session()
b)_ lttng_create_handle()
c)_ lttng_channel_set_default_attr() // one for ctf and one for metadata
d)_ lttng_enable_channel() // subbuf_size = 16384 (for both ctf and metadata)
// switch_timer_interval = 1000000 (for both ctf and metadata)
e)_ lttng_add_context(..,LTTNG_EVENT_CONTEXT_VPID,..)
f)_ lttng_enable_event_with_filter() // enable a tracepoint which has only 1 hit per second
3)_ start the instrumented apps to produce events
4)_ lttng_start_tracing()
5)_ sleep 5
6)_ "pkill relayd"
7)_ lttng_stop_tracing_no_wait()
8)_ loop every 1 sec until lttng_data_pending() returns 0 (timeout at after 60 sec)
9)_ lttng_destroy_session()
10)_ "pkill TestApp" (kill all running instrumented apps)
11)_ pause for a couple of sec then repeat from step-1 to 11
Sometimes the coredump happens during the 2nd or 3rd iteration.
* Unfortunately, we could not reproduce the above issue using equivalent lttng cli commands !
Any other information:
======================
- "sessiond -vvv" have been collected for each of the above 2 cases.
Files
- Assignee set to Julien Desfossez
- Target version set to 2.2
- Status changed from New to Feedback
- Assignee changed from Julien Desfossez to Mathieu Desnoyers
Please check if the following commits fix your 2 issues (or any of the 2):
lttng-tools:
2b8f875 Cleanup: Use own mutex within timer setup/teardown
f2ad556 Fix: channel management thread should hold a refcount
fc64324 Fix: move metadata cache setup before ownership passing
4419b4f Fix: consumer metadata switch timer error handling
lttng-ust:
64bf51a Fix: add internal mutex for timer
Thanks,
Mathieu
Hi Mathieu,
I tried the new commits, but still having the same problem.
I have collected some logs for you (attached).
And for some reason now I see 3 relayds.
SC-1:~/temp # lttng-sessiond -vvv --verbose-consumer &> /tmp/sessiond.log &
[1] 23690
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
24167 pts/0 00:00:00 lttng-consumerd
24329 ? 00:00:00 lttng-relayd
25197 ? 00:00:00 lttng-relayd
25916 ? 00:00:00 lttng-relayd
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
24167 pts/0 00:00:00 lttng-consumerd <defunct>
/Amer
Please ignore my previous update as it was meant for a different bug.
I am sorry!
We have rebuilt our lttng using the following commits (as from Apr-27):
userspace-rcu 171ff3a (HEAD, origin/master, origin/HEAD) Add "sparc" host cpu to configure.ac
lttng-ust 64bf51a (HEAD, origin/master, origin/HEAD) Fix: add internal mutex for timer
lttng-tools 2b8f875 (HEAD, origin/master, origin/HEAD) Cleanup: Use own mutex within timer ...
babeltrace 888ec52 (HEAD, origin/master, origin/HEAD) Fix: add missing error return when ....
It now gives a new flavour to the original problem.
Instead of having to repeat the test case a couple of time to get the coredump, it now gives coredump at a single attempt. However, according to the gdb printout, the coredump seems to point towards a different place now (instead of abort, it now gives segfault):
(gdb) bt
#0 shm_close_wakeup_fd (ref=<optimized out>, handle=<optimized out>) at shm.h:175
#1 ring_buffer_stream_close_wakeup_fd (config=<optimized out>, chan=0x62f190, handle=0x633de0, cpu=<optimized out>)
at ring_buffer_frontend.c:946
#2 0x0000000000415f70 in close_metadata (chan_key=<optimized out>) at ust-consumer.c:625
#3 lttng_ustconsumer_recv_cmd (ctx=<optimized out>, sock=21, consumer_sockpoll=0x7fccf1d90c10) at ust-consumer.c:1019
#4 0x000000000040548c in consumer_thread_sessiond_poll (data=<optimized out>) at consumer.c:2892
#5 0x00007fccf3d1e7b6 in start_thread () from /lib64/libpthread.so.0
#6 0x00007fccf3a7ac6d in clone () from /lib64/libc.so.6
#7 0x0000000000000000 in ?? ()
Enclosed is the termial log with "bt full".
Please, let us know if further info is needed.
I've been able to reproduce from scripts:
console 1:
lttng-sessiond
console 2
One 100ps test app started.
console 3:
#!/bin/sh
for a in $(seq 1 1000); do
lttng create -U net://localhost
lttng enable-channel -u --buffers-uid --switch-timer 1000000 test
lttng enable-channel -u --buffers-uid --switch-timer 1000000 metadata
lttng add-context -u -t vpid -c test
lttng enable-event -u -a --filter '1==1' -c test
lttng start
lttng destroy
done
console 4:
for a in $(seq 1 20); do (lttng-relayd &); sleep 1; killall lttng-relayd; done
Can you try again with the following upstream commits ?
ea88ca2 Fix: consumerd metadata channel/cache/timer races
a0cbdd2 Fix: consumerd channel destroy race
a500c25 Cleanup: document RCU read-side lock better
f96e454 consumer relayd interaction: fix segfaults
It fixes it for me here.
Thanks,
Mathieu
I have re-run my test case multiple times and the bug is no longer seen .
Looks like the fix is good.
Thanks for your help,
Tan
- Status changed from Feedback to Resolved
Also available in: Atom
PDF