Project

General

Profile

Actions

Bug #510

closed

lttng2.2.0rc1: ConsumerD coredump once in a while after a few iteration of "TC dealing with killing relayd while session is active"

Added by Tan le tran over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
High
Target version:
Start date:
04/23/2013
Due date:
% Done:

0%

Estimated time:

Description

Commit used:
============
rcu        : 8a97780 (HEAD) Fix: tests/api.h use cpuset.h
ust        : a16877a (HEAD, origin/master, origin/HEAD) Typo fix in README
tools      : 6bf5e7c (HEAD, origin/master, origin/HEAD) Fix: remove mention of trace ...
babeltrace : d088404 (HEAD, origin/master, origin/HEAD) Use objstack for AST allocation

Problem Description:
====================
 * ConsumerD coredump with "in __assert_fail () from /lib64/libc.so.6" 
   So far, 2 different flavours (in terms of "gdb> bt" printouts) have been seen:

   Case1:
        (gdb) bt
        #0  0x00007ff900a13b55 in raise () from /lib64/libc.so.6
        #1  0x00007ff900a15131 in abort () from /lib64/libc.so.6
        #2  0x00007ff900a0ca10 in __assert_fail () from /lib64/libc.so.6
        #3  0x000000000041c28f in lttng_ustconsumer_request_metadata (ctx=0x632d10, channel=0x6359b0) at ust-consumer.c:1404
        #4  0x000000000040c88f in metadata_switch_timer (ctx=0x632d10, sig=44, si=0x7ff8fe5cdb10, uc=0x0) at consumer-timer.c:68
        #5  0x000000000040ce85 in consumer_timer_metadata_thread (data=0x632d10) at consumer-timer.c:215
        #6  0x00007ff900d5c7b6 in start_thread () from /lib64/libpthread.so.0
        #7  0x00007ff900ab8c6d in clone () from /lib64/libc.so.6
        #8  0x0000000000000000 in ?? ()

   Case2:
       (gdb) bt
        #0  0x00007feadea4bb55 in raise () from /lib64/libc.so.6
        #1  0x00007feadea4d131 in abort () from /lib64/libc.so.6
        #2  0x00007feadea44a10 in __assert_fail () from /lib64/libc.so.6
        #3  0x0000000000413a29 in lttng_ht_add_unique_u64 (ht=0x633fa0, node=0x63c138) at hashtable.c:281
        #4  0x000000000040ad1e in consumer_thread_channel_poll (data=0x632d10) at consumer.c:2697
        #5  0x00007feaded947b6 in start_thread () from /lib64/libpthread.so.0
        #6  0x00007feadeaf0c6d in clone () from /lib64/libc.so.6
        #7  0x0000000000000000 in ?? ()

Is problem reproducible ?
=========================
  * yes 

How to reproduce (if reproducible):
===================================
  * When running the following scenario a couple of times, a consumerD coredump
    can be observed (NOTE: all interaction with lttng were done via API calls):

    1)_ lttng-relayd  -C tcp://0.0.0.0:53000 -D tcp://0.0.0.0:53001  -o <some out put path> &
    2)_ create a session (using the above ports)
         a)_ lttng_create_session()
         b)_ lttng_create_handle()
         c)_ lttng_channel_set_default_attr()  // one for ctf and one for metadata
         d)_ lttng_enable_channel()  // subbuf_size = 16384  (for both ctf and metadata)
                                     // switch_timer_interval = 1000000 (for both ctf and metadata)
         e)_ lttng_add_context(..,LTTNG_EVENT_CONTEXT_VPID,..)
         f)_ lttng_enable_event_with_filter()   // enable a tracepoint which has only 1 hit per second
    3)_ start the instrumented apps to produce events 
    4)_ lttng_start_tracing()
    5)_ sleep 5
    6)_ "pkill relayd" 
    7)_ lttng_stop_tracing_no_wait()
    8)_ loop every 1 sec until lttng_data_pending() returns 0 (timeout at after 60 sec)
    9)_ lttng_destroy_session()
    10)_ "pkill TestApp" (kill all running instrumented apps)
    11)_ pause for a couple of sec then repeat from step-1 to 11
         Sometimes the coredump happens during the 2nd or 3rd iteration.

  * Unfortunately, we could not reproduce the above issue using equivalent lttng cli commands !

Any other information:
======================
- "sessiond -vvv" have been collected for each of the above 2 cases.


Files

usingTestApp_sessionD_vvv_SC1_flavor1.log (171 KB) usingTestApp_sessionD_vvv_SC1_flavor1.log "sessiond -vvv" printout for case1 Tan le tran, 04/23/2013 02:07 PM
usingTestApp_sessionD_vvv_SC1_flavor2.log (658 KB) usingTestApp_sessionD_vvv_SC1_flavor2.log "sessiond -vvv" printout for case2 Tan le tran, 04/23/2013 02:07 PM
usingTestApp_terminal_and_gdb_info_flavor1.log (19.4 KB) usingTestApp_terminal_and_gdb_info_flavor1.log gdb info for case1 Tan le tran, 04/23/2013 02:07 PM
usingTestApp_terminal_and_gdb_info_flavor2.log (24.6 KB) usingTestApp_terminal_and_gdb_info_flavor2.log gdb info for case2 Tan le tran, 04/23/2013 02:07 PM
sessiond.log (756 KB) sessiond.log Amer Alhalabi, 04/26/2013 04:09 PM
apr27_terminal.log (14.9 KB) apr27_terminal.log gdb printout Tan le tran, 04/30/2013 07:30 AM
Actions #1

Updated by David Goulet over 11 years ago

  • Assignee set to Julien Desfossez
  • Target version set to 2.2
Actions #2

Updated by Mathieu Desnoyers over 11 years ago

  • Status changed from New to Feedback
  • Assignee changed from Julien Desfossez to Mathieu Desnoyers

Please check if the following commits fix your 2 issues (or any of the 2):

lttng-tools:
2b8f875 Cleanup: Use own mutex within timer setup/teardown
f2ad556 Fix: channel management thread should hold a refcount
fc64324 Fix: move metadata cache setup before ownership passing
4419b4f Fix: consumer metadata switch timer error handling

lttng-ust:
64bf51a Fix: add internal mutex for timer

Thanks,

Mathieu

Actions #3

Updated by Amer Alhalabi over 11 years ago

Hi Mathieu,

I tried the new commits, but still having the same problem.

I have collected some logs for you (attached).

And for some reason now I see 3 relayds.

SC-1:~/temp # lttng-sessiond -vvv --verbose-consumer &> /tmp/sessiond.log &
[1] 23690

SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond

SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond

SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
24167 pts/0 00:00:00 lttng-consumerd
24329 ? 00:00:00 lttng-relayd
25197 ? 00:00:00 lttng-relayd
25916 ? 00:00:00 lttng-relayd

SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
24167 pts/0 00:00:00 lttng-consumerd <defunct>

/Amer

Actions #4

Updated by Amer Alhalabi over 11 years ago

Please ignore my previous update as it was meant for a different bug.

I am sorry!

Actions #5

Updated by Tan le tran over 11 years ago

We have rebuilt our lttng using the following commits (as from Apr-27):
  userspace-rcu 171ff3a (HEAD, origin/master, origin/HEAD) Add "sparc" host cpu to configure.ac
  lttng-ust 64bf51a (HEAD, origin/master, origin/HEAD) Fix: add internal mutex for timer
  lttng-tools 2b8f875 (HEAD, origin/master, origin/HEAD) Cleanup: Use own mutex within timer ...
  babeltrace 888ec52 (HEAD, origin/master, origin/HEAD) Fix: add missing error return when ....

It now gives a new flavour to the original problem.
Instead of having to repeat the test case a couple of time to get the coredump, it now gives coredump at a single attempt. However, according to the gdb printout, the coredump seems to point towards a different place now (instead of abort, it now gives segfault):

(gdb) bt
  #0  shm_close_wakeup_fd (ref=<optimized out>, handle=<optimized out>) at shm.h:175
  #1  ring_buffer_stream_close_wakeup_fd (config=<optimized out>, chan=0x62f190, handle=0x633de0, cpu=<optimized out>)
    at ring_buffer_frontend.c:946
  #2  0x0000000000415f70 in close_metadata (chan_key=<optimized out>) at ust-consumer.c:625
  #3  lttng_ustconsumer_recv_cmd (ctx=<optimized out>, sock=21, consumer_sockpoll=0x7fccf1d90c10) at ust-consumer.c:1019
  #4  0x000000000040548c in consumer_thread_sessiond_poll (data=<optimized out>) at consumer.c:2892
  #5  0x00007fccf3d1e7b6 in start_thread () from /lib64/libpthread.so.0
  #6  0x00007fccf3a7ac6d in clone () from /lib64/libc.so.6
  #7  0x0000000000000000 in ?? ()

Enclosed is the termial log with "bt full".
Please, let us know if further info is needed.

Actions #6

Updated by Mathieu Desnoyers over 11 years ago

I've been able to reproduce from scripts:

console 1:
lttng-sessiond

console 2
One 100ps test app started.

console 3:
#!/bin/sh

for a in $(seq 1 1000); do
    lttng create -U net://localhost
    lttng enable-channel -u --buffers-uid --switch-timer 1000000 test
    lttng enable-channel -u --buffers-uid --switch-timer 1000000 metadata
    lttng add-context -u -t vpid -c test
    lttng enable-event -u -a --filter '1==1' -c test
    lttng start
    lttng destroy
done

console 4:

for a in $(seq 1 20); do (lttng-relayd &); sleep 1; killall lttng-relayd; done

Actions #7

Updated by Mathieu Desnoyers over 11 years ago

Can you try again with the following upstream commits ?

ea88ca2 Fix: consumerd metadata channel/cache/timer races
a0cbdd2 Fix: consumerd channel destroy race
a500c25 Cleanup: document RCU read-side lock better
f96e454 consumer relayd interaction: fix segfaults

It fixes it for me here.

Thanks,

Mathieu

Actions #8

Updated by Tan le tran over 11 years ago

I have re-run my test case multiple times and the bug is no longer seen .
Looks like the fix is good.
Thanks for your help,
Tan

Actions #9

Updated by Mathieu Desnoyers over 11 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF