Project

General

Profile

Actions

Bug #510

closed
TL MD

lttng2.2.0rc1: ConsumerD coredump once in a while after a few iteration of "TC dealing with killing relayd while session is active"

Bug #510: lttng2.2.0rc1: ConsumerD coredump once in a while after a few iteration of "TC dealing with killing relayd while session is active"

Added by Tan le tran about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
High
Target version:
Start date:
04/23/2013
Due date:
% Done:

0%

Estimated time:

Description

Commit used:
============
rcu        : 8a97780 (HEAD) Fix: tests/api.h use cpuset.h
ust        : a16877a (HEAD, origin/master, origin/HEAD) Typo fix in README
tools      : 6bf5e7c (HEAD, origin/master, origin/HEAD) Fix: remove mention of trace ...
babeltrace : d088404 (HEAD, origin/master, origin/HEAD) Use objstack for AST allocation

Problem Description:
====================
 * ConsumerD coredump with "in __assert_fail () from /lib64/libc.so.6" 
   So far, 2 different flavours (in terms of "gdb> bt" printouts) have been seen:

   Case1:
        (gdb) bt
        #0  0x00007ff900a13b55 in raise () from /lib64/libc.so.6
        #1  0x00007ff900a15131 in abort () from /lib64/libc.so.6
        #2  0x00007ff900a0ca10 in __assert_fail () from /lib64/libc.so.6
        #3  0x000000000041c28f in lttng_ustconsumer_request_metadata (ctx=0x632d10, channel=0x6359b0) at ust-consumer.c:1404
        #4  0x000000000040c88f in metadata_switch_timer (ctx=0x632d10, sig=44, si=0x7ff8fe5cdb10, uc=0x0) at consumer-timer.c:68
        #5  0x000000000040ce85 in consumer_timer_metadata_thread (data=0x632d10) at consumer-timer.c:215
        #6  0x00007ff900d5c7b6 in start_thread () from /lib64/libpthread.so.0
        #7  0x00007ff900ab8c6d in clone () from /lib64/libc.so.6
        #8  0x0000000000000000 in ?? ()

   Case2:
       (gdb) bt
        #0  0x00007feadea4bb55 in raise () from /lib64/libc.so.6
        #1  0x00007feadea4d131 in abort () from /lib64/libc.so.6
        #2  0x00007feadea44a10 in __assert_fail () from /lib64/libc.so.6
        #3  0x0000000000413a29 in lttng_ht_add_unique_u64 (ht=0x633fa0, node=0x63c138) at hashtable.c:281
        #4  0x000000000040ad1e in consumer_thread_channel_poll (data=0x632d10) at consumer.c:2697
        #5  0x00007feaded947b6 in start_thread () from /lib64/libpthread.so.0
        #6  0x00007feadeaf0c6d in clone () from /lib64/libc.so.6
        #7  0x0000000000000000 in ?? ()

Is problem reproducible ?
=========================
  * yes 

How to reproduce (if reproducible):
===================================
  * When running the following scenario a couple of times, a consumerD coredump
    can be observed (NOTE: all interaction with lttng were done via API calls):

    1)_ lttng-relayd  -C tcp://0.0.0.0:53000 -D tcp://0.0.0.0:53001  -o <some out put path> &
    2)_ create a session (using the above ports)
         a)_ lttng_create_session()
         b)_ lttng_create_handle()
         c)_ lttng_channel_set_default_attr()  // one for ctf and one for metadata
         d)_ lttng_enable_channel()  // subbuf_size = 16384  (for both ctf and metadata)
                                     // switch_timer_interval = 1000000 (for both ctf and metadata)
         e)_ lttng_add_context(..,LTTNG_EVENT_CONTEXT_VPID,..)
         f)_ lttng_enable_event_with_filter()   // enable a tracepoint which has only 1 hit per second
    3)_ start the instrumented apps to produce events 
    4)_ lttng_start_tracing()
    5)_ sleep 5
    6)_ "pkill relayd" 
    7)_ lttng_stop_tracing_no_wait()
    8)_ loop every 1 sec until lttng_data_pending() returns 0 (timeout at after 60 sec)
    9)_ lttng_destroy_session()
    10)_ "pkill TestApp" (kill all running instrumented apps)
    11)_ pause for a couple of sec then repeat from step-1 to 11
         Sometimes the coredump happens during the 2nd or 3rd iteration.

  * Unfortunately, we could not reproduce the above issue using equivalent lttng cli commands !

Any other information:
======================
- "sessiond -vvv" have been collected for each of the above 2 cases.


Files

usingTestApp_sessionD_vvv_SC1_flavor1.log (171 KB) usingTestApp_sessionD_vvv_SC1_flavor1.log "sessiond -vvv" printout for case1 Tan le tran, 04/23/2013 02:07 PM
usingTestApp_sessionD_vvv_SC1_flavor2.log (658 KB) usingTestApp_sessionD_vvv_SC1_flavor2.log "sessiond -vvv" printout for case2 Tan le tran, 04/23/2013 02:07 PM
usingTestApp_terminal_and_gdb_info_flavor1.log (19.4 KB) usingTestApp_terminal_and_gdb_info_flavor1.log gdb info for case1 Tan le tran, 04/23/2013 02:07 PM
usingTestApp_terminal_and_gdb_info_flavor2.log (24.6 KB) usingTestApp_terminal_and_gdb_info_flavor2.log gdb info for case2 Tan le tran, 04/23/2013 02:07 PM
sessiond.log (756 KB) sessiond.log Amer Alhalabi, 04/26/2013 04:09 PM
apr27_terminal.log (14.9 KB) apr27_terminal.log gdb printout Tan le tran, 04/30/2013 07:30 AM

DG Updated by David Goulet about 13 years ago Actions #1

  • Assignee set to Julien Desfossez
  • Target version set to 2.2

MD Updated by Mathieu Desnoyers about 13 years ago Actions #2

  • Status changed from New to Feedback
  • Assignee changed from Julien Desfossez to Mathieu Desnoyers

Please check if the following commits fix your 2 issues (or any of the 2):

lttng-tools:
2b8f875 Cleanup: Use own mutex within timer setup/teardown
f2ad556 Fix: channel management thread should hold a refcount
fc64324 Fix: move metadata cache setup before ownership passing
4419b4f Fix: consumer metadata switch timer error handling

lttng-ust:
64bf51a Fix: add internal mutex for timer

Thanks,

Mathieu

AA Updated by Amer Alhalabi about 13 years ago Actions #3

Hi Mathieu,

I tried the new commits, but still having the same problem.

I have collected some logs for you (attached).

And for some reason now I see 3 relayds.

SC-1:~/temp # lttng-sessiond -vvv --verbose-consumer &> /tmp/sessiond.log &
[1] 23690

SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond

SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond

SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
24167 pts/0 00:00:00 lttng-consumerd
24329 ? 00:00:00 lttng-relayd
25197 ? 00:00:00 lttng-relayd
25916 ? 00:00:00 lttng-relayd

SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
24167 pts/0 00:00:00 lttng-consumerd <defunct>

/Amer

AA Updated by Amer Alhalabi about 13 years ago Actions #4

Please ignore my previous update as it was meant for a different bug.

I am sorry!

TL Updated by Tan le tran about 13 years ago Actions #5

We have rebuilt our lttng using the following commits (as from Apr-27):
  userspace-rcu 171ff3a (HEAD, origin/master, origin/HEAD) Add "sparc" host cpu to configure.ac
  lttng-ust 64bf51a (HEAD, origin/master, origin/HEAD) Fix: add internal mutex for timer
  lttng-tools 2b8f875 (HEAD, origin/master, origin/HEAD) Cleanup: Use own mutex within timer ...
  babeltrace 888ec52 (HEAD, origin/master, origin/HEAD) Fix: add missing error return when ....

It now gives a new flavour to the original problem.
Instead of having to repeat the test case a couple of time to get the coredump, it now gives coredump at a single attempt. However, according to the gdb printout, the coredump seems to point towards a different place now (instead of abort, it now gives segfault):

(gdb) bt
  #0  shm_close_wakeup_fd (ref=<optimized out>, handle=<optimized out>) at shm.h:175
  #1  ring_buffer_stream_close_wakeup_fd (config=<optimized out>, chan=0x62f190, handle=0x633de0, cpu=<optimized out>)
    at ring_buffer_frontend.c:946
  #2  0x0000000000415f70 in close_metadata (chan_key=<optimized out>) at ust-consumer.c:625
  #3  lttng_ustconsumer_recv_cmd (ctx=<optimized out>, sock=21, consumer_sockpoll=0x7fccf1d90c10) at ust-consumer.c:1019
  #4  0x000000000040548c in consumer_thread_sessiond_poll (data=<optimized out>) at consumer.c:2892
  #5  0x00007fccf3d1e7b6 in start_thread () from /lib64/libpthread.so.0
  #6  0x00007fccf3a7ac6d in clone () from /lib64/libc.so.6
  #7  0x0000000000000000 in ?? ()

Enclosed is the termial log with "bt full".
Please, let us know if further info is needed.

MD Updated by Mathieu Desnoyers about 13 years ago Actions #6

I've been able to reproduce from scripts:

console 1:
lttng-sessiond

console 2
One 100ps test app started.

console 3:
#!/bin/sh

for a in $(seq 1 1000); do
    lttng create -U net://localhost
    lttng enable-channel -u --buffers-uid --switch-timer 1000000 test
    lttng enable-channel -u --buffers-uid --switch-timer 1000000 metadata
    lttng add-context -u -t vpid -c test
    lttng enable-event -u -a --filter '1==1' -c test
    lttng start
    lttng destroy
done

console 4:

for a in $(seq 1 20); do (lttng-relayd &); sleep 1; killall lttng-relayd; done

MD Updated by Mathieu Desnoyers about 13 years ago Actions #7

Can you try again with the following upstream commits ?

ea88ca2 Fix: consumerd metadata channel/cache/timer races
a0cbdd2 Fix: consumerd channel destroy race
a500c25 Cleanup: document RCU read-side lock better
f96e454 consumer relayd interaction: fix segfaults

It fixes it for me here.

Thanks,

Mathieu

TL Updated by Tan le tran about 13 years ago Actions #8

I have re-run my test case multiple times and the bug is no longer seen .
Looks like the fix is good.
Thanks for your help,
Tan

MD Updated by Mathieu Desnoyers about 13 years ago Actions #9

  • Status changed from Feedback to Resolved
Actions

Also available in: PDF Atom