Bug #595


don't contact the consumer when it is dead

Added by Julien Desfossez almost 11 years ago. Updated almost 11 years ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:


on ctrl+c with snapshots enabled, the cleanup mechanism tries to contact the consumer even when it is already gone.

steps to reproduce :
start lttng-sessiond in a console (not in daemon mode)
in a separate console :
lttng create --snapshot
lttng enable-event -k sched_switch
lttng start

then ctrl+c in the console with the session

This should appear :
PERROR [15884/15884]: sendmsg: Bad file descriptor (in lttcomm_send_unix_sock() at unix.c:218)

Backtrace leading to this bug :
#4 0x000000000042f5a5 in lttcomm_send_unix_sock (sock=<optimized out>, buf=<optimized out>,
len=<optimized out>) at unix.c:219
#5 0x000000000040ec0c in consumer_send_msg (sock=0x7f3b8c003eb0, msg=<optimized out>)
at consumer.c:676
#6 0x0000000000412253 in kernel_consumer_destroy_channel (socket=0x7f3b8c003eb0,
channel=<optimized out>) at kernel-consumer.c:384
#7 0x000000000040ac6b in kernel_destroy_session (ksess=0x7f3b8c003cb0) at kernel.c:774
#8 0x000000000041524b in cmd_destroy_session (session=0x7f3b8c005cd0, wpipe=-1) at cmd.c:1828
#9 0x000000000040552e in cleanup () at main.c:486
#10 main (argc=<optimized out>, argv=<optimized out>) at main.c:4769

Log starting at the ctrl+c :
^CDEBUG1 [19560/19560]: SIGINT caught (in sighandler() at main.c:4229)
DEBUG1 [19560/19560]: Terminating all threads (in stop_threads() at main.c:382)
DEBUG1 [19701/19701]: ignoring first SIGINT (in sighandler() at lttng-consumerd.c:81)
DEBUG1 [19560/19560]: Futex n to 1 wake done (in futex_nto1_wake() at futex.c:104)
DEBUG1 [19560/19659]: [ust-thread] cleanup complete. (in thread_ht_cleanup() at ht-cleanup.c:130)
DEBUG1 [19560/19662]: Futex n to 1 wait done (in futex_nto1_wait() at futex.c:90)
DEBUG1 [19560/19662]: Dispatch thread dying (in thread_dispatch_ust_registration() at main.c:1684)
DEBUG1 [19560/19702]: consumer thread cleanup completed (in thread_manage_consumer() at main.c:1191)
DEBUG1 [19701/19706]: Incoming command on sock (in consumer_thread_sessiond_poll() at consumer.c:2948)
DEBUG1 [19560/19665]: Application notify communication apps thread cleanup complete (in ust_thread_manage_notify() at ust-thread.c:174)
DEBUG1 [19701/19706]: Communication interrupted on command socket (in consumer_thread_sessiond_poll() at consumer.c:2959)
DEBUG1 [19701/19706]: Consumer thread sessiond poll exiting (in consumer_thread_sessiond_poll() at consumer.c:2969)
DEBUG1 [19560/19660]: Health check thread dying (in thread_manage_health() at main.c:3553)
DEBUG1 [19701/19705]: poll num_rdy : 1 (in consumer_thread_data_poll() at consumer.c:2387)
DEBUG1 [19701/19705]: consumer_data_pipe wake up (in consumer_thread_data_poll() at consumer.c:2411)
DEBUG1 [19560/19663]: Notifying applications of session daemon state: 0 (in notify_ust_apps() at main.c:555)
DEBUG1 [19701/19705]: Consumer delete flagged data stream (in validate_endpoint_status_data_stream() at consumer.c:2059)
DEBUG1 [19701/19705]: polling thread exiting (in consumer_thread_data_poll() at consumer.c:2541)
DEBUG1 [19560/19663]: Got the wait shm fd 5 (in get_wait_shm() at shm.c:119)
DEBUG1 [19701/19703]: Channel event catched in thread (in consumer_thread_channel_poll() at consumer.c:2681)
DEBUG1 [19560/19663]: Futex wait update active 0 (in futex_wait_update() at futex.c:63)
DEBUG1 [19701/19703]: Channel poll thread exiting (in consumer_thread_channel_poll() at consumer.c:2837)
DEBUG1 [19560/19663]: UST Registration thread cleanup complete (in thread_registration_apps() at main.c:1888)
DEBUG1 [19701/19704]: Metadata event catched in thread (in consumer_thread_metadata_poll() at consumer.c:2154)
DEBUG1 [19701/19704]: Metadata thread pipe hung up (in consumer_thread_metadata_poll() at consumer.c:2177)
DEBUG1 [19560/19664]: Application communication apps thread cleanup complete (in thread_manage_apps() at main.c:1351)
DEBUG1 [19701/19704]: Metadata poll thread exiting (in consumer_thread_metadata_poll() at consumer.c:2295)
DEBUG1 [19560/19666]: Kernel thread dying (in thread_manage_kernel() at main.c:881)
PERROR [19701/19701]: sendmsg: Broken pipe (in lttcomm_send_unix_sock() at unix.c:218)
DEBUG1 [19560/19661]: Clean command context structure (in clean_command_ctx() at main.c:535)
lttng-consumerd: unix.c:219: lttcomm_send_unix_sock: Assertion `0' failed.
DEBUG1 [19560/19661]: Client thread dying (in thread_manage_clients() at main.c:3810)
DEBUG1 [19560/19560]: Cleaning up (in cleanup() at main.c:447)
DEBUG1 [19560/19560]: Removing /var/run/lttng directory (in cleanup() at main.c:463)
DEBUG1 [19560/19560]: Cleaning up all sessions (in cleanup() at main.c:477)
DEBUG1 [19560/19560]: Tearing down kernel session (in kernel_destroy_session() at kernel.c:755)
DEBUG1 [19560/19560]: Sending kernel consumer destroy channel key 33 (in kernel_consumer_destroy_channel() at kernel-consumer.c:376)
PERROR [19560/19560]: sendmsg: Bad file descriptor (in lttcomm_send_unix_sock() at unix.c:218)

Actions #1

Updated by David Goulet almost 11 years ago

  • Status changed from New to Confirmed
  • Priority changed from Low to Normal
  • Target version changed from 2.3 to 2.2
Actions #2

Updated by David Goulet almost 11 years ago

  • Priority changed from Normal to Low
  • Target version changed from 2.2 to 2.4

We acknowledge this bug however the fix is a bit ambitious for a RC release.

The issue here is that there is possibly concurrent commands being sent to a dying consumer(s) thus creating a race between state change of that consumer (to an error state) and the commands.

What we need to do is to stop copying the consumer's file descriptor to every consumer output object and add checks to every callsite that tries to contact the consumer to check if that FD is still valid. Furthermore, to avoid the race between the state change and the communication, each callsite that detects an error should change the consumer state.

Actions #3

Updated by David Goulet almost 11 years ago

  • Priority changed from Low to High
  • Target version changed from 2.4 to 2.3
Actions #4

Updated by David Goulet almost 11 years ago

  • Status changed from Confirmed to Resolved
  • % Done changed from 0 to 100

Also available in: Atom PDF