Project

General

Profile

Actions

Bug #475

closed

ConsumerD coredump with signal 11, Segfault (in pthread_mutex_lock () from /lib64/libpthread.so.0, )

Added by Tan le tran about 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
03/20/2013
Due date:
% Done:

100%

Estimated time:

Description

Commit used:
============
  lttng-tools   ed22248 (HEAD, temp_bug429_bug433_patches) Apply patches for bug429 and 433
                b325dc7 (origin/stable-2.1) Fix: put session list lock around the app registration
  lttng-ust     164931d (HEAD, origin/stable-2.1) Fix: refcount issue in lttng-ust-abi.c
  userspace-rcu da9bed2 (HEAD, tag: v0.7.6) Version 0.7.6

Problem description:
====================
  ConsummerD coredump

Scenario:
=========
  During our stress test (4 different users repeatedly create, start, 
  stop, destroy, convert sessions), occasionally we have seen coredump
  from consummerD with the following gdb info:

  gdb ...
  Core was generated by `lttng-consumerd --quiet -u --consumerd-cmd-sock /var/run/lttng/ustconsumerd64/c'.
  Program terminated with signal 11, Segmentation fault.
  #0  0x00007f960a2533f4 in pthread_mutex_lock () from /lib64/libpthread.so.0

(gdb) bt
#0  0x00007f960a2533f4 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x0000000000404d09 in consumer_add_stream (stream=0xa000000, ht=0x627960) at consumer.c:585
#2  0x000000000040903a in consumer_thread_data_poll (data=0x626fd0) at consumer.c:2378
#3  0x00007f960a2517b6 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f9609fadc6d in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()

Is the problem reproducible ?
=============================
 Yes, but not at will.
 So far, it happens 3 times after running the above stress test on 6 different nights.
 (stress test is run at least 12hrs long)


Files

gdb_output.log (11.1 KB) gdb_output.log gdb output Tan le tran, 03/20/2013 06:10 AM
Actions #1

Updated by David Goulet about 11 years ago

  • Status changed from New to Confirmed
  • Assignee set to David Goulet
  • Priority changed from Normal to High
  • Target version set to 2.2
Actions #2

Updated by Mathieu Desnoyers about 11 years ago

  • Target version changed from 2.2 to 2.1 stable
Actions #3

Updated by David Goulet almost 11 years ago

This is a bit weird because of what I see, the "new_stream" is set to 0xa000000 which does NOT look like a valid heap address. But somehow, it points to valid memory because it was dereferenced before calling pthread_mutex_lock().

This value was passed by the sessiond thread because "pipe_readlen" equals 4, the size of the pointer. Also, the heap address space seems in 0x620000 and above.

To be honest, I checked at the code and can't figure out how this pointer could be invalid. Let's keep this one open and if someone else can reproduce it, we'll be able to connect the dots.

Actions #4

Updated by David Goulet almost 11 years ago

  • Target version changed from 2.1 stable to 2.2

I'll be pushing a fix in master (currently after 2.2-rc2) that should probably fix this issue. We think that it's caused by a race between a partial read/write on the NON BLOCK data pipe between the threads that causes this corrupted value.

The fix adds a layer over pipes (called lttng_pipe) that handles partial read/write and EINTR as well as protects read and write operations with two independent mutexes.

The patches will close this issue. If it's ever comes back, we'll reopen this. Also, the backport of this layer is conflicting too much with stable-2.1 code base so I'm tagging back this issue to 2.2 meaning that it will only be fixed in 2.2-rc3+.

Actions #5

Updated by David Goulet almost 11 years ago

  • Status changed from Confirmed to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF