Project

General

Profile

Bug #1110

Fork() test 12 from Linux Test Project fails when traced with userspace events; events are miscounted

Added by Ricardo Nabinger Sanchez about 3 years ago. Updated about 3 years ago.

Status:
Invalid
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
05/22/2017
Due date:
% Done:

0%

Estimated time:

Description

While tracing fork12 test from LTP, the test failed unexpectedly.

LTTng-ust master as of d9ab33856fd7bef5b598c9b6a47309b3bf9b06f2.

Steps to reproduce:

git clone https://github.com/linux-test-project/ltp.git
cd ltp
make autotools
./configure
make

Then, when running as root, the test will abort if the problem is triggered:

lttng create
lttng enable-channel -u chan_ust
lttng enable-channel -k chan_kernel
lttng add-context -u -t vpid -t ip -t procname -t vtid -c chan_ust
lttng enable-event -u -a -c chan_ust
lttng enable-event -k -c chan_kernel --syscall --all
lttng start
LD_PRELOAD=liblttng-ust-dl.so:liblttng-ust-fork.so:liblttng-ust-fd.so ./fork12

After forcibly stopping the test (^C), errors are reported and events could be miscounted when issuing lttng stop:

[warning] 9223372037047268318 events discarded, please refer to the documentation on channel configuration.

When running with blocking, the test still fails but the counting seems possible (events could have been indeed lost during PID starvation, which is what this test does):

ltp/testcases/kernel/syscalls/fork# LTTNG_UST_BLOCKING_RETRY_TIMEOUT=-1 LD_PRELOAD=liblttng-ust-fd.so:liblttng-ust-fork.so:liblttng-ust-dl.so:liblttng-ust-libc-wrapper.so ./fork12
fork12      0  TINFO  :  Forking as many kids as possible
fork12      0  TINFO  :  Number of processes forked is 10717
fork12      1  TPASS  :  fork() eventually failed with 11: Resource temporarily unavailable
  (system pauses for a while after creating forking-havoc; test seems to get stuck, so a ^C is sent)
^Cfork12      1  TBROK  :  fork12.c:132: Unexpected signal 2 received.
fork12      2  TBROK  :  fork12.c:132: Unexpected signal 2 received.
fork12      1  TBROK  :  fork12.c:132: Unexpected signal 2 received.
fork12      2  TBROK  :  fork12.c:132: Remaining cases broken
fork12      3  TBROK  :  fork12.c:132: Remaining cases broken
fork12      1  TBROK  :  fork12.c:132: Unexpected signal 2 received.
fork12      2  TBROK  :  fork12.c:132: Remaining cases broken
fork12      1  TBROK  :  fork12.c:132: Unexpected signal 2 received.
fork12      1  TBROK  :  fork12.c:132: Unexpected signal 2 received.
fork12      2  TBROK  :  fork12.c:132: Remaining cases broken
fork12      1  TBROK  :  fork12.c:132: Unexpected signal 2 received.
fork12      2  TBROK  :  fork12.c:132: Remaining cases broken
fork12      2  TBROK  :  fork12.c:132: Remaining cases broken
(many more...)

When stopping collection:

[warning] 184368 events discarded, please refer to the documentation on channel configuration.

#1

Updated by Jonathan Rajotte Julien about 3 years ago

  • Status changed from New to Feedback

Hi,

The hanging seems to be related to the actual testing code:

/*
 * Introducing a sleep(3) to make sure all children are
 * at pause() when SIGQUIT is sent to them
 */
sleep(3);
kill(0, SIGQUIT);
while (wait(&waitstatus) > 0) ;

Changing sleep(3) to sleep(10) fixes the hang on my machine. The sleep was introduce to prevent a race condition [1]. Using LTTng incurs overhead (lttng ust thread creation, registering, etc.) hence can easily delay execution of child processes and the parent might send SIGQUIT before certain children get to the pause() statement resulting in the observed hang.

As for the value of the discarded event, I was unable to reproduce. Is lttng-modules in sync with HEAD of master ?

[1] http://marc.info/?l=ltp-list&m=117751559908566&w=2

Cheers

#2

Updated by Ricardo Nabinger Sanchez about 3 years ago

Jonathan Rajotte Julien wrote:

Changing sleep(3) to sleep(10) fixes the hang on my machine. The sleep was introduce to prevent a race condition [1]. Using LTTng incurs overhead (lttng ust thread creation, registering, etc.) hence can easily delay execution of child processes and the parent might send SIGQUIT before certain children get to the pause() statement resulting in the observed hang.

I agree with your assessment. The test creates quite a load on the machine. I wasn't aware of this sleep() subtlety.

As for the value of the discarded event, I was unable to reproduce. Is lttng-modules in sync with HEAD of master ?

Yes, all synced during the same time, master branch:
  • babeltrace: v1.2.0-654-g76206d8
  • lttng-modules: v2.10.0-rc1-4-g0147ae6
  • lttng-tools: v2.7.0-rc1-1040-gd9ab3385 (lttng-tools seems to be missing a 2.10 tag)
  • lttng-ust: v2.10.0-rc1-6-gdd77bd5b
  • userspace-rcu: v0.9.0-62-g2748bd6
#3

Updated by Jonathan Rajotte Julien about 3 years ago

  • Status changed from Feedback to Invalid

I was no able to reproduce the miscounted discarded events. Still this is most probably a valid bug see #1114.

Closing this bug as invalid since the problem (race) is not on lttng side.

Thansk for reporting this.

Also available in: Atom PDF