Project

General

Profile

Actions

Bug #1109

open

Fork() test 9 from Linux Test Project fails when traced with userspace events

Added by Ricardo Nabinger Sanchez almost 7 years ago. Updated almost 7 years ago.

Status:
Feedback
Priority:
Low
Assignee:
-
Target version:
-
Start date:
05/22/2017
Due date:
% Done:

0%

Estimated time:

Description

While tracing fork09 test from LTP, the test failed unexpectedly. LTTng required some of the FDs the test aims at exhausting. It seems such a test will always fail when tracing, for it assumes it will be the sole user of its FD space.

LTTng-ust master as of dd77bd5b20e95fefa8d857fdd3b7c9bdbbf24bc7.

Steps to reproduce:

git clone https://github.com/linux-test-project/ltp.git
cd ltp
make autotools
./configure
make

Then, when running as root, the test will abort if the problem is triggered:

lttng create
lttng enable-channel -u chan_ust
lttng enable-channel -k chan_kernel
lttng add-context -u -t vpid -t ip -t procname -t vtid -c chan_ust
lttng enable-event -u -a -c chan_ust
lttng enable-event -k -c chan_kernel --syscall --all
lttng start
LD_PRELOAD=liblttng-ust-dl.so:liblttng-ust-fork.so:liblttng-ust-fd.so ./fork09

Example output:

  
ltp/testcases/kernel/syscalls/fork# LD_PRELOAD=liblttng-ust-dl.so:liblttng-ust-fork.so:liblttng-ust-fd.so ./fork09
fork09      0  TINFO  :  OPEN_MAX is 1024
fork09      0  TINFO  :  first file descriptor is 22 
fork09      0  TINFO  :  Parent reporting 1023 files open
fork09      0  TINFO  :  Child opened new file #1023
libgcc_s.so.1 must be installed for pthread_cancel to work
fork09      1  TBROK  :  tst_sig.c:233: unexpected signal SIGIOT/SIGABRT(6) received (pid = 2568).
fork09      2  TBROK  :  tst_sig.c:233: Remaining cases broken
fork09      0  TWARN  :  tst_tmpdir.c:337: tst_rmdir: rmobj(/tmp/forz9gzhZ) failed: unlink(/tmp/forz9gzhZ) failed; errno=21: EISDIR
fork09      1  TFAIL  :  fork09.c:147: test 1 FAILED

When checking the trace, one can see that some calls to open() failed with errno == 24, "Too many open files":

...
[13:46:47.130018542] (+0.000013569) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/etc/ld.so.cache", flags = 524288, mode = 1 }
[13:46:47.130020273] (+0.000001731) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 }
[13:46:47.130021801] (+0.000001528) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/lib64/tls/x86_64/libgcc_s.so.1", flags = 524288, mode = 55440 }
[13:46:47.130022801] (+0.000001000) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 }
[13:46:47.130023703] (+0.000000902) rns syscall_entry_newstat: { cpu_id = 1 }, { filename = "/lib64/tls/x86_64" }
[13:46:47.130026945] (+0.000003242) rns syscall_exit_newstat: { cpu_id = 1 }, { ret = -2, statbuf = 140723106378272 }
[13:46:47.130027652] (+0.000000707) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/lib64/tls/libgcc_s.so.1", flags = 524288, mode = 55440 }
[13:46:47.130028567] (+0.000000915) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 }
[13:46:47.130028952] (+0.000000385) rns syscall_entry_newstat: { cpu_id = 1 }, { filename = "/lib64/tls" }
[13:46:47.130030619] (+0.000001667) rns syscall_exit_newstat: { cpu_id = 1 }, { ret = -2, statbuf = 140723106378272 }
[13:46:47.130031155] (+0.000000536) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/lib64/x86_64/libgcc_s.so.1", flags = 524288, mode = 55440 }
[13:46:47.130032002] (+0.000000847) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 }
[13:46:47.130032395] (+0.000000393) rns syscall_entry_newstat: { cpu_id = 1 }, { filename = "/lib64/x86_64" }
[13:46:47.130034017] (+0.000001622) rns syscall_exit_newstat: { cpu_id = 1 }, { ret = -2, statbuf = 140723106378272 }
[13:46:47.130034583] (+0.000000566) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/lib64/libgcc_s.so.1", flags = 524288, mode = 55440 }
[13:46:47.130035432] (+0.000000849) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 }
[13:46:47.130035793] (+0.000000361) rns syscall_entry_newstat: { cpu_id = 1 }, { filename = "/lib64" }
[13:46:47.130039354] (+0.000003561) rns syscall_exit_newstat: { cpu_id = 1 }, { ret = 0, statbuf = 140723106378272 }
[13:46:47.130045019] (+0.000005665) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/dev/tty", flags = 2306, mode = 0 }
[13:46:47.130046134] (+0.000001115) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 }
[13:46:47.130046973] (+0.000000839) rns syscall_entry_writev: { cpu_id = 1 }, { fd = 2, vec = 140723106380368, vlen = 1 }
[13:46:47.130060399] (+0.000013426) rns syscall_exit_writev: { cpu_id = 1 }, { ret = 59, vec = 140723106380368 }
[13:46:47.130061293] (+0.000000894) rns syscall_entry_mmap: { cpu_id = 1 }, { addr = 0x0, len = 4096, prot = 3, flags = 34, fd = -1, offset = 0 }
[13:46:47.130067022] (+0.000005729) rns syscall_exit_mmap: { cpu_id = 1 }, { ret = 0x7FB33758B000 }
[13:46:47.130075155] (+0.000008133) rns syscall_entry_rt_sigprocmask: { cpu_id = 1 }, { how = 1, nset = 140723106380064, sigsetsize = 8 }
[13:46:47.130075728] (+0.000000573) rns syscall_exit_rt_sigprocmask: { cpu_id = 1 }, { ret = 0, oset = 0 }
[13:46:47.130126316] (+0.000050588) rns syscall_entry_rt_sigprocmask: { cpu_id = 1 }, { how = 0, nset = 140723106379920, sigsetsize = 8 }
[13:46:47.130128042] (+0.000001726) rns syscall_exit_rt_sigprocmask: { cpu_id = 1 }, { ret = 0, oset = 140723106379792 }
[13:46:47.130128816] (+0.000000774) rns syscall_entry_getpid: { cpu_id = 1 }, { }
[13:46:47.130129698] (+0.000000882) rns syscall_exit_getpid: { cpu_id = 1 }, { ret = 2568 }
[13:46:47.130130243] (+0.000000545) rns syscall_entry_gettid: { cpu_id = 1 }, { }
[13:46:47.130131252] (+0.000001009) rns syscall_exit_gettid: { cpu_id = 1 }, { ret = 2568 }
[13:46:47.130132261] (+0.000001009) rns syscall_entry_tgkill: { cpu_id = 1 }, { tgid = 2568, pid = 2568, sig = 6 }

Actions #1

Updated by Jonathan Rajotte Julien almost 7 years ago

  • Status changed from New to Feedback
  • Priority changed from Normal to Low
  • Target version deleted (2.10)

Hi

What would be the expected behavior ? What do you propose as solution?

The test result is actually not consistent and can report a success in the event where the parent lttng-ust threads close a fd during the open loop resulting in a not-exhausted open file state. This will leave place for pthread_cancel to perform its dlopen on libgcc_s.so.

Note that the error message:

libgcc_s.so.1 must be installed for pthread_cancel to work

comes from glibc and is a result of exhaustion of fd (as you stated). It is also misleading since the reason why it fails have nothing to do with the presence of libgcc_s.so.1 on the host. Glibc simply ignore the returned error (-24, too many open files)

The test make assumptions regarding the fds control and using lttng make those assumptions obsolete. There is not much lttng can do on this side. LTTng could reserve an fd to compensate for the use of pthread_cancel but i'm not really sure of the pertinence of such an solution since this all happen in the destructor of lttng-ust, neither its viability.

Cheers

Actions #2

Updated by Ricardo Nabinger Sanchez almost 7 years ago

As I briefly mentioned, I think there is hardly a solution. Reserving FDs seems impractical (intercepting events when the application is finding out about FD limits could result in more problems, i.e., such events are lost).

Perhaps the solution is to clearly document how many FDs LTTng needs for its internal communication and, if possible, warn the user when enough verbosity/debugging is requested (as to minimize frustration in the case FD-exhaustion is the actual problem). I understand this is an uncommon scenario for most applications, though not that uncommon for server applications (for example, Apache HTTPD, Varnish, HAProxy).

Actions #3

Updated by Jonathan Rajotte Julien almost 7 years ago

Ricardo Nabinger Sanchez wrote:

As I briefly mentioned, I think there is hardly a solution. Reserving FDs seems impractical (intercepting events when the application is finding out about FD limits could result in more problems, i.e., such events are lost).

Perhaps the solution is to clearly document how many FDs LTTng needs for its internal communication

FD for communication are opened at application start (before main) hence lttng-ust will works (see value of "first file" when we preload a liblttng-ust-fork etc.). The problematic here is pthread_cancel that requires at least one fd on child exit.

and, if possible, warn the user when enough verbosity/debugging is requested (as to minimize frustration in the case FD-exhaustion is the actual problem).

Glibc is the one aborting. Not sure how we can provide more information. To be honest, I think glibc are the one dropping the ball and returning a phony error message.

This is the portion inside lttng-ust dealing with pthread_cancel.

if (global_apps.thread_active) {                                      
        ret = pthread_cancel(global_apps.ust_listener);               
        if (ret) {                                                    
                ERR("Error cancelling global ust listener thread: %s",
                        strerror(ret));                               
        } else {                                                                     
                global_apps.thread_active = 0;                        
        }                                                             
}                                                                     

Not much we can do if pthread simply abort.

I understand this is an uncommon scenario for most applications, though n>ot that uncommon for server applications (for example, Apache HTTPD, Varnish, HAProxy).

Any thoughts?

Actions #4

Updated by Mathieu Desnoyers almost 7 years ago

One more information element: lttng-ust, during its execution, opens 2 file descriptors per listener thread (and there are 2 of those threads running). The tracepoint infrastructure also uses a FD for dlopen(). If tracing is enabled, we open a "wakeup pipe" FD for each channel as well, and one FD for the channel memory mapping shm (we close it quickly after mmaping it though).

It appears that given the nature of the test-case, lttng-ust already reserved those FDs before the test try to exhaust those left, which is why it does not trigger lttng-ust FD exhaustion error messages.

What trigger the issue here is indeed pthread_cancel in glibc trying to internally open a .so. It seems to be the visible problem only because it executes when the application quits. However, in different FD exhaustion scenarios, lttng-ust could fail in many more code paths (whenever it try to open a file descriptor). Also, the number of FDs is not fixed (depends on the number of channel).

Perhaps describing the FD requirements by lttng-ust in the lttng-ust.3 man page could be an approach to deal with this.

Actions #5

Updated by Jonathan Rajotte Julien almost 7 years ago

Thanks for correcting me regarding FD allocation.

I will sync with eepp (for style) regarding the man page update.

Cheers

Actions

Also available in: Atom PDF