Bug #1109
openFork() test 9 from Linux Test Project fails when traced with userspace events
0%
Description
While tracing fork09
test from LTP, the test failed unexpectedly. LTTng required some of the FDs the test aims at exhausting. It seems such a test will always fail when tracing, for it assumes it will be the sole user of its FD space.
LTTng-ust master as of dd77bd5b20e95fefa8d857fdd3b7c9bdbbf24bc7.
Steps to reproduce:
git clone https://github.com/linux-test-project/ltp.git cd ltp make autotools ./configure make
Then, when running as root, the test will abort if the problem is triggered:
lttng create lttng enable-channel -u chan_ust lttng enable-channel -k chan_kernel lttng add-context -u -t vpid -t ip -t procname -t vtid -c chan_ust lttng enable-event -u -a -c chan_ust lttng enable-event -k -c chan_kernel --syscall --all lttng start LD_PRELOAD=liblttng-ust-dl.so:liblttng-ust-fork.so:liblttng-ust-fd.so ./fork09
Example output:
ltp/testcases/kernel/syscalls/fork# LD_PRELOAD=liblttng-ust-dl.so:liblttng-ust-fork.so:liblttng-ust-fd.so ./fork09 fork09 0 TINFO : OPEN_MAX is 1024 fork09 0 TINFO : first file descriptor is 22 fork09 0 TINFO : Parent reporting 1023 files open fork09 0 TINFO : Child opened new file #1023 libgcc_s.so.1 must be installed for pthread_cancel to work fork09 1 TBROK : tst_sig.c:233: unexpected signal SIGIOT/SIGABRT(6) received (pid = 2568). fork09 2 TBROK : tst_sig.c:233: Remaining cases broken fork09 0 TWARN : tst_tmpdir.c:337: tst_rmdir: rmobj(/tmp/forz9gzhZ) failed: unlink(/tmp/forz9gzhZ) failed; errno=21: EISDIR fork09 1 TFAIL : fork09.c:147: test 1 FAILED
When checking the trace, one can see that some calls to open()
failed with errno == 24
, "Too many open files":
... [13:46:47.130018542] (+0.000013569) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/etc/ld.so.cache", flags = 524288, mode = 1 } [13:46:47.130020273] (+0.000001731) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 } [13:46:47.130021801] (+0.000001528) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/lib64/tls/x86_64/libgcc_s.so.1", flags = 524288, mode = 55440 } [13:46:47.130022801] (+0.000001000) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 } [13:46:47.130023703] (+0.000000902) rns syscall_entry_newstat: { cpu_id = 1 }, { filename = "/lib64/tls/x86_64" } [13:46:47.130026945] (+0.000003242) rns syscall_exit_newstat: { cpu_id = 1 }, { ret = -2, statbuf = 140723106378272 } [13:46:47.130027652] (+0.000000707) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/lib64/tls/libgcc_s.so.1", flags = 524288, mode = 55440 } [13:46:47.130028567] (+0.000000915) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 } [13:46:47.130028952] (+0.000000385) rns syscall_entry_newstat: { cpu_id = 1 }, { filename = "/lib64/tls" } [13:46:47.130030619] (+0.000001667) rns syscall_exit_newstat: { cpu_id = 1 }, { ret = -2, statbuf = 140723106378272 } [13:46:47.130031155] (+0.000000536) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/lib64/x86_64/libgcc_s.so.1", flags = 524288, mode = 55440 } [13:46:47.130032002] (+0.000000847) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 } [13:46:47.130032395] (+0.000000393) rns syscall_entry_newstat: { cpu_id = 1 }, { filename = "/lib64/x86_64" } [13:46:47.130034017] (+0.000001622) rns syscall_exit_newstat: { cpu_id = 1 }, { ret = -2, statbuf = 140723106378272 } [13:46:47.130034583] (+0.000000566) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/lib64/libgcc_s.so.1", flags = 524288, mode = 55440 } [13:46:47.130035432] (+0.000000849) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 } [13:46:47.130035793] (+0.000000361) rns syscall_entry_newstat: { cpu_id = 1 }, { filename = "/lib64" } [13:46:47.130039354] (+0.000003561) rns syscall_exit_newstat: { cpu_id = 1 }, { ret = 0, statbuf = 140723106378272 } [13:46:47.130045019] (+0.000005665) rns syscall_entry_open: { cpu_id = 1 }, { filename = "/dev/tty", flags = 2306, mode = 0 } [13:46:47.130046134] (+0.000001115) rns syscall_exit_open: { cpu_id = 1 }, { ret = -24 } [13:46:47.130046973] (+0.000000839) rns syscall_entry_writev: { cpu_id = 1 }, { fd = 2, vec = 140723106380368, vlen = 1 } [13:46:47.130060399] (+0.000013426) rns syscall_exit_writev: { cpu_id = 1 }, { ret = 59, vec = 140723106380368 } [13:46:47.130061293] (+0.000000894) rns syscall_entry_mmap: { cpu_id = 1 }, { addr = 0x0, len = 4096, prot = 3, flags = 34, fd = -1, offset = 0 } [13:46:47.130067022] (+0.000005729) rns syscall_exit_mmap: { cpu_id = 1 }, { ret = 0x7FB33758B000 } [13:46:47.130075155] (+0.000008133) rns syscall_entry_rt_sigprocmask: { cpu_id = 1 }, { how = 1, nset = 140723106380064, sigsetsize = 8 } [13:46:47.130075728] (+0.000000573) rns syscall_exit_rt_sigprocmask: { cpu_id = 1 }, { ret = 0, oset = 0 } [13:46:47.130126316] (+0.000050588) rns syscall_entry_rt_sigprocmask: { cpu_id = 1 }, { how = 0, nset = 140723106379920, sigsetsize = 8 } [13:46:47.130128042] (+0.000001726) rns syscall_exit_rt_sigprocmask: { cpu_id = 1 }, { ret = 0, oset = 140723106379792 } [13:46:47.130128816] (+0.000000774) rns syscall_entry_getpid: { cpu_id = 1 }, { } [13:46:47.130129698] (+0.000000882) rns syscall_exit_getpid: { cpu_id = 1 }, { ret = 2568 } [13:46:47.130130243] (+0.000000545) rns syscall_entry_gettid: { cpu_id = 1 }, { } [13:46:47.130131252] (+0.000001009) rns syscall_exit_gettid: { cpu_id = 1 }, { ret = 2568 } [13:46:47.130132261] (+0.000001009) rns syscall_entry_tgkill: { cpu_id = 1 }, { tgid = 2568, pid = 2568, sig = 6 }
Updated by Jonathan Rajotte Julien over 7 years ago
- Status changed from New to Feedback
- Priority changed from Normal to Low
- Target version deleted (
2.10)
Hi
What would be the expected behavior ? What do you propose as solution?
The test result is actually not consistent and can report a success in the event where the parent lttng-ust threads close a fd during the open loop resulting in a not-exhausted open file state. This will leave place for pthread_cancel to perform its dlopen on libgcc_s.so.
Note that the error message:
libgcc_s.so.1 must be installed for pthread_cancel to work
comes from glibc and is a result of exhaustion of fd (as you stated). It is also misleading since the reason why it fails have nothing to do with the presence of libgcc_s.so.1 on the host. Glibc simply ignore the returned error (-24, too many open files)
The test make assumptions regarding the fds control and using lttng make those assumptions obsolete. There is not much lttng can do on this side. LTTng could reserve an fd to compensate for the use of pthread_cancel but i'm not really sure of the pertinence of such an solution since this all happen in the destructor of lttng-ust, neither its viability.
Cheers
Updated by Ricardo Nabinger Sanchez over 7 years ago
As I briefly mentioned, I think there is hardly a solution. Reserving FDs seems impractical (intercepting events when the application is finding out about FD limits could result in more problems, i.e., such events are lost).
Perhaps the solution is to clearly document how many FDs LTTng needs for its internal communication and, if possible, warn the user when enough verbosity/debugging is requested (as to minimize frustration in the case FD-exhaustion is the actual problem). I understand this is an uncommon scenario for most applications, though not that uncommon for server applications (for example, Apache HTTPD, Varnish, HAProxy).
Updated by Jonathan Rajotte Julien over 7 years ago
Ricardo Nabinger Sanchez wrote:
As I briefly mentioned, I think there is hardly a solution. Reserving FDs seems impractical (intercepting events when the application is finding out about FD limits could result in more problems, i.e., such events are lost).
Perhaps the solution is to clearly document how many FDs LTTng needs for its internal communication
FD for communication are opened at application start (before main) hence lttng-ust will works (see value of "first file" when we preload a liblttng-ust-fork etc.). The problematic here is pthread_cancel that requires at least one fd on child exit.
and, if possible, warn the user when enough verbosity/debugging is requested (as to minimize frustration in the case FD-exhaustion is the actual problem).
Glibc is the one aborting. Not sure how we can provide more information. To be honest, I think glibc are the one dropping the ball and returning a phony error message.
This is the portion inside lttng-ust dealing with pthread_cancel.
if (global_apps.thread_active) { ret = pthread_cancel(global_apps.ust_listener); if (ret) { ERR("Error cancelling global ust listener thread: %s", strerror(ret)); } else { global_apps.thread_active = 0; } }
Not much we can do if pthread simply abort.
I understand this is an uncommon scenario for most applications, though n>ot that uncommon for server applications (for example, Apache HTTPD, Varnish, HAProxy).
Any thoughts?
Updated by Mathieu Desnoyers over 7 years ago
One more information element: lttng-ust, during its execution, opens 2 file descriptors per listener thread (and there are 2 of those threads running). The tracepoint infrastructure also uses a FD for dlopen(). If tracing is enabled, we open a "wakeup pipe" FD for each channel as well, and one FD for the channel memory mapping shm (we close it quickly after mmaping it though).
It appears that given the nature of the test-case, lttng-ust already reserved those FDs before the test try to exhaust those left, which is why it does not trigger lttng-ust FD exhaustion error messages.
What trigger the issue here is indeed pthread_cancel in glibc trying to internally open a .so. It seems to be the visible problem only because it executes when the application quits. However, in different FD exhaustion scenarios, lttng-ust could fail in many more code paths (whenever it try to open a file descriptor). Also, the number of FDs is not fixed (depends on the number of channel).
Perhaps describing the FD requirements by lttng-ust in the lttng-ust.3 man page could be an approach to deal with this.
Updated by Jonathan Rajotte Julien over 7 years ago
Thanks for correcting me regarding FD allocation.
I will sync with eepp (for style) regarding the man page update.
Cheers