Project

General

Profile

Bug #1032

Bug #1031: Cannot create buffers of size 4 GiB or greater

LTTng session daemon may hang under certain circumstances

Added by Daniel U. Thibault over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
Start date:
06/16/2016
Due date:
% Done:

0%

Estimated time:

Description

(LTTng 2.8.0 lttng-tools a0f518a on Ubuntu 14.04.4, kernel 4.2.0-27-generic)
I ran into this while working with bug 1031.

After trying to load a session with apparently too greedy buffer memory demands (11 channels of twice 64 MiB each), I stopped the session daemon completely (sudo service lttng-sessiond stop), stop confirmed by the System Monitor) and restarted it. I then tried to load a less demanding session (which normally succeeds on a freshly booted system) and the LTTng command line client hanged, apparently because the session daemon hanged as well.

~$ lttng load demanding-session
Error: Failed to load session demanding-session: Kernel create channel failed
Error: Kernel create channel failed
$ sudo service lttng-sessiond stop
lttng-sessiond stop/waiting
$ sudo service lttng-sessiond start
lttng-sessiond start/running, process 2693
$ lttng load less-demanding-session
^C
$ sudo service lttng-sessiond stop
lttng-sessiond stop/waiting

The verbose session daemon log is attached. The last part (in response to the lttng load less-demanding-session) is:

DEBUG1 - 10:20:53.405425 [2842/2938]: Wait for client response (in thread_manage_clients() at main.c:4489)
DEBUG1 - 10:20:53.405520 [2842/2938]: Receiving data from client ... (in thread_manage_clients() at main.c:4534)
DEBUG1 - 10:20:53.405612 [2842/2938]: Nothing recv() from client... continuing (in thread_manage_clients() at main.c:4538)
DEBUG1 - 10:20:53.405630 [2842/2938]: Clean command context structure (in clean_command_ctx() at main.c:831)
DEBUG1 - 10:20:53.405638 [2842/2938]: Accepting client command ... (in thread_manage_clients() at main.c:4437)
DEBUG1 - 10:20:53.405714 [2842/2938]: Wait for client response (in thread_manage_clients() at main.c:4489)
DEBUG1 - 10:20:53.405732 [2842/2938]: Receiving data from client ... (in thread_manage_clients() at main.c:4534)
DEBUG1 - 10:20:53.405819 [2842/2938]: Processing client command 8 (in process_client_msg() at main.c:3020)
DEBUG1 - 10:20:53.405849 [2842/2938]: Waiting for 1 URIs from client ... (in process_client_msg() at main.c:3795)
DEBUG2 - 10:20:53.405911 [2842/2938]: Trying to find session by name less-demanding-session (in session_find_by_name() at session.c:276)
DEBUG3 - 10:20:53.405961 [2842/2938]: Created hashtable size 4 at 0x7f5048006f00 of type 1 (in lttng_ht_new() at hashtable.c:145)
DEBUG1 - 10:20:53.405973 [2842/2938]: Allocating ltt_sessions_ht_by_id (in ltt_sessions_ht_alloc() at session.c:153)
DEBUG3 - 10:20:53.405978 [2842/2938]: Created hashtable size 4 at 0x7f50480072c0 of type 2 (in lttng_ht_new() at hashtable.c:145)
DEBUG1 - 10:20:53.405998 [2842/2938]: Tracing session less-demanding-session created with ID 0 by UID 1000 GID 1000 (in session_create() at session.c:420)
DEBUG2 - 10:20:53.406003 [2842/2938]: Trying to find session by name less-demanding-session (in session_find_by_name() at session.c:276)
DEBUG3 - 10:20:53.406032 [2842/2938]: Created hashtable size 4 at 0x7f504800c6f0 of type 1 (in lttng_ht_new() at hashtable.c:145)
DEBUG2 - 10:20:53.406039 [2842/2938]: Setting trace directory path from URI to /media/sf_MD1200bas/Ubuntu 14.04.4/ltraces/less-demanding-session (in add_uri_to_consumer() at cmd.c:761)
DEBUG1 - 10:20:53.406049 [2842/2938]: Sending response (size: 20, retcode: Success (10)) (in thread_manage_clients() at main.c:4584)
DEBUG1 - 10:20:53.406065 [2842/2938]: Clean command context structure (in clean_command_ctx() at main.c:831)
DEBUG1 - 10:20:53.406071 [2842/2938]: Accepting client command ... (in thread_manage_clients() at main.c:4437)
DEBUG1 - 10:20:53.406343 [2842/2938]: Wait for client response (in thread_manage_clients() at main.c:4489)
DEBUG1 - 10:20:53.406364 [2842/2938]: Receiving data from client ... (in thread_manage_clients() at main.c:4534)
DEBUG1 - 10:20:53.406425 [2842/2938]: Nothing recv() from client... continuing (in thread_manage_clients() at main.c:4538)
DEBUG1 - 10:20:53.406439 [2842/2938]: Clean command context structure (in clean_command_ctx() at main.c:831)
DEBUG1 - 10:20:53.406444 [2842/2938]: Accepting client command ... (in thread_manage_clients() at main.c:4437)
DEBUG1 - 10:20:53.406507 [2842/2938]: Wait for client response (in thread_manage_clients() at main.c:4489)
DEBUG1 - 10:20:53.406524 [2842/2938]: Receiving data from client ... (in thread_manage_clients() at main.c:4534)
DEBUG1 - 10:20:53.406586 [2842/2938]: Processing client command 5 (in process_client_msg() at main.c:3020)
DEBUG1 - 10:20:53.406597 [2842/2938]: Getting session less-demanding-session by name (in process_client_msg() at main.c:3108)
DEBUG2 - 10:20:53.406601 [2842/2938]: Trying to find session by name less-demanding-session (in session_find_by_name() at session.c:276)
DEBUG1 - 10:20:53.406606 [2842/2938]: Creating kernel session (in create_kernel_session() at main.c:2936)
DEBUG3 - 10:20:53.406635 [2842/2938]: Created hashtable size 4 at 0x7f5048003d30 of type 1 (in lttng_ht_new() at hashtable.c:145)

Somehow the session daemon hangs shortly after lttng_ht_new() at hashtable.c:145

This is weird because it has to come from either persistent (file-based) LTTng data or from messing up the kernel itself.


Files

lttng-hanging-problem-sessiond-log (17.2 KB) lttng-hanging-problem-sessiond-log Full verbose session daemon log Daniel U. Thibault, 06/16/2016 10:42 AM
lttng-hanging-problem-dmesg (45.5 KB) lttng-hanging-problem-dmesg Daniel U. Thibault, 06/16/2016 10:48 AM
lttng-hanging-problem-dmesg-second (42.5 KB) lttng-hanging-problem-dmesg-second Daniel U. Thibault, 06/16/2016 11:07 AM
#1

Updated by Daniel U. Thibault over 3 years ago

If I then ctrl+c the session daemon, it hangs during its shut-down sequence as well!

DEBUG1 - 10:46:00.826510 [2843/2843]: run_as worker received signal SIGINT (in worker_sighandler() at runas.c:566)
DEBUG1 - 10:46:00.826667 [2842/2842]: SIGINT caught (in sighandler() at main.c:5419)
DEBUG1 - 10:46:00.826694 [2842/2842]: Terminating all threads (in stop_threads() at main.c:510)
DEBUG1 - 10:46:00.826771 [2842/2842]: Futex n to 1 wake done (in futex_nto1_wake() at futex.c:123)
DEBUG1 - 10:46:00.826839 [2842/2944]: Thread kernel return from poll on 2 fds (in thread_manage_kernel() at main.c:1152)
DEBUG1 - 10:46:00.826914 [2842/2944]: Kernel thread dying (in thread_manage_kernel() at main.c:1236)
DEBUG3 - 10:46:00.827083 [2842/2942]: [ust-thread] Manage notify return from poll on 2 fds (in ust_thread_manage_notify() at ust-thread.c:76)
DEBUG1 - 10:46:00.827122 [2842/2942]: Application notify communication apps thread cleanup complete (in ust_thread_manage_notify() at ust-thread.c:187)
DEBUG1 - 10:46:00.827190 [2842/2941]: Apps thread return from poll on 2 fds (in thread_manage_apps() at main.c:1640)
DEBUG1 - 10:46:00.827229 [2842/2941]: Application communication apps thread cleanup complete (in thread_manage_apps() at main.c:1748)
DEBUG3 - 10:46:00.827319 [2842/2943]: [agent-thread] Manage agent return from poll on 2 fds (in agent_thread_manage_registration() at agent-thread.c:269)
DEBUG3 - 10:46:00.827341 [2842/2943]: [agent-thread] 1 fd ready (in agent_thread_manage_registration() at agent-thread.c:280)
DEBUG3 - 10:46:00.827354 [2842/2943]: [agent-thread] Destroy TCP socket on port 5345 (in destroy_tcp_socket() at agent-thread.c:137)
DEBUG1 - 10:46:00.827416 [2842/2943]: [agent-thread] is cleaning up and stopping. (in agent_thread_manage_registration() at agent-thread.c:357)
DEBUG1 - 10:46:00.827540 [2842/2940]: Notifying applications of session daemon state: 0 (in notify_ust_apps() at main.c:851)
DEBUG1 - 10:46:00.827653 [2842/2940]: Got the wait shm fd 18 (in get_wait_shm() at shm.c:115)
DEBUG1 - 10:46:00.827702 [2842/2940]: Futex wait update active 0 (in futex_wait_update() at futex.c:66)
DEBUG1 - 10:46:00.827765 [2842/2940]: UST Registration thread cleanup complete (in thread_registration_apps() at main.c:2355)
DEBUG1 - 10:46:00.827826 [2842/2939]: Futex n to 1 wait done (in futex_nto1_wait() at futex.c:106)
DEBUG1 - 10:46:00.827854 [2842/2939]: Dispatch thread dying (in thread_dispatch_ust_registration() at main.c:2132)
DEBUG1 - 10:46:00.827957 [2842/2937]: Health check thread dying (in thread_manage_health() at main.c:4363)
#2

Updated by Daniel U. Thibault over 3 years ago

And here are the dmesg contents.

#3

Updated by Daniel U. Thibault over 3 years ago

Once the less-demanding-session has run successfully (10 minutes, a mere total 65 MiB of data --all system calls of a quiescent system) and has been destroyed, launching a second run fails with the dreaded Error: Kernel create channel failed. Attached is the dmesg log of this run (up to the error).

#4

Updated by Sébastien Boisvert over 3 years ago

I will try to reproduce this bug. Then, I will come up with a solution.

#5

Updated by Sébastien Boisvert over 3 years ago

Actual result:

I reproduced the problem without even starting my session.


Error: Channel channel_8: Kernel create channel failed (session demanding-session-bug-1032)
Error: Command error

Commands: http://pastebin.ubuntu.com/18022152/
Output: http://pastebin.ubuntu.com/18022206/

Expected result:

LTTng creates 11 channels, and traces the kernel.

#6

Updated by Sébastien Boisvert over 3 years ago

Daniel U. Thibault wrote:

This is weird because it has to come from either persistent (file-based) LTTng data or from messing up the kernel itself.

It is related to the kernel. I think that kmalloc_node fails to allocate memory.

lttng-hanging-problem-dmesg

[82924.737489] lttng-sessiond: page allocation failure: order:7, mode:0x20c0d0
[82924.737506] CPU: 0 PID: 648 Comm: lttng-sessiond Tainted: G           OE   4.2.0-27-generic #32~14.04.1-Ubuntu
[82924.737508] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[82924.737514]  0000000000000000 ffff880079b9f6e8 ffffffff817b5013 0000000000000001
[82924.737518]  000000000020c0d0 ffff880079b9f778 ffffffff8117bdf2 ffff880000000080
[82924.737520]  000000070020c0d0 ffff880000000042 0000000000000000 0000000000000007
[82924.737523] Call Trace:
[82924.737542]  [<ffffffff817b5013>] dump_stack+0x45/0x57
[82924.737551]  [<ffffffff8117bdf2>] warn_alloc_failed+0xd2/0x120
[82924.737555]  [<ffffffff8117fac9>] __alloc_pages_nodemask+0x7f9/0x990
[82924.737557]  [<ffffffff8117fe5a>] alloc_kmem_pages_node+0x6a/0x120
[82924.737563]  [<ffffffff817b12de>] kmalloc_large_node+0x2c/0x59
[82924.737581]  [<ffffffff811cf345>] __kmalloc_node+0x245/0x2b0
[82924.737596]  [<ffffffffc00a4187>] lib_ring_buffer_backend_create+0x207/0x610 [lttng_lib_ring_buffer]
[82924.737599]  [<ffffffffc00a72f3>] lib_ring_buffer_create+0x93/0x2e0 [lttng_lib_ring_buffer]
[82924.737627]  [<ffffffff8134ba8b>] ? apparmor_file_alloc_security+0x5b/0x160
[82924.737640]  [<ffffffff817ba6d6>] ? mutex_lock+0x16/0x37
[82924.737645]  [<ffffffffc00a4a34>] channel_backend_init+0x2b4/0x340 [lttng_lib_ring_buffer]
[82924.737652]  [<ffffffffc00a54b0>] channel_create+0x70/0x210 [lttng_lib_ring_buffer]
[82924.737656]  [<ffffffffc04134d9>] _channel_create+0x39/0x90 [lttng_ring_buffer_client_discard]
[82924.737689]  [<ffffffffc0169e1d>] lttng_channel_create+0xfd/0x1b0 [lttng_tracer]
[82924.737700]  [<ffffffffc016ccd4>] lttng_abi_create_channel+0xd4/0x230 [lttng_tracer]
[82924.737708]  [<ffffffffc016d12a>] lttng_session_ioctl+0x12a/0x300 [lttng_tracer]
[82924.737711]  [<ffffffff81354898>] ? aa_sock_msg_perm+0x78/0x140
[82924.737715]  [<ffffffff8174fc52>] ? unix_stream_recvmsg+0x42/0x50
[82924.737719]  [<ffffffff8174cd80>] ? unix_state_double_lock+0x70/0x70
[82924.737725]  [<ffffffff8169513b>] ? sock_recvmsg+0x3b/0x50
[82924.737733]  [<ffffffff811fdfbd>] do_vfs_ioctl+0x2cd/0x4b0
[82924.737741]  [<ffffffff8109e804>] ? finish_task_switch+0x64/0x230
[82924.737743]  [<ffffffff811fe219>] SyS_ioctl+0x79/0x90
[82924.737748]  [<ffffffff817bc3b2>] entry_SYSCALL_64_fastpath+0x16/0x75

kmalloc_node is called by lttng-lib-ring-buffer.ko.
The source file is lttng-modules/lib/ringbuffer/ring_buffer_backend.c

kmalloc_node is called on two lines in function lib_ring_buffer_backend_allocate() which is called by lib_ring_buffer_backend_create():

- https://github.com/lttng/lttng-modules/blob/3c8ebbc8658330fcd7794aecccdec16c46193da1/lib/ringbuffer/ring_buffer_backend.c#L67
- https://github.com/lttng/lttng-modules/blob/3c8ebbc8658330fcd7794aecccdec16c46193da1/lib/ringbuffer/ring_buffer_backend.c#L73

#7

Updated by Jérémie Galarneau over 3 years ago

  • Parent task set to #1031
#8

Updated by Jérémie Galarneau over 3 years ago

  • Status changed from New to Resolved

See #1031.

Also available in: Atom PDF