Bug #1032
closedBug #1031: Cannot create buffers of size 4 GiB or greater
LTTng session daemon may hang under certain circumstances
0%
Description
(LTTng 2.8.0 lttng-tools a0f518a on Ubuntu 14.04.4, kernel 4.2.0-27-generic)
I ran into this while working with bug 1031.
After trying to load a session with apparently too greedy buffer memory demands (11 channels of twice 64 MiB each), I stopped the session daemon completely (sudo service lttng-sessiond stop
), stop confirmed by the System Monitor) and restarted it. I then tried to load a less demanding session (which normally succeeds on a freshly booted system) and the LTTng command line client hanged, apparently because the session daemon hanged as well.
~$ lttng load demanding-session Error: Failed to load session demanding-session: Kernel create channel failed Error: Kernel create channel failed $ sudo service lttng-sessiond stop lttng-sessiond stop/waiting $ sudo service lttng-sessiond start lttng-sessiond start/running, process 2693 $ lttng load less-demanding-session ^C $ sudo service lttng-sessiond stop lttng-sessiond stop/waiting
The verbose session daemon log is attached. The last part (in response to the lttng load less-demanding-session
) is:
DEBUG1 - 10:20:53.405425 [2842/2938]: Wait for client response (in thread_manage_clients() at main.c:4489) DEBUG1 - 10:20:53.405520 [2842/2938]: Receiving data from client ... (in thread_manage_clients() at main.c:4534) DEBUG1 - 10:20:53.405612 [2842/2938]: Nothing recv() from client... continuing (in thread_manage_clients() at main.c:4538) DEBUG1 - 10:20:53.405630 [2842/2938]: Clean command context structure (in clean_command_ctx() at main.c:831) DEBUG1 - 10:20:53.405638 [2842/2938]: Accepting client command ... (in thread_manage_clients() at main.c:4437) DEBUG1 - 10:20:53.405714 [2842/2938]: Wait for client response (in thread_manage_clients() at main.c:4489) DEBUG1 - 10:20:53.405732 [2842/2938]: Receiving data from client ... (in thread_manage_clients() at main.c:4534) DEBUG1 - 10:20:53.405819 [2842/2938]: Processing client command 8 (in process_client_msg() at main.c:3020) DEBUG1 - 10:20:53.405849 [2842/2938]: Waiting for 1 URIs from client ... (in process_client_msg() at main.c:3795) DEBUG2 - 10:20:53.405911 [2842/2938]: Trying to find session by name less-demanding-session (in session_find_by_name() at session.c:276) DEBUG3 - 10:20:53.405961 [2842/2938]: Created hashtable size 4 at 0x7f5048006f00 of type 1 (in lttng_ht_new() at hashtable.c:145) DEBUG1 - 10:20:53.405973 [2842/2938]: Allocating ltt_sessions_ht_by_id (in ltt_sessions_ht_alloc() at session.c:153) DEBUG3 - 10:20:53.405978 [2842/2938]: Created hashtable size 4 at 0x7f50480072c0 of type 2 (in lttng_ht_new() at hashtable.c:145) DEBUG1 - 10:20:53.405998 [2842/2938]: Tracing session less-demanding-session created with ID 0 by UID 1000 GID 1000 (in session_create() at session.c:420) DEBUG2 - 10:20:53.406003 [2842/2938]: Trying to find session by name less-demanding-session (in session_find_by_name() at session.c:276) DEBUG3 - 10:20:53.406032 [2842/2938]: Created hashtable size 4 at 0x7f504800c6f0 of type 1 (in lttng_ht_new() at hashtable.c:145) DEBUG2 - 10:20:53.406039 [2842/2938]: Setting trace directory path from URI to /media/sf_MD1200bas/Ubuntu 14.04.4/ltraces/less-demanding-session (in add_uri_to_consumer() at cmd.c:761) DEBUG1 - 10:20:53.406049 [2842/2938]: Sending response (size: 20, retcode: Success (10)) (in thread_manage_clients() at main.c:4584) DEBUG1 - 10:20:53.406065 [2842/2938]: Clean command context structure (in clean_command_ctx() at main.c:831) DEBUG1 - 10:20:53.406071 [2842/2938]: Accepting client command ... (in thread_manage_clients() at main.c:4437) DEBUG1 - 10:20:53.406343 [2842/2938]: Wait for client response (in thread_manage_clients() at main.c:4489) DEBUG1 - 10:20:53.406364 [2842/2938]: Receiving data from client ... (in thread_manage_clients() at main.c:4534) DEBUG1 - 10:20:53.406425 [2842/2938]: Nothing recv() from client... continuing (in thread_manage_clients() at main.c:4538) DEBUG1 - 10:20:53.406439 [2842/2938]: Clean command context structure (in clean_command_ctx() at main.c:831) DEBUG1 - 10:20:53.406444 [2842/2938]: Accepting client command ... (in thread_manage_clients() at main.c:4437) DEBUG1 - 10:20:53.406507 [2842/2938]: Wait for client response (in thread_manage_clients() at main.c:4489) DEBUG1 - 10:20:53.406524 [2842/2938]: Receiving data from client ... (in thread_manage_clients() at main.c:4534) DEBUG1 - 10:20:53.406586 [2842/2938]: Processing client command 5 (in process_client_msg() at main.c:3020) DEBUG1 - 10:20:53.406597 [2842/2938]: Getting session less-demanding-session by name (in process_client_msg() at main.c:3108) DEBUG2 - 10:20:53.406601 [2842/2938]: Trying to find session by name less-demanding-session (in session_find_by_name() at session.c:276) DEBUG1 - 10:20:53.406606 [2842/2938]: Creating kernel session (in create_kernel_session() at main.c:2936) DEBUG3 - 10:20:53.406635 [2842/2938]: Created hashtable size 4 at 0x7f5048003d30 of type 1 (in lttng_ht_new() at hashtable.c:145)
Somehow the session daemon hangs shortly after lttng_ht_new() at hashtable.c:145
This is weird because it has to come from either persistent (file-based) LTTng data or from messing up the kernel itself.
Files
Updated by Daniel U. Thibault over 8 years ago
If I then ctrl+c the session daemon, it hangs during its shut-down sequence as well!
DEBUG1 - 10:46:00.826510 [2843/2843]: run_as worker received signal SIGINT (in worker_sighandler() at runas.c:566) DEBUG1 - 10:46:00.826667 [2842/2842]: SIGINT caught (in sighandler() at main.c:5419) DEBUG1 - 10:46:00.826694 [2842/2842]: Terminating all threads (in stop_threads() at main.c:510) DEBUG1 - 10:46:00.826771 [2842/2842]: Futex n to 1 wake done (in futex_nto1_wake() at futex.c:123) DEBUG1 - 10:46:00.826839 [2842/2944]: Thread kernel return from poll on 2 fds (in thread_manage_kernel() at main.c:1152) DEBUG1 - 10:46:00.826914 [2842/2944]: Kernel thread dying (in thread_manage_kernel() at main.c:1236) DEBUG3 - 10:46:00.827083 [2842/2942]: [ust-thread] Manage notify return from poll on 2 fds (in ust_thread_manage_notify() at ust-thread.c:76) DEBUG1 - 10:46:00.827122 [2842/2942]: Application notify communication apps thread cleanup complete (in ust_thread_manage_notify() at ust-thread.c:187) DEBUG1 - 10:46:00.827190 [2842/2941]: Apps thread return from poll on 2 fds (in thread_manage_apps() at main.c:1640) DEBUG1 - 10:46:00.827229 [2842/2941]: Application communication apps thread cleanup complete (in thread_manage_apps() at main.c:1748) DEBUG3 - 10:46:00.827319 [2842/2943]: [agent-thread] Manage agent return from poll on 2 fds (in agent_thread_manage_registration() at agent-thread.c:269) DEBUG3 - 10:46:00.827341 [2842/2943]: [agent-thread] 1 fd ready (in agent_thread_manage_registration() at agent-thread.c:280) DEBUG3 - 10:46:00.827354 [2842/2943]: [agent-thread] Destroy TCP socket on port 5345 (in destroy_tcp_socket() at agent-thread.c:137) DEBUG1 - 10:46:00.827416 [2842/2943]: [agent-thread] is cleaning up and stopping. (in agent_thread_manage_registration() at agent-thread.c:357) DEBUG1 - 10:46:00.827540 [2842/2940]: Notifying applications of session daemon state: 0 (in notify_ust_apps() at main.c:851) DEBUG1 - 10:46:00.827653 [2842/2940]: Got the wait shm fd 18 (in get_wait_shm() at shm.c:115) DEBUG1 - 10:46:00.827702 [2842/2940]: Futex wait update active 0 (in futex_wait_update() at futex.c:66) DEBUG1 - 10:46:00.827765 [2842/2940]: UST Registration thread cleanup complete (in thread_registration_apps() at main.c:2355) DEBUG1 - 10:46:00.827826 [2842/2939]: Futex n to 1 wait done (in futex_nto1_wait() at futex.c:106) DEBUG1 - 10:46:00.827854 [2842/2939]: Dispatch thread dying (in thread_dispatch_ust_registration() at main.c:2132) DEBUG1 - 10:46:00.827957 [2842/2937]: Health check thread dying (in thread_manage_health() at main.c:4363)
Updated by Daniel U. Thibault over 8 years ago
And here are the dmesg contents.
Updated by Daniel U. Thibault over 8 years ago
Once the less-demanding-session has run successfully (10 minutes, a mere total 65 MiB of data --all system calls of a quiescent system) and has been destroyed, launching a second run fails with the dreaded Error: Kernel create channel failed
. Attached is the dmesg log of this run (up to the error).
Updated by Sébastien Boisvert over 8 years ago
I will try to reproduce this bug. Then, I will come up with a solution.
Updated by Sébastien Boisvert over 8 years ago
Actual result:
I reproduced the problem without even starting my session.
Error: Channel channel_8: Kernel create channel failed (session demanding-session-bug-1032)
Error: Command error
Commands: http://pastebin.ubuntu.com/18022152/
Output: http://pastebin.ubuntu.com/18022206/
Expected result:
LTTng creates 11 channels, and traces the kernel.
Updated by Sébastien Boisvert over 8 years ago
Daniel U. Thibault wrote:
This is weird because it has to come from either persistent (file-based) LTTng data or from messing up the kernel itself.
It is related to the kernel. I think that kmalloc_node fails to allocate memory.
lttng-hanging-problem-dmesg
[82924.737489] lttng-sessiond: page allocation failure: order:7, mode:0x20c0d0 [82924.737506] CPU: 0 PID: 648 Comm: lttng-sessiond Tainted: G OE 4.2.0-27-generic #32~14.04.1-Ubuntu [82924.737508] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [82924.737514] 0000000000000000 ffff880079b9f6e8 ffffffff817b5013 0000000000000001 [82924.737518] 000000000020c0d0 ffff880079b9f778 ffffffff8117bdf2 ffff880000000080 [82924.737520] 000000070020c0d0 ffff880000000042 0000000000000000 0000000000000007 [82924.737523] Call Trace: [82924.737542] [<ffffffff817b5013>] dump_stack+0x45/0x57 [82924.737551] [<ffffffff8117bdf2>] warn_alloc_failed+0xd2/0x120 [82924.737555] [<ffffffff8117fac9>] __alloc_pages_nodemask+0x7f9/0x990 [82924.737557] [<ffffffff8117fe5a>] alloc_kmem_pages_node+0x6a/0x120 [82924.737563] [<ffffffff817b12de>] kmalloc_large_node+0x2c/0x59 [82924.737581] [<ffffffff811cf345>] __kmalloc_node+0x245/0x2b0 [82924.737596] [<ffffffffc00a4187>] lib_ring_buffer_backend_create+0x207/0x610 [lttng_lib_ring_buffer] [82924.737599] [<ffffffffc00a72f3>] lib_ring_buffer_create+0x93/0x2e0 [lttng_lib_ring_buffer] [82924.737627] [<ffffffff8134ba8b>] ? apparmor_file_alloc_security+0x5b/0x160 [82924.737640] [<ffffffff817ba6d6>] ? mutex_lock+0x16/0x37 [82924.737645] [<ffffffffc00a4a34>] channel_backend_init+0x2b4/0x340 [lttng_lib_ring_buffer] [82924.737652] [<ffffffffc00a54b0>] channel_create+0x70/0x210 [lttng_lib_ring_buffer] [82924.737656] [<ffffffffc04134d9>] _channel_create+0x39/0x90 [lttng_ring_buffer_client_discard] [82924.737689] [<ffffffffc0169e1d>] lttng_channel_create+0xfd/0x1b0 [lttng_tracer] [82924.737700] [<ffffffffc016ccd4>] lttng_abi_create_channel+0xd4/0x230 [lttng_tracer] [82924.737708] [<ffffffffc016d12a>] lttng_session_ioctl+0x12a/0x300 [lttng_tracer] [82924.737711] [<ffffffff81354898>] ? aa_sock_msg_perm+0x78/0x140 [82924.737715] [<ffffffff8174fc52>] ? unix_stream_recvmsg+0x42/0x50 [82924.737719] [<ffffffff8174cd80>] ? unix_state_double_lock+0x70/0x70 [82924.737725] [<ffffffff8169513b>] ? sock_recvmsg+0x3b/0x50 [82924.737733] [<ffffffff811fdfbd>] do_vfs_ioctl+0x2cd/0x4b0 [82924.737741] [<ffffffff8109e804>] ? finish_task_switch+0x64/0x230 [82924.737743] [<ffffffff811fe219>] SyS_ioctl+0x79/0x90 [82924.737748] [<ffffffff817bc3b2>] entry_SYSCALL_64_fastpath+0x16/0x75
kmalloc_node is called by lttng-lib-ring-buffer.ko.
The source file is lttng-modules/lib/ringbuffer/ring_buffer_backend.c
kmalloc_node is called on two lines in function lib_ring_buffer_backend_allocate() which is called by lib_ring_buffer_backend_create():
- https://github.com/lttng/lttng-modules/blob/3c8ebbc8658330fcd7794aecccdec16c46193da1/lib/ringbuffer/ring_buffer_backend.c#L67
- https://github.com/lttng/lttng-modules/blob/3c8ebbc8658330fcd7794aecccdec16c46193da1/lib/ringbuffer/ring_buffer_backend.c#L73