Bug #631
closedhard lockup with lttng-modules and kernel 3.10 +
0%
Description
Starting from Linux kernel commit
06c017fdd4dc48451a29ac37fc1db4a3f86b7f40 "timekeeping: Hold
timekeepering locks in do_adjtimex and hardpps" (3.10 kernels), the
xtime write seqlock is held across calls to __do_adjtimex(), which
includes a call to notify_cmos_timer(), and hence
schedule_delayed_work().
This introduces a side-effect for a set of tracepoints, including mainly
the workqueue tracepoints: a tracer hooking on those tracepoints and
reading current time with ktime_get() will cause hard system LOCKUP such
as:
WARNING: CPU: 6 PID: 2258 at kernel/watchdog.c:245 watchdog_overflow_callback+0x93/0x9e() Watchdog detected hard LOCKUP on cpu 6 Modules linked in: lttng_probe_workqueue(O) lttng_probe_vmscan(O) lttng_probe_udp(O) lttng_probe_timer(O) lttng_probe_s] CPU: 6 PID: 2258 Comm: ntpd Tainted: G O 3.11.0 #158 Hardware name: Supermicro X7DAL/X7DAL, BIOS 6.00 12/03/2007 0000000000000000 ffffffff814f83eb ffffffff813b206a ffff88042fd87c78 ffffffff8106a07c 0000000000000000 ffffffff810c94c2 0000000000000000 ffff88041f31bc00 0000000000000000 ffff88042fd87d68 ffff88042fd87ef8 Call Trace: <NMI> [<ffffffff813b206a>] ? dump_stack+0x41/0x51 [<ffffffff8106a07c>] ? warn_slowpath_common+0x79/0x92 [<ffffffff810c94c2>] ? watchdog_overflow_callback+0x93/0x9e [<ffffffff8106a12d>] ? warn_slowpath_fmt+0x45/0x4a [<ffffffff810c94c2>] ? watchdog_overflow_callback+0x93/0x9e [<ffffffff810c942f>] ? watchdog_enable_all_cpus.part.2+0x31/0x31 [<ffffffff810ecc66>] ? __perf_event_overflow+0x12c/0x1ae [<ffffffff810eab60>] ? perf_event_update_userpage+0x13/0xc2 [<ffffffff81016820>] ? intel_pmu_handle_irq+0x26a/0x2fd [<ffffffff813b7a0b>] ? perf_event_nmi_handler+0x24/0x3d [<ffffffff813b728f>] ? nmi_handle.isra.3+0x58/0x12f [<ffffffff813b7a59>] ? perf_ibs_nmi_handler+0x35/0x35 [<ffffffff813b7404>] ? do_nmi+0x9e/0x2bc [<ffffffff813b6af7>] ? end_repeat_nmi+0x1e/0x2e [<ffffffff810a2a33>] ? read_seqcount_begin.constprop.4+0x8/0xf [<ffffffff810a2a33>] ? read_seqcount_begin.constprop.4+0x8/0xf [<ffffffff810a2a33>] ? read_seqcount_begin.constprop.4+0x8/0xf <<EOE>> [<ffffffff810a2d6c>] ? ktime_get+0x23/0x5e [<ffffffffa0314670>] ? lib_ring_buffer_clock_read.isra.28+0x1f/0x21 [lttng_ring_buffer_client_discard] [<ffffffffa0314786>] ? lttng_event_reserve+0x112/0x3f3 [lttng_ring_buffer_client_discard] [<ffffffffa045b1c5>] ? __event_probe__workqueue_queue_work+0x72/0xe0 [lttng_probe_workqueue] [<ffffffff812ef7e9>] ? sock_aio_read.part.10+0x110/0x124 [<ffffffff81133a36>] ? do_sync_readv_writev+0x50/0x76 [<ffffffff8107d514>] ? __queue_work+0x1ab/0x265 [<ffffffff8107da7e>] ? queue_delayed_work_on+0x3f/0x4e [<ffffffff810a473d>] ? __do_adjtimex+0x408/0x413 [<ffffffff810a3e9a>] ? do_adjtimex+0x98/0xee [<ffffffff8106cec6>] ? SYSC_adjtimex+0x32/0x5d [<ffffffff813bb74b>] ? tracesys+0xdd/0xe2
Updated by Mathieu Desnoyers over 11 years ago
Proposed the two following patches to fix this issue. One of those need to apply to the Linux kernel.
http://lists.lttng.org/pipermail/lttng-dev/2013-September/021367.html
http://lists.lttng.org/pipermail/lttng-dev/2013-September/021368.html
Updated by Mathieu Desnoyers over 11 years ago
The problem has been identified by John Stultz, and recognized as a Linux kernel issue. An actual fix for the issue (rather than the work-arounds I proposed) was proposed by John Stultz. I am blacklisting kernels 3.10 and 3.11, as well as master, until the fix gets in.
commit fc8216ae9ec5d18172d8227d179475e7cc1fb45c Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Date: Mon Sep 16 11:10:04 2013 -0500 Blacklist Linux kernels 3.10+ Linux kernels 3.10 and 3.11 introduce a deadlock in the timekeeping subsystem. See http://lkml.kernel.org/r/1378943457-27314-1-git-send-email-john.stultz@linaro.org for details. Awaiting patch merge into Linux master, stable-3.10 and stable-3.11 for fine-grained kernel version blacklisting. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Updated by Mathieu Desnoyers about 11 years ago
- Status changed from Confirmed to Resolved
The kernel fix got into 3.10, 3.11, 3.12.
The following commit takes care of blacklisting only the appropriate versions:
commit e14bf96416c39675a5f785b032d1c5279020b93d Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Date: Fri Sep 27 14:10:40 2013 -0400 Blacklist kernels 3.10.13 and 3.11.2 It looks like my guessing on kernel version at which Greg will pull this fix was wrong. It will probably appear in the next round of stable releases. The fix that needs to reach stable-3.10 and stable-3.11 before we can remove those from the backlist: commit 7bd36014460f793c19e7d6c94dab67b0afcfcb7f Author: John Stultz <john.stultz@linaro.org> Date: Wed Sep 11 16:50:56 2013 -0700 timekeeping: Fix HRTICK related deadlock from ntp lock changes Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: John Stultz <john.stultz@linaro.org>