Project

General

Profile

Actions

Bug #1189

closed

dead loop in rculfhash with urcu-sigal in userspace-rcu-0.10.2

Added by Ying Luo almost 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
07/12/2019
Due date:
% Done:

0%

Estimated time:

Description

->System infomation:
Component: Userspace RCU
Component version: 0.10.2
System information:
x86, 24 CPUS
3.10.0-862.14.1.6_59.x86_64 GNU/Linux

->Step to reproduce issue:
1. install userspace-rcu-0.10.2 package
2. apply attached patch to source code of userspace-rcu-0.10.2
3. cd userspace-rcu-0.10.2/doc/examples/rculfhash/
4. make
5. ./cds_lfht_destroy

>Expected behavior: program of Step5 should end after 200,000 loop
>Actual behavior: program trapped into dead loop, which can be observed by gdb

->There is no such issue in userspace-rcu-0.9.3

->Root Cause:
1. userspace-rcu-0.10.2 use a seperate thread(cds_lfht_workqueue) to handle hash table auto resize, and this thread block all signal at initialization.
2. While in urcu-signal flavor, call_rcu_thread will send SIGRCU signal to all the registed thread to make them do cmm_smp_mb
3. Based on above, call_rcu_thread and workqueue thread will trapped into a dead loop: where call_rcu_thread hold the rcu_registry_lock to waiting all the registed thread to do cmm_smp_mb, while workqueue thread nerver do cmm_smp_mb and waiting to get rcu_registry_lock to unregiter from rcu.


Files

reproduce_bug.patch (3.46 KB) reproduce_bug.patch patch to reproduce issue Ying Luo, 07/12/2019 05:46 AM
Actions #1

Updated by Mathieu Desnoyers over 4 years ago

  • Status changed from New to Resolved

Fixed by the following commit in master:

commit 9fd30396a597942084b007f33cc7f2c279f746e9
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Thu Sep 19 10:10:31 2019 -0400

    Fix: provide errno as argument to urcu_die()

    commit 1a990de3add "Fix: rculfhash worker needs to unblock to SIGRCU" 
    provides "ret" (-1) as argument to urcu_die(), but should rather provide
    errno.

    Reported by Coverity:

    ** CID 1405700:  Error handling issues  (NEGATIVE_RETURNS) /src/rculfhash.c: 2171 in cds_lfht_worker_init()

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

commit 1a990de3addad89fc397f57bb359175d307e6960
Author: hewenliang <hewenliang4@huawei.com>
Date:   Tue Sep 17 10:59:18 2019 -0400

    Fix: rculfhash worker needs to unblock to SIGRCU

    In urcu-signal flavor, call_rcu_thread calls synchronize_rcu which
    will send SIGRCU signal to all registed threads, and then loops to
    wait need_mb to be cleared. However, the registed workqueue_thread
    does not process the SIGRCU signal, and never clear the need_mb.
    Based on above, call_rcu_thread and workqueue_thread will wait
    forever for completion of the grace period: call_rcu_thread which holds
    the rcu_registry_lock, waits for workqueue_thread to do cmm_smp_mb.
    While workqueue thread never does cmm_smp_mb because of signal blocking,
    and it will eventually wait to get rcu_registry_lock in do_resize_cb.

    The phenomenon is as follows, which is easy to be triggered:

    (gdb) t 2
    [Switching to thread 2 (Thread 0xffff83c3b080 (LWP 27116))]
    0  0x0000ffff845296c4 in poll () from /lib64/libc.so.6
    (gdb) bt
    0 0x0000ffff845296c4 in poll () from /lib64/libc.so.6
    1 0x0000ffff8461b93c in force_mb_all_readers () at urcu.c:241
    2 0x0000ffff8461c748 in smp_mb_master () at urcu.c:249
    3 urcu_signal_synchronize_rcu () at urcu.c:445
    4 0x0000ffff8461d004 in call_rcu_thread  at urcu-call-rcu-impl.h:364
    5 0x0000ffff845eb8bc in start_thread () from /lib64/libpthread.so.0
    6 0x0000ffff845335cc in thread_start () from /lib64/libc.so.6
    (gdb) t 3
    [Switching to thread 3 (Thread 0xffff8443c080 (LWP 27191))]
    0 0x0000ffff845f51c4 in __lll_lock_wait () from /lib64/libpthread.so.0
    (gdb) bt
    0 0x0000ffff845f51c4 in __lll_lock_wait () from /lib64/libpthread.so.0
    1 0x0000ffff845ee048 in pthread_mutex_lock () from /lib64/libpthread.so.0
    2 0x0000ffff8461b814 in mutex_lock ( <rcu_registry_lock>) at urcu.c:157
    3 0x0000ffff8461b9e4 in urcu_signal_unregister_thread () at urcu.c:564
    4 0x0000ffff8463e62c in do_resize_cb (work=0x11e2e790) at rculfhash.c:2042
    5 0x0000ffff8463c940 in workqueue_thread (arg=0x11e1d260) at workqueue.c:228
    6 0x0000ffff845eb8bc in start_thread () from /lib64/libpthread.so.0
    7 0x0000ffff845335cc in thread_start () from /lib64/libc.so.6

    So we should not block SIGRCU in workqueue thread to avoid blocking
    forever in the grace period awaiting on the worker thread when using
    urcu-signal flavor.

    Signed-off-by: hewenliang <hewenliang4@huawei.com>
    Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

backported into stable-0.10 as:

commit ef728ceea316503bdfd75c386512045fc8aa8285
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Thu Sep 19 10:10:31 2019 -0400

    Fix: provide errno as argument to urcu_die()

    commit 1a990de3add "Fix: rculfhash worker needs to unblock to SIGRCU" 
    provides "ret" (-1) as argument to urcu_die(), but should rather provide
    errno.

    Reported by Coverity:

    ** CID 1405700:  Error handling issues  (NEGATIVE_RETURNS) /src/rculfhash.c: 2171 in cds_lfht_worker_init()

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

commit 23da24d2d03a22841e7f76bf25a52a803b403362
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Wed Sep 18 11:13:24 2019 -0400

    Fix: include urcu-signal-nr.h

    Erroneous include file name in backport of commit 1a990de3add
    "Fix: rculfhash worker needs to unblock to SIGRCU".

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

commit 22e3a77f65d53b84dbe653fa6cb83181cafe2482
Author: hewenliang <hewenliang4@huawei.com>
Date:   Tue Sep 17 10:59:18 2019 -0400

    Fix: rculfhash worker needs to unblock to SIGRCU

    In urcu-signal flavor, call_rcu_thread calls synchronize_rcu which
    will send SIGRCU signal to all registed threads, and then loops to
    wait need_mb to be cleared. However, the registed workqueue_thread
    does not process the SIGRCU signal, and never clear the need_mb.
    Based on above, call_rcu_thread and workqueue_thread will wait
    forever for completion of the grace period: call_rcu_thread which holds
    the rcu_registry_lock, waits for workqueue_thread to do cmm_smp_mb.
    While workqueue thread never does cmm_smp_mb because of signal blocking,
    and it will eventually wait to get rcu_registry_lock in do_resize_cb.

    The phenomenon is as follows, which is easy to be triggered:

    (gdb) t 2
    [Switching to thread 2 (Thread 0xffff83c3b080 (LWP 27116))]
    0  0x0000ffff845296c4 in poll () from /lib64/libc.so.6
    (gdb) bt
    0 0x0000ffff845296c4 in poll () from /lib64/libc.so.6
    1 0x0000ffff8461b93c in force_mb_all_readers () at urcu.c:241
    2 0x0000ffff8461c748 in smp_mb_master () at urcu.c:249
    3 urcu_signal_synchronize_rcu () at urcu.c:445
    4 0x0000ffff8461d004 in call_rcu_thread  at urcu-call-rcu-impl.h:364
    5 0x0000ffff845eb8bc in start_thread () from /lib64/libpthread.so.0
    6 0x0000ffff845335cc in thread_start () from /lib64/libc.so.6
    (gdb) t 3
    [Switching to thread 3 (Thread 0xffff8443c080 (LWP 27191))]
    0 0x0000ffff845f51c4 in __lll_lock_wait () from /lib64/libpthread.so.0
    (gdb) bt
    0 0x0000ffff845f51c4 in __lll_lock_wait () from /lib64/libpthread.so.0
    1 0x0000ffff845ee048 in pthread_mutex_lock () from /lib64/libpthread.so.0
    2 0x0000ffff8461b814 in mutex_lock ( <rcu_registry_lock>) at urcu.c:157
    3 0x0000ffff8461b9e4 in urcu_signal_unregister_thread () at urcu.c:564
    4 0x0000ffff8463e62c in do_resize_cb (work=0x11e2e790) at rculfhash.c:2042
    5 0x0000ffff8463c940 in workqueue_thread (arg=0x11e1d260) at workqueue.c:228
    6 0x0000ffff845eb8bc in start_thread () from /lib64/libpthread.so.0
    7 0x0000ffff845335cc in thread_start () from /lib64/libc.so.6

    So we should not block SIGRCU in workqueue thread to avoid blocking
    forever in the grace period awaiting on the worker thread when using
    urcu-signal flavor.

    Signed-off-by: hewenliang <hewenliang4@huawei.com>
    Co-developed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Actions

Also available in: Atom PDF