Project

General

Profile

Actions

Bug #1314

closed

lttng_enable_channel fails because of EAGAIN

Added by Heng Guo almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
05/06/2021
Due date:
% Done:

0%

Estimated time:

Description

Our customer is using lttng and gets this issue. Share the customer info below.

Customer info :

In our test loops we are constantly seeing this issue. It is always around ntpd clock sync - few examples below.
Is it possible that ntp is causing the issue? Could it be fixed?

It depends on timing, so not with 100% reproducibility.
It is always seen during node boot, and ntpd trace is always near.

Example 1)
2020-11-12T18:10:12.882072+00:00 du1 ntpd9022: 0.0.0.0 c61c 0c clock_step +2.065932 s
2020-11-12T18:10:14.948341+00:00 du1 ntpd9022: 0.0.0.0 c615 05 clock_sync
2020-11-12T18:10:14.957275+00:00 du1 kernel: [ 96.890291] hrtimer: interrupt took 14734 ns
2020-11-12T18:10:15.632289+00:00 du1 ttad: tta wake-up, restarting traces
2020-11-12T18:10:15.978724+00:00 du1 ted4247: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed

Example 2)
2020-11-12T16:33:39.753979+00:00 du1 ted4314: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed
2020-11-12T16:33:43.507062+00:00 du1 ntpd9311: 0.0.0.0 c61c 0c clock_step +3.991664 s
2020-11-12T16:33:47.499274+00:00 du1 ntpd9311: 0.0.0.0 c615 05 clock_sync

Example 3)
2020-10-12T06:24:52.747019+00:00 du1 ntpd9433: 0.0.0.0 c61c 0c clock_step +4.648429 s
2020-10-12T06:24:57.395974+00:00 du1 ntpd9433: 0.0.0.0 c615 05 clock_sync
2020-10-12T06:24:57.402268+00:00 du1 kernel: [ 100.756555] hrtimer: interrupt took 9625 ns
2020-10-12T06:24:58.080346+00:00 du1 ttad: tta wake-up, restarting traces
2020-10-12T06:24:58.318945+00:00 du1 ted4716: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed

Lttng_tools: 2.10.11
Lttng_ust: 2.10.7

I add some debug log in lttng-tools and get the attached log - lttng-tools.log. So because EAGAIN error is got at SOCKET API in lttng-ust function - ustcomm_recv_unix_sock().

2021-02-22T19:15:25.832314+00:00 du1 lttng-sessiond: ustctl_send_stream_to_ust, ret=-11, app->sock=964(in ust_consumer_send_stream_to_ust() at ../../../../git/src/bin/lttng-sessiond/ust-consumer.c:431)

I made a patch 0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch in which both sendmsg and recvmsg API are retried if EAGAIN error is got. The issue is not reproduced with it and log is lttng-ust.log

I double check the linux kernel source codes and share unix_stream_read_generic() below. EAGAIN is return at timeout, the possible reason could be temporary system busy.

static int unix_stream_read_generic(struct unix_stream_read_state *state,
                                    bool freezable)
{
......
                last = skb = skb_peek(&sk->sk_receive_queue);
                last_len = last ? last->len : 0;
again:
                if (skb == NULL) {
......
                        if (!timeo) {
                                err = -EAGAIN;
                                break;
                        }

                        mutex_unlock(&u->iolock);

                        timeo = unix_stream_data_wait(sk, timeo, last,
                                                      last_len, freezable);
......
}

So could lttng expert check if this patch is reasonable?


Files

0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch (1.98 KB) 0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch Heng Guo, 05/06/2021 10:31 PM
0001-retry-to-send-stream-or-channel-to-ust-at-EAGAIN.patch (3.67 KB) 0001-retry-to-send-stream-or-channel-to-ust-at-EAGAIN.patch Heng Guo, 06/11/2021 05:08 AM
0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch (4.48 KB) 0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch Jonathan Rajotte Julien, 07/14/2021 08:06 PM
0001-Fix-ust-UST-communication-can-return-EAGAIN.patch (34.6 KB) 0001-Fix-ust-UST-communication-can-return-EAGAIN.patch Jonathan Rajotte Julien, 07/14/2021 08:06 PM
0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch (4.48 KB) 0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch Updated commit msg Jonathan Rajotte Julien, 07/15/2021 04:24 PM
0001-Fix-UST-communication-can-return-EAGAIN.patch (34.1 KB) 0001-Fix-UST-communication-can-return-EAGAIN.patch Updated commit msg Jonathan Rajotte Julien, 07/15/2021 04:24 PM
0001-Fix-ust-segfault-on-lttng-start-on-filter-bytecode-c.patch (5.27 KB) 0001-Fix-ust-segfault-on-lttng-start-on-filter-bytecode-c.patch Jonathan Rajotte Julien, 10/21/2021 04:04 PM
0003-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch (4.64 KB) 0003-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch Jonathan Rajotte Julien, 10/21/2021 04:04 PM
0002-Fix-ust-UST-communication-can-return-EAGAIN.patch (37 KB) 0002-Fix-ust-UST-communication-can-return-EAGAIN.patch Jonathan Rajotte Julien, 10/21/2021 04:04 PM
Actions

Also available in: Atom PDF