Project

General

Profile

Actions

Bug #1314

closed

lttng_enable_channel fails because of EAGAIN

Added by Heng Guo almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
05/06/2021
Due date:
% Done:

0%

Estimated time:

Description

Our customer is using lttng and gets this issue. Share the customer info below.

Customer info :

In our test loops we are constantly seeing this issue. It is always around ntpd clock sync - few examples below.
Is it possible that ntp is causing the issue? Could it be fixed?

It depends on timing, so not with 100% reproducibility.
It is always seen during node boot, and ntpd trace is always near.

Example 1)
2020-11-12T18:10:12.882072+00:00 du1 ntpd9022: 0.0.0.0 c61c 0c clock_step +2.065932 s
2020-11-12T18:10:14.948341+00:00 du1 ntpd9022: 0.0.0.0 c615 05 clock_sync
2020-11-12T18:10:14.957275+00:00 du1 kernel: [ 96.890291] hrtimer: interrupt took 14734 ns
2020-11-12T18:10:15.632289+00:00 du1 ttad: tta wake-up, restarting traces
2020-11-12T18:10:15.978724+00:00 du1 ted4247: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed

Example 2)
2020-11-12T16:33:39.753979+00:00 du1 ted4314: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed
2020-11-12T16:33:43.507062+00:00 du1 ntpd9311: 0.0.0.0 c61c 0c clock_step +3.991664 s
2020-11-12T16:33:47.499274+00:00 du1 ntpd9311: 0.0.0.0 c615 05 clock_sync

Example 3)
2020-10-12T06:24:52.747019+00:00 du1 ntpd9433: 0.0.0.0 c61c 0c clock_step +4.648429 s
2020-10-12T06:24:57.395974+00:00 du1 ntpd9433: 0.0.0.0 c615 05 clock_sync
2020-10-12T06:24:57.402268+00:00 du1 kernel: [ 100.756555] hrtimer: interrupt took 9625 ns
2020-10-12T06:24:58.080346+00:00 du1 ttad: tta wake-up, restarting traces
2020-10-12T06:24:58.318945+00:00 du1 ted4716: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed

Lttng_tools: 2.10.11
Lttng_ust: 2.10.7

I add some debug log in lttng-tools and get the attached log - lttng-tools.log. So because EAGAIN error is got at SOCKET API in lttng-ust function - ustcomm_recv_unix_sock().

2021-02-22T19:15:25.832314+00:00 du1 lttng-sessiond: ustctl_send_stream_to_ust, ret=-11, app->sock=964(in ust_consumer_send_stream_to_ust() at ../../../../git/src/bin/lttng-sessiond/ust-consumer.c:431)

I made a patch 0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch in which both sendmsg and recvmsg API are retried if EAGAIN error is got. The issue is not reproduced with it and log is lttng-ust.log

I double check the linux kernel source codes and share unix_stream_read_generic() below. EAGAIN is return at timeout, the possible reason could be temporary system busy.

static int unix_stream_read_generic(struct unix_stream_read_state *state,
                                    bool freezable)
{
......
                last = skb = skb_peek(&sk->sk_receive_queue);
                last_len = last ? last->len : 0;
again:
                if (skb == NULL) {
......
                        if (!timeo) {
                                err = -EAGAIN;
                                break;
                        }

                        mutex_unlock(&u->iolock);

                        timeo = unix_stream_data_wait(sk, timeo, last,
                                                      last_len, freezable);
......
}

So could lttng expert check if this patch is reasonable?


Files

0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch (1.98 KB) 0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch Heng Guo, 05/06/2021 10:31 PM
0001-retry-to-send-stream-or-channel-to-ust-at-EAGAIN.patch (3.67 KB) 0001-retry-to-send-stream-or-channel-to-ust-at-EAGAIN.patch Heng Guo, 06/11/2021 05:08 AM
0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch (4.48 KB) 0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch Jonathan Rajotte Julien, 07/14/2021 08:06 PM
0001-Fix-ust-UST-communication-can-return-EAGAIN.patch (34.6 KB) 0001-Fix-ust-UST-communication-can-return-EAGAIN.patch Jonathan Rajotte Julien, 07/14/2021 08:06 PM
0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch (4.48 KB) 0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch Updated commit msg Jonathan Rajotte Julien, 07/15/2021 04:24 PM
0001-Fix-UST-communication-can-return-EAGAIN.patch (34.1 KB) 0001-Fix-UST-communication-can-return-EAGAIN.patch Updated commit msg Jonathan Rajotte Julien, 07/15/2021 04:24 PM
0001-Fix-ust-segfault-on-lttng-start-on-filter-bytecode-c.patch (5.27 KB) 0001-Fix-ust-segfault-on-lttng-start-on-filter-bytecode-c.patch Jonathan Rajotte Julien, 10/21/2021 04:04 PM
0003-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch (4.64 KB) 0003-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch Jonathan Rajotte Julien, 10/21/2021 04:04 PM
0002-Fix-ust-UST-communication-can-return-EAGAIN.patch (37 KB) 0002-Fix-ust-UST-communication-can-return-EAGAIN.patch Jonathan Rajotte Julien, 10/21/2021 04:04 PM
Actions #1

Updated by Jonathan Rajotte Julien almost 3 years ago

  • Description updated (diff)

Hi,

Based on a quick overview, no the patch is not reasonable since lttng-ust does not manipulate the socket options but lttng-sessiond does. Hence the EAGAIN handling should not be done in lttng-ust but in lttng-tools.

Note that a quick lookup at the recv and send man page provided the same info in regards to EAGAIN being possible for socket with configured timeout, which is the case for lttng most of the time (lttcomm_setsockopt_rcv_timeout/lttcomm_setsockopt_snd_timeout in thread_registration_apps):

       EAGAIN or EWOULDBLOCK
              The socket is marked nonblocking and the receive operation
              would block, or a receive timeout had been set and the
              timeout expired before data was received.  POSIX.1 allows
              either error to be returned for this case, and does not
              require these constants to have the same value, so a
              portable application should check for both possibilities.

And please use the bug tracker formatting feature for code block (pre) [1].

[1] https://bugs.lttng.org/help/en/wiki_syntax_textile.html

Still I think you are indeed onto something here.

Actions #2

Updated by Jonathan Rajotte Julien almost 3 years ago

  • Status changed from New to Feedback

Hi Heng Guo,

Did you have a chance to put some effort into this? With lttng 2.13 coming it would be nice to have an external contribution for this bug. Your contribution would be very appreciated. It would be a change from the usual "Dump and Pray for it to be resolved the right way" of Wind River.

You are on the right track but concretely the EAGAIN must be handled on lttng-sessiond side at the ustctl_send_stream_to_ust callsite. On EAGAIN we probably need to consider the app as "unresponsive" and handle it like a communication failure.

Cheers

Actions #4

Updated by Jonathan Rajotte Julien almost 3 years ago

  • File deleted (lttng-tools.log)
Actions #5

Updated by Jonathan Rajotte Julien almost 3 years ago

  • File deleted (lttng-ust.log)
Actions #6

Updated by Heng Guo almost 3 years ago

Thanks Jonathan for your help.

I have a workaround in src/bin/lttng-sessiond/ust-consumer.c and is already verified by our customer.

But this patch is only to resend stream or channel to ust again, so not sure if it is the patch that you expect.

My workaround is to Lttng-tools 2.10, so I merge it to latest master and attached.

Thanks again.
Heng

Actions #7

Updated by Jonathan Rajotte Julien almost 3 years ago

Thanks for the update.

This is indeed a workaround and not a fix we can merge as is. As far as I understand, the workaround provided basicly boil down to doubling the LTTNG_APP_SOCKET_TIMEOUT value for only certain ust app communication. Is your customer aware of that?

And this does not resolve the underlying issue (at least in 2.10) which is that a failure for a single app should not force an error on the whole channel setup.

I guess we'll take it from here.

Cheers

Actions #8

Updated by Heng Guo almost 3 years ago

Yes I have explained too many times about this workaround which is the same to LTTNG_APP_SOCKET_TIMEOUT. But my customer still insist this workaround and reject LTTNG_APP_SOCKET_TIMEOUT.

Heng

Updated by Jonathan Rajotte Julien almost 3 years ago

Hi Heng,

See attached our proposed solution. This solution should, overall, be much more robust than performing a single retrying with the hope that the second communication try does not timeout.

Note that stable 2.10 is not supported anymore and we will not merge those fix upstream in the stable 2.10 branch.

We will be applying a variation of those fixes, and more, for the 2.11, 2.12, 2.13 and master branch. We also performed a lot of work on the testing side that will only be part upstreamed in supported branches.

Cheers

Updated by Jonathan Rajotte Julien almost 3 years ago

Hi,

The previous patches changeset are valid but the commit message is not.

Please use these updated version.

Cheers

Actions #11

Updated by Heng Guo almost 3 years ago

Hi Cheers,

I have forward this official patches to our customer, in order to verify them.
Waiting for test result.

Thanks,
Heng

Actions #12

Updated by Mathieu Desnoyers almost 3 years ago

  • Project changed from LTTng-UST to LTTng-tools

Updated by Jonathan Rajotte Julien over 2 years ago

Hi,

The patchset addressing this issue was released for the following release: 2.11.8 [1], 2.12.6 [2], 2.13.1 [3].

[1] https://lists.lttng.org/pipermail/lttng-dev/2021-October/030059.html
[2] https://lists.lttng.org/pipermail/lttng-dev/2021-October/030060.html
[3] https://lists.lttng.org/pipermail/lttng-dev/2021-October/030061.html

* Fix: ust: app stuck on recv message during UST comm timeout scenario 
* Fix: ust: UST communication can return -EAGAIN 
* Fix: ust: segfault on lttng start on filter bytecode copy 

See attached for the patches backported to 2.10.

Marking as resolved.

Cheers

Actions

Also available in: Atom PDF