Project

General

Profile

Actions

Bug #1314

open

lttng_enable_channel fails because of EAGAIN

Added by Heng Guo about 1 month ago. Updated 1 day ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
05/06/2021
Due date:
% Done:

0%

Estimated time:

Description

Our customer is using lttng and gets this issue. Share the customer info below.

Customer info :

In our test loops we are constantly seeing this issue. It is always around ntpd clock sync - few examples below.
Is it possible that ntp is causing the issue? Could it be fixed?

It depends on timing, so not with 100% reproducibility.
It is always seen during node boot, and ntpd trace is always near.

Example 1)
2020-11-12T18:10:12.882072+00:00 du1 ntpd9022: 0.0.0.0 c61c 0c clock_step +2.065932 s
2020-11-12T18:10:14.948341+00:00 du1 ntpd9022: 0.0.0.0 c615 05 clock_sync
2020-11-12T18:10:14.957275+00:00 du1 kernel: [ 96.890291] hrtimer: interrupt took 14734 ns
2020-11-12T18:10:15.632289+00:00 du1 ttad: tta wake-up, restarting traces
2020-11-12T18:10:15.978724+00:00 du1 ted4247: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed

Example 2)
2020-11-12T16:33:39.753979+00:00 du1 ted4314: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed
2020-11-12T16:33:43.507062+00:00 du1 ntpd9311: 0.0.0.0 c61c 0c clock_step +3.991664 s
2020-11-12T16:33:47.499274+00:00 du1 ntpd9311: 0.0.0.0 c615 05 clock_sync

Example 3)
2020-10-12T06:24:52.747019+00:00 du1 ntpd9433: 0.0.0.0 c61c 0c clock_step +4.648429 s
2020-10-12T06:24:57.395974+00:00 du1 ntpd9433: 0.0.0.0 c615 05 clock_sync
2020-10-12T06:24:57.402268+00:00 du1 kernel: [ 100.756555] hrtimer: interrupt took 9625 ns
2020-10-12T06:24:58.080346+00:00 du1 ttad: tta wake-up, restarting traces
2020-10-12T06:24:58.318945+00:00 du1 ted4716: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed

Lttng_tools: 2.10.11
Lttng_ust: 2.10.7

I add some debug log in lttng-tools and get the attached log - lttng-tools.log. So because EAGAIN error is got at SOCKET API in lttng-ust function - ustcomm_recv_unix_sock().

2021-02-22T19:15:25.832314+00:00 du1 lttng-sessiond: ustctl_send_stream_to_ust, ret=-11, app->sock=964(in ust_consumer_send_stream_to_ust() at ../../../../git/src/bin/lttng-sessiond/ust-consumer.c:431)

I made a patch 0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch in which both sendmsg and recvmsg API are retried if EAGAIN error is got. The issue is not reproduced with it and log is lttng-ust.log

I double check the linux kernel source codes and share unix_stream_read_generic() below. EAGAIN is return at timeout, the possible reason could be temporary system busy.

static int unix_stream_read_generic(struct unix_stream_read_state *state,
                                    bool freezable)
{
......
                last = skb = skb_peek(&sk->sk_receive_queue);
                last_len = last ? last->len : 0;
again:
                if (skb == NULL) {
......
                        if (!timeo) {
                                err = -EAGAIN;
                                break;
                        }

                        mutex_unlock(&u->iolock);

                        timeo = unix_stream_data_wait(sk, timeo, last,
                                                      last_len, freezable);
......
}

So could lttng expert check if this patch is reasonable?


Files

Actions #1

Updated by Jonathan Rajotte Julien about 1 month ago

  • Description updated (diff)

Hi,

Based on a quick overview, no the patch is not reasonable since lttng-ust does not manipulate the socket options but lttng-sessiond does. Hence the EAGAIN handling should not be done in lttng-ust but in lttng-tools.

Note that a quick lookup at the recv and send man page provided the same info in regards to EAGAIN being possible for socket with configured timeout, which is the case for lttng most of the time (lttcomm_setsockopt_rcv_timeout/lttcomm_setsockopt_snd_timeout in thread_registration_apps):

       EAGAIN or EWOULDBLOCK
              The socket is marked nonblocking and the receive operation
              would block, or a receive timeout had been set and the
              timeout expired before data was received.  POSIX.1 allows
              either error to be returned for this case, and does not
              require these constants to have the same value, so a
              portable application should check for both possibilities.

And please use the bug tracker formatting feature for code block (pre) [1].

[1] https://bugs.lttng.org/help/en/wiki_syntax_textile.html

Still I think you are indeed onto something here.

Actions #2

Updated by Jonathan Rajotte Julien 12 days ago

  • Status changed from New to Feedback

Hi Heng Guo,

Did you have a chance to put some effort into this? With lttng 2.13 coming it would be nice to have an external contribution for this bug. Your contribution would be very appreciated. It would be a change from the usual "Dump and Pray for it to be resolved the right way" of Wind River.

You are on the right track but concretely the EAGAIN must be handled on lttng-sessiond side at the ustctl_send_stream_to_ust callsite. On EAGAIN we probably need to consider the app as "unresponsive" and handle it like a communication failure.

Cheers

Actions #4

Updated by Jonathan Rajotte Julien 6 days ago

  • File deleted (lttng-tools.log)
Actions #5

Updated by Jonathan Rajotte Julien 6 days ago

  • File deleted (lttng-ust.log)
Actions #6

Updated by Heng Guo 5 days ago

Thanks Jonathan for your help.

I have a workaround in src/bin/lttng-sessiond/ust-consumer.c and is already verified by our customer.

But this patch is only to resend stream or channel to ust again, so not sure if it is the patch that you expect.

My workaround is to Lttng-tools 2.10, so I merge it to latest master and attached.

Thanks again.
Heng

Actions #7

Updated by Jonathan Rajotte Julien 5 days ago

Thanks for the update.

This is indeed a workaround and not a fix we can merge as is. As far as I understand, the workaround provided basicly boil down to doubling the LTTNG_APP_SOCKET_TIMEOUT value for only certain ust app communication. Is your customer aware of that?

And this does not resolve the underlying issue (at least in 2.10) which is that a failure for a single app should not force an error on the whole channel setup.

I guess we'll take it from here.

Cheers

Actions #8

Updated by Heng Guo 1 day ago

Yes I have explained too many times about this workaround which is the same to LTTNG_APP_SOCKET_TIMEOUT. But my customer still insist this workaround and reject LTTNG_APP_SOCKET_TIMEOUT.

Heng

Actions

Also available in: Atom PDF