Bug #1314


lttng_enable_channel fails because of EAGAIN

Added by Heng Guo over 1 year ago. Updated 10 months ago.

Target version:
Start date:
Due date:
% Done:


Estimated time:


Our customer is using lttng and gets this issue. Share the customer info below.

Customer info :

In our test loops we are constantly seeing this issue. It is always around ntpd clock sync - few examples below.
Is it possible that ntp is causing the issue? Could it be fixed?

It depends on timing, so not with 100% reproducibility.
It is always seen during node boot, and ntpd trace is always near.

Example 1)
2020-11-12T18:10:12.882072+00:00 du1 ntpd9022: c61c 0c clock_step +2.065932 s
2020-11-12T18:10:14.948341+00:00 du1 ntpd9022: c615 05 clock_sync
2020-11-12T18:10:14.957275+00:00 du1 kernel: [ 96.890291] hrtimer: interrupt took 14734 ns
2020-11-12T18:10:15.632289+00:00 du1 ttad: tta wake-up, restarting traces
2020-11-12T18:10:15.978724+00:00 du1 ted4247: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed

Example 2)
2020-11-12T16:33:39.753979+00:00 du1 ted4314: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed
2020-11-12T16:33:43.507062+00:00 du1 ntpd9311: c61c 0c clock_step +3.991664 s
2020-11-12T16:33:47.499274+00:00 du1 ntpd9311: c615 05 clock_sync

Example 3)
2020-10-12T06:24:52.747019+00:00 du1 ntpd9433: c61c 0c clock_step +4.648429 s
2020-10-12T06:24:57.395974+00:00 du1 ntpd9433: c615 05 clock_sync
2020-10-12T06:24:57.402268+00:00 du1 kernel: [ 100.756555] hrtimer: interrupt took 9625 ns
2020-10-12T06:24:58.080346+00:00 du1 ttad: tta wake-up, restarting traces
2020-10-12T06:24:58.318945+00:00 du1 ted4716: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed

Lttng_tools: 2.10.11
Lttng_ust: 2.10.7

I add some debug log in lttng-tools and get the attached log - lttng-tools.log. So because EAGAIN error is got at SOCKET API in lttng-ust function - ustcomm_recv_unix_sock().

2021-02-22T19:15:25.832314+00:00 du1 lttng-sessiond: ustctl_send_stream_to_ust, ret=-11, app->sock=964(in ust_consumer_send_stream_to_ust() at ../../../../git/src/bin/lttng-sessiond/ust-consumer.c:431)

I made a patch 0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch in which both sendmsg and recvmsg API are retried if EAGAIN error is got. The issue is not reproduced with it and log is lttng-ust.log

I double check the linux kernel source codes and share unix_stream_read_generic() below. EAGAIN is return at timeout, the possible reason could be temporary system busy.

static int unix_stream_read_generic(struct unix_stream_read_state *state,
                                    bool freezable)
                last = skb = skb_peek(&sk->sk_receive_queue);
                last_len = last ? last->len : 0;
                if (skb == NULL) {
                        if (!timeo) {
                                err = -EAGAIN;


                        timeo = unix_stream_data_wait(sk, timeo, last,
                                                      last_len, freezable);

So could lttng expert check if this patch is reasonable?


0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch (1.98 KB) 0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch Heng Guo, 05/06/2021 10:31 PM
0001-retry-to-send-stream-or-channel-to-ust-at-EAGAIN.patch (3.67 KB) 0001-retry-to-send-stream-or-channel-to-ust-at-EAGAIN.patch Heng Guo, 06/11/2021 05:08 AM
0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch (4.48 KB) 0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch Jonathan Rajotte Julien, 07/14/2021 08:06 PM
0001-Fix-ust-UST-communication-can-return-EAGAIN.patch (34.6 KB) 0001-Fix-ust-UST-communication-can-return-EAGAIN.patch Jonathan Rajotte Julien, 07/14/2021 08:06 PM
0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch (4.48 KB) 0002-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch Updated commit msg Jonathan Rajotte Julien, 07/15/2021 04:24 PM
0001-Fix-UST-communication-can-return-EAGAIN.patch (34.1 KB) 0001-Fix-UST-communication-can-return-EAGAIN.patch Updated commit msg Jonathan Rajotte Julien, 07/15/2021 04:24 PM
0001-Fix-ust-segfault-on-lttng-start-on-filter-bytecode-c.patch (5.27 KB) 0001-Fix-ust-segfault-on-lttng-start-on-filter-bytecode-c.patch Jonathan Rajotte Julien, 10/21/2021 04:04 PM
0003-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch (4.64 KB) 0003-Fix-ust-app-stuck-on-recv-message-during-UST-comm-ti.patch Jonathan Rajotte Julien, 10/21/2021 04:04 PM
0002-Fix-ust-UST-communication-can-return-EAGAIN.patch (37 KB) 0002-Fix-ust-UST-communication-can-return-EAGAIN.patch Jonathan Rajotte Julien, 10/21/2021 04:04 PM
Actions #1

Updated by Jonathan Rajotte Julien over 1 year ago

  • Description updated (diff)


Based on a quick overview, no the patch is not reasonable since lttng-ust does not manipulate the socket options but lttng-sessiond does. Hence the EAGAIN handling should not be done in lttng-ust but in lttng-tools.

Note that a quick lookup at the recv and send man page provided the same info in regards to EAGAIN being possible for socket with configured timeout, which is the case for lttng most of the time (lttcomm_setsockopt_rcv_timeout/lttcomm_setsockopt_snd_timeout in thread_registration_apps):

              The socket is marked nonblocking and the receive operation
              would block, or a receive timeout had been set and the
              timeout expired before data was received.  POSIX.1 allows
              either error to be returned for this case, and does not
              require these constants to have the same value, so a
              portable application should check for both possibilities.

And please use the bug tracker formatting feature for code block (pre) [1].


Still I think you are indeed onto something here.

Actions #2

Updated by Jonathan Rajotte Julien about 1 year ago

  • Status changed from New to Feedback

Hi Heng Guo,

Did you have a chance to put some effort into this? With lttng 2.13 coming it would be nice to have an external contribution for this bug. Your contribution would be very appreciated. It would be a change from the usual "Dump and Pray for it to be resolved the right way" of Wind River.

You are on the right track but concretely the EAGAIN must be handled on lttng-sessiond side at the ustctl_send_stream_to_ust callsite. On EAGAIN we probably need to consider the app as "unresponsive" and handle it like a communication failure.


Actions #4

Updated by Jonathan Rajotte Julien about 1 year ago

  • File deleted (lttng-tools.log)
Actions #5

Updated by Jonathan Rajotte Julien about 1 year ago

  • File deleted (lttng-ust.log)
Actions #6

Updated by Heng Guo about 1 year ago

Thanks Jonathan for your help.

I have a workaround in src/bin/lttng-sessiond/ust-consumer.c and is already verified by our customer.

But this patch is only to resend stream or channel to ust again, so not sure if it is the patch that you expect.

My workaround is to Lttng-tools 2.10, so I merge it to latest master and attached.

Thanks again.

Actions #7

Updated by Jonathan Rajotte Julien about 1 year ago

Thanks for the update.

This is indeed a workaround and not a fix we can merge as is. As far as I understand, the workaround provided basicly boil down to doubling the LTTNG_APP_SOCKET_TIMEOUT value for only certain ust app communication. Is your customer aware of that?

And this does not resolve the underlying issue (at least in 2.10) which is that a failure for a single app should not force an error on the whole channel setup.

I guess we'll take it from here.


Actions #8

Updated by Heng Guo about 1 year ago

Yes I have explained too many times about this workaround which is the same to LTTNG_APP_SOCKET_TIMEOUT. But my customer still insist this workaround and reject LTTNG_APP_SOCKET_TIMEOUT.


Updated by Jonathan Rajotte Julien about 1 year ago

Hi Heng,

See attached our proposed solution. This solution should, overall, be much more robust than performing a single retrying with the hope that the second communication try does not timeout.

Note that stable 2.10 is not supported anymore and we will not merge those fix upstream in the stable 2.10 branch.

We will be applying a variation of those fixes, and more, for the 2.11, 2.12, 2.13 and master branch. We also performed a lot of work on the testing side that will only be part upstreamed in supported branches.


Actions #11

Updated by Heng Guo about 1 year ago

Hi Cheers,

I have forward this official patches to our customer, in order to verify them.
Waiting for test result.


Actions #12

Updated by Mathieu Desnoyers about 1 year ago

  • Project changed from LTTng-UST to LTTng-tools

Updated by Jonathan Rajotte Julien 10 months ago


The patchset addressing this issue was released for the following release: 2.11.8 [1], 2.12.6 [2], 2.13.1 [3].


* Fix: ust: app stuck on recv message during UST comm timeout scenario 
* Fix: ust: UST communication can return -EAGAIN 
* Fix: ust: segfault on lttng start on filter bytecode copy 

See attached for the patches backported to 2.10.

Marking as resolved.



Also available in: Atom PDF