Bug #1314
closedlttng_enable_channel fails because of EAGAIN
0%
Description
Our customer is using lttng and gets this issue. Share the customer info below.
Customer info :
In our test loops we are constantly seeing this issue. It is always around ntpd clock sync - few examples below.
Is it possible that ntp is causing the issue? Could it be fixed?
It depends on timing, so not with 100% reproducibility.
It is always seen during node boot, and ntpd trace is always near.
Example 1)
2020-11-12T18:10:12.882072+00:00 du1 ntpd9022: 0.0.0.0 c61c 0c clock_step +2.065932 s
2020-11-12T18:10:14.948341+00:00 du1 ntpd9022: 0.0.0.0 c615 05 clock_sync
2020-11-12T18:10:14.957275+00:00 du1 kernel: [ 96.890291] hrtimer: interrupt took 14734 ns
2020-11-12T18:10:15.632289+00:00 du1 ttad: tta wake-up, restarting traces
2020-11-12T18:10:15.978724+00:00 du1 ted4247: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed
Example 2)
2020-11-12T16:33:39.753979+00:00 du1 ted4314: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed
2020-11-12T16:33:43.507062+00:00 du1 ntpd9311: 0.0.0.0 c61c 0c clock_step +3.991664 s
2020-11-12T16:33:47.499274+00:00 du1 ntpd9311: 0.0.0.0 c615 05 clock_sync
Example 3)
2020-10-12T06:24:52.747019+00:00 du1 ntpd9433: 0.0.0.0 c61c 0c clock_step +4.648429 s
2020-10-12T06:24:57.395974+00:00 du1 ntpd9433: 0.0.0.0 c615 05 clock_sync
2020-10-12T06:24:57.402268+00:00 du1 kernel: [ 100.756555] hrtimer: interrupt took 9625 ns
2020-10-12T06:24:58.080346+00:00 du1 ttad: tta wake-up, restarting traces
2020-10-12T06:24:58.318945+00:00 du1 ted4716: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed
Lttng_tools: 2.10.11
Lttng_ust: 2.10.7
I add some debug log in lttng-tools and get the attached log - lttng-tools.log. So because EAGAIN error is got at SOCKET API in lttng-ust function - ustcomm_recv_unix_sock().
2021-02-22T19:15:25.832314+00:00 du1 lttng-sessiond: ustctl_send_stream_to_ust, ret=-11, app->sock=964(in ust_consumer_send_stream_to_ust() at ../../../../git/src/bin/lttng-sessiond/ust-consumer.c:431)
I made a patch 0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch in which both sendmsg and recvmsg API are retried if EAGAIN error is got. The issue is not reproduced with it and log is lttng-ust.log
I double check the linux kernel source codes and share unix_stream_read_generic() below. EAGAIN is return at timeout, the possible reason could be temporary system busy.
static int unix_stream_read_generic(struct unix_stream_read_state *state, bool freezable) { ...... last = skb = skb_peek(&sk->sk_receive_queue); last_len = last ? last->len : 0; again: if (skb == NULL) { ...... if (!timeo) { err = -EAGAIN; break; } mutex_unlock(&u->iolock); timeo = unix_stream_data_wait(sk, timeo, last, last_len, freezable); ...... }
So could lttng expert check if this patch is reasonable?
Files