Bug #1314
Updated by Jonathan Rajotte Julien almost 3 years ago
Our customer is using lttng and gets this issue. Share the customer info below. ---------------------------------- Customer info : _In our test loops we are constantly seeing this issue. It is always around ntpd clock sync - few examples below._ _Is it possible that ntp is causing the issue? Could it be fixed?_ _It depends on timing, so not with 100% reproducibility. It is always seen during node boot, and ntpd trace is always near._ _Example 1)_ _2020-11-12T18:10:12.882072+00:00 du1 ntpd[9022]: 0.0.0.0 c61c 0c clock_step +2.065932 s_ _2020-11-12T18:10:14.948341+00:00 du1 ntpd[9022]: 0.0.0.0 c615 05 clock_sync_ _2020-11-12T18:10:14.957275+00:00 du1 kernel: [ 96.890291] hrtimer: interrupt took 14734 ns_ _2020-11-12T18:10:15.632289+00:00 du1 ttad: tta wake-up, restarting traces_ _2020-11-12T18:10:15.978724+00:00 du1 ted[4247]: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed_ _Example 2)_ _2020-11-12T16:33:39.753979+00:00 du1 ted[4314]: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed_ _2020-11-12T16:33:43.507062+00:00 du1 ntpd[9311]: 0.0.0.0 c61c 0c clock_step +3.991664 s_ _2020-11-12T16:33:47.499274+00:00 du1 ntpd[9311]: 0.0.0.0 c615 05 clock_sync_ _Example 3)_ _2020-10-12T06:24:52.747019+00:00 du1 ntpd[9433]: 0.0.0.0 c61c 0c clock_step +4.648429 s_ _2020-10-12T06:24:57.395974+00:00 du1 ntpd[9433]: 0.0.0.0 c615 05 clock_sync_ _2020-10-12T06:24:57.402268+00:00 du1 kernel: [ 100.756555] hrtimer: interrupt took 9625 ns_ _2020-10-12T06:24:58.080346+00:00 du1 ttad: tta wake-up, restarting traces_ _2020-10-12T06:24:58.318945+00:00 du1 ted[4716]: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed_ --------------------------- *Lttng_tools: 2.10.11* *Lttng_ust: 2.10.7* I add some debug log in lttng-tools and get the attached log - lttng-tools.log. So because EAGAIN error is got at SOCKET API in lttng-ust function - ustcomm_recv_unix_sock(). <pre> ------- 2021-02-22T19:15:25.832314+00:00 du1 lttng-sessiond: ustctl_send_stream_to_ust, ret=-11, app->sock=964(in ust_consumer_send_stream_to_ust() at ../../../../git/src/bin/lttng-sessiond/ust-consumer.c:431) </pre> ------- I made a patch 0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch in which both sendmsg and recvmsg API are retried if EAGAIN error is got. The issue is not reproduced with it and log is lttng-ust.log I double check the linux kernel source codes and share unix_stream_read_generic() below. EAGAIN is return at timeout, the possible reason could be temporary system busy. <pre> --------------- static int unix_stream_read_generic(struct unix_stream_read_state *state, bool freezable) { ...... last = skb = skb_peek(&sk->sk_receive_queue); last_len = last ? last->len : 0; again: if (skb == NULL) { ...... if (!timeo) { err = -EAGAIN; break; } mutex_unlock(&u->iolock); timeo = unix_stream_data_wait(sk, timeo, last, last_len, freezable); ...... } </pre> ------------- So could lttng expert check if this patch is reasonable?