Project

General

Profile

Bug #1314

Updated by Jonathan Rajotte Julien almost 3 years ago

Our customer is using lttng and gets this issue. Share the customer info below. 

 
 ---------------------------------- 
 Customer info : 

 _In our test loops we are constantly seeing this issue. It is always around ntpd clock sync - few examples below._ 
  _Is it possible that ntp is causing the issue? Could it be fixed?_ 

 _It depends on timing, so not with 100% reproducibility. 
 It is always seen during node boot, and ntpd trace is always near._ 

 _Example 1)_ 
  _2020-11-12T18:10:12.882072+00:00 du1 ntpd[9022]: 0.0.0.0 c61c 0c clock_step +2.065932 s_ 
  _2020-11-12T18:10:14.948341+00:00 du1 ntpd[9022]: 0.0.0.0 c615 05 clock_sync_ 
  _2020-11-12T18:10:14.957275+00:00 du1 kernel: [ 96.890291] hrtimer: interrupt took 14734 ns_ 
  _2020-11-12T18:10:15.632289+00:00 du1 ttad: tta wake-up, restarting traces_ 
  _2020-11-12T18:10:15.978724+00:00 du1 ted[4247]: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed_ 

 _Example 2)_ 
  _2020-11-12T16:33:39.753979+00:00 du1 ted[4314]: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed_ 
  _2020-11-12T16:33:43.507062+00:00 du1 ntpd[9311]: 0.0.0.0 c61c 0c clock_step +3.991664 s_ 
  _2020-11-12T16:33:47.499274+00:00 du1 ntpd[9311]: 0.0.0.0 c615 05 clock_sync_ 

 _Example 3)_ 
  _2020-10-12T06:24:52.747019+00:00 du1 ntpd[9433]: 0.0.0.0 c61c 0c clock_step +4.648429 s_ 
  _2020-10-12T06:24:57.395974+00:00 du1 ntpd[9433]: 0.0.0.0 c615 05 clock_sync_ 
  _2020-10-12T06:24:57.402268+00:00 du1 kernel: [ 100.756555] hrtimer: interrupt took 9625 ns_ 
  _2020-10-12T06:24:58.080346+00:00 du1 ttad: tta wake-up, restarting traces_ 
  _2020-10-12T06:24:58.318945+00:00 du1 ted[4716]: Error[TED]:recreate_session_in_lttng(): lttng_enable_channel, UST create channel failed_ 



 
 --------------------------- 

 *Lttng_tools: 2.10.11* 
 *Lttng_ust: 2.10.7* 

 I add some debug log in lttng-tools and get the attached log - lttng-tools.log. So because EAGAIN error is got at SOCKET API in lttng-ust function - ustcomm_recv_unix_sock().  
 <pre> ------- 
 2021-02-22T19:15:25.832314+00:00 du1 lttng-sessiond: ustctl_send_stream_to_ust, ret=-11, app->sock=964(in ust_consumer_send_stream_to_ust() at ../../../../git/src/bin/lttng-sessiond/ust-consumer.c:431) 
 </pre> ------- 

 I made a patch 0001-lttng-ust-retry-sendmsg-or-recvmsg-when-EAGAIN-is-go.patch in which both sendmsg and recvmsg API are retried if EAGAIN error is got. The issue is not reproduced with it and log is lttng-ust.log 

 I double check the linux kernel source codes and share unix_stream_read_generic() below. EAGAIN is return at timeout, the possible reason could be temporary system busy.  
 <pre> --------------- 
 static int unix_stream_read_generic(struct unix_stream_read_state *state, 
                                     bool freezable) 
 { 
 ...... 
                 last = skb = skb_peek(&sk->sk_receive_queue); 
                 last_len = last ? last->len : 0; 
 again: 
                 if (skb == NULL) { 
 ...... 
                         if (!timeo) { 
                                 err = -EAGAIN; 
                                 break; 
                         } 

                         mutex_unlock(&u->iolock); 

                         timeo = unix_stream_data_wait(sk, timeo, last, 
                                                       last_len, freezable); 
 ...... 
 } 
 </pre> ------------- 

 So could lttng expert check if this patch is reasonable? 

Back