Project

General

Profile

Actions

Bug #892

closed

lttng commands stuck when option --set-url is used to send the traces over network

Added by vinay kambam over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Target version:
Start date:
07/07/2015
Due date:
% Done:

0%

Estimated time:

Description

lttng commands stuck when option --set-url is used to send the traces over network ,Suspecting if the relay daemon is unable to process the request from session daemon.When we restarted the relay daemon ,we haven't seen this problem again

I have few questions in regard to this
-Does the relay daemon started on the remote machine has the limitation to handle the number of connections ?
-Does relay daemon cleans up the connection if there happens an unordered crash of the target on which session daemon is started ?
-The netstat's output has the queued packets at the relay daemon side , interestingly there are no lttng sessions on the target

The netstat output is as

tcp 0 0 0.0.0.0:5342 0.0.0.0:* LISTEN
tcp 32 0 137.58.215.17:5342 10.74.59.170:36740 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:40543 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:43537 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:43347 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:45642 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:45433 ESTABLISHED
tcp 33 0 137.58.215.17:5342 10.74.58.20:57832 CLOSE_WAIT
tcp 32 0 137.58.215.17:5342 10.67.29.116:53648 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:54402 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:55397 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:34960 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:37913 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:43167 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:51673 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:48492 ESTABLISHED
tcp 32 0 137.58.215.17:5342 172.31.98.221:39518 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:47521 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:51916 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:48219 ESTABLISHED
tcp 33 0 137.58.215.17:5342 10.74.58.20:44167 CLOSE_WAIT
tcp 32 0 137.58.215.17:5342 10.74.59.170:47477 ESTABLISHED
tcp 32 0 137.58.215.17:5342 137.58.191.89:34615 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.67.30.52:50288 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:55543 ESTABLISHED
tcp 89 0 137.58.215.17:5342 10.74.58.20:37882 CLOSE_WAIT

-when the relay daemon is in this state , we are unable to create new sessions to send the traces to the same daemon and lttng commands stuck

In our case , relay daemon was started few days ago, It was working fine collecting traces from sessions on different machines and some how it went to this "bad" state.

-Is it due to the reason that the relay daemon was started and being run for few days?

we are using lttng 2.6 version

  1. which lttng
    /usr/bin/lttng
  1. lttng --version
    lttng (LTTng Trace Control) 2.6.0 - Gaia - v2.6.0

Could you please find the root cause for this issue and a workaround ?


Files

relayd-logs.zip (21.5 KB) relayd-logs.zip logs of the relay daemon run in verbose mode when the issue has occured vinay kambam, 08/27/2015 05:15 AM
Actions #1

Updated by Jérémie Galarneau over 9 years ago

  • Assignee set to Jonathan Rajotte Julien
Actions #2

Updated by Jonathan Rajotte Julien over 9 years ago

  • Status changed from New to Feedback
  • Priority changed from Critical to Normal

Hi Vinay,

Lot of stuff going on here.

Were you able to reproduce it on demand? If so could you provide us with your methodology so we can investigate further.


-Does the relay daemon started on the remote machine has the limitation to handle the number of connections ?

There is no hard coded limit as far as I know.


-Does relay daemon cleans up the connection if there happens an unordered crash of the target on which session daemon is started ?

Supposed to but as with anything bug can happen.


-Is it due to the reason that the relay daemon was started and being run for few days?

Time should not be a variable but anything can be the source of a bug.

We are currently trying to consolidate the sessiond <-> relayd <-> client link as multiples issues were reported.

In the mean time, the best you can do is come up with a reproduction scenario.

Thanks
Jonathan

Actions #3

Updated by vinay kambam over 9 years ago

Hi Jonathan,

We are unable to reproduce it on demand. There are no particular steps to reproduce it but I have the logs at the relay daemon ,run in verbose mode when this issue had occurred.

Whenever we faced this issue,observed that the relay daemon was started and being run for few days and target on which session daemon was started ,was subjected to unordered crashes/restarts.

Thanks,
Vinay

Actions #4

Updated by Jonathan Rajotte Julien over 9 years ago

  • Status changed from Feedback to In Progress

Hi Vinay,

As I said we are currently putting a lot of effort on hardening the relayd and sessiond for over-the-nertwork tracing.

You can check the evolution of this work here:

https://github.com/compudj/lttng-tools-dev/tree/staging-test-fixes-v2

This is still a work in progress that will most probably be applied on lttng stable 2.7.

I will keep you updated on this.

Thanks

Actions #5

Updated by Jérémie Galarneau over 9 years ago

  • Assignee changed from Jonathan Rajotte Julien to Mathieu Desnoyers
Actions #6

Updated by KRANTHI KUMAR ADARI over 9 years ago

Hi Jérémie,

Do you have any new updates on this bug. We can provide test support if you want.
We can reproducd the fault with repeated restart(>40) on trace generating system towards same relayd on remote host.

BR
Kranthi.

Actions #7

Updated by Jérémie Galarneau over 9 years ago

A number of fixes addressing stability problems of the relay daemon have been merged in master and stable-2.7. Can you try to reproduce on those branches?

Actions #8

Updated by Jonathan Rajotte Julien about 9 years ago

  • Status changed from In Progress to Feedback

Hi Kranthi,

Well if you can provide us with some test case it is always appreciated.

Were you able to conduct these tests on master and 2.7 branch ?

Thanks

Actions #9

Updated by KRANTHI KUMAR ADARI about 9 years ago

Hi Jonathan,

We have tried on master and the problem is fixed :-)
Thanks a lot.

BR
Kranthi.

Actions #10

Updated by Jérémie Galarneau about 9 years ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF