Bug #892
closedlttng commands stuck when option --set-url is used to send the traces over network
0%
Description
lttng commands stuck when option --set-url is used to send the traces over network ,Suspecting if the relay daemon is unable to process the request from session daemon.When we restarted the relay daemon ,we haven't seen this problem again
I have few questions in regard to this
-Does the relay daemon started on the remote machine has the limitation to handle the number of connections ?
-Does relay daemon cleans up the connection if there happens an unordered crash of the target on which session daemon is started ?
-The netstat's output has the queued packets at the relay daemon side , interestingly there are no lttng sessions on the target
The netstat output is as
tcp 0 0 0.0.0.0:5342 0.0.0.0:* LISTEN
tcp 32 0 137.58.215.17:5342 10.74.59.170:36740 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:40543 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:43537 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:43347 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:45642 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:45433 ESTABLISHED
tcp 33 0 137.58.215.17:5342 10.74.58.20:57832 CLOSE_WAIT
tcp 32 0 137.58.215.17:5342 10.67.29.116:53648 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:54402 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:55397 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:34960 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:37913 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:43167 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:51673 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:48492 ESTABLISHED
tcp 32 0 137.58.215.17:5342 172.31.98.221:39518 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:47521 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:51916 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.58.20:48219 ESTABLISHED
tcp 33 0 137.58.215.17:5342 10.74.58.20:44167 CLOSE_WAIT
tcp 32 0 137.58.215.17:5342 10.74.59.170:47477 ESTABLISHED
tcp 32 0 137.58.215.17:5342 137.58.191.89:34615 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.67.30.52:50288 ESTABLISHED
tcp 32 0 137.58.215.17:5342 10.74.59.170:55543 ESTABLISHED
tcp 89 0 137.58.215.17:5342 10.74.58.20:37882 CLOSE_WAIT
-when the relay daemon is in this state , we are unable to create new sessions to send the traces to the same daemon and lttng commands stuck
In our case , relay daemon was started few days ago, It was working fine collecting traces from sessions on different machines and some how it went to this "bad" state.
-Is it due to the reason that the relay daemon was started and being run for few days?
we are using lttng 2.6 version
- which lttng
/usr/bin/lttng
- lttng --version
lttng (LTTng Trace Control) 2.6.0 - Gaia - v2.6.0
Could you please find the root cause for this issue and a workaround ?
Files
Updated by Jérémie Galarneau over 9 years ago
- Assignee set to Jonathan Rajotte Julien
Updated by Jonathan Rajotte Julien over 9 years ago
- Status changed from New to Feedback
- Priority changed from Critical to Normal
Hi Vinay,
Lot of stuff going on here.
Were you able to reproduce it on demand? If so could you provide us with your methodology so we can investigate further.
-Does the relay daemon started on the remote machine has the limitation to handle the number of connections ?
There is no hard coded limit as far as I know.
-Does relay daemon cleans up the connection if there happens an unordered crash of the target on which session daemon is started ?
Supposed to but as with anything bug can happen.
-Is it due to the reason that the relay daemon was started and being run for few days?
Time should not be a variable but anything can be the source of a bug.
We are currently trying to consolidate the sessiond <-> relayd <-> client link as multiples issues were reported.
In the mean time, the best you can do is come up with a reproduction scenario.
Thanks
Jonathan
Updated by vinay kambam over 9 years ago
- File relayd-logs.zip relayd-logs.zip added
Hi Jonathan,
We are unable to reproduce it on demand. There are no particular steps to reproduce it but I have the logs at the relay daemon ,run in verbose mode when this issue had occurred.
Whenever we faced this issue,observed that the relay daemon was started and being run for few days and target on which session daemon was started ,was subjected to unordered crashes/restarts.
Thanks,
Vinay
Updated by Jonathan Rajotte Julien over 9 years ago
- Status changed from Feedback to In Progress
Hi Vinay,
As I said we are currently putting a lot of effort on hardening the relayd and sessiond for over-the-nertwork tracing.
You can check the evolution of this work here:
https://github.com/compudj/lttng-tools-dev/tree/staging-test-fixes-v2
This is still a work in progress that will most probably be applied on lttng stable 2.7.
I will keep you updated on this.
Thanks
Updated by Jérémie Galarneau over 9 years ago
- Assignee changed from Jonathan Rajotte Julien to Mathieu Desnoyers
Updated by KRANTHI KUMAR ADARI over 9 years ago
Hi Jérémie,
Do you have any new updates on this bug. We can provide test support if you want.
We can reproducd the fault with repeated restart(>40) on trace generating system towards same relayd on remote host.
BR
Kranthi.
Updated by Jérémie Galarneau over 9 years ago
A number of fixes addressing stability problems of the relay daemon have been merged in master and stable-2.7. Can you try to reproduce on those branches?
Updated by Jonathan Rajotte Julien about 9 years ago
- Status changed from In Progress to Feedback
Hi Kranthi,
Well if you can provide us with some test case it is always appreciated.
Were you able to conduct these tests on master and 2.7 branch ?
Thanks
Updated by KRANTHI KUMAR ADARI about 9 years ago
Hi Jonathan,
We have tried on master and the problem is fixed :-)
Thanks a lot.
BR
Kranthi.
Updated by Jérémie Galarneau about 9 years ago
- Status changed from Feedback to Resolved