Bug #501
closedconsumerd defuncts - 2.2 rc1
0%
Description
Test Case Procedure:
1. create a Per-UID session with UST domain type and url=/tmp/<some dir>
2. create CTF channel
3. enable events with filter
4. destroy session
5. create another Per-UID session with UST domain type and url= <IP address and port numbers>
6. create CTF channel with switch timer 1 sec
7. create metadata channel with switch timer 1 sec
8. add VPID context
9. enable events with filter
10. start tracing
step 6 fails often because no UST consumer was detected.
I have attached some logs, and below there is more info about the version.
- ps -e | grep -i lttng
21490 pts/0 00:00:00 lttng-sessiond
21971 pts/0 00:00:01 lttng-consumerd <defunct>
28781 ? 00:00:00 lttng-relayd
- 500c239 (HEAD, tag: v2.2.0-rc1, origin/master, origin/HEAD) Update version to v2.2.0-rc1
- 61411ca Fix: Add missing header for make dist
- ee69440 Fix: add missing declaration for disable-lttng-ust
- af6142c Fix: don't keep vpid, procname, nor patchlevel environment for per-uid buffers
- 0f907de Relayd add_stream command handle tracefile rotation
- fe4477e Move stream file rotation functions to utils
- bdd8ca8 (HEAD, origin/master, origin/HEAD) connect: don't report EACCES
- 256e8c9 connect: don't print error on EPERM
- 76bacff Bump liblttng-ust-ctl lib version major number
- 3469bbb Optimisation: implement callsite hash table in tracepoint.c
- 5da1090 Optimisation: only update added library in tracepoint.c
- 5f73392 Optimisation: only fix pending events once per lazy update
- 9c915ee (tag: v2.2.0-rc1) Version 2.2.0-rc1
- 46c881a Fix: tracepoint instrumentation constructor order issue
- 77ca146 Documentation: update 2.0 to 2.x
- 60302ad Add demo test back as an example
- d107390 (HEAD, origin/master, origin/HEAD) Add tab to output in order to allow easy nesting of tables.
- f1e42e5 Remove urcu-api-list.sh from dist tarball
- 10f55ae Add urcu-api-list.sh script
- 7e5b9a4 list: implement cds_list_for_each_safe()
- 108a92e Fix: tests/api.h use cpuset.h
- b3231c1 Fix hurd-i386: move cpuset tests outside of sched_setaffinity conditional
- 2953b50 Fix tests: finer-grained use of CPU_SET, CPU_ZERO and cpu_set_t
- fe1f986 Test for CPU_SET
- 63b495d Fix build on architectures with HAVE_SCHED_GETCPU but without HAVE_SYSCONF
- 957be43 README: document that Clang 3.0 (based on LLVM 3.0) is supported
- 79a8e6c clang: silence "unused expression result" warning
Files
Updated by Mathieu Desnoyers over 11 years ago
- Project changed from LTTng to LTTng-tools
Updated by Mathieu Desnoyers over 11 years ago
- Description updated (diff)
- Status changed from New to Feedback
- Assignee set to Mathieu Desnoyers
- Target version set to 2.2
Please check if the following commits fix your 2 issues (or any of the 2):
lttng-tools:
2b8f875 Cleanup: Use own mutex within timer setup/teardown
f2ad556 Fix: channel management thread should hold a refcount
fc64324 Fix: move metadata cache setup before ownership passing
4419b4f Fix: consumer metadata switch timer error handling
lttng-ust:
64bf51a Fix: add internal mutex for timer
Thanks,
Mathieu
Updated by Amer Alhalabi over 11 years ago
- File sessiond.log sessiond.log added
i Mathieu,
I tried the new commits, but still having the same problem.
I have collected some logs for you (attached).
And for some reason now I see 3 relayds.
SC-1:~/temp # lttng-sessiond -vvv --verbose-consumer &> /tmp/sessiond.log &
[1] 23690
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
24167 pts/0 00:00:00 lttng-consumerd
24329 ? 00:00:00 lttng-relayd
25197 ? 00:00:00 lttng-relayd
25916 ? 00:00:00 lttng-relayd
SC-1:~/temp # ps -e | grep -i lttng
23690 pts/0 00:00:00 lttng-sessiond
24167 pts/0 00:00:00 lttng-consumerd <defunct>
/Amer
Updated by Mathieu Desnoyers over 11 years ago
Can you try with the following commits ?
ea88ca2 Fix: consumerd metadata channel/cache/timer races
a0cbdd2 Fix: consumerd channel destroy race
a500c25 Cleanup: document RCU read-side lock better
f96e454 consumer relayd interaction: fix segfaults
If it does not fix your issue, I will need more info:
- detailed script triggering your issue,
- if possible, a backtrace of all threads within the defunct consumerd.
Thanks,
Mathieu
Updated by Amer Alhalabi over 11 years ago
- File sessiond3.log sessiond3.log added
Hi Mathieu,
I confirm that those commits fix the consumerd defuncts issue. Thanks!
one minor tiny issue: at session destroy, I am getting the following error:
"Error: UST consumer close metadata 9 not found"
I have attached the log.
Thanks,
Amer
Updated by Mathieu Desnoyers over 11 years ago
Please provide the script that triggers this issue.
I'm able to generate those errors if I kill the lttng-relayd by hand before issuing the destroy command, but in that case the error is expected: the remote end closed the connection.
However, if you can trigger this error in "normal" non-faulty scenario, I want to know how, because it may hide something important.
Thanks,
Mathieu
Updated by David Goulet over 11 years ago
Looking at the logs, the relayd is closed before the destroy command.
DEBUG1 [31281/31285]: Consumer set delete flag on stream by idx 2 (in update_endpoint_status_by_netidx() at consumer.c:351)
[...]
DEBUG3 [31281/31294]: Relayd closing socket 20 (in relayd_close() at relayd.c:386)
DEBUG3 [31281/31294]: Relayd closing socket 23 (in relayd_close() at relayd.c:386)
[...]
DEBUG1 [30806/30810]: Processing client command 9 (in process_client_msg() at main.c:2358) <-- DESTROY command
[...]
Error: UST consumer close metadata 9 not found
So this is expected behavior. Maybe it should be a DBG instead of an ERR since this is expected in at least one use case?
Updated by Mathieu Desnoyers over 11 years ago
- Status changed from Feedback to Resolved
10:43 < Compudj> since from the POV of the sessiond, it's really an error, no ?
10:43 < Compudj> the relayd went out improperly
10:44 < dgoulet> well not if the relayd dies
10:44 < dgoulet> ah right
10:44 < dgoulet> improper shutdown, I agree, that should indicate an error
Updated by Amer Alhalabi over 11 years ago
Hi guys!
I agree with both of you. It should be an error.
The TC kills relayd before destroy, so that error is expected.
We can be put this issue to bed.
Thank you guys for your support.
Amer