Bug #1394
closedAssertion `cmd_ctx.reply_payload.buffer.size >= sizeof(*llm)' failed
0%
Description
We start the tracing with:
lttng create xxx_trace --snapshot --no-output
lttng enable-channel xxx_ktrace -s xxx_trace -k --subbuf-size=256k --num-subbuf=16
lttng enable-event -s xxx_trace -c xxx_ktrace -k sched_switch,irq_handler_entry,irq_handler_exit,softirq_entry,softirq_exit,softirq_raise
lttng enable-event -s xxx_trace -c xxx_ktrace -k --syscall restart_syscall,exit,read,write,open,close,execve,ioctl,fcntl,ptrace,mmap,munmap,fs
lttng start xxx_trace
Then we do reboot application, without rebooting the kernel, we start tracing again with same commands as above, without using lttng-stop, or lttng-destroy.
We always get the following output:
2023-09-21T11:39:49.534934+00:00 Error: Session name already exists
2023-09-21T11:39:49.538108+00:00 Error: Channel xxx_ktrace: Tracing has already been started once (session xxx_trace)
2023-09-21T11:39:49.540223+00:00 Error: Event sched_switch: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.540388+00:00 Error: Event irq_handler_entry: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.545139+00:00 Error: Event irq_handler_exit: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.545881+00:00 Error: Event softirq_entry: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.546041+00:00 Error: Event softirq_exit: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.546161+00:00 Error: Event softirq_raise: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548267+00:00 Error: Event restart_syscall: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548357+00:00 Error: Event exit: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548458+00:00 Error: Event read: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548549+00:00 Error: Event write: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548634+00:00 Error: Event open: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548720+00:00 Error: Event close: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548804+00:00 Error: Event execve: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548930+00:00 Error: Event ioctl: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.549027+00:00 Error: Event fcntl: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.549118+00:00 Error: Event ptrace: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.549204+00:00 Error: Event mmap: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.549289+00:00 Error: Event munmap: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.549374+00:00 Error: Event fs: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
This is expected.
But the problem is that sometime, not always, above errors are followed by this assert error:
2023-09-21T11:39:49.592244+00:00 lttng-sessiond: ../../../../git/src/bin/lttng-sessiond/client.c:2678: thread_manage_clients: Assertion `cmd_ctx.reply_payload.buffer.size >= sizeof(*llm)' failed.
2023-09-21T11:39:49.592258+00:00 Tracing started for session xxx_trace
and consequently lttng-sessiond process aborts.
What is causing assert error at lttng-sessiond/client.c:2678? Can starting same session name without stop, or destroy, lead to this assert error? We did not experience `cmd_ctx.reply_payload.buffer.size >= sizeof(*llm)' asserts with previous versions of lttng.
version info:
lttng-tools 2.13.9
lttng-ust 2.13.5
lttng-ust212 2.12.6
architecture:
x86, Intel Atom 12 cores
OS version:
Linux version 5.10.176
Updated by Kienan Stewart about 1 year ago
Hi Sani,
I've been unable to reproduce the issue you're seeing. Are you at any other point during the tracing running any other lttng commands? Would you be able to share more of, or the complete, sessiond log which has the assertion being triggered?
Do you have other sessions or channels configured (eg. userspace tracing of your application?)
thanks,
kienan
Updated by Sani Rus about 1 year ago
Hi Kienan,
Thank you very much for following on this bug.
Here is additional information from our side:
- yes, there are several application processes trying to create same sessions, and enable same channels, and events. We know that this is not recommended, but it a part of our very complex testing set-up.
- so far the crash happened only once, we will search for the sessiond log file, if it was stored somewhere
- yes, we do have other sessions, and channels configured for user-space tracing
Best regards,
Sani
Updated by Kienan Stewart about 1 year ago
- Status changed from New to On pause
Hi Sani,
my first thought is that it's some kind of race condition when one process is running a command with the session in an unexpected state.
My attempt at a reproducer script runs a number of sub-processes in parallel looping through the commands that you have posted. I also tried removing --no-output
and using lttng snapshot record
, in case it happened while the snapshots are writing to disk.
If you see the crash again, a coredump and/or backtrace would be useful to understand the state that lead to the assertion being triggered.
I'll put the ticket "on pause" until there's more information available.
thanks,
kienan
Updated by Sani Rus almost 1 year ago
Hi Kienan,
Ok, thank you very much for the information provided. I will try to provide more information from the test team.
Best regards,
Sani
Updated by Sani Rus 12 months ago
Hi Kienan,
Here is more information:
- No lttng commands are issued at the time when the lttng-sessiond abort occurs, the abort happens during application processes restart. Only application processes are restarted, nothing else.
- There is an interesting behaviour. When we got the output "Warning: Tracing already started for session xxx_trace", lttng survives, there is no abort in the lttng-sessiond process. But when we got output "Tracing started for session xxx_trace" lttng-sessiond process aborts.
- Interesting is that the message comes even after lttng-sessiond was already supposed to be crashed as assert already fired.
Best regards,
Sani
Updated by Kienan Stewart 10 months ago
- Status changed from On pause to In Progress
Updated by Erica Bugden 10 months ago
- Status changed from In Progress to Feedback
Hello Sani,
We believe we have resolved the issue observed here. The fix is available on lttng-tools master and was also backported to the stable-2.13 branch (stable-2.13 fix)
Feel free to let us know if this resolves the issue for you.
Regards,
Erica
Updated by Kienan Stewart 7 months ago
- Status changed from Feedback to Resolved