Project

General

Profile

Actions

Bug #1394

closed

Assertion `cmd_ctx.reply_payload.buffer.size >= sizeof(*llm)' failed

Added by Sani Rus 8 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
10/10/2023
Due date:
% Done:

0%

Estimated time:

Description

We start the tracing with:

lttng create xxx_trace --snapshot --no-output
lttng enable-channel xxx_ktrace -s xxx_trace -k --subbuf-size=256k --num-subbuf=16
lttng enable-event -s xxx_trace -c xxx_ktrace -k sched_switch,irq_handler_entry,irq_handler_exit,softirq_entry,softirq_exit,softirq_raise
lttng enable-event -s xxx_trace -c xxx_ktrace -k --syscall restart_syscall,exit,read,write,open,close,execve,ioctl,fcntl,ptrace,mmap,munmap,fs
lttng start xxx_trace

Then we do reboot application, without rebooting the kernel, we start tracing again with same commands as above, without using lttng-stop, or lttng-destroy.

We always get the following output:

2023-09-21T11:39:49.534934+00:00 Error: Session name already exists
2023-09-21T11:39:49.538108+00:00 Error: Channel xxx_ktrace: Tracing has already been started once (session xxx_trace)
2023-09-21T11:39:49.540223+00:00 Error: Event sched_switch: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.540388+00:00 Error: Event irq_handler_entry: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.545139+00:00 Error: Event irq_handler_exit: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.545881+00:00 Error: Event softirq_entry: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.546041+00:00 Error: Event softirq_exit: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.546161+00:00 Error: Event softirq_raise: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548267+00:00 Error: Event restart_syscall: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548357+00:00 Error: Event exit: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548458+00:00 Error: Event read: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548549+00:00 Error: Event write: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548634+00:00 Error: Event open: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548720+00:00 Error: Event close: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548804+00:00 Error: Event execve: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.548930+00:00 Error: Event ioctl: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.549027+00:00 Error: Event fcntl: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.549118+00:00 Error: Event ptrace: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.549204+00:00 Error: Event mmap: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.549289+00:00 Error: Event munmap: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)
2023-09-21T11:39:49.549374+00:00 Error: Event fs: The command tried to enable an event in a new domain for a session that has already been started once. (channel xxx_ktrace, session xxx_trace)

This is expected.

But the problem is that sometime, not always, above errors are followed by this assert error:

2023-09-21T11:39:49.592244+00:00 lttng-sessiond: ../../../../git/src/bin/lttng-sessiond/client.c:2678: thread_manage_clients: Assertion `cmd_ctx.reply_payload.buffer.size >= sizeof(*llm)' failed.
2023-09-21T11:39:49.592258+00:00 Tracing started for session xxx_trace

and consequently lttng-sessiond process aborts.

What is causing assert error at lttng-sessiond/client.c:2678? Can starting same session name without stop, or destroy, lead to this assert error? We did not experience `cmd_ctx.reply_payload.buffer.size >= sizeof(*llm)' asserts with previous versions of lttng.

version info:

lttng-tools 2.13.9
lttng-ust 2.13.5
lttng-ust212 2.12.6

architecture:
x86, Intel Atom 12 cores

OS version:
Linux version 5.10.176

Actions #1

Updated by Kienan Stewart 8 months ago

Hi Sani,

I've been unable to reproduce the issue you're seeing. Are you at any other point during the tracing running any other lttng commands? Would you be able to share more of, or the complete, sessiond log which has the assertion being triggered?

Do you have other sessions or channels configured (eg. userspace tracing of your application?)

thanks,
kienan

Actions #2

Updated by Sani Rus 8 months ago

Hi Kienan,

Thank you very much for following on this bug.

Here is additional information from our side:
- yes, there are several application processes trying to create same sessions, and enable same channels, and events. We know that this is not recommended, but it a part of our very complex testing set-up.
- so far the crash happened only once, we will search for the sessiond log file, if it was stored somewhere
- yes, we do have other sessions, and channels configured for user-space tracing

Best regards,
Sani

Actions #4

Updated by Kienan Stewart 8 months ago

  • Status changed from New to On pause

Hi Sani,

my first thought is that it's some kind of race condition when one process is running a command with the session in an unexpected state.

My attempt at a reproducer script runs a number of sub-processes in parallel looping through the commands that you have posted. I also tried removing --no-output and using lttng snapshot record, in case it happened while the snapshots are writing to disk.

If you see the crash again, a coredump and/or backtrace would be useful to understand the state that lead to the assertion being triggered.

I'll put the ticket "on pause" until there's more information available.

thanks,
kienan

Actions #5

Updated by Sani Rus 7 months ago

Hi Kienan,

Ok, thank you very much for the information provided. I will try to provide more information from the test team.

Best regards,
Sani

Actions #10

Updated by Sani Rus 7 months ago

Hi Kienan,

Here is more information:
- No lttng commands are issued at the time when the lttng-sessiond abort occurs, the abort happens during application processes restart. Only application processes are restarted, nothing else.
- There is an interesting behaviour. When we got the output "Warning: Tracing already started for session xxx_trace", lttng survives, there is no abort in the lttng-sessiond process. But when we got output "Tracing started for session xxx_trace" lttng-sessiond process aborts.
- Interesting is that the message comes even after lttng-sessiond was already supposed to be crashed as assert already fired.

Best regards,

Sani

Actions #11

Updated by Sani Rus 7 months ago

Hi Kienan,

In addition to information above - would it be possible to add more debug information in the assert statement, like outputting the actual value of buffer size.

Best regards,

Sani

Actions #12

Updated by Kienan Stewart 6 months ago

  • Status changed from On pause to In Progress
Actions #13

Updated by Erica Bugden 6 months ago

  • Status changed from In Progress to Feedback

Hello Sani,

We believe we have resolved the issue observed here. The fix is available on lttng-tools master and was also backported to the stable-2.13 branch (stable-2.13 fix)

Feel free to let us know if this resolves the issue for you.

Regards,
Erica

Actions #14

Updated by Sani Rus 6 months ago

Hi Erica,

Thank you very much for providing us with the fix. We will integrate it, and come back in case that this problem occurs again.

You can close this bug.

Best regards,

Sani

Actions #15

Updated by Kienan Stewart 3 months ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF