Project

General

Profile

Actions

Bug #1429

open
EL

Rotation Hangs Indefinitely When Traced Process Restarts

Bug #1429: Rotation Hangs Indefinitely When Traced Process Restarts

Added by Evan Lu 4 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
07/21/2025
Due date:
% Done:

0%

Estimated time:

Description

Hi LTTng developers,

I'm encountering an issue where session rotation gets stuck indefinitely (`LTTNG_ROTATION_STATE_ONGOING`) under a specific condition. This appears related to how stream rotation state is managed when the traced process is restarted frequently.

From syslog the session: "session-20250717-061512-logging" was repeatedly responded with: "TRACE_CHUNK_EXISTS command: trace chunk exists locally" starting at: "2025-07-17 06:24:02.060602"

Reproduction Scenario
  • A long-running process manages an LTTng session.
  • A short-lived user process is repeatedly started and stopped every few seconds.
  • The session is set up for rotation (manual or periodic).
  • Over time, the session rotation hangs indefinitely.
Environment
  • LTTng: 2.13.11
  • Kernel: Linux 6.8.0-52-generic
  • Arch: x86\_64
  • OS: Ubuntu18

Files

syslog.zip (2.77 MB) syslog.zip Evan Lu, 07/21/2025 08:36 PM

KS Updated by Kienan Stewart 4 months ago Actions #1

Hi Evan,

how is the user process started and stopped? In particular I'm wondering if it's a graceful exit or something abrupt like SIGSTOP, kill, or similar.

At first blush, it sounds to me like it could potentially be a case where an application is killed after an event reservation has been made, but before the event commit finishes. That particular scenario can leave individual ring-buffers in a state where they will not be consumable. This would be the type of scenario that will be addressed by the upcoming buffer-stall recovery feature in the LTTng-UST and tools.

I haven't had a change to read through your logs yet. Once I do, I will reply again.

thanks,
kienan

EL Updated by Evan Lu 4 months ago Actions #2

Kienan Stewart wrote in #note-1:

Hi Evan,

how is the user process started and stopped? In particular I'm wondering if it's a graceful exit or something abrupt like SIGSTOP, kill, or similar.

At first blush, it sounds to me like it could potentially be a case where an application is killed after an event reservation has been made, but before the event commit finishes. That particular scenario can leave individual ring-buffers in a state where they will not be consumable. This would be the type of scenario that will be addressed by the upcoming buffer-stall recovery feature in the LTTng-UST and tools.

I haven't had a change to read through your logs yet. Once I do, I will reply again.

thanks,
kienan

Hi Kienan,

Thanks for the explanation.

You're correct — the user process is currently exited abruptly using `kill`, and in some cases, also via `LOG`, which, as you know, results in an immediate termination without cleanup. I'd appreciate it if you could share more details about the upcoming buffer-stall recovery feature. Additionally, is there any way to detect this issue and recover from it in the current version?

Looking forward to your insights after you've had a chance to review the logs.

Best,
Evan

KS Updated by Kienan Stewart 4 months ago · Edited Actions #3

Hi Evan,

it seems probable to me that this is a case of buffer stall. From the LTTng-tools logs alone, it's hard to say with 100% certainty.

One approach to validate it that I've used in the past is this commit https://review.lttng.org/c/lttng-ust/+/13177 (originally for 2.12, but the idea is there) that adds some extra debugging output to lttng-ust, and then run both the traced application(s) and lttng-tools with LTTNG_UST_DEBUG=1. In the logs I would look for something like the following:

grep 'Reservation failed' *.log | grep -E -o 'CPU [0-9]' | sort | uniq -c
   2497 CPU 0
   2054 CPU 1
   3785 CPU 2
   1751 CPU 3

Some reservation failures are normal in a loaded system. The cases I've seen where a particular sub-buffer becomes unusable, it would look something more like:

$ grep 'Reservation failed' *.log | grep -E -o 'CPU [0-9]' | sort | uniq -c
 649336 CPU 0
   2292 CPU 1
   7262 CPU 2
   2595 CPU 3

In LTTng 2.14.0 and earlier, the only way to recover is to destroy the session and re-create it.

If you'd like to try the upcoming buffer-stall recovery, you could build a patched lttng-ust and lttng-tools. These changes aren't yet merged into master.

The patchset for tools: https://review.lttng.org/c/lttng-tools/+/15040/3
The patchset for UST: https://review.lttng.org/c/lttng-ust/+/14995/4

Your instrumented applications will need to be recompiled, as there are ABI changes in UST to add this functionality.

The high-level overview of the changes is that periodically the consumer daemon will check for the sub-buffers that have been in a certain state for too long and attempt to modify them, so a stall that is permanent now should resolve after a given period. There are options to enable this when adding channels to a session.

EL Updated by Evan Lu about 2 months ago Actions #4

Kienan Stewart wrote in #note-3:

Hi Evan,

it seems probable to me that this is a case of buffer stall. From the LTTng-tools logs alone, it's hard to say with 100% certainty.

One approach to validate it that I've used in the past is this commit https://review.lttng.org/c/lttng-ust/+/13177 (originally for 2.12, but the idea is there) that adds some extra debugging output to lttng-ust, and then run both the traced application(s) and lttng-tools with LTTNG_UST_DEBUG=1. In the logs I would look for something like the following:

[...]

Some reservation failures are normal in a loaded system. The cases I've seen where a particular sub-buffer becomes unusable, it would look something more like: [...]

In LTTng 2.14.0 and earlier, the only way to recover is to destroy the session and re-create it.

If you'd like to try the upcoming buffer-stall recovery, you could build a patched lttng-ust and lttng-tools. These changes aren't yet merged into master.

The patchset for tools: https://review.lttng.org/c/lttng-tools/+/15040/3
The patchset for UST: https://review.lttng.org/c/lttng-ust/+/14995/4

Your instrumented applications will need to be recompiled, as there are ABI changes in UST to add this functionality.

The high-level overview of the changes is that periodically the consumer daemon will check for the sub-buffers that have been in a certain state for too long and attempt to modify them, so a stall that is permanent now should resolve after a given period. There are options to enable this when adding channels to a session.

Hi Kienan,

I reproduced the issue with your patch applied and observed a large number of messages like:

```
Reservation failed (-5): Input/output error
```

It seems this is caused by the buffer stall.

Since destroying and recreating the session results in data loss, do you have a recommended rule of thumb for how long I should wait before being confident the session is truly stalled and safe to destroy/recreate? Alternatively, is there an API or counter I can check to detect a stall more reliably?

On buffer-stall recovery: is this feature already available in the current master branch? And do you have an estimate for when the next release with this recovery mechanism will be available?

Thanks again for your help.

Best,
Evan

KS Updated by Kienan Stewart about 2 months ago Actions #5

Hi Evan,

buffer stall recovery is now in the master branch. I think we're aiming to get an RC for the next release by the end of October. I can let you know here when the RC is available.

thanks,
kienan

Actions

Also available in: PDF Atom