Project

General

Profile

Bug #1274

"[error] Event id ____ is outside range." when reading a trace

Added by Christophe Bedard 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
06/23/2020
Due date:
% Done:

0%

Estimated time:

Description

I'm using custom LTTng branches off of stable-2.11 inside a Ubuntu 18.04 Docker container (5.3.0-53-generic).

This worked fine with commit 6f84b7ea107c913475e789ab1c86c254c0e35e0d from the custom lttng-ust branch above, but the next commit defines a lot more events, and seems to break something. I looked over the last commit but I don't see any problems.

I've attached a userspace trace that is generated when LD_PRELOADing the custom liblttng-ust-libc-wrapper.so and running an application. Reading it with babeltrace prints some events but stops at some point and gives the following output:

[error] Event id 19144 is outside range.
[error] Reading event failed.
Error printing trace.

in case it's helpful, babeltrace2 also errors out (although I'm not using bt2 and currently cannot use it):

06-23 10:40:15.487 24237 24237 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1415 [muxer] Cannot get next message: comp-addr=0xb85980, muxer-comp-addr=0xb7f2e0, muxer-msg-iter-addr=0xb80d60, msg-iter-addr=0xb80c80, status=ERROR
06-23 10:40:15.487 24237 24237 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0xb80c80, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
06-23 10:40:15.487 24237 24237 W LIB/GRAPH consume_graph_sink@graph.c:473 Component's "consume" method failed: status=ERROR, comp-addr=0xb7f4f0, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=0, comp-class-so-handle-addr=0xa2d230, comp-class-so-handle-path="/usr/lib/x86_64-linux-gnu/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
06-23 10:40:15.487 24237 24237 E CLI cmd_run@babeltrace2.c:2548 Graph failed to complete successfully

ERROR:    [Babeltrace CLI] (babeltrace2.c:2548)
  Graph failed to complete successfully
CAUSED BY [libbabeltrace2] (graph.c:473)
  Component's "consume" method failed: status=ERROR, comp-addr=0xb7f4f0, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty",
  comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=0, comp-class-so-handle-addr=0xa2d230,
  comp-class-so-handle-path="/usr/lib/x86_64-linux-gnu/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
CAUSED BY [libbabeltrace2] (iterator.c:864)
  Component input port message iterator's "next" method failed: iter-addr=0xb80c80, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING,
  iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu",
  iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:991)
  Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0xb80d60, muxer-upstream-msg-iter-wrap-addr=0xb82450
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:454)
  Upstream iterator's next method returned an error: status=ERROR
CAUSED BY [libbabeltrace2] (iterator.c:864)
  Component input port message iterator's "next" method failed: iter-addr=0xb81f00, iter-upstream-comp-name="auto-disc-source-ctf-fs", iter-upstream-comp-log-level=WARNING,
  iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="fs", iter-upstream-comp-class-partial-descr="Read CTF traces from the file sy", iter-upstream-port-type=OUTPUT,
  iter-upstream-port-name="06ec4ff1-bba0-4773-a749-ef866a19f1d7 | 0 | 11", status=ERROR
CAUSED BY [auto-disc-source-ctf-fs (06ec4ff1-bba0-4773-a749-ef866a19f1d7 | 0 | 11): 'source.ctf.fs'] (fs.c:109)
  Failed to get next message from CTF message iterator.
CAUSED BY [auto-disc-source-ctf-fs: 'source.ctf.fs'] (msg-iter.c:2899)
  Cannot handle state: msg-it-addr=0xb820f0, state=AFTER_EVENT_HEADER
CAUSED BY [auto-disc-source-ctf-fs: 'source.ctf.fs'] (msg-iter.c:1282)
  No event class with ID of event class ID to use in stream class: msg-it-addr=0xb820f0, stream-class-id=0, event-class-id=19144, trace-class-addr=0xa23f50

Perhaps this is unrelated, but the metadata file looks weird: event IDs go from 0 to 31 then reset to 17 and go up to 23, all on the same stream (+ there is an event that has its metadata cut in two).

I assume this is more of a lttng-ust problem rather than a babeltrace problem, but this is where I encounter it. There are a lot of similarities with bug #648, but since that one is quite old and was closed without any fix, here is another issue.


Files

ust_trace.tar.xz (31.6 KB) ust_trace.tar.xz Christophe Bedard, 06/23/2020 10:38 AM
#1

Updated by Christophe Bedard 3 months ago

using commit 6f84b7ea107c913475e789ab1c86c254c0e35e0d instead of 269ea68b2c01d21c7711d0376b89989389301675 for lttng-ust (from this branch: https://github.com/ApexAI/lttng-ust/commits/ust-libc-wrapper-additions) indeed works, so perhaps it has something to do with having a lot of events (30 total events in the modified lttng_ust_libc_wrapper shared lib).

#2

Updated by Mathieu Desnoyers 3 months ago

The usual cause of this kind of issue is a corrupted trace. Can you reproduce without the custom commits on top of our supported stable lttng-ust branch ?

#3

Updated by Christophe Bedard 3 months ago

i.e. without the added callstack userspace context but with the added libc wrapper tracepoints (since this should be a normal use of lttng-ust and could be split out)?

#4

Updated by Mathieu Desnoyers 3 months ago

Yes, this is what I mean.

#5

Updated by Christophe Bedard 3 months ago

That seems to work. But from what I understand, the custom commits can work if I do not have too many events (either simply tracepoints that are enabled or events that get generated, I'm not sure).

Would you mind explaining (just briefly) how the custom commits might corrupt a trace? I'm just wondering if it's a defect/bug in the custom code, or if it's a limitation of lttng, etc. I'm just trying to figure out what my next steps should be, given that I'm using an unsupported/custom feature and that I kind of need it.

#6

Updated by Mathieu Desnoyers 3 months ago

The first thing to do is to double-check the custom commits to ensure there is no mismatch between the field description layout (which ends up in the metadata) and the field size/alignment. The field alignment would be the first thing to double-check, as it's forced to 1 byte on x86 (which is very likely the only architecture on which the custom callstack feature was tested), so alignment issues don't show up.

On what architecture are you reproducing this issue ?

#7

Updated by Christophe Bedard 3 months ago

I'm on an x86_64/amd64 system, so I assume it should be fine, but I'll double-check the layout.

#8

Updated by Christophe Bedard 3 months ago

After more investigation, I've fixed the "problem," or part of it, at least. Here is an explanation for the sake of documenting it.

I realized that I was only getting a corrupted trace if I had the `write`/`read` glibc functions supported in my modified `ust-libc-wrapper`. Looking at the userspace callstack information of the last event of the trace before it got corrupted was interesting. It showed that tracepoints were getting triggered inside `lttng-ust` itself, i.e. internal code inside `tracepoint()` was calling `read`/`write`, which was triggering other tracepoints, and so on.

I'm not sure how that didn't lead to a stack overflow, though. I'm also not sure how exactly that corrupted the trace and what part the custom callstack ust context played in it. I did see a bunch of zeros (hex) in some of the binary channel files, as if an event had a lot of (invalid) padding, but I didn't spend enough time reading the CTF specs to understand what it did/how that might affect the trace/where it might've come from. I'm guessing something is wrong in the custom ust callstack code.

The "fix" was to simply not allow nesting of tracepoints. It's not pretty but it works: https://github.com/ApexAI/lttng-ust/commit/264dc7b7e730b4f2dfb0673359a7e12d3be8c9f7#diff-d9d4a38112a88e8366f9f5f54b0a06acR317

This issue can be closed (unsupported/user error).

Also available in: Atom PDF