Bug #1352
closedLTTng consumer daemon crashed with error: "Attempt to send invalid file descriptor to master (fd = -1)"
100%
Description
Below are lttng packages used in our project.
babeltrace 1.5.8
liburcu 0.9.7
lttng-tools 2.12.6 + some upstream patches
lttng-ust 2.12.2
2022-03-15T08:16:12.389523+00:00 [lttng.sh Error: Attempt to send invalid file descriptor to master (fd = -1)
2022-03-15T08:16:12.389538+00:00 [lttng.sh PERROR - 08:16:12.389492137 [2524/2524]: Failed to close result file descriptor: Bad file descriptor (in send_fds_to_master() at ../../../git/src/common/runas.c:758)
2022-03-15T08:16:12.389545+00:00 [lttng.sh PERROR - 08:16:12.389522720 [2520/2545]: Failed to open file relative to trace chunk file_path = "xxx/x/64-bit/xxxx_0", flags = 577, mode = 432: No such file or directory (in _lttng_trace_chunk_open_fs_handle_locked() at ../../../git/src/common/trace-chunk.c:1410)
2022-03-15T08:16:12.389551+00:00 [lttng.sh Error: Failed to open stream file "xxxx_0"
2022-03-15T08:16:12.389555+00:00 [lttng.sh Error: Snapshot channel failed
2022-03-15T08:16:12.398688+00:00 [lttng.sh lttng-consumerd: ../../../../git/src/common/ust-consumer/ust-consumer.c:1141: snapshot_channel: Assertion `!stream->trace_chunk' failed.
After this error, they can see the below subsequent errors and they occurred while creating the snapshot for "xxx" lttng trace session which is created in snapshot mode.
2022-03-15T08:16:14.060086+00:00 [lttng.sh Error: Handling metadata request
2022-03-15T08:16:14.060095+00:00 [lttng.sh Error: Health error occurred in thread_consumer_management
2022-03-15T08:16:14.060100+00:00 [lttng.sh Error: Failed to close trace chunk on user space consumer
2022-03-15T08:16:14.060105+00:00 [lttng.sh Error: Failed to close snapshot trace chunk of session "xxx"
2022-03-15T08:16:14.062867+00:00 [lttng.sh Error: Trace chunk creation error on consumer
2022-03-15T08:16:14.062876+00:00 [lttng.sh Error: Failed to set temporary trace chunk to record a snapshot of session "xxx"
2022-03-15T08:16:14.844002+00:00 [lttng.sh Error: Trace chunk creation error on consumer
2022-03-15T08:16:14.844011+00:00 [lttng.sh Error: Failed to set temporary trace chunk to record a snapshot of session "xxxxeventlog"
2022-03-15T08:16:35.224333+00:00 take_snapshot(): lttng_snapshot_record failed for xxx, Trace chunk creation failed on consumer
2022-03-15T08:16:35.224499+00:00 [lttng.sh Error: Trace chunk creation error on consumer
2022-03-15T08:16:35.224509+00:00 [lttng.sh Error: Failed to set temporary trace chunk to record a snapshot of session "xxx"
2022-03-15T08:16:35.227241+00:00 [lttng.sh Error: Trace chunk creation error on consumer
2022-03-15T08:16:35.227250+00:00 [lttng.sh Error: Failed to set temporary trace chunk to record a snapshot of session "xxx"
Here are some assumptions:
Here is the sequence of events...
Multiple applications have crashed and so, some specific logs is generated for each of these application crashes
- There are multiple dump functions created as scripts invoked by xxxdump (which is registered as core_pattern) which will generate files with debug information (about the crashed program and system state etc.,).
Generation of snpashots for the created and started LTTng trace sessions is also one such operation.
- There is a top level temporary directory created at program crash and this one is used to store the debug info generated by each of these dumps
- Each dump is expected to complete within 3 secs and if it doesn't, then it will aborted
- The dump for collecting LTTng traces (which is a script hosting "lttng snapshot record") is also one among them
- As there are multiple application crashes at the same time, it looks like LTTng snapshot generation is taking more time than 3 secs due to which,
the dump is aborted
- After the dump is aborted and all the remaining dumps have executed, then the temporary collectind directory is removed (after being tar-ed)
- As the path for hosting LTTng traces is inside this temporary collecting directory which is passed to LTTng for snapshot generation,
lttng consumerd will continue to write to it. We believe this will result in lttng-consumerd to experience invalid FD ("no such file or directory")
as the snapshot folder is removed and eventually crash due to assertion
They don't think it is appropriate for lttng-consumerd to crash when the snapshot folder is removed.
They think LTTng should be able to handle this gracefully and return an error/warning rather than abort.
Could you say if it is possible to improve handling at the event of removing snapshot folder when snapshot recording is in progress?
Files