Project

General

Profile

Actions

Bug #1352

closed

LTTng consumer daemon crashed with error: "Attempt to send invalid file descriptor to master (fd = -1)"

Added by Bogdan Codres over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
04/04/2022
Due date:
% Done:

100%

Estimated time:

Description

Below are lttng packages used in our project.

babeltrace 1.5.8
liburcu 0.9.7
lttng-tools 2.12.6 + some upstream patches
lttng-ust 2.12.2

2022-03-15T08:16:12.389523+00:00  [lttng.sh Error: Attempt to send invalid file descriptor to master (fd = -1)
2022-03-15T08:16:12.389538+00:00  [lttng.sh PERROR - 08:16:12.389492137 [2524/2524]: Failed to close result file descriptor: Bad file descriptor (in send_fds_to_master() at ../../../git/src/common/runas.c:758)
2022-03-15T08:16:12.389545+00:00  [lttng.sh PERROR - 08:16:12.389522720 [2520/2545]: Failed to open file relative to trace chunk file_path = "xxx/x/64-bit/xxxx_0", flags = 577, mode = 432: No such file or directory (in _lttng_trace_chunk_open_fs_handle_locked() at ../../../git/src/common/trace-chunk.c:1410)
2022-03-15T08:16:12.389551+00:00  [lttng.sh Error: Failed to open stream file "xxxx_0" 
2022-03-15T08:16:12.389555+00:00  [lttng.sh Error: Snapshot channel failed
2022-03-15T08:16:12.398688+00:00  [lttng.sh lttng-consumerd: ../../../../git/src/common/ust-consumer/ust-consumer.c:1141: snapshot_channel: Assertion `!stream->trace_chunk' failed.

After this error, they can see the below subsequent errors and they occurred while creating the snapshot for "xxx" lttng trace session which is created in snapshot mode.

2022-03-15T08:16:14.060086+00:00  [lttng.sh Error: Handling metadata request
2022-03-15T08:16:14.060095+00:00  [lttng.sh Error: Health error occurred in thread_consumer_management
2022-03-15T08:16:14.060100+00:00  [lttng.sh Error: Failed to close trace chunk on user space consumer
2022-03-15T08:16:14.060105+00:00  [lttng.sh Error: Failed to close snapshot trace chunk of session "xxx" 
2022-03-15T08:16:14.062867+00:00  [lttng.sh Error: Trace chunk creation error on consumer
2022-03-15T08:16:14.062876+00:00  [lttng.sh Error: Failed to set temporary trace chunk to record a snapshot of session "xxx" 
2022-03-15T08:16:14.844002+00:00  [lttng.sh Error: Trace chunk creation error on consumer
2022-03-15T08:16:14.844011+00:00  [lttng.sh Error: Failed to set temporary trace chunk to record a snapshot of session "xxxxeventlog" 
2022-03-15T08:16:35.224333+00:00  take_snapshot(): lttng_snapshot_record failed for xxx, Trace chunk creation failed on consumer
2022-03-15T08:16:35.224499+00:00  [lttng.sh Error: Trace chunk creation error on consumer
2022-03-15T08:16:35.224509+00:00  [lttng.sh Error: Failed to set temporary trace chunk to record a snapshot of session "xxx" 
2022-03-15T08:16:35.227241+00:00  [lttng.sh Error: Trace chunk creation error on consumer
2022-03-15T08:16:35.227250+00:00  [lttng.sh Error: Failed to set temporary trace chunk to record a snapshot of session "xxx" 

Here are some assumptions:
Here is the sequence of events...

Multiple applications have crashed and so, some specific logs is generated for each of these application crashes

- There are multiple dump functions created as scripts invoked by xxxdump (which is registered as core_pattern) which will generate files with debug information (about the crashed program and system state etc.,).
Generation of snpashots for the created and started LTTng trace sessions is also one such operation.
- There is a top level temporary directory created at program crash and this one is used to store the debug info generated by each of these dumps
- Each dump is expected to complete within 3 secs and if it doesn't, then it will aborted
- The dump for collecting LTTng traces (which is a script hosting "lttng snapshot record") is also one among them
- As there are multiple application crashes at the same time, it looks like LTTng snapshot generation is taking more time than 3 secs due to which,
the dump is aborted
- After the dump is aborted and all the remaining dumps have executed, then the temporary collectind directory is removed (after being tar-ed)
- As the path for hosting LTTng traces is inside this temporary collecting directory which is passed to LTTng for snapshot generation,
lttng consumerd will continue to write to it. We believe this will result in lttng-consumerd to experience invalid FD ("no such file or directory")
as the snapshot folder is removed and eventually crash due to assertion
They don't think it is appropriate for lttng-consumerd to crash when the snapshot folder is removed.
They think LTTng should be able to handle this gracefully and return an error/warning rather than abort.

Could you say if it is possible to improve handling at the event of removing snapshot folder when snapshot recording is in progress?


Files

run.sh (656 Bytes) run.sh Bogdan Codres, 04/07/2022 08:55 AM
Actions #1

Updated by Jonathan Rajotte Julien over 2 years ago

  • Status changed from New to Feedback

Hi Bogdan,

Please provide a dummy down reproducer with all steps necessary to reproduce the observed behavior.

Cheers.

Actions #2

Updated by Bogdan Codres over 2 years ago

Hello Julien,

Unfortunately I can provide more info on the reproducer as I provided in INFO.
Basically I cannot reproduce on my side, only the client can. I think that the details
offers some kind of reproduction steps:
- have multiple generation of snapshots of the lttng sessions in the same time
- dump isn't finised in 3 sec, so a directory deletion is triggered
- "no such file or directory" is triggered in the LTTNG side

I have some assumptions of mine regarding what might trigger this issue

- missing mutex (pthread_mutex_lock(&stream->lock);) for this 2 lines in * consumer_del_metadata_stream *

       lttng_trace_chunk_put(stream->trace_chunk);
    stream->trace_chunk = NULL;*

- missing mutex lock for channel in * lttng_ustconsumer_recv_cmd*

Bogdan

Actions #3

Updated by Jonathan Rajotte Julien over 2 years ago

Hi Bogdan,

Well reproducing the issue artificially is the hard part here. In any case we will need to get a good way to reproduce the problem in a synthetic manner otherwise it is gonna be quite difficult to affirm that we indeed fix the problem with whatever fix we come up with, especially for those of us that do not have access to your client production/test system.

Provide a reproducer and we will have another look.

Cheers

Actions #4

Updated by Bogdan Codres over 2 years ago

One more concern that I have. Based on my client usecase that
deletes folders if the snapshot isn't finished in 3 sec ...I wonder if

assert(path) ; --> from snapshot channel is sufficient ?
Isn't this assert checking if the patch (*char) is different from NULL ?
Shouldn't be something like this ?

    /* Check existence and permissions */
    ret = access(path, F_OK | X_OK);
    if (ret < 0) {
        ERR("No such file or access denied: %s", pathname);
        goto end;
    }

Bogdan

Actions #5

Updated by Bogdan Codres over 2 years ago

Hello Julien !

I've manage to reproduce a similar log as the client by removing the directory
where the logs should be saved. Please check the script attached.

Session snap-test created.
Traces will be output to /root/lttng-traces/snap-test-20220407-044000
UST channel ch1 enabled for session snap-test
All UST events are enabled in channel ch1
Tracing started for session snap-test
PERROR - 04:40:05.379797033 [1576/1576]: Failed to fstat directory file descriptor 3: Stale file handle (in lttng_directory_handle_create_from_dirfd() at ../../../../git/src/common/compat/directory-handle.c:165)
PERROR - 04:40:05.380028748 [1575/1584]: Failed to create trace chunk subdirectory "ust/uid/0/64-bit/index": Stale file handle (in lttng_trace_chunk_create_subdirectory() at ../../../git/src/common/trace-chunk.c:1264)
Error: ask_channel_creation consumer command failed
Error: Error creating UST channel "ch1" on the consumer daemon
PERROR - 04:40:05.380433829 [1576/1576]: Failed to fstat directory file descriptor 0: Stale file handle (in lttng_directory_handle_create_from_dirfd() at ../../../../git/src/common/compat/directory-handle.c:165)
PERROR - 04:40:05.380445197 [1576/1576]: Failed to close file descriptor received fd in run-as worker: Bad file descriptor (in cleanup_received_fds() at ../../../git/src/common/runas.c:820)
Error: Error cleaning up FD
Error: Run-as worker has hung-up during run_as_cmd
Error: run_as worker exiting (ret = 1)
Error: lttng-runas terminated with status code 1
PERROR - 04:40:05.451847229 [1593/1593]: Failed to fstat directory file descriptor 3: Stale file handle (in lttng_directory_handle_create_from_dirfd() at ../../../../git/src/common/compat/directory-handle.c:165)
Error: Attempt to send invalid file descriptor to master (fd = -1)
PERROR - 04:40:05.452126650 [1593/1593]: Failed to close result file descriptor: Bad file descriptor (in send_fds_to_master() at ../../../git/src/common/runas.c:758)
PERROR - 04:40:05.452132810 [1591/1600]: Failed to open file relative to trace chunk file_path = "uid/0/64-bit/ch1_0", flags = 577, mode = 432: Stale file handle (in _lttng_trace_chunk_open_fs_handle_locked() at ../../../git/src/common/trace-chunk.c:1410)
Error: Failed to open stream file "ch1_0"
Error: ask_channel_creation consumer command failed
Error: Error creating UST channel "ch1" on the consumer daemon
Waiting for data availability
Tracing stopped for session snap-test

Bogdan

Actions #6

Updated by Jonathan Rajotte Julien about 2 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF