Bug #1372
openConsumer crashes during rotation (write to bad file descriptor)
0%
Description
Hi,
We are observing seemingly random crashes in the LTTng consumer daemon when tracing a C++ application with LTTng-UST. Our workload has a single printf
-like tracepoint, where each string is in the order of 1kb and the total output is around 30MB/s. LTTng is set up with a single session and channel enabling this tracepoint, and we enabled rotation with a maximum size of 100MB or every 30 seconds. We are periodically starting new traced processes and the system runs close to 100% CPU load. This ran on an AWS Graviton2 (ARM) instance with CentOS 7 and a 5.4 kernel, using LTTng-UST 2.13.5 and LTTng-tools 2.13.8.
The first reported error is a write to a bad file descriptor (-1), apparently when waking up the metadata poll thread during a rotation.
I looked through the LTTng logs with -vvv --verbose-consumer
and I suspect this might be caused by some sort of race condition; looking at the following log fragments (I inserted two additional debug lines in LTTng so the line numbers may not exactly match those in the 2.13.8 source code):
DBG1 - 15:12:12.865621802 [Rotation]: Consumer rotate channel key 574 (in consumer_rotate_channel() at consumer.c:1694)
A rotation is requested for the channel with key 574. Then the metadata pipe of this channel is closed:
DBG3 - 15:12:13.269805401 [UST application management]: Buffer registry per PID find id: 288 (in buffer_reg_pid_find() at buffer-registry.c:308) DBG3 - 15:12:13.269811366 [UST application management]: No metadata to push for metadata key 574 (in ust_app_push_metadata() at ust-app.c:689) DBG2 - 15:12:13.269816979 [UST application management]: Consumer close metadata channel key 574 (in consumer_close_metadata() at consumer.c:1394) DBG1 - 15:12:13.269826899 [6593/6605]: Incoming command on sock (in consumer_thread_sessiond_poll() at consumer.c:3283) DBG1 - 15:12:13.269835259 [6593/6605]: UST consumer close metadata key 574 (in close_metadata() at ust-consumer.c:804) DBG1 - 15:12:13.269838591 [6593/6605]: Closing metadata channel key 574 (in lttng_ustconsumer_close_metadata() at ust-consumer.c:3235) DBG3 - 15:12:13.269841167 [6593/6605]: close() fd = 603 (in lttng_ustconsumer_close_metadata() at ust-consumer.c:3250) DBG1 - 15:12:13.269847075 [6593/6605]: Received command on sock (in consumer_thread_sessiond_poll() at consumer.c:3299)
But then LTTng unsuccessfully attempts to wake up the metadata poll thread by writing to the file descriptor that was just closed (the
channel key = 574
in the metadata poll thread message below is channel->key
):DBG1 - 15:12:13.271001175 [6593/6605]: Waking up metadata poll thread (writing to pipe): channel name = 'metadata', channel key = 574 (in consumer_metadata_wakeup_pipe() at consumer.c:888) DBG3 - 15:12:13.271010093 [6593/6605]: write() fd = -1 (in consumer_metadata_wakeup_pipe() at consumer.c:892) PERROR - 15:12:13.271014655 [6593/6605]: Failed to write to UST metadata pipe while attempting to wake-up the metadata poll thread: Bad file descriptor (in consumer_metadata_wakeup_pipe() at consumer.c:907) Error: Failed to dump the metadata cache Error: Rotate channel failed
After this, there are a bunch more errors in the logs, but I guess they are ultimately caused by the issue above:
DBG1 - 15:12:13.273572566 [6593/6604]: Consumer mmap write() ret 446464 (len 446464) (in lttng_consumer_on_read_subbuffer_mmap() at consumer.c:1726) lttng-consumerd: consumer.c:1639: lttng_consumer_on_read_subbuffer_mmap: Assertion `stream->net_seq_idx != (uint64_t) -1ULL || stream->trace_chunk' failed. ... DBG1 - 15:12:13.475559588 [Consumer management]: Error when receiving data from the consumer socket 61 (in consumer_socket_recv() at consumer.c:159) Error: Handling metadata request DBG1 - 15:12:13.476382551 [Rotation]: Error when receiving data from the consumer socket 60 (in consumer_socket_recv() at consumer.c:159) Error: Error pushing metadata to consumer DBG1 - 15:12:13.476418727 [Rotation]: Consumer rotate channel key 604 (in consumer_rotate_channel() at consumer.c:1694) DBG1 - 15:12:13.476424962 [Rotation]: Setting trace chunk close command to "move to completed chunk folder" (in lttng_trace_chunk_set_close_command() at trace-chunk.c:1797) DBG1 - 15:12:13.476428671 [Rotation]: lttng_trace_chunk_rename_path from .tmp_new_chunk to (null) (in lttng_trace_chunk_rename_path_no_lock() at trace-chunk.c:755) DBG3 - 15:12:13.476437516 [Rotation]: renameat() old_dirfd = 946, old_name = .tmp_new_chunk, new_dirfd = 946, new_name = 20230421T151158+0000-5, uid = 1004, gid = 1004 (in run_as_renameat() at runas.c:1883) DBG1 - 15:12:13.476444999 [Rotation]: Using run_as worker (in run_as() at runas.c:1646) Error: Error pushing metadata to consumer DBG2 - 15:12:13.476569797 [UST application management]: Consumer close metadata channel key 624 (in consumer_close_metadata() at consumer.c:1394) DBG1 - 15:12:13.476585469 [UST application management]: PID 7729 unregistering with sock 1060 (in ust_app_unregister() at ust-app.c:4253) DBG1 - 15:12:13.476589957 [UST application management]: Flushing app session buffers for ust app pid 7729 (in ust_app_flush_app_session() at ust-app.c:5315) DBG2 - 15:12:13.476611364 [UST application management]: Consumer flush channel key 651 (in consumer_flush_channel() at consumer.c:1328) Error: Error flushing consumer channel DBG3 - 15:12:13.476616188 [UST application management]: Buffer registry per PID find id: 327 (in buffer_reg_pid_find() at buffer-registry.c:308) DBG3 - 15:12:13.476620406 [UST application management]: No metadata to push for metadata key 652 (in ust_app_push_metadata() at ust-app.c:689) DBG2 - 15:12:13.476623277 [UST application management]: Consumer close metadata channel key 652 (in consumer_close_metadata() at consumer.c:1394) DBG1 - 15:12:13.476633312 [UST application management]: PID 11467 unregistering with sock 457 (in ust_app_unregister() at ust-app.c:4253) DBG1 - 15:12:13.476636996 [UST application management]: Flushing app session buffers for ust app pid 11467 (in ust_app_flush_app_session() at ust-app.c:5315) DBG2 - 15:12:13.476640508 [UST application management]: Consumer flush channel key 661 (in consumer_flush_channel() at consumer.c:1328) Error: Error flushing consumer channel DBG3 - 15:12:13.476643946 [UST application management]: Buffer registry per PID find id: 332 (in buffer_reg_pid_find() at buffer-registry.c:308) DBG2 - 15:12:13.476649501 [UST application management]: Consumer push metadata to consumer socket -1 (in consumer_push_metadata() at consumer.c:1462) Error: Error pushing metadata to consumer ...
We currently do not have simple steps to reliably reproduce the issue, but this happens somewhat irregularly after 5-10 minutes (sometimes earlier, sometimes later) of running our workload. I attached a longer fragment of the logs around the time the error happens, let us know if we can provide any more information to help.
Files
Updated by Maximilian Fickert over 1 year ago
Another example of this with less noise between the sequence of rotate/close/wakeup log messages:
DBG2 - 11:38:02.567715442 [UST application management]: Consumer close metadata channel key 136 (in consumer_close_metadata() at consumer.c:1394) DBG1 - 11:38:02.567741993 [6549/6558]: Incoming command on sock (in consumer_thread_sessiond_poll() at consumer.c:3283) DBG1 - 11:38:02.567764795 [6549/6558]: UST consumer push metadata key 136 of len 2845 (in lttng_ustconsumer_recv_cmd() at ust-consumer.c:1776) DBG3 - 11:38:02.567785218 [Rotation]: Consumer pushing metadata on sock 60 of len 2845 (in consumer_push_metadata() at consumer.c:1479) DBG1 - 11:38:02.567805254 [6549/6558]: UST consumer push metadata key 136 of len 2845 (in lttng_ustconsumer_recv_metadata() at ust-consumer.c:1284) DBG1 - 11:38:02.567818087 [6549/6558]: Writing 2845 bytes from offset 0 in metadata cache (in consumer_metadata_cache_write() at consumer-metadata-cache.c:96) DBG1 - 11:38:02.567826924 [6549/6558]: Received command on sock (in consumer_thread_sessiond_poll() at consumer.c:3299) DBG1 - 11:38:02.567836360 [Rotation]: Consumer rotate channel key 136 (in consumer_rotate_channel() at consumer.c:1694) DBG1 - 11:38:02.567855855 [6549/6558]: Incoming command on sock (in consumer_thread_sessiond_poll() at consumer.c:3283) DBG1 - 11:38:02.567865004 [6549/6558]: UST consumer close metadata key 136 (in close_metadata() at ust-consumer.c:804) DBG1 - 11:38:02.567868318 [6549/6558]: Closing metadata channel key 136 (in lttng_ustconsumer_close_metadata() at ust-consumer.c:3235) DBG3 - 11:38:02.567870829 [6549/6558]: close() fd = 254 (in lttng_ustconsumer_close_metadata() at ust-consumer.c:3250) DBG1 - 11:38:02.567883342 [6549/6558]: Received command on sock (in consumer_thread_sessiond_poll() at consumer.c:3299) DBG1 - 11:38:02.567903928 [6549/6558]: Incoming command on sock (in consumer_thread_sessiond_poll() at consumer.c:3283) DBG1 - 11:38:02.567911264 [6549/6558]: Consumer sample rotate position for channel 136 (in lttng_consumer_rotate_channel() at consumer.c:4002) DBG1 - 11:38:02.567919461 [6549/6558]: Set rotate ready for stream 253 produced = 0 consumed = 0 (in lttng_consumer_rotate_channel() at consumer.c:4160) DBG1 - 11:38:02.567918402 [UST application management]: PID 2051 unregistering with sock 779 (in ust_app_unregister() at ust-app.c:4253) DBG1 - 11:38:02.567929578 [UST application management]: Flushing app session buffers for ust app pid 2051 (in ust_app_flush_app_session() at ust-app.c:5315) DBG2 - 11:38:02.567934468 [UST application management]: Consumer flush channel key 129 (in consumer_flush_channel() at consumer.c:1328) DBG1 - 11:38:02.567922431 [6549/6558]: Set rotation position for stream 253 at position 0 (in lttng_consumer_rotate_channel() at consumer.c:4191) DBG1 - 11:38:02.567941311 [6549/6558]: Consumer rotate ready streams in channel 136 (in lttng_consumer_rotate_ready_streams() at consumer.c:4663) DBG1 - 11:38:02.567944404 [6549/6558]: Consumer rotate ready stream 253 (in lttng_consumer_rotate_ready_streams() at consumer.c:4679) DBG1 - 11:38:02.567946718 [6549/6558]: Consumer rotate stream 253 (in lttng_consumer_rotate_stream() at consumer.c:4579) DBG1 - 11:38:02.567949803 [6549/6558]: Rotate local stream: stream key 253, channel key 136 (in rotate_local_stream() at consumer.c:4539) DBG1 - 11:38:02.567957393 [6549/6558]: Opening stream output file "pid/sf_awsuswest2te-3512-20230425-113748/metadata" (in consumer_stream_create_output_files() at consumer-stream.c:1198) DBG1 - 11:38:02.567960625 [6549/6558]: Opening trace chunk file "pid/sf_awsuswest2te-3512-20230425-113748/metadata" (in _lttng_trace_chunk_open_fs_handle_locked() at trace-chunk.c:1364) DBG1 - 11:38:02.568008002 [6549/6558]: Adding new file "pid/sf_awsuswest2te-3512-20230425-113748/metadata" to trace chunk "20230425T113749+0000-3" (in lttng_trace_chunk_add_file() at trace-chunk.c:1314) DBG1 - 11:38:02.568016125 [6549/6556]: Metadata poll return from wait with 31 fd(s) (in consumer_thread_metadata_poll() at consumer.c:2369) DBG1 - 11:38:02.568029425 [6549/6556]: Metadata event caught in thread (in consumer_thread_metadata_poll() at consumer.c:2372) DBG1 - 11:38:02.568038844 [6549/6556]: Metadata fd 253 is hup|err. (in consumer_thread_metadata_poll() at consumer.c:2481) DBG3 - 11:38:02.568039197 [Rotation]: Buffer registry per PID find id: 58 (in buffer_reg_pid_find() at buffer-registry.c:308) DBG1 - 11:38:02.568041831 [6549/6556]: Attempting to flush and consume the UST buffers (in consumer_thread_metadata_poll() at consumer.c:2486) DBG1 - 11:38:02.568048584 [Rotation]: Consumer rotate channel key 115 (in consumer_rotate_channel() at consumer.c:1694) DBG3 - 11:38:02.568017536 [6549/6558]: openat() fd = 3318, path = pid/sf_awsuswest2te-3512-20230425-113748/metadata, flags = 241, mode = 432, uid 1004, gid 1004 (in run_as_openat() at runas.c:1758) DBG1 - 11:38:02.568061794 [6549/6558]: Using run_as worker (in run_as() at runas.c:1646) DBG1 - 11:38:02.568241165 [6549/6558]: Waking up metadata poll thread (writing to pipe): channel name = 'metadata', channel key = 136 (in consumer_metadata_wakeup_pipe() at consumer.c:888) DBG3 - 11:38:02.568249214 [6549/6558]: write() fd = -1 (in consumer_metadata_wakeup_pipe() at consumer.c:892) PERROR - 11:38:02.568253399 [6549/6558]: Failed to write to UST metadata pipe while attempting to wake-up the metadata poll thread: Bad file descriptor (in consumer_metadata_wakeup_pipe() at consumer.c:907) Error: Failed to dump the metadata cache Error: Rotate channel failed
Updated by Erica Bugden over 1 year ago
- Status changed from New to In Progress
A yet unmerged fix was developed for the stable-2.13
branch:
A future step would be to port the fix to master
.