Project

General

Profile

Actions

Bug #1372

open

Consumer crashes during rotation (write to bad file descriptor)

Added by Maximilian Fickert about 1 year ago. Updated 10 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
04/25/2023
Due date:
% Done:

0%

Estimated time:

Description

Hi,

We are observing seemingly random crashes in the LTTng consumer daemon when tracing a C++ application with LTTng-UST. Our workload has a single printf-like tracepoint, where each string is in the order of 1kb and the total output is around 30MB/s. LTTng is set up with a single session and channel enabling this tracepoint, and we enabled rotation with a maximum size of 100MB or every 30 seconds. We are periodically starting new traced processes and the system runs close to 100% CPU load. This ran on an AWS Graviton2 (ARM) instance with CentOS 7 and a 5.4 kernel, using LTTng-UST 2.13.5 and LTTng-tools 2.13.8.

The first reported error is a write to a bad file descriptor (-1), apparently when waking up the metadata poll thread during a rotation.

I looked through the LTTng logs with -vvv --verbose-consumer and I suspect this might be caused by some sort of race condition; looking at the following log fragments (I inserted two additional debug lines in LTTng so the line numbers may not exactly match those in the 2.13.8 source code):

DBG1 - 15:12:12.865621802 [Rotation]: Consumer rotate channel key 574 (in consumer_rotate_channel() at consumer.c:1694)

A rotation is requested for the channel with key 574. Then the metadata pipe of this channel is closed:
DBG3 - 15:12:13.269805401 [UST application management]: Buffer registry per PID find id: 288 (in buffer_reg_pid_find() at buffer-registry.c:308)
DBG3 - 15:12:13.269811366 [UST application management]: No metadata to push for metadata key 574 (in ust_app_push_metadata() at ust-app.c:689)
DBG2 - 15:12:13.269816979 [UST application management]: Consumer close metadata channel key 574 (in consumer_close_metadata() at consumer.c:1394)
DBG1 - 15:12:13.269826899 [6593/6605]: Incoming command on sock (in consumer_thread_sessiond_poll() at consumer.c:3283)
DBG1 - 15:12:13.269835259 [6593/6605]: UST consumer close metadata key 574 (in close_metadata() at ust-consumer.c:804)
DBG1 - 15:12:13.269838591 [6593/6605]: Closing metadata channel key 574 (in lttng_ustconsumer_close_metadata() at ust-consumer.c:3235)
DBG3 - 15:12:13.269841167 [6593/6605]: close() fd = 603 (in lttng_ustconsumer_close_metadata() at ust-consumer.c:3250)
DBG1 - 15:12:13.269847075 [6593/6605]: Received command on sock (in consumer_thread_sessiond_poll() at consumer.c:3299)

But then LTTng unsuccessfully attempts to wake up the metadata poll thread by writing to the file descriptor that was just closed (the channel key = 574 in the metadata poll thread message below is channel->key):
DBG1 - 15:12:13.271001175 [6593/6605]: Waking up metadata poll thread (writing to pipe): channel name = 'metadata', channel key = 574 (in consumer_metadata_wakeup_pipe() at consumer.c:888)
DBG3 - 15:12:13.271010093 [6593/6605]: write() fd = -1 (in consumer_metadata_wakeup_pipe() at consumer.c:892)
PERROR - 15:12:13.271014655 [6593/6605]: Failed to write to UST metadata pipe while attempting to wake-up the metadata poll thread: Bad file descriptor (in consumer_metadata_wakeup_pipe() at consumer.c:907)
Error: Failed to dump the metadata cache
Error: Rotate channel failed

After this, there are a bunch more errors in the logs, but I guess they are ultimately caused by the issue above:
DBG1 - 15:12:13.273572566 [6593/6604]: Consumer mmap write() ret 446464 (len 446464) (in lttng_consumer_on_read_subbuffer_mmap() at consumer.c:1726)
lttng-consumerd: consumer.c:1639: lttng_consumer_on_read_subbuffer_mmap: Assertion `stream->net_seq_idx != (uint64_t) -1ULL || stream->trace_chunk' failed.
...
DBG1 - 15:12:13.475559588 [Consumer management]: Error when receiving data from the consumer socket 61 (in consumer_socket_recv() at consumer.c:159)
Error: Handling metadata request
DBG1 - 15:12:13.476382551 [Rotation]: Error when receiving data from the consumer socket 60 (in consumer_socket_recv() at consumer.c:159)
Error: Error pushing metadata to consumer
DBG1 - 15:12:13.476418727 [Rotation]: Consumer rotate channel key 604 (in consumer_rotate_channel() at consumer.c:1694)
DBG1 - 15:12:13.476424962 [Rotation]: Setting trace chunk close command to "move to completed chunk folder" (in lttng_trace_chunk_set_close_command() at trace-chunk.c:1797)
DBG1 - 15:12:13.476428671 [Rotation]: lttng_trace_chunk_rename_path from .tmp_new_chunk to (null) (in lttng_trace_chunk_rename_path_no_lock() at trace-chunk.c:755)
DBG3 - 15:12:13.476437516 [Rotation]: renameat() old_dirfd = 946, old_name = .tmp_new_chunk, new_dirfd = 946, new_name = 20230421T151158+0000-5, uid = 1004, gid = 1004 (in run_as_renameat() at runas.c:1883)
DBG1 - 15:12:13.476444999 [Rotation]: Using run_as worker (in run_as() at runas.c:1646)
Error: Error pushing metadata to consumer
DBG2 - 15:12:13.476569797 [UST application management]: Consumer close metadata channel key 624 (in consumer_close_metadata() at consumer.c:1394)
DBG1 - 15:12:13.476585469 [UST application management]: PID 7729 unregistering with sock 1060 (in ust_app_unregister() at ust-app.c:4253)
DBG1 - 15:12:13.476589957 [UST application management]: Flushing app session buffers for ust app pid 7729 (in ust_app_flush_app_session() at ust-app.c:5315)
DBG2 - 15:12:13.476611364 [UST application management]: Consumer flush channel key 651 (in consumer_flush_channel() at consumer.c:1328)
Error: Error flushing consumer channel
DBG3 - 15:12:13.476616188 [UST application management]: Buffer registry per PID find id: 327 (in buffer_reg_pid_find() at buffer-registry.c:308)
DBG3 - 15:12:13.476620406 [UST application management]: No metadata to push for metadata key 652 (in ust_app_push_metadata() at ust-app.c:689)
DBG2 - 15:12:13.476623277 [UST application management]: Consumer close metadata channel key 652 (in consumer_close_metadata() at consumer.c:1394)
DBG1 - 15:12:13.476633312 [UST application management]: PID 11467 unregistering with sock 457 (in ust_app_unregister() at ust-app.c:4253)
DBG1 - 15:12:13.476636996 [UST application management]: Flushing app session buffers for ust app pid 11467 (in ust_app_flush_app_session() at ust-app.c:5315)
DBG2 - 15:12:13.476640508 [UST application management]: Consumer flush channel key 661 (in consumer_flush_channel() at consumer.c:1328)
Error: Error flushing consumer channel
DBG3 - 15:12:13.476643946 [UST application management]: Buffer registry per PID find id: 332 (in buffer_reg_pid_find() at buffer-registry.c:308)
DBG2 - 15:12:13.476649501 [UST application management]: Consumer push metadata to consumer socket -1 (in consumer_push_metadata() at consumer.c:1462)
Error: Error pushing metadata to consumer
...

We currently do not have simple steps to reliably reproduce the issue, but this happens somewhat irregularly after 5-10 minutes (sometimes earlier, sometimes later) of running our workload. I attached a longer fragment of the logs around the time the error happens, let us know if we can provide any more information to help.


Files

lttng.log (290 KB) lttng.log Maximilian Fickert, 04/25/2023 07:30 AM
Actions #1

Updated by Maximilian Fickert about 1 year ago

Another example of this with less noise between the sequence of rotate/close/wakeup log messages:

DBG2 - 11:38:02.567715442 [UST application management]: Consumer close metadata channel key 136 (in consumer_close_metadata() at consumer.c:1394)
DBG1 - 11:38:02.567741993 [6549/6558]: Incoming command on sock (in consumer_thread_sessiond_poll() at consumer.c:3283)
DBG1 - 11:38:02.567764795 [6549/6558]: UST consumer push metadata key 136 of len 2845 (in lttng_ustconsumer_recv_cmd() at ust-consumer.c:1776)
DBG3 - 11:38:02.567785218 [Rotation]: Consumer pushing metadata on sock 60 of len 2845 (in consumer_push_metadata() at consumer.c:1479)
DBG1 - 11:38:02.567805254 [6549/6558]: UST consumer push metadata key 136 of len 2845 (in lttng_ustconsumer_recv_metadata() at ust-consumer.c:1284)
DBG1 - 11:38:02.567818087 [6549/6558]: Writing 2845 bytes from offset 0 in metadata cache (in consumer_metadata_cache_write() at consumer-metadata-cache.c:96)
DBG1 - 11:38:02.567826924 [6549/6558]: Received command on sock (in consumer_thread_sessiond_poll() at consumer.c:3299)
DBG1 - 11:38:02.567836360 [Rotation]: Consumer rotate channel key 136 (in consumer_rotate_channel() at consumer.c:1694)
DBG1 - 11:38:02.567855855 [6549/6558]: Incoming command on sock (in consumer_thread_sessiond_poll() at consumer.c:3283)
DBG1 - 11:38:02.567865004 [6549/6558]: UST consumer close metadata key 136 (in close_metadata() at ust-consumer.c:804)
DBG1 - 11:38:02.567868318 [6549/6558]: Closing metadata channel key 136 (in lttng_ustconsumer_close_metadata() at ust-consumer.c:3235)
DBG3 - 11:38:02.567870829 [6549/6558]: close() fd = 254 (in lttng_ustconsumer_close_metadata() at ust-consumer.c:3250)
DBG1 - 11:38:02.567883342 [6549/6558]: Received command on sock (in consumer_thread_sessiond_poll() at consumer.c:3299)
DBG1 - 11:38:02.567903928 [6549/6558]: Incoming command on sock (in consumer_thread_sessiond_poll() at consumer.c:3283)
DBG1 - 11:38:02.567911264 [6549/6558]: Consumer sample rotate position for channel 136 (in lttng_consumer_rotate_channel() at consumer.c:4002)
DBG1 - 11:38:02.567919461 [6549/6558]: Set rotate ready for stream 253 produced = 0 consumed = 0 (in lttng_consumer_rotate_channel() at consumer.c:4160)
DBG1 - 11:38:02.567918402 [UST application management]: PID 2051 unregistering with sock 779 (in ust_app_unregister() at ust-app.c:4253)
DBG1 - 11:38:02.567929578 [UST application management]: Flushing app session buffers for ust app pid 2051 (in ust_app_flush_app_session() at ust-app.c:5315)
DBG2 - 11:38:02.567934468 [UST application management]: Consumer flush channel key 129 (in consumer_flush_channel() at consumer.c:1328)
DBG1 - 11:38:02.567922431 [6549/6558]: Set rotation position for stream 253 at position 0 (in lttng_consumer_rotate_channel() at consumer.c:4191)
DBG1 - 11:38:02.567941311 [6549/6558]: Consumer rotate ready streams in channel 136 (in lttng_consumer_rotate_ready_streams() at consumer.c:4663)
DBG1 - 11:38:02.567944404 [6549/6558]: Consumer rotate ready stream 253 (in lttng_consumer_rotate_ready_streams() at consumer.c:4679)
DBG1 - 11:38:02.567946718 [6549/6558]: Consumer rotate stream 253 (in lttng_consumer_rotate_stream() at consumer.c:4579)
DBG1 - 11:38:02.567949803 [6549/6558]: Rotate local stream: stream key 253, channel key 136 (in rotate_local_stream() at consumer.c:4539)
DBG1 - 11:38:02.567957393 [6549/6558]: Opening stream output file "pid/sf_awsuswest2te-3512-20230425-113748/metadata" (in consumer_stream_create_output_files() at consumer-stream.c:1198)
DBG1 - 11:38:02.567960625 [6549/6558]: Opening trace chunk file "pid/sf_awsuswest2te-3512-20230425-113748/metadata" (in _lttng_trace_chunk_open_fs_handle_locked() at trace-chunk.c:1364)
DBG1 - 11:38:02.568008002 [6549/6558]: Adding new file "pid/sf_awsuswest2te-3512-20230425-113748/metadata" to trace chunk "20230425T113749+0000-3" (in lttng_trace_chunk_add_file() at trace-chunk.c:1314)
DBG1 - 11:38:02.568016125 [6549/6556]: Metadata poll return from wait with 31 fd(s) (in consumer_thread_metadata_poll() at consumer.c:2369)
DBG1 - 11:38:02.568029425 [6549/6556]: Metadata event caught in thread (in consumer_thread_metadata_poll() at consumer.c:2372)
DBG1 - 11:38:02.568038844 [6549/6556]: Metadata fd 253 is hup|err. (in consumer_thread_metadata_poll() at consumer.c:2481)
DBG3 - 11:38:02.568039197 [Rotation]: Buffer registry per PID find id: 58 (in buffer_reg_pid_find() at buffer-registry.c:308)
DBG1 - 11:38:02.568041831 [6549/6556]: Attempting to flush and consume the UST buffers (in consumer_thread_metadata_poll() at consumer.c:2486)
DBG1 - 11:38:02.568048584 [Rotation]: Consumer rotate channel key 115 (in consumer_rotate_channel() at consumer.c:1694)
DBG3 - 11:38:02.568017536 [6549/6558]: openat() fd = 3318, path = pid/sf_awsuswest2te-3512-20230425-113748/metadata, flags = 241, mode = 432, uid 1004, gid 1004 (in run_as_openat() at runas.c:1758)
DBG1 - 11:38:02.568061794 [6549/6558]: Using run_as worker (in run_as() at runas.c:1646)
DBG1 - 11:38:02.568241165 [6549/6558]: Waking up metadata poll thread (writing to pipe): channel name = 'metadata', channel key = 136 (in consumer_metadata_wakeup_pipe() at consumer.c:888)
DBG3 - 11:38:02.568249214 [6549/6558]: write() fd = -1 (in consumer_metadata_wakeup_pipe() at consumer.c:892)
PERROR - 11:38:02.568253399 [6549/6558]: Failed to write to UST metadata pipe while attempting to wake-up the metadata poll thread: Bad file descriptor (in consumer_metadata_wakeup_pipe() at consumer.c:907)
Error: Failed to dump the metadata cache
Error: Rotate channel failed

Actions #2

Updated by Erica Bugden 10 months ago

  • Status changed from New to In Progress

A yet unmerged fix was developed for the stable-2.13 branch:

A future step would be to port the fix to master.

Actions

Also available in: Atom PDF