Assert in lttng-ring-buffer-client.h:437: client_buffer_begin()
We observed single case where process instrumented with LTTNG crashed on start with the following assert:
#0 0x00007f361008b495 in raise () from /lib64/libc.so.6
#1 0x00007f361008cc75 in abort () from /lib64/libc.so.6
#2 0x00007f361008460e in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007f36100846d0 in __assert_fail () from /lib64/libc.so.6
#4 0x00007f361127c349 in client_buffer_begin (buf=0x7f35fbee7000, tsc=141244066318, subbuf_idx=0, handle=0x7f36080008c0) at lttng-ring-buffer-client.h:437
#5 0x00007f361128e5a0 in lib_ring_buffer_switch_old_start (buf=0x7f35fbee7000, chan=0x7f3608010a40, tsc=141244066318, handle=0x7f36080008c0, offsets=<optimized out>) at ring_buffer_frontend.c:1775
#6 0x00007f361128eb7c in lib_ring_buffer_reserve_slow (ctx=ctx@entry=0x7ffee79ab120, client_ctx=client_ctx@entry=0x7ffee79aadf0) at ring_buffer_frontend.c:2385
#7 0x00007f361127fd4f in lib_ring_buffer_reserve (config=0x7f36112b0da0 <client_config>, client_ctx=0x7ffee79aadf0, ctx=0x7ffee79ab120) at ../libringbuffer/frontend_api.h:212
#8 lttng_event_reserve (ctx=<optimized out>, event_id=<optimized out>) at lttng-ring-buffer-client.h:760
LTTNG version: 2.12.2
lttng create XXX --live
lttng enable-channel channel0 -u -s XXX --subbuf-size 1M --num-subbuf 8 --overwrite --tracefile-size 104857600 --tracefile-count 3
lttng enable-event 'XXX_*' -u -s XXX -c channel0
lttng add-context -u -t vpid -t vtid -s XXX -c channel0
lttng start XXX
Updated by Mathieu Desnoyers 2 months ago
Can you reproduce given the same configuration or it was a one-time thing ?
Also, can you tell us more about the environment ? One likely culprit would be that the information about the number of configured processors does not match what ends up being returned by sched_getcpu() (or its lttng-ust override if you have such .so loaded), and thus sched_getcpu() returns an index which is beyond the number of configured processors. It could also be that the lttng-consumerd's vision of the world has a different number of configured processors than the application. Is that run under some container/virtualisation ? If so, what is the configuration of both the LTTng sessiond/consumerd vs the instrumented application ?
I also remember that some of the vmware virtualization products provided incoherent number of processors vs sched_getcpu() to user-space for a while, which could cause this assert to trigger.
Updated by Sergei Dyshel 2 months ago
This is one-off issue, it does not reproduce. And we run LTTNG in production for a long time...
There is no virtualization involved, the system is x86_64-based server.
What kind of LTTNG sessiond/consumerd configuration do you want me to provide?
Updated by Mathieu Desnoyers about 2 months ago
- Status changed from New to Feedback
About sessiond/consumerd configuration, mainly whether those are within a different container than the traced application, and if that container's view of the possible cpus is the same as the application. Also output of error and debug logs from sessiond and consumerd when the problem occurs would be useful.
Finally, output of the application where the issue reproduces run with LTTNG_UST_DEBUG=1 would help.
But unless you can reproduce the issue, I suspect providing those logs will be tricky.