Project

General

Profile

Actions

Bug #1331

open

test_unix_socket fails for 64 bit arches on alpine linux but passes on 32bit arches

Added by Duncan Bellamy 3 months ago. Updated 21 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
11/02/2021
Due date:
% Done:

0%

Estimated time:

Description

Build log for x86_64:

https://gitlab.alpinelinux.org/a16bitsysop/aports/-/jobs/523491#L2351

FAIL: test_unix_socket 3 - Sent test payload file descriptors

Log:
PERROR - 17:52:06.330866344 [70399/70399]: sendmsg: Out of memory (in lttcomm_send_fds_unix_sock() at unix.c:453)
not ok 3 - Sent test payload file descriptors
FAIL: test_unix_socket 3 - Sent test payload file descriptors
  1. Failed test (test_unix_socket.c:test_high_fd_count() at line 111)
    PERROR - 17:52:06.331082468 [70399/70399]: Failed to send test payload file descriptors: ret = -1, expected = 1: Out of memory (in test_high_fd_count() at test_unix_socket.c:114)
Actions #1

Updated by Jonathan Rajotte Julien 3 months ago

  • Status changed from New to Feedback

Hi,

This looks like it is a limitation of your libc:

https://git.musl-libc.org/cgit/musl/tree/src/network/sendmsg.c

#if LONG_MAX > INT_MAX
    struct msghdr h;
    struct cmsghdr chbuf[1024/sizeof(struct cmsghdr)+1], *c;
    if (msg) {
        h = *msg;
        h.__pad1 = h.__pad2 = 0;
        msg = &h;
        if (h.msg_controllen) {
            if (h.msg_controllen > 1024) {
                errno = ENOMEM;
                return -1;
            }
            memcpy(chbuf, h.msg_control, h.msg_controllen);
            h.msg_control = chbuf;
            for (c=CMSG_FIRSTHDR(&h); c; c=CMSG_NXTHDR(&h,c))
                c->__pad1 = 0;
        }
    }
#endif
    return socketcall_cp(sendmsg, fd, msg, flags, 0, 0, 0);

On 32 bit it end up calling only socketcall_cp but on 64bit it seems that it limits the len of msg_controllen to 1024.

Based on commits 96107564e2eabbc13800fe7a7d930b67216d0805 and 7168790763cdeb794df52be6e3b39fbb021c5a64 these preprocessor stuff are workaround.

Note that here we are pushing the envelope (as the test should do) in term of passed FDs but we are within our right based on the unix man page

              The kernel constant SCM_MAX_FD defines a limit on the number of file descrip‐
              tors in the array.  Attempting to send an array larger than this limit causes
              sendmsg(2)  to  fail with the error EINVAL.  SCM_MAX_FD has the value 253 (or
              255 in kernels before 2.6.38)

As of today we do not send that many fds across the unix sockets for real usage. Most of the time the number of fds is less than 5.

Note sure what is the next step here for both of us considering that musl does not expose that the libc is musl (see second question of the faq: https://wiki.musl-libc.org/faq.html). It is not the first time we hit quirk from musl, most of the time they are "right" but in this case... it does not look like it.

Let me know if something does not sound right or if I have missed something pertinent for this.

Actions #2

Updated by Duncan Bellamy 3 months ago

Thanks for looking into this, I checked with an aarch64 build and LTTCOMM_MAX_SEND_FDS is 253

If I change the define to: #define HIGH_FD_COUNT LTTCOMM_MAX_SEND_FDS - 1

That test then passes on all arches, does it invalidate the test or does it mean musl has an off by one error?

But now: s390x has 1 error:

PASS: test_event_rule 212 - Log level rule "as least as severe as" accepted by python logging event rule: level = -657
ERROR: test_event_rule - exited with status 2

and 2 fails:

PASS: test_event_rule 32 - Serializing.
FAIL: test_event_rule 33 - Deserializing.
FAIL: test_event_rule 34 - serialized and from buffer are equal.
PASS: test_event_rule 35 - syscall object.

Actions #3

Updated by Jonathan Rajotte Julien 3 months ago

Duncan Bellamy wrote in #note-2:

Thanks for looking into this, I checked with an aarch64 build and LTTCOMM_MAX_SEND_FDS is 253

[...]

That test then passes on all arches, does it invalidate the test or does it mean musl has an off by one error?

That or we end up in the size < 1024 case.

To be honest, I'm not sure, but considering that the tests pass on glibc... for the 253 value I would tend to say that musl is the problem here.

But now: s390x has 1 error:

PASS: test_event_rule 212 - Log level rule "as least as severe as" accepted by python logging event rule: level = -657
ERROR: test_event_rule - exited with status 2

and 2 fails:

PASS: test_event_rule 32 - Serializing.
FAIL: test_event_rule 33 - Deserializing.
FAIL: test_event_rule 34 - serialized and from buffer are equal.
PASS: test_event_rule 35 - syscall object.

That should be relatively straight forward to debug. Do you have any easy way to spawn an equivalent system to the test env here? I could look into it.

Actions #4

Updated by Duncan Bellamy 3 months ago

I can have a go at creating a "test build" repo on GitLab, or you are welcome to create a merge request based on mine in the alpine CI to test out fixes and create a working merge request:

https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/26883/diffs

Actions #5

Updated by Jonathan Rajotte Julien 3 months ago

I was able to narrow it down a bit using the Alpine CI. For some reason we fail on a buffer view validity check during the deserialization.

I'll but some effort there in the upcoming week. I'll keep you updated on this. Would be much faster if I could simply spawn a vm/container of this image and use gdb.

Cheers

Actions #6

Updated by Duncan Bellamy 3 months ago

Okay thanks, I already have some docker builds that build some alpine packages from source. I can modify one to make a Dockerfile based one to build lttng-tools from source with gdb, then you can clone it.

I can do it later, is it better to build from git master or 2.13.0?

Actions #7

Updated by Jonathan Rajotte Julien 3 months ago

Okay thanks that would speed all this up.

Let's stay on the stable 2.13 branch, the same problem is seen on the 2.13.1 version so 2.13.0 is okai, as long as it is built with "-O0 -g".

Actions #8

Updated by Duncan Bellamy 2 months ago

Here it is:
https://github.com/a16bitsysop/docker-lttng

It’s not the actual alpine build environment but it uses abuild to build the packages.

It downloads the tarball during the build process and patches with the single
abuild -r command

This can be split into:
abuild installdeps
abuild unpack
abuild prepare
abuild build
abuild check

To run the commands individually so it isn’t built from scratch every time.

Let me know if it needs changing to make it easier to debug, or saving source.

I added a dbg sub package which has the debug symbols and moved autoreconf in prepare where it should be.

The GitHub action builds for same arches as alpine.

Actions #9

Updated by Jonathan Rajotte Julien 2 months ago

Bear with me here, how do I run the s390x "container" locally with a good old shell at the end as the result?

Actions #10

Updated by Duncan Bellamy 2 months ago

With qemu, I can create an extra script to setup qemu and run the build job ending in a shell.

Actions #11

Updated by Duncan Bellamy 2 months ago

I had a look at Installing QEMU, but it is actually already setup if you are using docker desktop.

I have updated the GitHub repo so it disables checks for lttng-tools and does not delete build artefacts after success.

So you can pull with:
docker image pull ghcr.io/a16bitsysop/docker-lttng

This will pull the s390x image as it is only built for s390x now.

Then if you run with:
docker container run -it --rm ghcr.io/a16bitsysop/docker-lttng /bin/sh

It will start with a shell in / using qemu
so you can:

uname -m
cd /tmp
abuild check

to run the tests, and edit source in /tmp/src/lttng-tools-2.13.1 before running

abuild build
abuild check

To try new changes.

Or you could just call make directly inside /tmp/src/lttng-tools-2.13.1

Actions #12

Updated by Duncan Bellamy 2 months ago

It is actually broken as it ends up with lttng-ust-dev 2.12 installed, will rebuild it.

Actions #13

Updated by Duncan Bellamy 2 months ago

The lttng-ust-dev version it finishes with is fixed now, if you pulled the old image you have to pull again to get the new one.

Actions #14

Updated by Duncan Bellamy 21 days ago

I have update my PR https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/26883 moving autoreconf to prepare for lttng-tools and patching out the failing test for s390x so it can be updated.

Actions

Also available in: Atom PDF