What happens with unix stream ancillary data on partial reads?

So I’ve read lots of information on unix-stream ancillary data, but one thing missing from all the documentation is what is supposed to happen when there is a partial read?

Suppose I’m receiving the following messages into a 24 byte buffer

msg1 [20 byes]   (no ancillary data)
msg2 [7 bytes]   (2 file descriptors)
msg3 [7 bytes]   (1 file descriptor)
msg4 [10 bytes]  (no ancillary data)
msg5 [7 bytes]   (5 file descriptors)

The first call to recvmsg, I get all of msg1 (and part of msg2? Will the OS ever do that?) If I get part of msg2, do I get the ancillary data right away, and need to save it for the next read when I know what the message was actually telling me to do with the data? If I free up the 20 bytes from msg1 and then call recvmsg again, will it ever deliver msg3 and msg4 at the same time? Does the ancillary data from msg3 and msg4 get concatenated in the control message struct?

While I could write test programs to experimentally find this out, I’m looking for documentation about how ancillary data behaves in a streaming context. It seems odd that I can’t find anything official on it.


I’m going to add my experimental findings here, which i got from this test program:

https://github.com/nrdvana/daemonproxy/blob/master/src/ancillary_test.c

Linux 3.2.59, 3.17.6

It appears that Linux will append portions of ancillary-bearing messages to the end of other messages as long as no prior ancillary payload needed to be delivered during this call to recvmsg. Once one message’s ancillary data is being delivered, it will return a short read rather than starting the next ancillary-data message. So, in the example above, the reads I get are:

recv1: [24 bytes] (msg1 + partial msg2 with msg2's 2 file descriptors)
recv2: [10 bytes] (remainder of msg2 + msg3 with msg3's 1 file descriptor)
recv3: [17 bytes] (msg4 + msg5 with msg5's 5 file descriptors)
recv4: [0 bytes]

BSD 4.4, 10.0

BSD provides more alignment than Linux, and gives a short read immediately before the start of a message with ancillary data. But, it will happily append a non-ancillary-bearing message to the end of an ancillary-bearing message. So for BSD, it looks like if your buffer is larger than the ancillary-bearing message, you get almost packet-like behavior. The reads I get are:

recv1: [20 bytes] (msg1)
recv2: [7 bytes]  (msg2, with msg2's 2 file descriptors)
recv3: [17 bytes] (msg3, and msg4, with msg3's 1 file descriptor)
recv4: [7 bytes]  (msg5 with 5 file descriptors)
recv5: [0 bytes]

TODO:

Would still like to know how it happens on older Linux, iOS, Solaris, etc, and how it could be expected to happen in the future.

Asked By: M Conrad

||

Ancillary data is received as if it were queued along with the first normal data octet in the segment (if any).

POSIX.1-2017

For the rest of your question, things get a bit hairy.

…For the purposes of this section, a datagram is considered to be a data segment that terminates a record, and that includes a source address as a special type of ancillary data.

Data segments are placed into the queue as data is delivered to the socket by the protocol. Normal data segments are placed at the end of the queue as they are delivered. If a new segment contains the same type of data as the preceding segment and includes no ancillary data, and if the preceding segment does not terminate a record, the segments are logically merged into a single segment…

A receive operation shall never return data or ancillary data from more than one segment.

So modern BSD sockets exactly match this extract. This is not surprising :-).

Remember the POSIX standard was written after UNIX, and after splits like BSD v.s. System V. One of the main goals was to help understand the existing range of behaviour, and prevent even more splits in existing features.

Linux was implemented without reference to BSD code. It appears to behave differently here.

  1. If I read you correctly, it sounds like Linux is additionally merging “segments” when a new segment does include ancillary data, but the previous segment does not.

  2. Your point that “Linux will append portions of ancillary-bearing messages to the end of other messages as long as no prior ancillary payload needed to be delivered during this call to recvmsg”, does not seem entirely explained by the standard. One possible explanation would involve a race condition. If you read part of a “segment”, you will receive the ancillary data. Perhaps Linux interpreted this as meaning the remainder of the segment no longer counts as including ancillary data! So when a new segment is received, it is merged – either as per the standard, or as per difference 1 above.

If you want to write a maximally portable program, you should avoid this area altogether. When using ancillary data, it is much more common to use datagram sockets. If you want to work on all the strange platforms that technically aspire to provide something mostly like POSIX, your question seems to be venturing into a dark and untested corner.


You could argue Linux still follows several significant principles:

  1. “Ancillary data is received as if it were queued along with the first normal data octet in the segment”.
  2. Ancillary data is never “concatenated”, as you put it.

However, I am not convinced the Linux behaviour is particularly useful, when you compare it to the BSD behaviour. It seems like the program you describe would need to add a Linux-specific workaround. And I don’t know a justification for why Linux would expect you to do that.

It might have looked sensible when writing the Linux kernel code, but without ever having been tested or exercised by any program.

Or it might be exercised by some program code which mostly works under this subset, but in principle could have edge-case “bugs” or race conditions.

If you cannot make sense of the Linux behaviour and its intended usage, I think that argues for treating this as a “dark, untested corner” on Linux.

Answered By: sourcejedi
Categories: Answers Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.