Why is 'H' / 72 / 0x48 the second most common byte in executables?
(If the score of this question is 72, please don’t upvote!)
I ran this:
cat /usr/bin/* |
perl -ne 'map {$a{$_}++} split//; END{print map { "$a{$_}t$_n" } keys %a}' |
grep --text . | sort -n | plotpipe --log y {1}
and got this:
(Even with a log y-axis it still looks exponential! There is more than 100x between the top and the bottom)
Looking at the numbers:
:
31919597 ^H
32983719 ^B
33943030 ^O
39130281 213
39893389 $
52237360 211
53229196 ^A
76884442 377
100776756 H
746405320 ^@
It is hardly surprising that ^@ (NUL) is the most common byte in executables. 377 (255) and ^A (1) also make intuitively sense to me.
But what causes ‘H’ (72) to be the second most common byte in executables – far more common than 255 and 1?
Background
For a Perl script, I needed to find the least common byte in Perl scripts. By accident, I didn’t grep out only Perl scripts but ran the command on all binaries. I expected a few bytes to stand out, such as NUL, 1, and 255, but never ‘H’.
The input for the graph is the count of each byte, sorted. The y-axis represents the count, and the x-axis represents the line number (1-256, as a byte can only take on 256 different values). The y-axis is log scale, so the difference is bigger than exponential.
That would be the 64 bit operand size prefix of amd64 machine code instructions.
You’ll notice it only happens on amd64 executables.
If you compare on the /bin/*
of http://ftp.debian.org/debian/pool/main/c/coreutils/coreutils_9.1-1_arm64.deb,
http://ftp.debian.org/debian/pool/main/c/coreutils/coreutils_9.1-1_amd64.deb and
http://ftp.debian.org/debian/pool/main/c/coreutils/coreutils_9.1-1_i386.deb, you’ll see:
$ for f (coreutils_9.1-1_*.deb) bsdtar xOf $f da* | bsdtar xO ./bin/* | xxd -p -c1 | sort | uniq -c | sort -rn | head -n 5 | grep -H --label="${${f:r}##*_}" .
amd64: 692417 00
amd64: 145689 ff
amd64: 81911 48
amd64: 48006 89
amd64: 45331 0f
arm64:1409826 00
arm64: 70391 ff
arm64: 67915 03
arm64: 49380 20
arm64: 41655 40
i386: 515346 00
i386: 171643 ff
i386: 78361 0e
i386: 69317 24
i386: 50497 83
0x48 (72, ‘H’) is only in the top 3 on amd64.
On ls
on my amd64 Debian system:
$ xxd -p -c1 =ls | sort | uniq -c | sort -rn | head -n 5
39187 00
7827 ff
5565 48
4181 20
3393 0f
If we disassemble the code in that executable, we find a lot of 0x48 bytes in the instructions:
$ objdump -d =ls | grep -cw 48
5353
Most of them in first position:
$ objdump -d =ls | grep -wm10 48
4000: 48 83 ec 08 sub $0x8,%rsp
4004: 48 8b 05 ad ff 01 00 mov 0x1ffad(%rip),%rax # 23fb8 <__gmon_start__@Base>
400b: 48 85 c0 test %rax,%rax
4012: 48 83 c4 08 add $0x8,%rsp
44b6: 68 48 00 00 00 push $0x48
4751: 48 89 f3 mov %rsi,%rbx
4754: 48 83 ec 68 sub $0x68,%rsp
4758: 48 8b 3e mov (%rsi),%rdi
475b: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
4764: 48 89 44 24 58 mov %rax,0x58(%rsp)
$ objdump -d =ls | grep -Pc '^s*[da-f]+:s+48'
5113
According to http://ref.x86asm.net/geek.html#x48, that 0x48 is the 64 Bit Operand Size REX.W
opcode prefix which specify that the operation is to be made on 64 bit operands instead of whatever default it’s meant to be.
$ objdump -d =ls | pcregrep -o1 -o2 '^s*[da-f]+:s+(48 .. ).*?t(S+)' | sort | uniq -c | sort -rn | head
1512 48 89 mov
1040 48 8b mov
630 48 8d lea
372 48 85 test
326 48 83 add
198 48 39 cmp
158 48 83 sub
79 48 01 add
72 48 83 cmp
69 48 c7 movq
All instructions done on 64 bit operands.