I ran sudo perf record -F 99 find /
followed by sudo perf report
and selected “Annotate fdopendir” and here are the first seven instructions:
push %rbp
push %rbx
mov %edi,%esi
mov %edi,%ebx
mov $0x1,%edi
sub $0xa8,%rsp
mov %rsp,%rbp
The first instruction appears to be saving the caller’s base frame pointer. I believe instructions 2 through 5 are irrelevant to this question but here for completeness. Instructions 6 and 7 are confusing to me. Shouldn’t the assignment of rbp to rsp occur before subtracting 0xa8 from rsp?
Advertisement
Answer
The x86-64 System V ABI doesn’t require making a traditional / legacy stack-frame. This looks close to a traditional stack frame setup, but it’s definitely not because there’s no mov %rsp, %rbp
right after the first push %rbp
.
We’re seeing compiler-generated code that simply uses RBP as a temporary register, and is using it to hold a pointer to a local on the stack. It’s just a coincidence that this happens to involve the instruction mov %rsp, %rbp
sometime after push %rbp
. This is not making a stack frame.
In x86-64 System V, RBX and RBP are the only 2 “low 8” registers that are call-preserved, and thus usable without REX prefixes in some cases (e.g. for the push/pop, and when used in addressing modes), saving code-size. GCC prefers to use them before saving/restoring any of R12..R15. What registers are preserved through a linux x86-64 function call (For pointers, copying them with mov
always requires a REX prefix for 64-bit operand-size, so there are fewer savings than for 32-bit integers, but gcc still goes for RBX then RBP, in that order, when it needs to save/restore call-preserved regs in a function.)
Disassembly of /lib/libc.so.6
(glibc) on my system (Arch Linux) shows similar but different code-gen for fdopendir
. You stopped the disassembly too soon, before it makes a function call. That sheds some light on why it wanted a call-preserved temporary register: it wanted the var in a reg across the call.
00000000000c1260 <fdopendir>: c1260: 55 push %rbp c1261: 89 fe mov %edi,%esi c1263: 53 push %rbx c1264: 89 fb mov %edi,%ebx c1266: bf 01 00 00 00 mov $0x1,%edi c126b: 48 81 ec a8 00 00 00 sub $0xa8,%rsp c1272: 64 48 8b 04 25 28 00 00 00 mov %fs:0x28,%rax # stack-check cookie c127b: 48 89 84 24 98 00 00 00 mov %rax,0x98(%rsp) c1283: 31 c0 xor %eax,%eax c1285: 48 89 e5 mov %rsp,%rbp # save a pointer c1288: 48 89 ea mov %rbp,%rdx # and pass it as a function arg c128b: e8 90 7d 02 00 callq e9020 <__fxstat> c1290: 85 c0 test %eax,%eax c1292: 78 6a js c12fe <fdopendir+0x9e> c1294: 8b 44 24 18 mov 0x18(%rsp),%eax c1298: 25 00 f0 00 00 and $0xf000,%eax c129d: 3d 00 40 00 00 cmp $0x4000,%eax c12a2: 75 4c jne c12f0 <fdopendir+0x90> .... c12c1: 48 89 e9 mov %rbp,%rcx # pass the pointer as the 4th arg c12c4: 89 c2 mov %eax,%edx c12c6: 31 f6 xor %esi,%esi c12c8: 89 df mov %ebx,%edi c12ca: e8 d1 f7 ff ff callq c0aa0 <__alloc_dir> c12cf: 48 8b 8c 24 98 00 00 00 mov 0x98(%rsp),%rcx c12d7: 64 48 33 0c 25 28 00 00 00 xor %fs:0x28,%rcx # check the stack cookie c12e0: 75 38 jne c131a <fdopendir+0xba> c12e2: 48 81 c4 a8 00 00 00 add $0xa8,%rsp c12e9: 5b pop %rbx c12ea: 5d pop %rbp c12eb: c3 retq
This is pretty silly code-gen; gcc could have simply used mov %rsp, %rcx
the 2nd time it needed it. I’d call this a missed-optimization. It never needed that pointer in a call-preserved register because it always knew where it was relative to RSP.
(Even if it hadn’t been exactly at RSP+0, lea something(%rsp), %rdx
and lea something(%rsp), %rcx
would have been totally fine the two times it was needed, with probably less total cost than saving/restoring RBP + the required mov
instructions.)
Or it could have used mov 0x18(%rbp),%eax
instead of rsp to save a byte of code-size in that addressing mode. Avoiding direct references to RSP between function calls reduces the amount of stack-sync uops Intel CPUs need to insert.