Linux perf_events annotation frame pointer confusion

Question

I ran sudo perf record -F 99 find / followed by sudo perf report and selected &#8220;Annotate fdopendir&#8221; and here are the first seven instructions: push %rbp push %rbx mov %edi,%esi mov %edi,%ebx mov $0x1,%edi sub $0xa8,%rsp mov %rsp,%rbp The first instruction appears to be saving the caller&#8217;s bas…

Accepted Answer

The x86-64 System V ABI doesn’t require making a traditional / legacy stack-frame. This looks close to a traditional stack frame setup, but it’s definitely not because there’s no mov %rsp, %rbp right after the first push %rbp.We’re seeing compiler-generated code that simply uses RBP as a temporary register, and is using it to hold a pointer to a local on the stack. It’s just a coincidence that this happens to involve the instruction mov %rsp, %rbp sometime after push %rbp. This is not making a stack frame.In x86-64 System V, RBX and RBP are the only 2 “low 8” registers that are call-preserved, and thus usable without REX prefixes in some cases (e.g. for the push/pop, and when used in addressing modes), saving code-size. GCC prefers to use them before saving/restoring any of R12..R15. What registers are preserved through a linux x86-64 function call (For pointers, copying them with mov always requires a REX prefix for 64-bit operand-size, so there are fewer savings than for 32-bit integers, but gcc still goes for RBX then RBP, in that order, when it needs to save/restore call-preserved regs in a function.)Disassembly of /lib/libc.so.6 (glibc) on my system (Arch Linux) shows similar but different code-gen for fdopendir. You stopped the disassembly too soon, before it makes a function call. That sheds some light on why it wanted a call-preserved temporary register: it wanted the var in a reg across the call.00000000000c1260 : c1260: 55 push %rbp c1261: 89 fe mov %edi,%esi c1263: 53 push %rbx c1264: 89 fb mov %edi,%ebx c1266: bf 01 00 00 00 mov $0x1,%edi c126b: 48 81 ec a8 00 00 00 sub $0xa8,%rsp c1272: 64 48 8b 04 25 28 00 00 00 mov %fs:0x28,%rax # stack-check cookie c127b: 48 89 84 24 98 00 00 00 mov %rax,0x98(%rsp) c1283: 31 c0 xor %eax,%eax c1285: 48 89 e5 mov %rsp,%rbp # save a pointer c1288: 48 89 ea mov %rbp,%rdx # and pass it as a function arg c128b: e8 90 7d 02 00 callq e9020 <__fxstat> c1290: 85 c0 test %eax,%eax c1292: 78 6a js c12fe c1294: 8b 44 24 18 mov 0x18(%rsp),%eax c1298: 25 00 f0 00 00 and $0xf000,%eax c129d: 3d 00 40 00 00 cmp $0x4000,%eax c12a2: 75 4c jne c12f0 .... c12c1: 48 89 e9 mov %rbp,%rcx # pass the pointer as the 4th arg c12c4: 89 c2 mov %eax,%edx c12c6: 31 f6 xor %esi,%esi c12c8: 89 df mov %ebx,%edi c12ca: e8 d1 f7 ff ff callq c0aa0 <__alloc_dir> c12cf: 48 8b 8c 24 98 00 00 00 mov 0x98(%rsp),%rcx c12d7: 64 48 33 0c 25 28 00 00 00 xor %fs:0x28,%rcx # check the stack cookie c12e0: 75 38 jne c131a c12e2: 48 81 c4 a8 00 00 00 add $0xa8,%rsp c12e9: 5b pop %rbx c12ea: 5d pop %rbp c12eb: c3 retq This is pretty silly code-gen; gcc could have simply used mov %rsp, %rcx the 2nd time it needed it. I’d call this a missed-optimization. It never needed that pointer in a call-preserved register because it always knew where it was relative to RSP.(Even if it hadn’t been exactly at RSP+0, lea something(%rsp), %rdx and lea something(%rsp), %rcx would have been totally fine the two times it was needed, with probably less total cost than saving/restoring RBP + the required mov instructions.)Or it could have used mov 0x18(%rbp),%eax instead of rsp to save a byte of code-size in that addressing mode. Avoiding direct references to RSP between function calls reduces the amount of stack-sync uops Intel CPUs need to insert.

Advertisement

Answer