Skip to content
Advertisement

Linux perf_events annotation frame pointer confusion

I ran sudo perf record -F 99 find / followed by sudo perf report and selected “Annotate fdopendir” and here are the first seven instructions:

push %rbp push %rbx mov %edi,%esi mov %edi,%ebx mov $0x1,%edi sub $0xa8,%rsp mov %rsp,%rbp

The first instruction appears to be saving the caller’s base frame pointer. I believe instructions 2 through 5 are irrelevant to this question but here for completeness. Instructions 6 and 7 are confusing to me. Shouldn’t the assignment of rbp to rsp occur before subtracting 0xa8 from rsp?

Advertisement

Answer

The x86-64 System V ABI doesn’t require making a traditional / legacy stack-frame. This looks close to a traditional stack frame setup, but it’s definitely not because there’s no mov %rsp, %rbp right after the first push %rbp.

We’re seeing compiler-generated code that simply uses RBP as a temporary register, and is using it to hold a pointer to a local on the stack. It’s just a coincidence that this happens to involve the instruction mov %rsp, %rbp sometime after push %rbp. This is not making a stack frame.

In x86-64 System V, RBX and RBP are the only 2 “low 8” registers that are call-preserved, and thus usable without REX prefixes in some cases (e.g. for the push/pop, and when used in addressing modes), saving code-size. GCC prefers to use them before saving/restoring any of R12..R15. What registers are preserved through a linux x86-64 function call (For pointers, copying them with mov always requires a REX prefix for 64-bit operand-size, so there are fewer savings than for 32-bit integers, but gcc still goes for RBX then RBP, in that order, when it needs to save/restore call-preserved regs in a function.)

Disassembly of /lib/libc.so.6 (glibc) on my system (Arch Linux) shows similar but different code-gen for fdopendir. You stopped the disassembly too soon, before it makes a function call. That sheds some light on why it wanted a call-preserved temporary register: it wanted the var in a reg across the call.

00000000000c1260 <fdopendir>:
   c1260:       55                      push   %rbp
   c1261:       89 fe                   mov    %edi,%esi
   c1263:       53                      push   %rbx
   c1264:       89 fb                   mov    %edi,%ebx
   c1266:       bf 01 00 00 00          mov    $0x1,%edi
   c126b:       48 81 ec a8 00 00 00    sub    $0xa8,%rsp
   c1272:       64 48 8b 04 25 28 00 00 00      mov    %fs:0x28,%rax    # stack-check cookie
   c127b:       48 89 84 24 98 00 00 00         mov    %rax,0x98(%rsp)
   c1283:       31 c0                   xor    %eax,%eax
   c1285:       48 89 e5                mov    %rsp,%rbp      # save a pointer
   c1288:       48 89 ea                mov    %rbp,%rdx      # and pass it as a function arg
   c128b:       e8 90 7d 02 00          callq  e9020 <__fxstat>
   c1290:       85 c0                   test   %eax,%eax
   c1292:       78 6a                   js     c12fe <fdopendir+0x9e>
   c1294:       8b 44 24 18             mov    0x18(%rsp),%eax
   c1298:       25 00 f0 00 00          and    $0xf000,%eax
   c129d:       3d 00 40 00 00          cmp    $0x4000,%eax
   c12a2:       75 4c                   jne    c12f0 <fdopendir+0x90>
   ....

   c12c1:       48 89 e9                mov    %rbp,%rcx      # pass the pointer as the 4th arg
   c12c4:       89 c2                   mov    %eax,%edx
   c12c6:       31 f6                   xor    %esi,%esi
   c12c8:       89 df                   mov    %ebx,%edi
   c12ca:       e8 d1 f7 ff ff          callq  c0aa0 <__alloc_dir>
   c12cf:       48 8b 8c 24 98 00 00 00         mov    0x98(%rsp),%rcx
   c12d7:       64 48 33 0c 25 28 00 00 00      xor    %fs:0x28,%rcx     # check the stack cookie
   c12e0:       75 38                   jne    c131a <fdopendir+0xba>
   c12e2:       48 81 c4 a8 00 00 00    add    $0xa8,%rsp
   c12e9:       5b                      pop    %rbx
   c12ea:       5d                      pop    %rbp
   c12eb:       c3                      retq   

This is pretty silly code-gen; gcc could have simply used mov %rsp, %rcx the 2nd time it needed it. I’d call this a missed-optimization. It never needed that pointer in a call-preserved register because it always knew where it was relative to RSP.

(Even if it hadn’t been exactly at RSP+0, lea something(%rsp), %rdx and lea something(%rsp), %rcx would have been totally fine the two times it was needed, with probably less total cost than saving/restoring RBP + the required mov instructions.)

Or it could have used mov 0x18(%rbp),%eax instead of rsp to save a byte of code-size in that addressing mode. Avoiding direct references to RSP between function calls reduces the amount of stack-sync uops Intel CPUs need to insert.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement