Skip to content

Clang 11 and GCC 8 O2 Breaks Inline Assembly

I have a short snippet of code, with some inline assembly that prints argv[0] properly in O0, but does not print anything in O2 (when using Clang. GCC, on the other hand, prints the string stored in envp[0] when printing argv[0]). This problem is also restricted to only argv (the other two function parameters can be used as expected with or without optimizations enabled). I tested this with both GCC and Clang, and both compilers have this issue.

Here is the code:

void exit(unsigned long long status) {
    asm volatile("movq $60, %%rax;" //system call 60 is exit
        "movq %0, %%rdi;" //return code 0
        : //no outputs
        :"rax", "rdi");

int open(const char *pathname, unsigned long long flags) {
    asm volatile("movq $2, %%rax;" //system call 2 is open
        "movq %0, %%rdi;"
        "movq %1, %%rsi;"
        : //no outputs
        :"r"(pathname), "r"(flags)
        :"rax", "rdi", "rsi");
        return 1;

int write(unsigned long long fd, const void *buf, size_t count) {
    asm volatile("movq $1, %%rax;" //system call 1 is write
        "movq %0, %%rdi;"
        "movq %1, %%rsi;"
        "movq %2, %%rdx;"
        : //no outputs
        :"r"(fd), "r"(buf), "r"(count)
        :"rax", "rdi", "rsi", "rdx");
        return 1;

static void entry(unsigned long long argc, char** argv, char** envp);

/* "The calling convention of the System V AMD64 ABI is followed on GNU/Linux. The registers RDI, RSI, RDX, RCX, R8, and R9 are used for integer and memory address arguments
and XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 are used for floating point arguments.
For system calls, R10 is used instead of RCX. Additional arguments are passed on the stack and the return value is stored in RAX."*/

//__attribute__((naked)) defines a pure-assembly function
__attribute__((naked)) void _start() {
    asm volatile("xor %%rbp,%%rbp;" // "%ebp,%ebp sets %ebp to zero. This is suggested by the ABI (Application Binary Interface specification), to mark the outermost frame."
    "pop %%rdi;" //rdi: arg1: argc -- can be popped off the stack because it is copied onto register
    "mov %%rsp, %%rsi;" //rsi: arg2: argv
    "mov %%rdi, %%rdx;"
    "shl $3, %%rdx;" //each argv pointer takes up 8 bytes (so multiply argc by 8)
    "add $8, %%rdx;" //add size of null word at end of argv-pointer array (8 bytes)
    "add %%rsp, %%rdx;" //rdx: arg3: envp
    "andq $-16, %%rsp;" //align stack to 16-bits (which is required on x86-64)
    "jmp %P0" // "After looking at the GCC source code, it's not exactly clear what the code P in front of a constraint means. But, among other things, it prevents GCC from putting a $ in front of constant values. Which is exactly what I need in this case."
    :"rdi", "rsp", "rsi", "rdx", "rbp", "memory");

//Function cannot be optimized-away, since it is passed-in as an argument to asm-block above
//Compiler Options: -fno-asynchronous-unwind-tables;-O2;-Wall;-nostdlibinc;-nobuiltininc;-fno-builtin;-nostdlib; -nodefaultlibs;--no-standard-libraries;-nostartfiles;-nostdinc++
//Linker Options: -nostdlib; -nodefaultlibs
static void entry(unsigned long long argc, char** argv, char** envp) {
    int ttyfd = open("/dev/tty", O_WRONLY);

    write(ttyfd, argv[0], 9);
    write(ttyfd, "n", 1);


Edit: Added syscall definitions.

Edit: Adding rcx and r11 to the clobber list for the syscalls fixed the issue for clang, but gcc to have the error.

Edit: GCC actually was not having an error, but some kind of strange error in my build system (CodeLite) made it so that the program ran some kind of partially-built program, even though GCC reported errors about it not recognizing two of the compiler flags passed-in. For GCC, use these flags instead: -fomit-frame-pointer;-fno-asynchronous-unwind-tables;-O2;-Wall;-nostdinc;-fno-builtin;-nostdlib; -nodefaultlibs;–no-standard-libraries;-nostartfiles;-nostdinc++. You can also use these flags for Clang, due to Clang’s support for the above GCC options.



  1. You can’t use extended asm in a naked function, only basic asm, according to the gcc manual. You don’t need to inform the compiler of clobbered registers (since it won’t do anything about them anyway; in a naked function you are responsible for all register management). And passing the address of entry in an extended operand is unnecessary; just do jmp entry.

    (In my tests your code doesn’t compile at all, so I assume you weren’t showing us your exact code – next time please do, so as to avoid wasting people’s time.)

  2. Linux x86-64 syscall system calls are allowed to clobber the rcx and r11 registers, so you need to add those to the clobber lists of your system calls.

  3. You align the stack to a 16-byte boundary before jumping to entry. However, the 16-byte alignment rule is based on the assumption that you will be calling the function with call, which would push an additional 8 bytes onto the stack. As such, the called function actually expects the stack to initially be, not a multiple of 16, but 8 more or less than a multiple of 16. So you are actually aligning the stack incorrectly, and this can be a cause of all sorts of mysterious trouble.

    So either replace your jmp with call, or else subtract a further 8 bytes from rsp (or just push some 64-bit register of your choice).

  4. Style note: unsigned long is already 64 bits on Linux x86-64, so it would be more idiomatic to use that in place of unsigned long long everywhere.

  5. General hint: learn about register constraints in extended asm. You can have the compiler load your desired registers for you, instead of writing instructions in your asm to do it yourself. So your exit function could instead look like:

    void exit(unsigned long status) {
        asm volatile("syscall"
            : //no outputs
            :"a"(60), "D" (status)
            :"rcx", "r11");

This in particular saves you a few instructions, since status is already in the %rdi register on function entry. With your original code, the compiler has to move it somewhere else so that you can then load it into %rdi yourself.

  1. Your open function always returns 1, which will typically not be the fd that was actually opened. So if your program is run with standard output redirected, your program will write to the redirected stdout, instead of to the tty as it seems to want to do. Indeed, this makes the open syscall completely pointless, because you never use the file you opened.

    You should arrange for open to return the value that was actually returned by the system call, which will be left in the %rax register when syscall returns. You can use an output operand to have this stored in a temporary variable (which the compiler will likely optimize out), and return that. You’ll need to use a digit constraint since it is going in the same register as an input operand. I leave this as an exercise for you. It would likewise be nice if your write function actually returned the number of bytes written.

7 People found this is helpful