Windows C Runtime toupper slow when locale set

Question

I'm diagnosing an edge case in a cross platform (Windows and Linux) application where toupper is substantially slower on Windows. I'm assuming this is the same for tolower as well. Originally I tested this with a simple C program on each without locale information set or even including the header file and there was very little performance difference. Test was

Accepted Answer

Identical (and fairly good) performance with LANG=C vs. LANG=anything else is expected for the glibc implementation used by Linux.Your Linux results make sense. Your testing method is probably ok. Use a profiler to see how much time your microbenchmark spends inside the Windows functions. If the Windows implementation does turn out to be the problem, maybe there’s a Windows function that can convert whole strings, like the C++ boost::to_upper_copy (unless that’s even slower, see below).Also note that upcasing ASCII strings can be SIMD vectorized pretty efficiently. I wrote a case-flip function for a single vector in another answer, using C SSE intrinsics; it can be adapted to upcase instead of flipcase. This should be a huge speedup if you spend a lot of time upcasing strings that are more than 16 bytes long, and that you know are ASCII.Actually, Boost’s to_upper_copy() appears to compile to extremely slow code, like 10x slower than toupper. See that link for my vectorized strtoupper(dst,src), which is ASCII-only but could be extended with a fallback when non-ASCII src bytes are detected.How does your current code handle UTF-8? There’s not much gain in supporting non-ASCII locales if you assume that all characters are a single byte. IIRC, Windows uses UTF-16 for most stuff, which is unfortunate because it turned out that the world wanted more than 2^16 codepoints. UTF-16 is a variable-length encoding of Unicode, like UTF-8 but without the advantage of reading ASCII. Fixed-width has a lot of advantage, but unfortunately you can’t assume that even with UTF-16. Java made this mistake, too, and is stuck with UTF-16.The glibc source is:#define __ctype_toupper ((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)int toupper (int c) { return c >= -128 && c < 256 ? __ctype_toupper[c] : c;}The asm from the x86-64 Ubuntu 15.10’s /lib/x86_64-linux-gnu/libc.so.6 is:## disassembly from objconv -fyasm -v2 /lib/x86_64-linux-gnu/libc.so.6 /dev/stdout 2>&1toupper: lea edx, [rdi+80H] ; 0002E300 _ 8D. 97, 00000080 movsxd rax, edi ; 0002E306 _ 48: 63. C7 cmp edx, 383 ; 0002E309 _ 81. FA, 0000017F ja ?_01766 ; 0002E30F _ 77, 19 mov rdx, qword [rel ?_37923] ; 0002E311 _ 48: 8B. 15, 00395AA8(rel) sub rax, -128 ; 0002E318 _ 48: 83. E8, 80 mov rdx, qword [fs:rdx] ; 0002E31C _ 64 48: 8B. 12 mov rdx, qword [rdx] ; 0002E320 _ 48: 8B. 12 mov rdx, qword [rdx+48H] ; 0002E323 _ 48: 8B. 52, 48 mov eax, dword [rdx+rax*4] ; 0002E327 _ 8B. 04 82 ## the final table lookup, indexing an array of 4B ints?_01766: rep ret ; actual objconv output shows the prefix on a separate lineSo it takes an early-out if the arg isn’t in the 0 – 0xFF range (so this branch should predict perfectly not-taken), otherwise it finds the table for the current locale, which involves three pointer dereferences: one load from a global, and one thread-local, and one more dereference. Then it actually indexes into the 256-entry table.This is the entire library function; the toupper label in the disassembly is what your code calls. (Well, through a layer of indirection through the PLT because of dynamic linking, but after the first call triggers lazy symbol lookup, it’s just one extra jmp instruction between your code and those 11 insns in the library.)

Advertisement

Answer