Skip to content
Advertisement

Windows C Runtime toupper slow when locale set

I’m diagnosing an edge case in a cross platform (Windows and Linux) application where toupper is substantially slower on Windows. I’m assuming this is the same for tolower as well.

Originally I tested this with a simple C program on each without locale information set or even including the header file and there was very little performance difference. Test was a million iteration loop calling each character for a string to the toupper() function.

After including the header file and including the line below it’s much slower and calls a lot of the MS C runtime library locale specific functions. This is fine but the performance hit is really bad. On Linux this doesn’t appear to have any affect at all on performance.

setlocale(LC_ALL, ""); // system default locale

If I set the following it runs as fast as linux but does appear to skip all the locale functions.

setlocale(LC_ALL, NULL); // should be interpreted as the same as below?
OR
setlocale(LC_ALL, "C"); 

Note: Visual Studio 2015 for Windows 10 G++ for Linux running Cent OS

Have tried dutch settings settings and same outcome, slow on Windows no speed difference on Linux.

Am I doing something wrong or is there a bug with the locale settings on Windows or is it the other way where linux isn’t doing what it should? I haven’t done a debug on the linux app as I’m not as familiar with linux so do not know exactly what it’s doing internally. What should I test next to sort this out?

Code below for testing (Linux):

// C++ is only used for timing.  The original program is in C.
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <chrono>
#include <locale.h>

using namespace std::chrono;

void strToUpper(char *strVal);

int main()
{

    typedef high_resolution_clock Clock;
    high_resolution_clock::time_point t1 = Clock::now();

    // set locale
    //setlocale(LC_ALL,"nl_NL");
    setlocale(LC_ALL,"en_US");

    // testing string
    char str[] = "the quick brown fox jumps over the lazy dog";

    for (int i = 0; i < 1000000; i++)
    {
        strToUpper(str);
    }

    high_resolution_clock::time_point t2 = Clock::now();
    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
    printf("chrono time %2.6f:n",time_span.count());
}

void strToUpper(char *strVal)
{
    unsigned char *t;
    t = (unsigned char *)strVal;

    while (*t)
    {
        *t = toupper(*t);
        *t++;
    }
}

For windows change the local information to:

// set locale
//setlocale(LC_ALL,"nld_nld");
setlocale(LC_ALL, "english_us");

You can see the locale change from the separator in the time completed, full stop vs comma.

EDIT – Profiling data application function calls profiling As you can see above most of the time spent in child system calls from _toupper_l. Without the locale information set the toupper call does NOT call the child _toupper_l which makes it very quick.

Advertisement

Answer

Identical (and fairly good) performance with LANG=C vs. LANG=anything else is expected for the glibc implementation used by Linux.

Your Linux results make sense. Your testing method is probably ok. Use a profiler to see how much time your microbenchmark spends inside the Windows functions. If the Windows implementation does turn out to be the problem, maybe there’s a Windows function that can convert whole strings, like the C++ boost::to_upper_copy<std::string> (unless that’s even slower, see below).


Also note that upcasing ASCII strings can be SIMD vectorized pretty efficiently. I wrote a case-flip function for a single vector in another answer, using C SSE intrinsics; it can be adapted to upcase instead of flipcase. This should be a huge speedup if you spend a lot of time upcasing strings that are more than 16 bytes long, and that you know are ASCII.

Actually, Boost’s to_upper_copy() appears to compile to extremely slow code, like 10x slower than toupper. See that link for my vectorized strtoupper(dst,src), which is ASCII-only but could be extended with a fallback when non-ASCII src bytes are detected.


How does your current code handle UTF-8? There’s not much gain in supporting non-ASCII locales if you assume that all characters are a single byte. IIRC, Windows uses UTF-16 for most stuff, which is unfortunate because it turned out that the world wanted more than 2^16 codepoints. UTF-16 is a variable-length encoding of Unicode, like UTF-8 but without the advantage of reading ASCII. Fixed-width has a lot of advantage, but unfortunately you can’t assume that even with UTF-16. Java made this mistake, too, and is stuck with UTF-16.


The glibc source is:

#define __ctype_toupper 
     ((int32_t *) _NL_CURRENT (LC_CTYPE, _NL_CTYPE_TOUPPER) + 128)
int toupper (int c) {
    return c >= -128 && c < 256 ? __ctype_toupper[c] : c;
}

The asm from the x86-64 Ubuntu 15.10’s /lib/x86_64-linux-gnu/libc.so.6 is:

## disassembly from  objconv -fyasm -v2 /lib/x86_64-linux-gnu/libc.so.6 /dev/stdout 2>&1
toupper:
    lea     edx, [rdi+80H]                          ; 0002E300 _ 8D. 97, 00000080
    movsxd  rax, edi                                ; 0002E306 _ 48: 63. C7
    cmp     edx, 383                                ; 0002E309 _ 81. FA, 0000017F
    ja      ?_01766                                 ; 0002E30F _ 77, 19
    mov     rdx, qword [rel ?_37923]                ; 0002E311 _ 48: 8B. 15, 00395AA8(rel)
    sub     rax, -128                               ; 0002E318 _ 48: 83. E8, 80
    mov     rdx, qword [fs:rdx]                     ; 0002E31C _ 64 48: 8B. 12
    mov     rdx, qword [rdx]                        ; 0002E320 _ 48: 8B. 12
    mov     rdx, qword [rdx+48H]                    ; 0002E323 _ 48: 8B. 52, 48
    mov     eax, dword [rdx+rax*4]                  ; 0002E327 _ 8B. 04 82   ## the final table lookup, indexing an array of 4B ints
?_01766:
    rep ret                                         ; actual objconv output shows the prefix on a separate line

So it takes an early-out if the arg isn’t in the 0 – 0xFF range (so this branch should predict perfectly not-taken), otherwise it finds the table for the current locale, which involves three pointer dereferences: one load from a global, and one thread-local, and one more dereference. Then it actually indexes into the 256-entry table.

This is the entire library function; the toupper label in the disassembly is what your code calls. (Well, through a layer of indirection through the PLT because of dynamic linking, but after the first call triggers lazy symbol lookup, it’s just one extra jmp instruction between your code and those 11 insns in the library.)

Advertisement