Why can I not read a UTF-16 file longer than 4094 characters?

Question

Some information: I've only tried this on Linux I've tried both with GCC (7.2.0) and Clang (3.8.1) It requires C++11 or higher to my understanding What happens when I run it I get the expected string "abcd" repeated until it hits the position of 4094 characters. After that all it outputs is this sign "?" until the end of the

Accepted Answer

This looks like a library bug to me. Stepping through the sample program as compiled by gcc 7.1.1 using gdb:(gdb) n28 while (getline(file,line)) {(gdb) n29 std::wcout << line << std::endl;(gdb) p line.size()$1 = 80008000 characters read, as expected. But then:(gdb) p line[4092]$18 = (__gnu_cxx::__alloc_traits >::value_type &) @0x628240: 97 L'a'(gdb) p line[4093]$19 = (__gnu_cxx::__alloc_traits >::value_type &) @0x628244: 98 L'b'(gdb) p line[4094]$20 = (__gnu_cxx::__alloc_traits >::value_type &) @0x628248: 25344 L'挀'(gdb) p line[4095]$21 = (__gnu_cxx::__alloc_traits >::value_type &) @0x62824c: 25600 L'搀'(gdb) p line[4096]$22 = (__gnu_cxx::__alloc_traits >::value_type &) @0x628250: 24832 L'愀'line[4092] and line[4093] look ok. But then, I see line[4094], line[4095], and line[4096], containing 6300, 6400 and 6500, instead of 0063, 0064, and 0065.So, this is getting messed up starting with character 4094, and not 4096, actually. I dumped the binary UTF-16 file, and it looks correct to me. The BOM marker is followed by consistent endian-ness for the entire contents of the file.The only thing that’s puzzling is why both clang and gcc are supposedly affected, but a quick Google search indicates that clang also uses gcc’s libstdc++, at least up until recently. So, this looks like a libstdc++ bug to me.

Advertisement

Answer