Why can I not read a UTF-16 file longer than 4094 characters?

Question

Some information: I&#8217;ve only tried this on Linux I&#8217;ve tried both with GCC (7.2.0) and Clang (3.8.1) It requires C++11 or higher to my understanding What happens when I run it I get the expected string &#8220;abcd&#8221; repeated until it hits the position of 4094 characters. After that all it outpu…

Accepted Answer

This looks like a library bug to me. Stepping through the sample program as compiled by gcc 7.1.1 using gdb:(gdb) n28 while (getline(file,line)) {(gdb) n29 std::wcout << line << std::endl;(gdb) p line.size()$1 = 80008000 characters read, as expected. But then:(gdb) p line[4092]$18 = (__gnu_cxx::__alloc_traits >::value_type &) @0x628240: 97 L'a'(gdb) p line[4093]$19 = (__gnu_cxx::__alloc_traits >::value_type &) @0x628244: 98 L'b'(gdb) p line[4094]$20 = (__gnu_cxx::__alloc_traits >::value_type &) @0x628248: 25344 L'挀'(gdb) p line[4095]$21 = (__gnu_cxx::__alloc_traits >::value_type &) @0x62824c: 25600 L'搀'(gdb) p line[4096]$22 = (__gnu_cxx::__alloc_traits >::value_type &) @0x628250: 24832 L'愀'line[4092] and line[4093] look ok. But then, I see line[4094], line[4095], and line[4096], containing 6300, 6400 and 6500, instead of 0063, 0064, and 0065.So, this is getting messed up starting with character 4094, and not 4096, actually. I dumped the binary UTF-16 file, and it looks correct to me. The BOM marker is followed by consistent endian-ness for the entire contents of the file.The only thing that’s puzzling is why both clang and gcc are supposedly affected, but a quick Google search indicates that clang also uses gcc’s libstdc++, at least up until recently. So, this looks like a libstdc++ bug to me.

Advertisement

Answer