Skip to content
Advertisement

Why can I not read a UTF-16 file longer than 4094 characters?

Some information:

  • I’ve only tried this on Linux
  • I’ve tried both with GCC (7.2.0) and Clang (3.8.1)
  • It requires C++11 or higher to my understanding

What happens when I run it

I get the expected string “abcd” repeated until it hits the position of 4094 characters. After that all it outputs is this sign “?” until the end of the file.

What do I think about this?

I think this is not the expected behavior and that it must be a bug somewhere.

Code you can test with:

JavaScript

Advertisement

Answer

This looks like a library bug to me. Stepping through the sample program as compiled by gcc 7.1.1 using gdb:

JavaScript

8000 characters read, as expected. But then:

JavaScript

line[4092] and line[4093] look ok. But then, I see line[4094], line[4095], and line[4096], containing 6300, 6400 and 6500, instead of 0063, 0064, and 0065.

So, this is getting messed up starting with character 4094, and not 4096, actually. I dumped the binary UTF-16 file, and it looks correct to me. The BOM marker is followed by consistent endian-ness for the entire contents of the file.

The only thing that’s puzzling is why both clang and gcc are supposedly affected, but a quick Google search indicates that clang also uses gcc’s libstdc++, at least up until recently. So, this looks like a libstdc++ bug to me.

Advertisement