I need to convert between UTF-8, UTF-16 and UTF-32 for different API’s/modules and since I know have the option to use C++11 am looking at the new string types.
It looks like I can use string
, u16string
and u32string
for UTF-8, UTF-16 and UTF-32. I also found codecvt_utf8
and codecvt_utf16
which look to be able to do a conversion between char
or char16_t
and char32_t
and what looks like a higher level wstring_convert
but that only appears to work with bytes/std::string
and not a great deal of documentation.
Am I meant to use a wstring_convert
somehow for the UTF-16 ↔ UTF-32 and UTF-8 ↔ UTF-32 case? I only really found examples for UTF-8 to UTF-16, which I am not even sure will be correct on Linux where wchar_t
is normally considered UTF-32… Or do something more complex with those codecvt things directly?
Or is this just still not really in a usable state and I should stick with my own existing small routines using 8, 16 and 32bit unsigned integers?
Advertisement
Answer
If you read the documentation at CppReference.com for wstring_convert
, codecvt_utf8
, codecvt_utf16
, and codecvt_utf8_utf16
, the pages include a table that tells you exactly what you can use for the various UTF conversions.
And yes, you would use std::wstring_convert
to facilitate the conversion between the various UTFs. Despite its name, it is not limited to just std::wstring
, it actually operates with any std::basic_string
type (which std::string
, std::wstring
, and std::uXXstring
are all based on).
Class template std::wstring_convert performs conversions between byte string
std::string
and wide stringstd::basic_string<Elem>
, using an individual code conversion facet Codecvt. std::wstring_convert assumes ownership of the conversion facet, and cannot use a facet managed by a locale. The standard facets suitable for use with std::wstring_convert are std::codecvt_utf8 for UTF-8/UCS2 and UTF-8/UCS4 conversions and std::codecvt_utf8_utf16 for UTF-8/UTF-16 conversions.
For example:
typedef std::string u8string; u8string To_UTF8(const std::u16string &s) { std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv; return conv.to_bytes(s); } u8string To_UTF8(const std::u32string &s) { std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv; return conv.to_bytes(s); } std::u16string To_UTF16(const u8string &s) { std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv; return conv.from_bytes(s); } std::u16string To_UTF16(const std::u32string &s) { std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> conv; std::string bytes = conv.to_bytes(s); return std::u16string(reinterpret_cast<const char16_t*>(bytes.c_str()), bytes.length()/sizeof(char16_t)); } std::u32string To_UTF32(const u8string &s) { std::wstring_convert<codecvt_utf8<char32_t>, char32_t> conv; return conv.from_bytes(s); } std::u32string To_UTF32(const std::u16string &s) { const char16_t *pData = s.c_str(); std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> conv; return conv.from_bytes(reinterpret_cast<const char*>(pData), reinterpret_cast<const char*>(pData+s.length())); }