Skip to content
Advertisement

iconv not complete convert to utf8

When I converted the my text on this site, be converted correctly:
http://string-functions.com/encodedecode.aspx
I choose source ‘Windows-1252’ and target ‘utf-8’.
See it in the screenshot below:
https://i.stack.imgur.com/2Pn4E.png

But when I convert with the following code, Some letters are not converted and text disrupted.

iconv -c -f UTF-8 -t WINDOWS-1252 < mytext.txt > fixed_mytext.txt

A phrase that should be converted:

آموزش Ùˆ نرم اÙزارهای تعمیر مانیتور

If true convert should be this phrase:

 آموزش و نرم افزارهای تعمیر مانیتور 

plese help me. thank you

my orginal text:

http://www.todaymagazine.ir/forum.txt

Advertisement

Answer

The original text was in UTF-8. It got mistakenly interpreted as a text in Windows-1252 and converted from Windows-1252 to UTF-8. This should have never been done. To undo the damage we need to convert the file from UTF-8 to Windows-1252, and then just treat it as a UTF-8 file.

There’s a problem however. The letter ف is encoded in UTF-8 as 0xd9 0x81, and the code 0x81 is not a part of Windows1252.

Luckily when the first erroneous conversion was made, the character was not lost or replaced with a question mark. It got converted to a control character 0xc2 0x81.

The 0xd9 code is in Windows1252, it’s the letter Ù, which in UTF-8 is 0xc3 0x99. So the final byte sequence for ف in the converted file is 0xc3 0x99 0xc2 0x81.

We can just replace with something ASCII-friendly with a sed script, make an inverse conversion, and then replace it back with ف.

LANG=C sed $'s/xc3x99xc2x81/===FE===/g' forum.txt  | 
       iconv -f utf8 -t cp1252 | 
       sed $'s/===FE===/xd9x81/g'

The result is the original file encoded in UTF-8.

(make sure that ===FE=== is not used in the text first!)

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement