Skip to content
Advertisement

Reading a text file with unicode characters – Python3

I am trying to read a text file which has unicode characters (u) and other tags (n, u) in the text, here is an example:

(u’B9781437714227000962′, u’Definitionu2014Human papillomavirus (HPV)u2013related proliferation of the vaginal mucosa that leads to extensive, full-thickness loss of maturation of the vaginal epithelium.n’)

How can remove these unicode tags using python3 in Linux operating system?

Advertisement

Answer

To remove unicode escape sequence (or better: to translate them), in python3:

a.encode('utf-8').decode('unicode_escape')

The decode part will translate the unicode escape sequences to the relative unicode characters. Unfortunately such (un-)escape do no work on strings, so you need to encode the string first, before to decode it.

But as pointed in the question comment, you have a serialized document. Try do unserialize it with the correct tools, and you will have automatically also the unicode “unescaping” part.

Advertisement