I am trying to read a text file which has unicode characters (u) and other tags (n, u) in the text, here is an example:
(u’B9781437714227000962′, u’Definitionu2014Human papillomavirus (HPV)u2013related proliferation of the vaginal mucosa that leads to extensive, full-thickness loss of maturation of the vaginal epithelium.n’)
How can remove these unicode tags using python3 in Linux operating system?
Advertisement
Answer
To remove unicode escape sequence (or better: to translate them), in python3:
a.encode('utf-8').decode('unicode_escape')
The decode part will translate the unicode escape sequences to the relative unicode characters. Unfortunately such (un-)escape do no work on strings, so you need to encode the string first, before to decode it.
But as pointed in the question comment, you have a serialized document. Try do unserialize it with the correct tools, and you will have automatically also the unicode “unescaping” part.