Skip to content
Advertisement

how to replace [^u0009u000Au000Du0020-uD7FFuE000-uFFFDu10000-u10FFF]+ to “” in a file by sed or anything?

I find and replace some strange characters in xml file with text editor. with regular expression

[^u0009u000Au000Du0020-uD7FFuE000-uFFFDu10000-u10FFF]+ ---> "" 

Now, I need to it in linux command line.

I ask you how to use sed or anything that same find and replacement job in linux command line.

Thank you in advance

Advertisement

Answer

You can try this :

sed 's/\u(0009|000A|000D|0020|D7FF|E000|FFFD|10000|10FFF)//g' <<< "[^u0009u000Au000Du0020-uD7FFuE000-uFFFDu10000-u10FFF]"

Before replacing, be sure you really want to replace this characters as some of them are tabs, newlines, spaces…

Update :

One more generic pattern based on your 4-5 hexa codes sample :

sed 's/\u[0-9A-F]{4}[0-9A-F]?//g' <<< "[^u0009u000Au000Du0020-uD7FFuE000-uFFFDu10000-u10FFF]"  

will replace all u followed by 4 or 5 hexa codes

Please note that a capitalized word (ie Foo) following a 4 hexa code string will match :

u0000Foo will be changed to oo as the F of Foo will match the 5th optional hexa code.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement