Skip to content
Advertisement

How to clean a data file from binary junk?

I have this data file, which is supposed to be a normal ASCII file. However, it has some junk in the end of the first line. It only shows when I look at it with vi or less –>

  y mon d  h XX11 XX22 XX33 XX44 XX55 XX66^@
2011  6 6 10 14.0 15.5 14.3 11.3 16.2 16.1

grep is also saying that it’s a binary file: Binary file data.dat matches

This is causing some trouble in my parsing script. I’m splitting each line and putting them to array. The last element(XX66) in first array is somehow corrupted, because of the junk and I can’t make a match to it.

How to clean that line or the array? I have tried dos2unix to the file and substituting array members with s/s+$//. What is that junk anyway? Unfortunately I have no control over the data, it’s a third party data.

Any ideas?

Advertisement

Answer

Grep is trying to be smart and, when it sees an unprintable character, switches to “binary” mode. Add “-a” or “–text” to force grep to stay in “text” mode.

As for sed, try sed -e 's/([^ -~]*)//g', which says, “change everything not between space and tilde (chars 0x20 and 0x7E, respectively) into nothing”. That’ll strip tabs, too, but you can insert a tab character before the space to include them (or any other special character).

The “^@” is one way to represent an NUL (aka “ascii(0)” or “”). Some programs may also see that as an end-of-file if they were implemented in a naive way.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement