everyone!
I’d like to remove duplicates and keep lines with the highest value from one column (4th column) in a file with 4 fields. I must do this in a Linux server.
Before
gene subj e-value ident g1 h1 0.05 75.5 g1 h2 0.03 60.6 g2 h7 0.00 80.5 g2 h9 0.00 50.3 g2 h4 0.03 90.7 g3 h5 0.10 30.5 g3 h8 0.00 76.8 g4 h11 0.00 80.7
After
gene subj e-value ident g1 h1 0.05 75.5 g2 h4 0.03 90.7 g3 h8 0.00 76.8 g4 h11 0.00 80.7
Thank you so much and I’m sorry if I asked something repeated! But I didn’t find an answer for my problem.
Advertisement
Answer
You can try this, if it is no problem to get the output without the header:
tail -n +2 file.txt | sort -k1,1 -k4,4rn | sort -uk1,1
Explanation:
tail -n +2 file.txt
will remove the headers so they don’t get involved in all the sorting.
sort -k1,1 -k4,4rn
will sort by column 1 first (-k1,1
) and then by column 4 numerically and in reverse order (-k4,4rn
)
Finally:
sort -uk1,1
Will remove duplicates taking into account just the first column.
Be aware that -k1,1
means from column one to column one, hence -k4,4
is from column 4 to column 4. Adjust to fit your columns.