Skip to content
Advertisement

Remove duplicates and keep line which contains max value from one column – LINUX

everyone!

I’d like to remove duplicates and keep lines with the highest value from one column (4th column) in a file with 4 fields. I must do this in a Linux server.

Before

gene  subj  e-value ident
  g1    h1    0.05   75.5
  g1    h2    0.03   60.6 
  g2    h7    0.00   80.5
  g2    h9    0.00   50.3
  g2    h4    0.03   90.7
  g3    h5    0.10   30.5
  g3    h8    0.00   76.8
  g4    h11   0.00   80.7

After

gene  subj  e-value ident
  g1    h1    0.05   75.5
  g2    h4    0.03   90.7
  g3    h8    0.00   76.8
  g4    h11   0.00   80.7

Thank you so much and I’m sorry if I asked something repeated! But I didn’t find an answer for my problem.

Advertisement

Answer

You can try this, if it is no problem to get the output without the header:

tail -n +2 file.txt | sort -k1,1 -k4,4rn | sort -uk1,1

Explanation:

tail -n +2 file.txt

will remove the headers so they don’t get involved in all the sorting.

sort -k1,1 -k4,4rn

will sort by column 1 first (-k1,1) and then by column 4 numerically and in reverse order (-k4,4rn)

Finally:

 sort -uk1,1

Will remove duplicates taking into account just the first column.

Be aware that -k1,1 means from column one to column one, hence -k4,4 is from column 4 to column 4. Adjust to fit your columns.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement