Skip to content
Advertisement

Extracting a set of characters for a column in a txt file

I have a bed file (what is a txt file formed by columns separated by tabs). The fourth column has a name followed by numbers. Using the command line (Linux), I would like to get these names without repetition. A provided an example below.

This is my file:

$head example.bed
1       2160195 2161184 SKI_1.2160205.2161174
1       2234406 2234552 SKI_1.2234416.2234542
1       2234713 2234849 SKI_1.2234723.2234839
1       2235268 2235551 SKI_1.2235278.2235541
1       2235721 2236034 SKI_1.2235731.2236024
1       2237448 2237699 SKI_1.2237458.2237689
1       2238005 2238214 SKI_1.2238015.2238204
1       9770503 9770664 PIK3CD_1.9770513.9770654
1       9775588 9775837 PIK3CD_1.9775598.9775827
1       9775896 9776146 PIK3CD_1.9775906.9776136
...

My list should look like this:

SKI_1
PIK3CD_1
...

Could please help me with the code I need to use?

I found the solution years ago with grep but I have lost the document in which I used to save all useful codes.

Advertisement

Answer

Given so.txt:

1       2160195 2161184 SKI_1.2160205.2161174
1       2234406 2234552 SKI_1.2234416.2234542
1       2234713 2234849 SKI_1.2234723.2234839
1       2235268 2235551 SKI_1.2235278.2235541
1       2235721 2236034 SKI_1.2235731.2236024
1       2237448 2237699 SKI_1.2237458.2237689
1       2238005 2238214 SKI_1.2238015.2238204
1       9770503 9770664 PIK3CD_1.9770513.9770654
1       9775588 9775837 PIK3CD_1.9775598.9775827
1       9775896 9776146 PIK3CD_1.9775906.9776136

Then the following command should do the trick:

cat so.txt | awk '{split($4,f,".");print f[1];}' | sort -u
  1. $4 is the 4th column
  2. We split the 4th column on the . character. The result is put into the f array
  3. Finally we filter out the duplicates with sort -u
Advertisement