I have a bed file (what is a txt file formed by columns separated by tabs). The fourth column has a name followed by numbers. Using the command line (Linux), I would like to get these names without repetition. A provided an example below.
This is my file:
$head example.bed 1 2160195 2161184 SKI_1.2160205.2161174 1 2234406 2234552 SKI_1.2234416.2234542 1 2234713 2234849 SKI_1.2234723.2234839 1 2235268 2235551 SKI_1.2235278.2235541 1 2235721 2236034 SKI_1.2235731.2236024 1 2237448 2237699 SKI_1.2237458.2237689 1 2238005 2238214 SKI_1.2238015.2238204 1 9770503 9770664 PIK3CD_1.9770513.9770654 1 9775588 9775837 PIK3CD_1.9775598.9775827 1 9775896 9776146 PIK3CD_1.9775906.9776136 ...
My list should look like this:
SKI_1 PIK3CD_1 ...
Could please help me with the code I need to use?
I found the solution years ago with grep but I have lost the document in which I used to save all useful codes.
Advertisement
Answer
Given so.txt
:
1 2160195 2161184 SKI_1.2160205.2161174 1 2234406 2234552 SKI_1.2234416.2234542 1 2234713 2234849 SKI_1.2234723.2234839 1 2235268 2235551 SKI_1.2235278.2235541 1 2235721 2236034 SKI_1.2235731.2236024 1 2237448 2237699 SKI_1.2237458.2237689 1 2238005 2238214 SKI_1.2238015.2238204 1 9770503 9770664 PIK3CD_1.9770513.9770654 1 9775588 9775837 PIK3CD_1.9775598.9775827 1 9775896 9776146 PIK3CD_1.9775906.9776136
Then the following command should do the trick:
cat so.txt | awk '{split($4,f,".");print f[1];}' | sort -u
$4
is the 4th column- We split the 4th column on the
.
character. The result is put into thef
array - Finally we filter out the duplicates with
sort -u