Skip to content
Advertisement

extract substrings starting with same pattern in a file

I have a fileA.txt which contains strings:

>AMP_16 RS0247 CENPF__ENST00000366955.7__3251__30__0.43333__66.8488__1 RS0255
>AMP_16 RS0332 CENPF__ENST00000366955.7__2262__30__0.43333__67.9513__1 RS0451
>AMP_16 RS0332 CENPF__ENST00000366955.7__1108__30__0.43333__67.4673__1 RS0247
and so on .....

I would like to extract all the substring which start with RS Output:

RS0247_RS0255
RS0332_RS0451
RS0332_RS0247

I tried something like this:

while read line
do
        for word in $line
        do
                if [[ "$word" =~ ^RS ]]
                then
                        [ -z "$str" ] && str="$word"
                fi
        done
done < fileA.txt

echo "$str" 

However I only get the first string RS0247 printed out when I do echo

Advertisement

Answer

Given the three sample lines pasted above in the file f

Assuming a fixed format:

awk '{print $2"_"$4}' f
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247

Assuming a flexible format (and that you’re only interested in the first two occurrences of fields that start with RS:

awk '{f=1;for (i=1;i<=NF;i++){if($i~/^RS/){a[f]=$i;f++}}print a[1]"_"a[2]}' f
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247

Edit 1:

And assuming that you want your own script patched rather than an efficient solution:

#!/bin/bash
while read line
do
        str=""
        for word in $line
        do
                if [[ "$word" =~ ^RS ]]
                then
                        if [[ -z $str ]]
                        then
                                str=$word
                        else
                                str+="_${word}"
                        fi
                fi
        done
        echo "$str"
done < fileA.txt

Edit 2:

In terms of efficiency; I copied and pasted those 3 lines into fileA.txt 60 times (180 lines total). The runtimes for the three attempts above in the same order are:

  • real 0m0.002s
  • real 0m0.002s
  • real 0m0.011s
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement