I have a fileA.txt which contains strings:
>AMP_16 RS0247 CENPF__ENST00000366955.7__3251__30__0.43333__66.8488__1 RS0255 >AMP_16 RS0332 CENPF__ENST00000366955.7__2262__30__0.43333__67.9513__1 RS0451 >AMP_16 RS0332 CENPF__ENST00000366955.7__1108__30__0.43333__67.4673__1 RS0247 and so on .....
I would like to extract all the substring which start with RS Output:
RS0247_RS0255 RS0332_RS0451 RS0332_RS0247
I tried something like this:
while read line
do
for word in $line
do
if [[ "$word" =~ ^RS ]]
then
[ -z "$str" ] && str="$word"
fi
done
done < fileA.txt
echo "$str"
However I only get the first string RS0247 printed out when I do echo
Advertisement
Answer
Given the three sample lines pasted above in the file f…
Assuming a fixed format:
awk '{print $2"_"$4}' f
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
Assuming a flexible format (and that you’re only interested in the first two occurrences of fields that start with RS:
awk '{f=1;for (i=1;i<=NF;i++){if($i~/^RS/){a[f]=$i;f++}}print a[1]"_"a[2]}' f
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
Edit 1:
And assuming that you want your own script patched rather than an efficient solution:
#!/bin/bash
while read line
do
str=""
for word in $line
do
if [[ "$word" =~ ^RS ]]
then
if [[ -z $str ]]
then
str=$word
else
str+="_${word}"
fi
fi
done
echo "$str"
done < fileA.txt
Edit 2:
In terms of efficiency; I copied and pasted those 3 lines into fileA.txt 60 times (180 lines total). The runtimes for the three attempts above in the same order are:
- real 0m0.002s
- real 0m0.002s
- real 0m0.011s