I have a fileA.txt which contains strings:
>AMP_16 RS0247 CENPF__ENST00000366955.7__3251__30__0.43333__66.8488__1 RS0255 >AMP_16 RS0332 CENPF__ENST00000366955.7__2262__30__0.43333__67.9513__1 RS0451 >AMP_16 RS0332 CENPF__ENST00000366955.7__1108__30__0.43333__67.4673__1 RS0247 and so on .....
I would like to extract all the substring which start with RS Output:
RS0247_RS0255 RS0332_RS0451 RS0332_RS0247
I tried something like this:
while read line do for word in $line do if [[ "$word" =~ ^RS ]] then [ -z "$str" ] && str="$word" fi done done < fileA.txt echo "$str"
However I only get the first string RS0247 printed out when I do echo
Advertisement
Answer
Given the three sample lines pasted above in the file f
…
Assuming a fixed format:
awk '{print $2"_"$4}' f RS0247_RS0255 RS0332_RS0451 RS0332_RS0247
Assuming a flexible format (and that you’re only interested in the first two occurrences of fields that start with RS
:
awk '{f=1;for (i=1;i<=NF;i++){if($i~/^RS/){a[f]=$i;f++}}print a[1]"_"a[2]}' f RS0247_RS0255 RS0332_RS0451 RS0332_RS0247
Edit 1:
And assuming that you want your own script patched rather than an efficient solution:
#!/bin/bash while read line do str="" for word in $line do if [[ "$word" =~ ^RS ]] then if [[ -z $str ]] then str=$word else str+="_${word}" fi fi done echo "$str" done < fileA.txt
Edit 2:
In terms of efficiency; I copied and pasted those 3 lines into fileA.txt 60 times (180 lines total). The runtimes for the three attempts above in the same order are:
- real 0m0.002s
- real 0m0.002s
- real 0m0.011s