I have problems trying to extract data behind colons in multiple lines using while loop and awk
.
This is my data structure:
Identifiers:BioSample:SAMD00019077 Identifiers:BioSample:SAMD00019076 Identifiers:BioSample:SAMD00019075 Identifiers:BioSample:SAMD00019074 Identifiers:BioSample:SAMD00019073 Identifiers:BioSample:SAMD00019072 Identifiers:BioSample:SAMD00019071;SRA:DRS051563 Identifiers:BioSample:SAMD00019070;SRA:DRS051562 Identifiers:BioSample:SAMD00019069;SRA:DRS051561 ... Identifiers:BioSample:SAMD00019005;SRA:DRS051497 Identifiers:BioSample:SAMD00015713;SRA:DRS012785
What I want to get is the BioSample ID
, which is like SAMD00019077
.
Scripts I tried:
while read line ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp
for line in
cat 1.tmp; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done
for line in
cat 1.tmp; do echo $line | awk -F: '{print $3 > "1.tmp2"}' ; done
They only gave Biosample ID
of the last line:
$ while read line ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp $ head 1.tmp2 SAMD00015713;SRA
I read the posts here and looks like my problem is something to do with stdin
, stdout
and stderr
.
bash read loop only reading first line of input variable
bash while loop read only one line
Solution I tried, it gave result of 1 line
$ exec 3<&1 $ exec 1<&2 $ while read line ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done< 1.tmp $ head 1.tmp2 SAMD00015713;SRA $ exec 1<&3 3<&-
Also I tried exec < 1.tmp
to direct a file to stdin
but it lead to error.
I found these scripts worked very well for me. But I really want to know why the scripts I tried above fail.
cat 1.tmp | awk -F: '{print $3}' | head awk -F: '{print $3}' 1.tmp | head
Advertisement
Answer
First of all, awk
has the ability to loop through lines and the field separator can be a regex.
So, your script can be reduced to this optimized format:
awk -F'[;:]' '{print $3}' 1.tmp > 1.tmp2
This is the optimized format that you can use.
Having said that, you might want to know what was wrong in the your script.
while read line ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp ^ here
The >
marked above is the redirection operator.
It writes the stdout of the command (awk
in this case) to the file specified. It does not append, but overwrite.
So, in every iteration of the loop, the file is cleared and the output of the command is written to it. Hence it leaves only the last entry.
To fix that, you can use the append redirection: >>
.
while read line ; do echo $line | awk -F':' '{print $3}' >> 1.tmp2 ; done < 1.tmp
Now, there is a caveat. What if the file is not originally empty? This loop will append to the file, without clearing the file first. To fix that, you can first clear the file with:
>1.tmp2; while read line ; do echo $line | awk -F':' '{print $3}' >> 1.tmp2 ; done < 1.tmp
However, if we are sure that all the stdout produced by the loop needs to go into the file, you can simply move the redirection out of the loop. That way, shell does not have to keep opening and closing the file descriptors.
while read line ; do echo $line | awk -F':' '{print $3}'; done < 1.tmp > 1.tmp2
Note that these options are unoptimized, but would still work. The optimized option would be to let awk
itself do the line-by-line processing as mentioned in the first snippet in the answer.