Skip to content
Advertisement

Bash loop only read the last line

I have problems trying to extract data behind colons in multiple lines using while loop and awk.

This is my data structure:

Identifiers:BioSample:SAMD00019077
Identifiers:BioSample:SAMD00019076
Identifiers:BioSample:SAMD00019075
Identifiers:BioSample:SAMD00019074
Identifiers:BioSample:SAMD00019073
Identifiers:BioSample:SAMD00019072
Identifiers:BioSample:SAMD00019071;SRA:DRS051563
Identifiers:BioSample:SAMD00019070;SRA:DRS051562
Identifiers:BioSample:SAMD00019069;SRA:DRS051561
...
Identifiers:BioSample:SAMD00019005;SRA:DRS051497
Identifiers:BioSample:SAMD00015713;SRA:DRS012785

What I want to get is the BioSample ID, which is like SAMD00019077.

Scripts I tried:

  1. while read line ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp
  2. for line incat 1.tmp; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done
  3. for line incat 1.tmp; do echo $line | awk -F: '{print $3 > "1.tmp2"}' ; done

They only gave Biosample ID of the last line:

$ while read line ; do echo $line | 
  awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp
$ head 1.tmp2
SAMD00015713;SRA

I read the posts here and looks like my problem is something to do with stdin, stdout and stderr.

bash read loop only reading first line of input variable

bash while loop read only one line

Solution I tried, it gave result of 1 line

$ exec 3<&1
$ exec 1<&2
$ while read line ; do echo $line |  
  awk -F':' '{print $3}' > 1.tmp2 ; done< 1.tmp
$ head 1.tmp2
SAMD00015713;SRA
$ exec 1<&3 3<&-

Also I tried exec < 1.tmp to direct a file to stdin but it lead to error.

I found these scripts worked very well for me. But I really want to know why the scripts I tried above fail.

cat 1.tmp | awk -F: '{print $3}' | head

awk -F: '{print $3}' 1.tmp | head

Advertisement

Answer

First of all, awk has the ability to loop through lines and the field separator can be a regex.

So, your script can be reduced to this optimized format:

awk -F'[;:]' '{print $3}' 1.tmp > 1.tmp2

This is the optimized format that you can use.

Having said that, you might want to know what was wrong in the your script.

while read line ; do echo $line | awk -F':' '{print $3}' > 1.tmp2 ; done < 1.tmp
                                                         ^ here

The > marked above is the redirection operator. It writes the stdout of the command (awk in this case) to the file specified. It does not append, but overwrite. So, in every iteration of the loop, the file is cleared and the output of the command is written to it. Hence it leaves only the last entry.

To fix that, you can use the append redirection: >>.

while read line ; do echo $line | awk -F':' '{print $3}' >> 1.tmp2 ; done < 1.tmp

Now, there is a caveat. What if the file is not originally empty? This loop will append to the file, without clearing the file first. To fix that, you can first clear the file with:

>1.tmp2; while read line ; do echo $line | awk -F':' '{print $3}' >> 1.tmp2 ; done < 1.tmp

However, if we are sure that all the stdout produced by the loop needs to go into the file, you can simply move the redirection out of the loop. That way, shell does not have to keep opening and closing the file descriptors.

while read line ; do echo $line | awk -F':' '{print $3}'; done < 1.tmp > 1.tmp2

Note that these options are unoptimized, but would still work. The optimized option would be to let awk itself do the line-by-line processing as mentioned in the first snippet in the answer.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement