I have two files (recode and reads) that were built and saved with nano command and I want to compare what has on recode to reads and extract the lines in reads that overlaps. I have been trying to create a when loop with the previous logic on mind, but without success so far. The output data is not matching with the pattern specified in the loop while
with grep/recode. The script was supposed to read each line in recode.txt compare to reads.fastq, extract each match line plus one line before and 2 after in the reads.txt and save the output in different files (for all combined match lines per line of the recode.txt). Here are the tables and code:
File recode.txt
:
GTGTCTTA+ATCACGAC GTGTCTTA+ACAGTGGT GTGTCTTA+CAGATCCA GTGTCTTA+ACAAACGG GTGTCTTA+ACCCAGCA GTGTCTTA+AACCCCTC GTGTCTTA+CCCAACCT ATCACGAC+AAGGTTCA GTGTCTTA+GAAACCCA
File reads.fastq
:
################################### @NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N:0:ATCACGAC+AAGGTTCA GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT + AAAAA#EAAEEEEE6EAEAEEEEEEEEEEEEEEEAEEEEEE/EEEEEEEEEE/EEEEEEEEEEEEEEEAEEEEEA @NB500931:113:HW53WBGX2:1:11101:6116:1049 1:N:0:ACAAACGG+AAGGTTCA NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN + ################################### @NB500931:113:HW53WBGX2:1:11101:6885:1049 1:N:0:ACCCAGCA+ACTTAGCA GAGGGNGCTGTCCCAGTAATTGGGTTCAGATGACATTTGCTTGATTTTAGGGATGTACGAGATTTTCGTGGATC + AAA/A#EAEEEEEAEAEEA///EEAEEEEE///AEEAEE/AA//EAA<EEE/E//AEEEAAA//E/A<6//EEA @NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC +
Script:
#!/bin/bash #PBS -l nodes=1:ppn=8,walltime=96:00:00 while read line do echo "working on $line" grep -A3 "$line" reads.fastq | grep -v "^--$" >> "$line"_sorted.fastq done<recode.txt
So, both files are in UNIX format and the following script (without a loop) works smooth
According to the script without the looping:
grep -A3 "ATCACGAC+AAGGTTCA" reads.fastq | grep -v "^--$" > sorted_file.fastq
my output should be:
@NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N:0:ATCACGAC+AAGGTTCA GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT + @NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC +
However, my output using the loop while
give me a empty file with the correct name. Can you please help me?
UPDATE: I have tried dos2unix to convert my files and it didn’t work. UPDATE: I edited the question to include my expected output
Advertisement
Answer
Without seeing the expected output it’s a guess but it sounds like this is what you’re trying to do:
$ awk -F: 'NR==FNR{a[$0];next} $NF in a{c=3} c&&c--' recode.txt reads.fastq @NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC +
No shell loop required (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for SOME of the reasons why that matters), just saves the values from recode.txt as array indices and then when reading reads.fastq if the last :-separated field is an index of the array (i.e. existed in recode.txt) then set a counter to 3 and then print every line while the counter is greater than zero, decrementing the counter each time (see printing-with-sed-or-awk-a-line-following-a-matching-pattern for other examples of printing text starting from a match).
To save each found record in a file based on the string name in that final field as it looks like you might be trying to do in your shell loop would be:
awk -F: ' NR==FNR { a[$0]; next } $NF in a { c=3; close(out); out=$NF"_sorted.fastq" } c&&c-- { print >> out } ' recode.txt reads.fastq
Note that that just reads “reads.fastq” once total, not once per line of “recode.txt” as your shell loop was doing, so you can expect a vast performance improvement from that aspect alone.
Finally – if recode.txt is just a list of ALL of the final fields that exist in reads.fastq then you simply don’t need it, this is all you need to split reads.fastq into separate files of 3 lines per record named based on the value after the last :
on each line that starts with @
:
awk -F: ' /^@/ { c=3; close(out); out=$NF"_sorted.fastq" } c&&c-- { print >> out } ' reads.fastq