Skip to content
Advertisement

grep issues when using two files – I’ve tried everything

I have two files (recode and reads) that were built and saved with nano command and I want to compare what has on recode to reads and extract the lines in reads that overlaps. I have been trying to create a when loop with the previous logic on mind, but without success so far. The output data is not matching with the pattern specified in the loop while with grep/recode. The script was supposed to read each line in recode.txt compare to reads.fastq, extract each match line plus one line before and 2 after in the reads.txt and save the output in different files (for all combined match lines per line of the recode.txt). Here are the tables and code:

File recode.txt:

GTGTCTTA+ATCACGAC
GTGTCTTA+ACAGTGGT
GTGTCTTA+CAGATCCA
GTGTCTTA+ACAAACGG
GTGTCTTA+ACCCAGCA
GTGTCTTA+AACCCCTC
GTGTCTTA+CCCAACCT
ATCACGAC+AAGGTTCA
GTGTCTTA+GAAACCCA

File reads.fastq:

###################################
@NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N:0:ATCACGAC+AAGGTTCA
GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT
+
AAAAA#EAAEEEEE6EAEAEEEEEEEEEEEEEEEAEEEEEE/EEEEEEEEEE/EEEEEEEEEEEEEEEAEEEEEA
@NB500931:113:HW53WBGX2:1:11101:6116:1049 1:N:0:ACAAACGG+AAGGTTCA
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###################################
@NB500931:113:HW53WBGX2:1:11101:6885:1049 1:N:0:ACCCAGCA+ACTTAGCA
GAGGGNGCTGTCCCAGTAATTGGGTTCAGATGACATTTGCTTGATTTTAGGGATGTACGAGATTTTCGTGGATC
+
AAA/A#EAEEEEEAEAEEA///EEAEEEEE///AEEAEE/AA//EAA<EEE/E//AEEEAAA//E/A<6//EEA
@NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA
CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC
+

Script:

#!/bin/bash
#PBS -l nodes=1:ppn=8,walltime=96:00:00

while read line
do
echo "working on $line"
grep -A3 "$line" reads.fastq | grep -v "^--$" >> "$line"_sorted.fastq
done<recode.txt

So, both files are in UNIX format and the following script (without a loop) works smooth

According to the script without the looping:

grep -A3 "ATCACGAC+AAGGTTCA" reads.fastq | grep -v "^--$" > sorted_file.fastq

my output should be:

            @NB500931:113:HW53WBGX2:1:11101:11338:1049 1:N:0:ATCACGAC+AAGGTTCA
   GTAGTNCCAGCTGCAGAGCTGGAAGGATCGCTTGAGCGCAGAGGTAGAGGCTACAGTGAGCCGTGATCATGCCAT
            +

    @NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA
    CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC
            +

However, my output using the loop while give me a empty file with the correct name. Can you please help me?

UPDATE: I have tried dos2unix to convert my files and it didn’t work. UPDATE: I edited the question to include my expected output

Advertisement

Answer

Without seeing the expected output it’s a guess but it sounds like this is what you’re trying to do:

$ awk -F: 'NR==FNR{a[$0];next} $NF in a{c=3} c&&c--' recode.txt reads.fastq
@NB500931:113:HW53WBGX2:1:11101:8246:1049 1:N:0:ATCACGAC+AAGGTTCA
CTTGTNAGACACGATGCAGAGAATTAGCTGTTTGATGCCTATCTTCCCAACTCAGAGGCAAGCTGCCCAAAGGC
+

No shell loop required (see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for SOME of the reasons why that matters), just saves the values from recode.txt as array indices and then when reading reads.fastq if the last :-separated field is an index of the array (i.e. existed in recode.txt) then set a counter to 3 and then print every line while the counter is greater than zero, decrementing the counter each time (see printing-with-sed-or-awk-a-line-following-a-matching-pattern for other examples of printing text starting from a match).

To save each found record in a file based on the string name in that final field as it looks like you might be trying to do in your shell loop would be:

awk -F: '
    NR==FNR  { a[$0]; next }
    $NF in a { c=3; close(out); out=$NF"_sorted.fastq" }
    c&&c--   { print >> out }
' recode.txt reads.fastq

Note that that just reads “reads.fastq” once total, not once per line of “recode.txt” as your shell loop was doing, so you can expect a vast performance improvement from that aspect alone.

Finally – if recode.txt is just a list of ALL of the final fields that exist in reads.fastq then you simply don’t need it, this is all you need to split reads.fastq into separate files of 3 lines per record named based on the value after the last : on each line that starts with @:

awk -F: '
    /^@/   { c=3; close(out); out=$NF"_sorted.fastq" }
    c&&c-- { print >> out }
' reads.fastq
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement