Skip to content
Advertisement

Looping over pairs of files

Hello I need to iterate over pairs of files and do something with them.

For example I have 4 files which are named AA2234_1.fastq.gz AA2234_2.fastq.gz AA3945_1.fastq.gz AA3945_2.fastq.gz

As you can propably tell the pairs are AA2234_1.fastq.gz <-> AA2234_2.fastq.gz and AA3945_1.fastq.gz <-> AA3945_2.fastq.gz (they share the name before _ sign)

I have a command with syntax looking like this:

initialize_of_command file1 file2 output_a output_b output_c output_d parameteres

I want this script to find the number of files with fastq.gz extension in a directory, divide them by 2 to find number of pairs then match the pairs together using probably regex (maybe to two variables) and execute this command for each pair once.

I have no idea how to pair up those files using regex and how to iterate over the pairs so the scripts knows through which pairs it have already iterated.

Here is my unfinished script:

#!/bin/bash
raw_count_of_files=$(ls | grep -c "fastq.gz")
count_of_files=$((raw_count_of_files / 2))

for ((i=1;i<=count_of_files;i++));
do
java -jar /home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar PE -phred33 AA2234_1.fastq.gz AA2234_2.fastq.gz AA2234_forward_paired.fq.gz AA2234_forward_unpaired.fq.gz AA2234_reverse_paired.fq.gz AA2234_reverse_unpaired.fq.gz SLIDINGWINDOW:4:20 MINLEN:20;
done

Also I would like for the output names to be named after the shared name of input files which in this case is AA2234 and AA3945

The desired output of this script should be 8 files named accordingly to pairs:

AA2234_forward_paired.fq.gz 
AA2234_forward_unpaired.fq.gz 
AA2234_reverse_paired.fq.gz 
AA2234_reverse_unpaired.fq.gz

and

AA3945_forward_paired.fq.gz 
AA3945_forward_unpaired.fq.gz 
AA3945_reverse_paired.fq.gz 
AA3945_reverse_unpaired.fq.gz

Advertisement

Answer

Assuming the filenames do not contain whitespace, would you please try:

#!/bin/bash

declare -A hash                         # associative array to tie basename with files
for f in *fastq.gz; do                  # search the files with the suffix
    base=${f%_*}                        # remove after "_"
    if [[ -z ${hash[$base]} ]]; then    # if the variable is not defined
        hash[$base]=$f                  # then store the filename
    else
        hash[$base]+=" $f"              # else append the filenmame delimited by the whitespace
    fi
done

for base in "${!hash[@]}"; do           # loop over the hash keys (basename)
    read -r f1 f2 <<< "${hash[$base]}"  # split into filenames

    echo java -jar /home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar PE -phred33 "$f1" "$f2" "$base"_forward_paired.fq.gz "$base"_forward_unpaired.fq.gz "$base"_reverse_paired.fq.gz "$base"_reverse_unpaired.fq.gz SLIDINGWINDOW:4:20 MINLEN:20;
done

The script outputs the java command lines as a dry run. If the output looks good, drop echo and run.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement