Hello I need to iterate over pairs of files and do something with them.
For example I have 4 files which are named AA2234_1.fastq.gz
AA2234_2.fastq.gz
AA3945_1.fastq.gz
AA3945_2.fastq.gz
As you can propably tell the pairs are AA2234_1.fastq.gz
<-> AA2234_2.fastq.gz
and AA3945_1.fastq.gz
<-> AA3945_2.fastq.gz
(they share the name before _
sign)
I have a command
with syntax looking like this:
initialize_of_command file1 file2 output_a output_b output_c output_d parameteres
I want this script to find the number of files with fastq.gz
extension in a directory, divide them by 2 to find number of pairs then match the pairs together using probably regex (maybe to two variables) and execute this command
for each pair once.
I have no idea how to pair up those files using regex and how to iterate over the pairs so the scripts knows through which pairs it have already iterated.
Here is my unfinished script:
#!/bin/bash raw_count_of_files=$(ls | grep -c "fastq.gz") count_of_files=$((raw_count_of_files / 2)) for ((i=1;i<=count_of_files;i++)); do java -jar /home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar PE -phred33 AA2234_1.fastq.gz AA2234_2.fastq.gz AA2234_forward_paired.fq.gz AA2234_forward_unpaired.fq.gz AA2234_reverse_paired.fq.gz AA2234_reverse_unpaired.fq.gz SLIDINGWINDOW:4:20 MINLEN:20; done
Also I would like for the output names to be named after the shared name of input files which in this case is AA2234
and AA3945
The desired output of this script should be 8 files named accordingly to pairs:
AA2234_forward_paired.fq.gz AA2234_forward_unpaired.fq.gz AA2234_reverse_paired.fq.gz AA2234_reverse_unpaired.fq.gz
and
AA3945_forward_paired.fq.gz AA3945_forward_unpaired.fq.gz AA3945_reverse_paired.fq.gz AA3945_reverse_unpaired.fq.gz
Advertisement
Answer
Assuming the filenames do not contain whitespace, would you please try:
#!/bin/bash declare -A hash # associative array to tie basename with files for f in *fastq.gz; do # search the files with the suffix base=${f%_*} # remove after "_" if [[ -z ${hash[$base]} ]]; then # if the variable is not defined hash[$base]=$f # then store the filename else hash[$base]+=" $f" # else append the filenmame delimited by the whitespace fi done for base in "${!hash[@]}"; do # loop over the hash keys (basename) read -r f1 f2 <<< "${hash[$base]}" # split into filenames echo java -jar /home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar PE -phred33 "$f1" "$f2" "$base"_forward_paired.fq.gz "$base"_forward_unpaired.fq.gz "$base"_reverse_paired.fq.gz "$base"_reverse_unpaired.fq.gz SLIDINGWINDOW:4:20 MINLEN:20; done
The script outputs the java command lines as a dry run. If the output looks good, drop echo
and run.