How to extract strings by using awk on multiple text files and summaries to one file

I have 70 input files, file names are as like slurm-22801576.out, slurm-22801573.out, slurm-26801571.out, and so on. I want to extract all desired strings to one file. So I did the following but I was able to do so for one file only. How to do that on multiple files?

awk 'BEGIN{printf "file,reads,file,samplen"}NR==32{printf "%s,%s,",FILENAME,$3}NR==2{printf "%s,%s,",FILENAME,$18}' slurm-22801576.out > summary/total_reads.csv

JavaScript
​x
 
awk 'BEGIN{printf "file,reads,file,samplen"}NR==32{printf "%s,%s,",FILENAME,$3}NR==2{printf "%s,%s,",FILENAME,$18}' slurm-22801576.out > summary/total_reads.csv ​

But my output file has only one row

file,reads,file,sample
slurm-22801576.out,outm=2006_42_aligned.sam,slurm-22801576.out,480789160,

JavaScript
 
file,reads,file,sampleslurm-22801576.out,outm=2006_42_aligned.sam,slurm-22801576.out,480789160,​

Inside each input file, the texts look like this:

job starting at 23:42:13
java -ea -Xmx57039m -Xms57039m -cp /sw/bioinfo/bbmap/38.61b/rackham/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 pairedonly=t ambiguous=toss secondary=t killbadpairs=t perfectmode=t minid=1 mappedonly=t outm=2006_40_aligned.sam scafstats=2006_40_fulllength.scafstats in=/crex/proj/datasets/human_depleted/Ki-2006-40-226_unmapped_R1.fq.gz in2=/crex/proj/datasets/human_depleted/Ki-2006-40-226_unmapped_R2.fq.gz threads=auto
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, pairedonly=t, ambiguous=toss, secondary=t, killbadpairs=t, perfectmode=t, minid=1, mappedonly=t, outm=2006_40_aligned.sam, scafstats=2006_40_fulllength.scafstats, in=/crex/proj/datasets/human_depleted/Ki-2006-40-226_unmapped_R1.fq.gz, in2=/crex/proj/datasets/human_depleted/Ki-2006-40-226_unmapped_R2.fq.gz, threads=auto]
Version 38.61

Set OUTPUT_MAPPED_ONLY to true
Scaffold statistics will be written to 2006_40_fulllength.scafstats
Set threads to 10
Ambiguously mapped reads will be considered unmapped.
Set MINIMUM_ALIGNMENT_SCORE_RATIO to 1.000
Set genome to 1

Loaded Reference:       2.275 seconds.
Loading index for chunk 1-1, build 1
Generated Index:        4.174 seconds.
Analyzed Index:         3.394 seconds.
Started output stream:  0.264 seconds.
Creating scaffold statistics table:     0.064 seconds.
Cleared Memory:         1.216 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 10 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

   ------------------   Results   ------------------

Genome:                 1
Key Length:             13
Max Indel:              0
Minimum Score Ratio:    1.0
Mapping Mode:           perfect
Reads Used:             461344228       (57756075474 bases)

Mapping:                595.769 seconds.
Reads/sec:              774367.86
kBases/sec:             96943.77

JavaScript
 
job starting at 23:42:13java -ea -Xmx57039m -Xms57039m -cp /sw/bioinfo/bbmap/38.61b/rackham/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 pairedonly=t ambiguous=toss secondary=t killbadpairs=t perfectmode=t minid=1 mappedonly=t outm=2006_40_aligned.sam scafstats=2006_40_fulllength.scafstats in=/crex/proj/datasets/human_depleted/Ki-2006-40-226_unmapped_R1.fq.gz in2=/crex/proj/datasets/human_depleted/Ki-2006-40-226_unmapped_R2.fq.gz threads=autoExecuting align2.BBMap [build=1, overwrite=true, fastareadlen=500, pairedonly=t, ambiguous=toss, secondary=t, killbadpairs=t, perfectmode=t, minid=1, mappedonly=t, outm=2006_40_aligned.sam, scafstats=2006_40_fulllength.scafstats, in=/crex/proj/datasets/human_depleted/Ki-2006-40-226_unmapped_R1.fq.gz, in2=/crex/proj/datasets/human_depleted/Ki-2006-40-226_unmapped_R2.fq.gz, threads=auto]Version 38.61​Set OUTPUT_MAPPED_ONLY to trueScaffold statistics will be written to 2006_40_fulllength.scafstatsSet threads to 10Ambiguously mapped reads will be considered unmapped.Set MINIMUM_ALIGNMENT_SCORE_RATIO to 1.000Set genome to 1​Loaded Reference:       2.275 seconds.Loading index for chunk 1-1, build 1Generated Index:        4.174 seconds.Analyzed Index:         3.394 seconds.Started output stream:  0.264 seconds.Creating scaffold statistics table:     0.064 seconds.Cleared Memory:         1.216 seconds.Processing reads in paired-ended mode.Started read stream.Started 10 mapping threads.Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9​   ------------------   Results   ------------------​Genome:                 1Key Length:             13Max Indel:              0Minimum Score Ratio:    1.0Mapping Mode:           perfectReads Used:             461344228       (57756075474 bases)​Mapping:                595.769 seconds.Reads/sec:              774367.86kBases/sec:             96943.77​

The expected output file should look like this:

file,reads,file,sample
slurm-22801576.out,outm=2006_42_aligned.sam,slurm-22801576.out,480789160,
slurm-22801576.out,outm=2006_42_aligned.sam,slurm-22801576.out,480789160,
slurm-22801576.out,outm=2006_42_aligned.sam,slurm-22801576.out,480789160,
slurm-22801576.out,outm=2006_42_aligned.sam,slurm-22801576.out,480789160,

JavaScript
 
file,reads,file,sampleslurm-22801576.out,outm=2006_42_aligned.sam,slurm-22801576.out,480789160,slurm-22801576.out,outm=2006_42_aligned.sam,slurm-22801576.out,480789160,slurm-22801576.out,outm=2006_42_aligned.sam,slurm-22801576.out,480789160,slurm-22801576.out,outm=2006_42_aligned.sam,slurm-22801576.out,480789160,​

Answer

You may use this awk:

awk '
   BEGIN{print "file,reads,file,sample"}
   FNR==2 {printf "%s,%s,", FILENAME, $18}
   FNR==32 {printf "%s,%s,n", FILENAME, $3}
' slurm-*.out > summary/total_reads.csv

JavaScript
 
awk '   BEGIN{print "file,reads,file,sample"}   FNR==2 {printf "%s,%s,", FILENAME, $18}   FNR==32 {printf "%s,%s,n", FILENAME, $3}' slurm-*.out > summary/total_reads.csv ​

Advertisement

Answer