I’m trying run following command on a large number of samples.
JavaScript
x
java -jar GenomeAnalysisTK.jar
-R scaffs_HAPSgracilaria92_50REF.fasta
-T HaplotypeCaller
-I assembled_reads/{sample_name}.sorted.bam
--emitRefConfidence GVCF
-ploidy 1
-nt {number of cores}
-nct {number of threds}
-o {sample_name}.raw.snps.indels.g.vcf
I have:
JavaScript
3312 cores,
20 PB RAM of memory,
110 TFLOPS of compute power
but I have thousands of these samples to process.
Each sample takes about a day or two to finish on my local computer.
I’m using a shared linux cluster and a job scheduling system called Slurm
, if that helps.
Advertisement
Answer
Write a submission script such as the following and submit it with the sbatch
command.
JavaScript
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<nb of threads your Java application is able to use>
#SBATCH --mem=<number of MB of RAM your job needs>
#SBATCH --time=<duration of your job>
#SBATCH --array=1-<number of samples>
FILES=(assembled_reads/*.sorted.bam)
INFILE=${FILES[$SLURM_TASK_ARRAY_ID]}
OUTFILE=$(basename $INFILE .sorted.bam).raw.snps.indels.g.vcf
srun java -jar GenomeAnalysisTK.jar -R scaffs_HAPSgracilaria92_50REF.fasta -T HaplotypeCaller -I $INFILE --emitRefConfidence GVCF -ploidy 1 -nt 1-nct $SLURM_CPUS_PER_TASK -o $OUTFILE
This is totally untested, and only aims at giving you a first direction.
I am sure the administrators of the cluster you use have written some documentation, the first step would be to read it cover to cover.