I’m trying run following command on a large number of samples.
java -jar GenomeAnalysisTK.jar -R scaffs_HAPSgracilaria92_50REF.fasta -T HaplotypeCaller -I assembled_reads/{sample_name}.sorted.bam --emitRefConfidence GVCF -ploidy 1 -nt {number of cores} -nct {number of threds} -o {sample_name}.raw.snps.indels.g.vcf
I have:
3312 cores, 20 PB RAM of memory, 110 TFLOPS of compute power
but I have thousands of these samples to process.
Each sample takes about a day or two to finish on my local computer.
I’m using a shared linux cluster and a job scheduling system called Slurm
, if that helps.
Advertisement
Answer
Write a submission script such as the following and submit it with the sbatch
command.
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --cpus-per-task=<nb of threads your Java application is able to use> #SBATCH --mem=<number of MB of RAM your job needs> #SBATCH --time=<duration of your job> #SBATCH --array=1-<number of samples> FILES=(assembled_reads/*.sorted.bam) INFILE=${FILES[$SLURM_TASK_ARRAY_ID]} OUTFILE=$(basename $INFILE .sorted.bam).raw.snps.indels.g.vcf srun java -jar GenomeAnalysisTK.jar -R scaffs_HAPSgracilaria92_50REF.fasta -T HaplotypeCaller -I $INFILE --emitRefConfidence GVCF -ploidy 1 -nt 1-nct $SLURM_CPUS_PER_TASK -o $OUTFILE
This is totally untested, and only aims at giving you a first direction.
I am sure the administrators of the cluster you use have written some documentation, the first step would be to read it cover to cover.