Skip to content
Advertisement

How to feed a large number of samples in parallel to linux?

I’m trying run following command on a large number of samples.

java -jar GenomeAnalysisTK.jar                  
     -R   scaffs_HAPSgracilaria92_50REF.fasta    
     -T   HaplotypeCaller                         
     -I   assembled_reads/{sample_name}.sorted.bam 
     --emitRefConfidence GVCF                       
     -ploidy 1                                       
     -nt  {number of cores}                           
     -nct {number of threds}                           
     -o   {sample_name}.raw.snps.indels.g.vcf

I have:

3312 cores,
  20 PB RAM of memory,
 110 TFLOPS of compute power

but I have thousands of these samples to process.

Each sample takes about a day or two to finish on my local computer.

I’m using a shared linux cluster and a job scheduling system called Slurm, if that helps.

Advertisement

Answer

Write a submission script such as the following and submit it with the sbatch command.

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<nb of threads your Java application is able to use>
#SBATCH --mem=<number of MB of RAM your job needs>
#SBATCH --time=<duration of your job>
#SBATCH --array=1-<number of samples>

FILES=(assembled_reads/*.sorted.bam)

INFILE=${FILES[$SLURM_TASK_ARRAY_ID]}
OUTFILE=$(basename $INFILE .sorted.bam).raw.snps.indels.g.vcf

srun java -jar GenomeAnalysisTK.jar -R scaffs_HAPSgracilaria92_50REF.fasta -T HaplotypeCaller -I $INFILE --emitRefConfidence GVCF -ploidy 1 -nt 1-nct $SLURM_CPUS_PER_TASK -o $OUTFILE

This is totally untested, and only aims at giving you a first direction.

I am sure the administrators of the cluster you use have written some documentation, the first step would be to read it cover to cover.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement