hello everyones so im trying to set up a new hpc cluster i made an account and added users and im using a partition but whenerver i run a job it gives me an error that request node configuration is not available i checked my slurm.conf but it seems good to me i need some help
the error Batch job submission failed: Requested node configuration is not available
# # See the slurm.conf man page for more information. # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= SlurmdSpoolDir=/cm/local/apps/slurm/var/spool SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid #ProctrackType=proctrack/pgid ProctrackType=proctrack/cgroup #PluginDir= #FirstJobId= ReturnToService=2 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= TaskPlugin=task/cgroup #TrackWCKey=no #TreeWidth=50 #TmpFs= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd #JobCompType=jobcomp/filetxt #JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux #JobAcctGatherType=jobacct_gather/cgroup #JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm # AccountingStorageLoc=slurm_acct_db # AccountingStoragePass=SLURMDBD_USERPASS # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE # Server nodes SlurmctldHost=omics-master AccountingStorageHost=master # Nodes NodeName=omics[01-05] Procs=48 Feature=local # Partitions PartitionName=defq Default=YES MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL Nodes=omics[01-05] ClusterName=omics # Scheduler SchedulerType=sched/backfill # Statesave StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave/omics PrologFlags=Alloc # Generic resources types GresTypes=gpu # Epilog/Prolog section Prolog=/cm/local/apps/cmd/scripts/prolog Epilog=/cm/local/apps/cmd/scripts/epilog # Power saving section (disabled) # END AUTOGENERATED SECTION -- DO NOT REMOVE
and this is my sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 5 idle omics[01-05]
and this is my test script
#!/bin/bash #SBATCH --nodes=2 # Number of nodes #SBATCH --ntasks-per-node=4 #SBATCH --ntasks-per-socket=2 #SBATCH --output=std.out #SBATCH --error=std.err #SBATCH --mem-per-cpu=1gb echo "hello from:" hostname; pwd; date; sleep 10 echo "going to sleep during 10 seconds" echo "wake up, exiting
“
and thanks in advance
Advertisement
Answer
In the node definition, you do not specify RealMemory
so Slurm assumes the default of 1MB (!) per node. Hence the request of 1GB per CPU cannot be fulfilled.
You should run slurmd -C
on the compute node, that will give you the line to insert in the slurm.conf
file for Slurm to correctly know the hardware resources it can allocate.
$ slurmd -C | head -1 NodeName=node002 CPUs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=128547