Skip to content
Advertisement

Open MPI error when setting up a cluster with more than 3 hosts

We cannot run a program in a Open MPI cluster with more than 3 machines.

If we run:

mpirun --host master,slave5,slave3 ./cluster

it works.

If we run:

mpirun --host master,slave4,slave3,slave5 ./cluster 

We get the following error:

ssh: Could not resolve hostname slave5: Temporary failure in name resolution

Despite of the fact that it looks like a name resolution error, it is not, because slave5 works on the first command.

We’ve seen other people reporting the same error without any solution so far. Example:

Any ideas?

Advertisement

Answer

The issue is likely because Open MPI defaults to a tree-based spawn, meaning that it ssh’s from node A to node B, and then ssh’s from node B to node C. See https://blogs.cisco.com/performance/tree-based-launch-in-open-mpi and https://blogs.cisco.com/performance/tree-based-launch-in-open-mpi-part-2 for more details.

Hence, if if you disable the tree-based spawn (via “mpirun –mca plm_rsh_no_tree_spawn 1” — which will cause all ssh’s to occur from node A), your launch will work as expected.

However, the better solution is to make all your cluster machine names resolvable from all machines. E.g., when you can run something like this successfully, then Open MPI’s launch should work successfully:

foreach node (Node1 Node2 Node3 Node4 ...)
    foreach other (Node1 Node2 Node3 Node4 ...)
        echo from $node to $other
        ssh $node ssh $other hostname
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement