Running on HPC¶
This guide is based on the example described in PyTorch MNIST.
Parallel optimization using arrays¶
For simplicity, we will only use Slurm in the examples, but the same applies for PBS based systems
with the argument -t
.
Oríon synchronises workers transparently based on the experiment name. Thanks to this there is no master to setup and we can focus solely on submitting the workers. Also, since all synchronisation is done through the database, there is no special setup required to connect workers together. A minimal Slurm script to launch 10 workers would thus only require the following 2 lines.
#SBATCH --array=1-10
orion hunt -n parallel-exp python main.py --lr~'loguniform(1e-5, 1.0)'
All workers are optimizing the experiment parallel-exp
in parallel, each holding a copy of the
optimization algorithm. Adding Slurm options to execute the mnist example with proper ressources
gives the following
#SBATCH --array=1-10
#SBATCH --cpus-per-task=2
#SBATCH --output=/path/to/some/log/parallel-exp.%A.%a.out
#SBATCH --error=/path/to/some/log/parallel-exp.%A.%a.err
#SBATCH --gres=gpu:1
#SBATCH --job-name=parallel-exp
#SBATCH --mem=10GB
#SBATCH --time=2:59:00
orion hunt -n parallel-exp --worker-trials 1 python main.py --lr~'loguniform(1e-5, 1.0)'
For now, Oríon does not provide detection of lost trials if a worker gets killed due to a
timeout. Such trial would be indefinitely marked as pending
in the DB and thus could not be
executed again unless the state is fixed manually. To avoid this, you can set the timeout large
enough for a single trial and use the argument --worker-trials 1
to limit worker to
execute only one trial and then quit. If you have a large amount of tasks to execute but do not want
to have as many workers, you can limit the number of simultaneous jobs with the
character %
(ex: #SBATCH --array=1-100%10
).
#SBATCH --array=1-100%10
#SBATCH --cpus-per-task=2
#SBATCH --output=/path/to/some/log/parallel-exp.%A.%a.out
#SBATCH --error=/path/to/some/log/parallel-exp.%A.%a.err
#SBATCH --gres=gpu:1
#SBATCH --job-name=parallel-exp
#SBATCH --mem=10GB
#SBATCH --time=2:59:00
orion hunt -n parallel-exp --worker-trials 1 python main.py --lr~'loguniform(1e-5, 1.0)'
SSH tunnels¶
Some HPC infrastructure does not provide access to internet from the compute nodes. To get access to
the database from the compute nodes, it is necessary to open ssh tunnels to a gateway (typically
login nodes). The ssh tunnel will redirect traffic from different address and port, therefore the
config of the database needs to be modified accordingly. Suppose our config was the following
without using an ssh tunnel. ($HOME/.config/orion.core/orion_config.yaml
)
database:
type: 'mongodb'
name: 'db_name'
host: 'mongodb://user:pass@<db address>:27017'
Using port 42883, the config would now be like this
database:
type: 'mongodb'
name: 'db_name'
host: 'mongodb://user:pass@localhost'
port: '42883'
Note that the port number was removed from host
because it would have precedence over port
.
Also, the host address is changed to localhost
, because the traffic is send to
localhost:42883
and then transferred to <db address>:27017
on the other end of the ssh
tunnel.
Now, to open the ssh tunnel from the compute node, use this command
ssh -o StrictHostKeyChecking=no <gateway address> -L 42883:<db address>:27017 -n -N -f
Where <gateway address> is the hostname of the gateway (login node) that you want to connect to.
This would work for a single job, but it is likely to cause trouble if many jobs end up on the same compute node. The first job would open the ssh tunnel, and the following ones would fail because the port would no longer be available. They would still all be able to use the ssh tunnel, however when the first job would end, the ssh tunnel would close with it and all following jobs would loose access to the DB. To get around this problem, we need to randomly choose available ports instead, so that two jobs working on the same node use different ports. Here’s how
export ORION_DB_PORT=$(python -c "from socket import socket; s = socket(); s.bind((\"\", 0)); print(s.getsockname()[1])")
ssh -o StrictHostKeyChecking=no <gateway address> -L $ORION_DB_PORT:<db address>:27017 -n -N -f
These lines can then be added to the script to submit workers in parallel.
#SBATCH --array=1-100%10
#SBATCH --cpus-per-task=2
#SBATCH --output=/path/to/some/log/parallel-exp.%A.%a.out
#SBATCH --error=/path/to/some/log/parallel-exp.%A.%a.err
#SBATCH --gres=gpu:1
#SBATCH --job-name=parallel-exp
#SBATCH --mem=10GB
#SBATCH --time=2:59:00
export ORION_DB_PORT=$(python -c "from socket import socket; s = socket(); s.bind((\"\", 0)); print(s.getsockname()[1])")
ssh -o StrictHostKeyChecking=no <gateway address> -L $ORION_DB_PORT:<db address>:27017 -n -N -f
orion hunt -n parallel-exp --worker-trials 1 python main.py --lr~'loguniform(1e-5, 1.0)'
Notes for MongoDB¶
You may experience problems with MongoDB if you are using an encrypted connection with SSL and if you are using replica sets (both of which are highly recommended for security and high availability).
SSL¶
You will need to set the variable
ssl_match_hostname=false
in your URI to bypass the SSL hostname check. This is because
the address used with the tunnel is localhost
and this won’t be recognised by your SSL
certificate. From pymongo’s documentation
Think very carefully before setting this to False as that could make your application vulnerable to man-in-the-middle attacks
Replica Sets¶
So far, we know no simple methods to use replica sets with ssh tunnels and therefore we cannot recommend anything better than not setting up replica set in your MongoDB servers if you need to use ssh tunnels. When dealing with replica sets, the local process tries to open direct connection to each secondary servers (replica sets), which are normally on different hosts. These connections, which are pointing to different addresses, cannot pass through the ssh tunnel that was opened for the address of the primany mongodb server.