Deep dive tutorial
With your VPN activated and your development environment set up, lets launch our first "Hello World" experiment. To follow this guide, ensure you have accessed your development environment as described in Section 2 which will mean your home directory (~
) is at /root
.
1. Create a virtual environment and install dependencies
The first step when commencing a new project on the ISC is to create and activate a virtual environment as follows.
By default your virtual environment is created in /root
which is a volume mounted in read/write mode to your development container. Storing your virtual environment in an externally mounted volume rather than inside your container means that your container will start and your experiments will launch as fast as possible.
Install PyTorch to your virtual environment with the following.
We won't need PyTorch straight away for this Hello World example, but we'll go through some motions to get familiar with each time you start a new project.
2. Tutorial experiment
With this experiment we will demonstrate how the ISC works at a basic level. It will help to understand that the ISC cycling cluster comprises 12 compute nodes each with 6 GPUs, a total of 72 GPUs.
In this first experiment we will ask each machine to print out some of the environment variables set on the nodes when launched by the ISC.
2.1 Develop experiment files
Create a new project directory called tutorial
and navigate to it.
Inside your tutorial
directory create the following file called tutorial.py
. This will be our main project file describing the work that our project is intended to do.
Experiments are launched on the ISC using a TOML format file which communicates important details of your experiment to the ISC. This file can be named anything you like, by convention we suggest using the file extension .isc
to distinguish it from other files.
Inside your tutorial
directory create the following file called tutorial.isc
.
isc_project_id
is required and can be obtained from the Projects page in Control Plane.experiment_name
is required and can be any string you like.gpus
is a required field that must be an integer between 1 and 72 inclusive and describes the number of GPUs you want to use for your experiment.output_path
is required and must describe a path to a directory. The ISC will create a directory inside the path provided named with the experiment ID. The full path to that output directory will then be set as the$OUTPUT_PATH
environment variable which you can pass in as a command line argument or access from within your training script (as above).compute_mode
is optional and must be either "cycle" (default) or "interruptible". See below for more information about these compute modes.command
is required and describes the operation(s) that each node will execute when started, typically including, as a minimum, sourcing your virtual environment and launching a training script.
2.2 Compute mode cycle vs interruptible
Compute mode cycle means that your experiment will interrupt any interruptible experiment that is currently running on the ISC, run for 90 seconds and then terminate. If your experiment does not return an error within that time your experiment status will show "completed". If your experiment returns an error within that time your experiment status will show "failed". The purpose of this mode is to provide immediate feedback on the viability of your code.
Experiments with compute mode interruptible launch into a queue which behaves as follows.
Every 2 hours the ISC will list the active interruptible experiments and apportion the next 2 hour period in contiguous blocks to each.
The interruptible experiments will then run in order for their apportioned time unless interrupted by a cycle experiment.
If interrupted, the interruptible experiment will then wait until there are no further cycle experiments enqueued before resuming and completing its apportioned time.
Interruptible experiments will cycle in this manner indefinitely until the experiment completes or an error is encountered.
2.3 Launch and track experiment
Run the following commands to launch your experiment.
Run the following command to track the progress of your experiment.
In your terminal you will see the following report.
Because we launched our experiment in compute mode "cycle", our experiment will run in a matter of seconds. Refreshing this report by repeatedly running isc experiments
you will see the Status
of the experiment change from enqueued
to running
and then completed
.
Navigate to the directory reported under Output Path and list the contents of that directory with the following commands.
You will see that 3 of the 12 nodes have generated an output file as follows.
Inspect the contents of these files and you will find the output of the print statement in our tutorial.py
file (synthesised from each file as one output below).
Each node reports its unique rank and common Master address
, Master port
, and Output path
.
3.2.4 What about my 18 GPUs?
We asked for 18 GPUs and we got a report from 3 nodes. Because your experiment ran on the ISC cycling cluster with 6 GPUs on each node, only 3 nodes was necessary to fulfil your 18 GPU request. We just didn't do much with those GPUs this time around.
3.2 Torchrun experiment
Next we will demonstrate launching an experiment using torchrun. Inside your tutorial
directory create the following file called torchrun.py
. Note that we are now also accessing LOCAL_RANK
and WORLD_SIZE
environment variables that will be set by torchrun.
Inside your torchrun
directory create the following file called torchrun.isc
. Note that we are now launching our script using torchrun
, passing the $NNODES
, $N_PROC
, $MASTER_ADDR
, $MASTER_PORT
, and $RANK
environment variables.
Note that the ISC determines suitable values to set for $NNODES
and $N_PROC
based on the number of gpus
requested and the number of GPUs per node. In this case $NNODES
will be set to 3 and $N_PROC
will be set to 6 on each node, so that torchrun will start 6 processes in parallel on each node that will each independently run our torchrun.py
script.
Run the following commands to launch your experiment.
Inspecting just the contents of the rank_0.txt
and rank_1.txt
files generated by this experiment we find the following.
Note that the RANK
environment variable has been re-set by torchrun and now indexes processes across the entire cluster, whereas the new LOCAL_RANK
environment variable indexes processes locally on each node. The $RANK
environment variable is helpful for coordinating the activity of the entire cluster, for example to ensure each process accesses a unique subset of an overall dataset. The $LOCAL_RANK
environment variable is helpful for indexing each GPU on the local node.
Last updated