Deep dive tutorial

With your VPN activated and your development environment set up, lets launch our first "Hello World" experiment. To follow this guide, ensure you have accessed your development environment as described in Section 2 which will mean your home directory (~) is at /root.

1. Create a virtual environment and install dependencies

The first step when commencing a new project on the ISC is to create and activate a virtual environment as follows.

cd ~
python3 -m virtualenv ~/.tutorial
source ~/.tutorial/bin/activate

By default your virtual environment is created in /root which is a volume mounted in read/write mode to your development container. Storing your virtual environment in an externally mounted volume rather than inside your container means that your container will start and your experiments will launch as fast as possible.

Install PyTorch to your virtual environment with the following.

pip install torch

We won't need PyTorch straight away for this Hello World example, but we'll go through some motions to get familiar with each time you start a new project.

2. Tutorial experiment

With this experiment we will demonstrate how the ISC works at a basic level. It will help to understand that the ISC cycling cluster comprises 12 compute nodes each with 6 GPUs, a total of 72 GPUs.

In this first experiment we will ask each machine to print out some of the environment variables set on the nodes when launched by the ISC.

2.1 Develop experiment files

Create a new project directory called tutorial and navigate to it.

Inside your tutorial directory create the following file called tutorial.py. This will be our main project file describing the work that our project is intended to do.

import os

def main():
    # Useful for node communication
    master_address = os.environ["MASTER_ADDR"]
    master_port = os.environ["MASTER_PORT"]

    # Useful for training coordination
    rank = os.environ["RANK"]

    # Useful for saving experiment artefacts
    output_path = os.environ["OUTPUT_PATH"]

    print(f"Master address: {master_address}, Master port: {master_port}, Rank: {rank}, Output path: {output_path}")

if __name__ == "__main__":
    main()

Experiments are launched on the ISC using a TOML format file which communicates important details of your experiment to the ISC. This file can be named anything you like, by convention we suggest using the file extension .isc to distinguish it from other files.

Inside your tutorial directory create the following file called tutorial.isc.

isc_project_id = "<isc-project-id>"
experiment_name = "tutorial"
gpu_type = "24GB VRAM GPU"
gpus = 18
output_path = "~/outputs/tutorial"
compute_mode = "cycle"
command = "source ~/.tutorial/bin/activate && cd ~/tutorial && python tutorial.py"
  • isc_project_id is required and can be obtained from the Projects page in Control Plane.

  • experiment_name is required and can be any string you like.

  • gpus is a required field that must be an integer between 1 and 72 inclusive and describes the number of GPUs you want to use for your experiment.

  • output_path is required and must describe a path to a directory. The ISC will create a directory inside the path provided named with the experiment ID. The full path to that output directory will then be set as the $OUTPUT_PATH environment variable which you can pass in as a command line argument or access from within your training script (as above).

  • compute_mode is optional and must be either "cycle" (default) or "interruptible". See below for more information about these compute modes.

  • command is required and describes the operation(s) that each node will execute when started, typically including, as a minimum, sourcing your virtual environment and launching a training script.

2.2 Compute mode cycle vs interruptible

Compute mode cycle means that your experiment will interrupt any interruptible experiment that is currently running on the ISC, run for 90 seconds and then terminate. If your experiment does not return an error within that time your experiment status will show "completed". If your experiment returns an error within that time your experiment status will show "failed". The purpose of this mode is to provide immediate feedback on the viability of your code.

Experiments with compute mode interruptible launch into a queue which behaves as follows.

  • Every 2 hours the ISC will list the active interruptible experiments and apportion the next 2 hour period in contiguous blocks to each.

  • The interruptible experiments will then run in order for their apportioned time unless interrupted by a cycle experiment.

  • If interrupted, the interruptible experiment will then wait until there are no further cycle experiments enqueued before resuming and completing its apportioned time.

  • Interruptible experiments will cycle in this manner indefinitely until the experiment completes or an error is encountered.

2.3 Launch and track experiment

Run the following commands to launch your experiment.

cd ~/tutorial
isc train tutorial.isc

Run the following command to track the progress of your experiment.

isc experiments

In your terminal you will see the following report.

                               ISC Experiments                                                                                                                          
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ ID              ┃ Name     ┃ Output Path                            ┃ NNodes ┃ Compute Mode ┃ Status   ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ <experiment-id> │ tutorial │ /root/outputs/tutorial/<experiment-id> │ 3      │ cycle        │ enqueued │
└─────────────────┴──────────┴────────────────────────────────────────┴────────┴──────────────┴──────────┘

Because we launched our experiment in compute mode "cycle", our experiment will run in a matter of seconds. Refreshing this report by repeatedly running isc experiments you will see the Status of the experiment change from enqueued to running and then completed.

Navigate to the directory reported under Output Path and list the contents of that directory with the following commands.

cd /root/outputs/tutorial/<experiment-id>
ls

You will see that 3 of the 12 nodes have generated an output file as follows.

rank_0.txt  rank_1.txt  rank_2.txt

Inspect the contents of these files and you will find the output of the print statement in our tutorial.py file (synthesised from each file as one output below).

# rank_0.txt:
Master address: <ip-address>, Master port: <port>, Rank: 0, Output path: /root/outputs/tutorial/<experiment-id>

# rank_1.txt:
Master address: <ip-address>, Master port: <port>, Rank: 1, Output path: /root/outputs/tutorial/<experiment-id>

# rank_2.txt:
Master address: <ip-address>, Master port: <port>, Rank: 2, Output path: /root/outputs/tutorial/<experiment-id>

Each node reports its unique rank and common Master address, Master port, and Output path.

3.2.4 What about my 18 GPUs?

We asked for 18 GPUs and we got a report from 3 nodes. Because your experiment ran on the ISC cycling cluster with 6 GPUs on each node, only 3 nodes was necessary to fulfil your 18 GPU request. We just didn't do much with those GPUs this time around.

3.2 Torchrun experiment

Next we will demonstrate launching an experiment using torchrun. Inside your tutorial directory create the following file called torchrun.py. Note that we are now also accessing LOCAL_RANK and WORLD_SIZE environment variables that will be set by torchrun.

import os

def main():
    # Useful for node communication
    master_address = os.environ["MASTER_ADDR"]
    master_port = os.environ["MASTER_PORT"]

    # Useful for training coordination
    rank = os.environ["RANK"]
    local_rank = os.environ["LOCAL_RANK"]
    world_size = os.environ["WORLD_SIZE"]

    # Useful for saving experiment artefacts
    output_path = os.environ["OUTPUT_PATH"]

    print(f"Master address: {master_address}, Master port: {master_port}, Rank: {rank}, Local rank: {local_rank}, World size: {world_size}, Output path: {output_path}")

if __name__ == "__main__":
    main()

Inside your torchrun directory create the following file called torchrun.isc. Note that we are now launching our script using torchrun, passing the $NNODES, $N_PROC, $MASTER_ADDR, $MASTER_PORT, and $RANK environment variables.

isc_project_id = "<isc-project-id>"
experiment_name = "torchrun"
gpu_type = "24GB VRAM GPU"
gpus = 18
output_path = "~/outputs/torchrun"
compute_mode = "cycle"
command = "source ~/.tutorial/bin/activate && cd ~/tutorial && torchrun --nnodes=$NNODES --nproc-per-node=$N_PROC --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank=$RANK torchrun.py"

Note that the ISC determines suitable values to set for $NNODES and $N_PROC based on the number of gpus requested and the number of GPUs per node. In this case $NNODES will be set to 3 and $N_PROC will be set to 6 on each node, so that torchrun will start 6 processes in parallel on each node that will each independently run our torchrun.py script.

Run the following commands to launch your experiment.

cd ~/tutorial
isc train torchrun.isc

Inspecting just the contents of the rank_0.txt and rank_1.txt files generated by this experiment we find the following.

# rank_0.txt:
Master address: <ip-address>, Master port: <port>, Rank: 0, Local rank: 0, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>
Master address: <ip-address>, Master port: <port>, Rank: 1, Local rank: 1, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>
Master address: <ip-address>, Master port: <port>, Rank: 2, Local rank: 2, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>
Master address: <ip-address>, Master port: <port>, Rank: 3, Local rank: 3, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>
Master address: <ip-address>, Master port: <port>, Rank: 4, Local rank: 4, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>
Master address: <ip-address>, Master port: <port>, Rank: 5, Local rank: 5, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>

# rank_1.txt:
Master address: <ip-address>, Master port: <port>, Rank: 6, Local rank: 0, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>
Master address: <ip-address>, Master port: <port>, Rank: 7, Local rank: 1, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>
Master address: <ip-address>, Master port: <port>, Rank: 8, Local rank: 2, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>
Master address: <ip-address>, Master port: <port>, Rank: 9, Local rank: 3, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>
Master address: <ip-address>, Master port: <port>, Rank: 10, Local rank: 4, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>
Master address: <ip-address>, Master port: <port>, Rank: 11, Local rank: 5, World size: 18, Output path: /root/outputs/torchrun/<experiment-id>

...

Note that the RANK environment variable has been re-set by torchrun and now indexes processes across the entire cluster, whereas the new LOCAL_RANK environment variable indexes processes locally on each node. The $RANK environment variable is helpful for coordinating the activity of the entire cluster, for example to ensure each process accesses a unique subset of an overall dataset. The $LOCAL_RANK environment variable is helpful for indexing each GPU on the local node.

Last updated