3. Hello World

With your VPN activated and your development environment set up, lets launch our first "Hello World" experiment. To follow this guide, ensure you have accessed your development environment as described in Section 2 which will mean your home directory (~) is at /root.

It will help to understand that the ISC cycling cluster comprises 12 compute nodes each with 6 GPUs onboard for a total of 72 GPUs. In this guide we're going to train a Convolutional Neural Network (CNN) on the FashionMNIST dataset using 8 of the 12 nodes for 48 GPUs in total.

3.1 Create a Project on Control Plane

Visit the "Projects" page on Control Plane (https://cp.strongcompute.ai). Click on "New Project" and give your new project a name such as "Hello World". Make a note of the ID of your new Project which you will need later.

All experiments launched on the ISC must be associated with a Project which is used for compute consumption tracking and cost control. To successfully launch experiments you will need the help of your Organisation Owner or Admins to ensure your Organisation has sufficient budget and that any applied cost controls permit experiments to be launched under your new Project.

You can view your Organisation's budget and review applied cost controls by visiting the Billing tab on Control Plane.

3.2 Create and activate a virtual environment

When starting a new project on the ISC it is always important to create and activate a virtual environment as follows. As we're training on the FashionMNIST dataset, we'll call our virtual environment .fashion.

cd ~
python3 -m virtualenv ~/.fashion
source ~/.fashion/bin/activate

By default your virtual environment is created in /root which is a volume mounted in read/write mode to your development container. This means your container will start and your experiments will launch as fast as possible.

3.3 Clone the Strong Compute ISC Demos GitHub repository and install dependencies

In your terminal run the following commands to clone the ISC Demos repo and install the dependencies.

cd ~
git clone https://github.com/StrongResearch/isc-demos.git
cd ~/isc-demos

The ISC Demos repo includes a project subdirectory for our FashionMNIST example. Navigate to that subdirectory and install the necessary dependencies with the following.

cd ~/isc-demos/fashion_mnist
pip install -r requirements.txt

You will notice that in addition to PyTorch and other dependencies, we are installing another GitHub repository called cycling_utils. This is a repository developed by Strong Compute to offer simple helpful utilities for enabling saving and resuming your training from checkpoints.

3.4 Update the experiment launch file

Experiments are launched on the ISC using a TOML file which communicates important details of your experiment to the ISC. This file can be named anything you like. We suggest using the file extension .isc to distinguish it from other files.

Open the fashion_mnist launch file for editing with the following command (or open it for editing in VSCode).

cd ~/isc-demos/fashion_mnist
nano fashion_mnist.isc

Update the fashion_mnist.isc file with the ID of the Project you created above.

isc_project_id = "<isc-project-id>"
experiment_name = "fashion_mnist"
gpu_type = "24GB VRAM GPU"
gpus = 48
output_path = "~/outputs/fashion_mnist"
dataset_id = "8d2de5b2-d07f-47ce-a6d6-d217a1cfa369"
command = '''
source ~/.fashion/bin/activate && 
cd ~/isc-demos/fashion_mnist/ && 
torchrun --nnodes=$NNODES --nproc-per-node=$N_PROC 
--master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank=$RANK 
train.py --lr 0.001 --batch-size 16 --save-dir $OUTPUT_PATH --tboard-path $OUTPUT_PATH/tb'''
  • experiment_name is a required field that must be a string and can be anything you like.

  • gpu_type is a required field that for now must always be set to "24GB VRAM GPU".

  • gpus is a required field that must be an integer between 1 and 72 inclusive and describes the number of GPUs you want to use for your experiment.

  • output_path is a required field that must be a string and must be a valid path description. The path does not need to exist yet when you launch your experiment, the ISC will create it for you if necessary. When you launch an experiment, it will be assigned an experiment ID. The ISC will then create a subdirectory inside the output_path you nominated with the name of your experiment ID, and set the OUTPUT_PATH environment variable as the full path to that subdirectory. You can then pass this in as a command line argument to your training script as demonstrated above or access this environment variable from inside your training script.

  • dataset_id is an optional field that must be a string and must be the ID for a Dataset that you have access to in Control Plane. This example is based on the FashionMNIST Open Dataset. For more information about Datasets see the Datasets section under Basics.

  • command is a required field that must be a string and describes the sequence of operations you want each node to perform when it is started to run your experiment. In this example, we are activating our .fashion virtual environment, navigating into our fashion_mnist project directory, and calling torchrun to start our distributed training routine described train.py. Note that the torchrun arguments include --nnodes=$NNODES and --nproc-per-node=$N_PROC. These environment variables are set by the ISC based on the required gpus and the number of GPUs per node in the cluster.

Another optional argument you can include in your launch file is the following.

  • compute_mode must be a string and must be either "cycle" (default) or "interruptible". For explanation of these options and general ISC dynamics see the Compute mode heading of Experiments under Basic Concepts.

3.5 Launch and track an experiment

Launch your experiment by running the following commands.

cd ~/isc-demos/fashion_mnist
isc train fashion_mnist.isc

You will receive the following response.

Using credentials file /root/credentials.isc
Success: Experiment created: <experiment-id>

Track the status of your experiment from the terminal by running the following command.

isc experiments

The following report will displayed in your terminal.

                               ISC Experiments                                                                                                                          
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ ID              ┃ Name          ┃ Output Path                                 ┃ NNodes ┃ Compute Mode ┃ Status   ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ <experiment-id> │ fashion_mnist │ /root/outputs/fashion_mnist/<experiment-id> │ 8      │ cycle        │ enqueued │
└─────────────────┴───────────────┴─────────────────────────────────────────────┴────────┴──────────────┴──────────┘

Because we launched our experiment in the default compute mode "cycle", our experiment will run in seconds. Refreshing this report by repeatedly running isc experiments you will see the Status of the experiment change from enqueued to running and, after 90 seconds, completed.

3.6 Inspect experiment outputs in the development container

Navigate to the directory reported under Output Path and list the contents of that directory with the following commands.

cd /root/outputs/fashion_mnist/<experiment-id>
ls

You will find the following files in your output directory.

  • rank_X.txt files for X from 0 to 7. Each of these files contains a report from CUDA (your development container is based on the CUDA container image), logs from torch to indicate torchrun has started, and then logs of each print statement included in the train.py script. When developing your own projects, you will find any errors or tracebacks returned by your code reported in these files.

  • checkpoint.pt and (maybe) checkpoint.pt.temp files. The checkpoint.pt file contains the checkpoint of our model as well as the state of our optimizer, distributed samplers, learning rate scheduler, and metrics tracker. For more information about why and how to implement robust checkpointing see the Compute mode heading of Experiments under Basics.

  • tb directory which contains tensorboard logs as implemented in the train.py script.

3.7 Launch tensorboard

Launch tensorboard to view the training results graphically with the following command.

tensorboard --logdir /root/outputs/fashion_mnist/<experiment-id>

Enter the following URL in your browser to view your tensorboard.

http://localhost:6006/

Your tensorboard will resemble the following.

3.8 Inspect experiment outputs in Control Plane

Visit Control Plane https://cp.strongcompute.ai and click on "Experiments". You will find a record of the experiment you launched in the Experiments table.

Click the "View" button for your experiment, and then on the experiment page click the "View" button next to "Training Logs". In the window that appears you will see the last 100 lines from the rank_0.txt file in your experiment outputs.

Congratulations, you have successfully launched your first experiment on the ISC!

Last updated