3. Hello World
With your VPN activated and your development environment set up, lets launch our first "Hello World" experiment. To follow this guide, ensure you have accessed your development environment as described in Section 2 which will mean your home directory (~
) is at /root
.
It will help to understand that the ISC cycling cluster comprises 12 compute nodes each with 6 GPUs onboard for a total of 72 GPUs. In this guide we're going to train a Convolutional Neural Network (CNN) on the FashionMNIST dataset using 8 of the 12 nodes for 48 GPUs in total.
3.1 Create a Project on Control Plane
Visit the "Projects" page on Control Plane (https://cp.strongcompute.ai). Click on "New Project" and give your new project a name such as "Hello World". Make a note of the ID of your new Project which you will need later.
All experiments launched on the ISC must be associated with a Project which is used for compute consumption tracking and cost control. To successfully launch experiments you will need the help of your Organisation Owner or Admins to ensure your Organisation has sufficient budget and that any applied cost controls permit experiments to be launched under your new Project.
3.2 Install python
You can view your Organisation's budget and review applied cost controls by visiting the Billing tab on Control Plane.
Your new Container is a blank slate and will need some basic software (like python) installed for this demo which you can install with the following.
3.3 Create and activate a virtual environment
When starting a new project on the ISC it is always important to create and activate a virtual environment as follows. As we're training on the FashionMNIST dataset, we'll call our virtual environment .fashion
.
By default your virtual environment is created in /root
which is a volume mounted in read/write mode to your development container. This means your container will start and your experiments will launch as fast as possible.
3.4 Clone the Strong Compute ISC Demos GitHub repository and install dependencies
In your terminal run the following commands to clone the ISC Demos repo and install the dependencies.
The ISC Demos repo includes a project subdirectory for our FashionMNIST example. Navigate to that subdirectory and install the necessary dependencies with the following.
You will notice that in addition to PyTorch and other dependencies, we are installing another GitHub repository called cycling_utils
. This is a repository developed by Strong Compute to offer simple helpful utilities for enabling saving and resuming your training from checkpoints.
3.5 Update the experiment launch file
Experiments are launched on the ISC using a TOML file which communicates important details of your experiment to the ISC. This file can be named anything you like. We suggest using the file extension .isc
to distinguish it from other files.
Open the fashion_mnist launch file for editing with the following command (or open it for editing in VSCode).
Update the fashion_mnist.isc
file with the ID of the Project you created above.
experiment_name
is a required field that must be a string and can be anything you like.gpus
is a required field that must be an integer between 1 and 72 inclusive and describes the number of GPUs you want to use for your experiment.dataset_id_list
is an optional field that must be a list of unique strings corresponding to the IDs for Datasets that you have access to in Control Plane. This example is based on the FashionMNIST Open Dataset. For more information about Datasets see the Datasets section under Basics.command
is a required field that must be a string and describes the sequence of operations you want each node to perform when it is started to run your experiment. In this example, we are activating our.fashion
virtual environment, navigating into ourfashion_mnist
project directory, and callingtorchrun
to start our distributed training routine describedtrain.py
. Note that the torchrun arguments include--nnodes=$NNODES
and--nproc-per-node=$N_PROC
. These environment variables are set by the ISC based on the requiredgpus
and the number of GPUs per node in the cluster.
Another optional argument you can include in your launch file is the following.
compute_mode
must be a string and must be either"cycle"
(default) or"interruptible"
. For explanation of these options and general ISC dynamics see the Compute mode heading of Experiments under Basic Concepts.
3.6 Launch and track an experiment
Launch your experiment by running the following commands.
You will receive the following response.
Track the status of your experiment from the terminal by running the following command.
The following report will displayed in your terminal.
Because we launched our experiment in the default compute mode "cycle"
, our experiment will run in seconds. Refreshing this report by repeatedly running isc experiments
you will see the Status
of the experiment change from enqueued
to running
and, after suspending and resuming 3 times, the status of the experiment will show completed
.
3.7 Synchronising experiment artifacts
Four artifacts have been created to collect data generated by the experiment. To download artifacts from your experiment to your workstation, visit the Experiments page on Control Plane https://cp.strongcompute.ai and click on the "Outputs" button for your experiment, then click "Sync to workstation" for each artifact you want to download. The three artifact types that are important for this experiment are as follows.
Logs
Logs artifacts contain text files for each node running the experiment (e.g. rank_N.txt
) with anything printed to standard out or standard error. The logs artifact should be the first place to look for information to assist in debugging training code. Updates to the logs artifact are synchronised from running experiments every 10 seconds, and at the end of training (e.g. at the end of a 90 second cycle
experiment).
Checkpoints
Checkpoint artifacts are intended to contain larger files such as model weights. Updates to checkpoint artifacts are synchronised from running experiments every 10 minutes, and at the end of training (e.g. at the end of a 90 second cycle
experiment). Note that we are passing in the CHECKPOINT_ARTIFACT_PATH
environment variable set by the ISC in the experiment launch file above as the path for saving our model checkpoints.
Lossy
Lossy artifacts are intended to contain smaller files that we make more frequent updates to such as tensorboard logs. Updates to lossy artifacts are synchronised from running experiments every 30 seconds, and at the end of training (e.g. at the end of a 90 second cycle
experiment). Note that we are passing in the LOSSY_ARTIFACT_PATH
environment variable set by the ISC in the experiment launch file above as the path for saving our tensorboard logs.
Accessing artifacts in your container
After the artifact has downloaded to your workstation, the contents of the artifact will be available to retrieve from the following location inside your container.
When you click "Sync to Workstation", the experiment artifacts are downloaded in their state as at that moment in time. If the experiment is still running, you will need to click "Sync to Workstation" again to update the artifacts with latest changes from your running experiment.
3.8 Launch tensorboard
To launch the tensorboard view of logs generated by your experiment, first download the "lossy" logs to your workstation, then run the following command in your container terminal.
Enter the following URL in your browser to view your tensorboard.
Your tensorboard will resemble the following.
Congratulations, you have successfully launched and tracked your first experiment on the ISC!
Last updated