Launching Experiments
Last updated
Last updated
You can view all experiments associated with your organisation on the "Experiments" page in Control Plane (https://cp.strongcompute.ai) and filter experiments by mode (compute mode) and status. From the Experiments page you can view the most recently printed 100 lines of the master node log file for each experiment, and cancel experiments.
Please note that, from the Experiments page, anyone in an organisation can view or cancel jobs launched by anyone else within the same organisation.
Experiments are launched using an experiment launch file in TOML format which communicates important details of your experiment to the ISC. This file can be named anything you like, by convention we suggest using the file extension .isc
for distinction. An example of such a file is shown below.
isc_project_id
is required and can be obtained from the Projects page in Control Plane.
experiment_name
is required and can be any string you like.
gpu_type
is required and must always be set to "24GB VRAM GPU". Other GPU types will be made available in future.
gpus
is required and must be an integer between 1 and 72 inclusive, describing the number of GPUs you want to use for your experiment.
output_path
is required and must describe a path to a directory. The ISC will create an output directory inside the path provided named with the Experiment ID. The full path to that output directory will then be set as the $OUTPUT_PATH environment variable which you can pass in as a command line argument or access from within your training script.
compute_mode
is optional and must be "cycle" (default), "interruptible", or "burst". See below for more information about these compute modes.
max_rapid_cycles
is optional and must be an integer describing, for experiments with compute_mode="cycle"
, the number of times the experiment will cycle before completing. See below for more information about these compute modes.
burst_shape_priority_list
is optional and must be a list of burst shape IDs in quotation marks. E.g: burst_shape_priority_list = [ "gcp-desired-shape" ]
This field should only be specified when compute_mode="burst"
. See below for more information about this optional argument.
command
is required and describes the operation that each node will execute when started, typically including, as a minimum, sourcing your virtual environment and launching a training script.
Advanced note: if you want to change NNODES and N_PROC command, always use the environment variables $NNODES and $N_PROC for those flags (as above), unless you know the machine shape will match what you're changing it to!
Experiments with compute_mode="cycle"
will interrupt any experiment with compute_mode="interruptible"
that is currently running, and run in priority. Compute mode cycle
experiments will cycle between running
for 90s and paused
a number of times. Users can specify the number of times their cycle
mode experiment should cycle using the max_rapid_cycles
argument in the experiment launch file, with a minimum of 1 and a maximum of 5 cycles.
If your experiment does not return an error during cycling and resuming, the experiment status will show "completed". If your experiment returns an error within that time your experiment status will show "failed". The purpose of this compute mode is to provide immediate developer feedback on code viability, and to verify that your experiment is able to successfully pause and resume.
Experiments with compute_mode="interruptible"
launch into a queue which behaves as follows. Every 2 hours the ISC will enumerate the active interruptible experiments and apportion the next 2 hour period in contiguous blocks to each. The interruptible experiments will then run in order for their apportioned time unless interrupted by an experiment with compute_mode="cycle"
. If interrupted, the interruptible experiment will then wait until there are no further cycle experiments enqueued before resuming and completing its apportioned time. Interruptible experiments will cycle in this manner indefinitely until the experiment completes or an error is encountered.
A third available compute mode is burst which currently requires approval and supervision by Strong Compute to execute. Experiments launched with compute_mode="burst"
will have a dedicated cluster provisioned for them on a commercial cloud and will run uninterrupted for as long as the User allows or while available credits permit.
Commercial clouds offer a number of different types of compute node, equipped with different types and numbers of GPUs. These types of commercial cloud compute node are referred to as "shapes". Each shape of compute node is typically a unique combination of the following.
Commercial cloud provider,
Geographic region,
Processor type,
Type and number of GPUs,
Provisioning model (spot / on-demand).
When an experiment is launched with compute_mode="burst"
the ISC will search for a cluster on a commercial cloud with a suitable shape. Users can also specify their preferred shapes by including the following in their experiment launch file.
Users can obtain the necessary shape-id
values from the Burst page on Control Plane. If no valid shape-id
can be parsed from the burst_shape_priority_list
argument, or if the User does not include the burst_shape_priority_list
argument in the experiment launch file, the ISC will try all shapes listed on the Burst page on Control Plane. Each shape will be charged at a different rate.
Note: Burst experiments are currently limited to a maximum of 48 GPUs. Attempts to launch burst
experiments with more than 48 GPUs will return an error in the terminal.
When launching experiments with compute_mode="burst"
it is also necessary to visit the Experiments page Control Plane after the experiment has been created and click on "Launch Burst" for the experiment to confirm. After clicking "Launch Burst" the ISC will search for a suitable cluster on a commercial cloud, download your experiment data to that cluster, and start your burst experiment. Click on "View" for your burst experiment to track the progress of this startup process.
Be patient, this can take a few minutes.
After your burst experiment has started and its status is running
your experiment is running in an isolated environment on a remote cluster.
To access artefacts such a training logs and checkpoints from your burst experiment, visit the User Credentials page on Control Plane, Stop and Start your container. When your container starts, it will have a directory mounted in /root/exports
which will contain your burst experiment artefacts. The path to your burst experiment artefacts will be as follows.
Interacting with the /root/exports
directory is slow because it is a fused bucket, please be patient when exploring this directory with commands like ls
. For best user experience viewing performance metrics in rank_0.txt
or accessing checkpoints, first copy the files you need from /root/exports
to another subdirectory in /root
.
When your experiment is launched to run on the ISC, a copy of your container is started on each node, and a number of environment variables are set which are helpful for coordinating distributed computing operations. Environment variables can be accessed from within training scripts during training, or from within experiment launch file arguments (as above).
The ISC creates an output directory inside the path described by the output_path
argument in the experiment launch file, named with the Experiment ID. The full path to that output directory will then be set as the $OUTPUT_PATH
environment variable. Saving experiment artefacts to the path described by the $OUTPUT_PATH
environment variable is a convenient way of ensuring artefacts from each experiment are saved in separate directories.
The Experiment ID of the currently running experiment. This is automatically generated by the ISC when the experiment is submitted for launch.
The user provided name for the experiment described by the experiment_name
argument in the experiment launch file.
Compute mode of the currently running experiment, corresponding to the compute_mode
argument in the experiment launch file.
The number of times this experiment has been started or resumed. For example, an experiment with compute_mode="interruptible"
will have an incremented STRONG_CYCLE_COUNT
each 2 hour period in which it is allowed to continue to cycle, or when resuming after being interrupted by an experiment with compute_mode="cycle"
.
The amount of time allocated to the experiment to run as scheduled. For experiments with compute_mode="cycle"
this will typically be 90s. For experiments with compute_mode="interruptible"
this will be the share of the current 2 hour period apportioned to the experiment. Notwithstanding the allocated $STRONG_CYCLE_TIME
, experiments with compute_mode="interruptible"
will still be interrupted by experiments with compute_mode="cycle"
.
A dictionary mapping the IP addresses of each node in the cluster to the $RANK
of that machine in the cluster.
An integer index for each node in the cluster.
The IP address for the node in the cluster with $RANK=0
.
The available port associated with IP address for the node in the cluster with $RANK=0
.
The number of nodes that are running this experiment.
Based on the gpus
argument in the experiment launch file and used in conjunction with the $N_PROC
variable, the $NNODES
variable describes the number of nodes for the experiment to be launched on in order that the total number of processes started on the cluster is equal to gpus
. This variable also describes the actual number of nodes that are running the experiment. This is helpful for launching distributed training with popular utilities such as torchrun
and HuggingFace accelerate
.
Based on the gpus
argument in the experiment launch file, when used in conjunction with the $NNODES
variable the $N_PROC
variable describes the number of processes to be started on the node in order that the total number of processes started on the cluster is equal to gpus
. This is helpful for launching distributed training with popular utilities such as torchrun
(as above).