Launching Experiments
Last updated
Last updated
You can view all experiments associated with your organisation on the "Experiments" page in Control Plane (https://cp.strongcompute.ai) and filter experiments by mode (compute mode) and status. From the Experiments page you can view the most recently printed 100 lines of the master node log file for each experiment, and cancel experiments.
Please note that, from the Experiments page, anyone in an organisation can view or cancel jobs launched by anyone else within the same organisation.
Experiments are launched using an experiment launch file in TOML format which communicates important details of your experiment to the ISC. This file can be named anything you like, by convention we suggest using the file extension .isc
for distinction. An example of such a file is shown below.
isc_project_id
is required and can be obtained from the Projects page in Control Plane.
experiment_name
is required and can be any string you like.
gpus
is required and must be an integer between 1 and 72 inclusive, describing the number of GPUs you want to use for your experiment.
command
is required and describes the operation that each node will execute when started, typically including, as a minimum, sourcing your virtual environment and launching a training script.
compute_mode
is optional and must be "cycle" (default), "interruptible", or "burst". See below for more information about these compute modes.
max_rapid_cycles
is optional and must be an integer describing, for experiments with compute_mode="cycle"
, the number of times the experiment will cycle before completing. See below for more information about these compute modes.
dataset_id_list
is optional and must be a list of Dataset IDs in quotation marks. E.g: dataset_id_list = [ "dataset-id" ]
. These will be available within your container at runtime at /data/<dataset-id>
.
burst_shape_priority_list
is optional and must be a list of Burst Shape IDs in quotation marks. E.g: burst_shape_priority_list = [ "gcp-desired-shape" ]
This field should only be specified when compute_mode="burst"
. See below for more information about this optional argument.
input_artifact_id_list
is optional and must be a list of at most 3 artifact IDs in quotation marks. E.g. input_artifact_id_list = [ "<artifact-id>", "<artifact-id>", "<artifact-id>" ]
.
Advanced note: if you want to change NNODES and N_PROC command, always use the environment variables $NNODES and $N_PROC for those flags (as above), unless you know the machine shape will match what you're changing it to!
For information about retrieving experiment artifacts see the Artifacts (Job Results) page.
Experiments with compute_mode="cycle"
will interrupt any experiment with compute_mode="interruptible"
that is currently running, and run in priority. Compute mode cycle
experiments will cycle between running
for 90s and paused
a number of times. Users can specify the number of times their cycle
mode experiment should cycle using the max_rapid_cycles
argument in the experiment launch file, with a minimum of 1 and a maximum of 5 cycles.
If your experiment does not return an error during cycling and resuming, the experiment status will show "completed". If your experiment returns an error within that time your experiment status will show "failed". The purpose of this compute mode is to provide immediate developer feedback on code viability, and to verify that your experiment is able to successfully pause and resume.
Experiments with compute_mode="interruptible"
launch into a queue which behaves as follows. Every 2 hours the ISC will enumerate the active interruptible experiments and apportion the next 2 hour period in contiguous blocks to each. The interruptible experiments will then run in order for their apportioned time unless interrupted by an experiment with compute_mode="cycle"
. If interrupted, the interruptible experiment will then wait until there are no further cycle experiments enqueued before resuming and completing its apportioned time. Interruptible experiments will cycle in this manner indefinitely until the experiment completes or an error is encountered.
A third available compute mode is burst which currently requires approval and supervision by Strong Compute to execute. Experiments launched with compute_mode="burst"
will have a dedicated cluster provisioned for them on a commercial cloud and will run uninterrupted for as long as the User allows or while available credits permit.
Commercial clouds offer a number of different types of compute node, equipped with different types and numbers of GPUs. These types of commercial cloud compute node are referred to as "shapes". Each shape of compute node is typically a unique combination of the following.
Commercial cloud provider,
Geographic region,
Processor type,
Type and number of GPUs,
Provisioning model (spot / on-demand).
When an experiment is launched with compute_mode="burst"
the ISC will search for a cluster on a commercial cloud with a suitable shape. Users can also specify their preferred shapes by including the following in their experiment launch file.
Users can obtain the necessary shape-id
values from the Burst page on Control Plane. If no valid shape-id
can be parsed from the burst_shape_priority_list
argument, or if the User does not include the burst_shape_priority_list
argument in the experiment launch file, the ISC will try all shapes listed on the Burst page on Control Plane. Each shape will be charged at a different rate.
Note: Burst experiments are currently limited to a maximum of 48 GPUs. Attempts to launch burst
experiments with more than 48 GPUs will return an error in the terminal.
When your experiment is launched to run on the ISC, a copy of your container is started on each node, and a number of environment variables are set which are helpful for coordinating distributed computing operations. Environment variables can be accessed from within training scripts during training, or from within experiment launch file arguments (as above).
Resolves to /mnt/checkpoints
.
Synchronises every 10 minutes.
Useful for saving large files such as model weights.
CRUD_ARTIFACT_PATH
Resolves to /mnt/crud
.
Syncs every 10 minutes.
Useful for saving large files such as database images.
LOSSY_ARTIFACT_PATH
Resolves to /mnt/lossy
.
Synchronises every 30 seconds.
Useful for small files such as tensorboard logs.
The Experiment ID of the currently running experiment. This is automatically generated by the ISC when the experiment is submitted for launch.
The user provided name for the experiment described by the experiment_name
argument in the experiment launch file.
Compute mode of the currently running experiment, corresponding to the compute_mode
argument in the experiment launch file.
The number of times this experiment has been started or resumed. For example, an experiment with compute_mode="interruptible"
will have an incremented STRONG_CYCLE_COUNT
each 2 hour period in which it is allowed to continue to cycle, or when resuming after being interrupted by an experiment with compute_mode="cycle"
.
The amount of time allocated to the experiment to run as scheduled. For experiments with compute_mode="cycle"
this will typically be 90s. For experiments with compute_mode="interruptible"
this will be the share of the current 2 hour period apportioned to the experiment. Notwithstanding the allocated $STRONG_CYCLE_TIME
, experiments with compute_mode="interruptible"
will still be interrupted by experiments with compute_mode="cycle"
.
A dictionary mapping the IP addresses of each node in the cluster to the $RANK
of that machine in the cluster.
An integer index for each node in the cluster.
The IP address for the node in the cluster with $RANK=0
.
The available port associated with IP address for the node in the cluster with $RANK=0
.
The number of nodes that are running this experiment.
Based on the gpus
argument in the experiment launch file and used in conjunction with the $N_PROC
variable, the $NNODES
variable describes the number of nodes for the experiment to be launched on in order that the total number of processes started on the cluster is equal to gpus
. This variable also describes the actual number of nodes that are running the experiment. This is helpful for launching distributed training with popular utilities such as torchrun
and HuggingFace accelerate
.
Based on the gpus
argument in the experiment launch file, when used in conjunction with the $NNODES
variable the $N_PROC
variable describes the number of processes to be started on the node in order that the total number of processes started on the cluster is equal to gpus
. This is helpful for launching distributed training with popular utilities such as torchrun
(as above).