Job Management
Understanding Failed jobs
There are many reasons for why a job may fail; often there's a quick fix if you check the logs!
To check the logs of a
cycle
experiment, run the following command:cat ~/outputs/PROJECT_NAME/EXPERIMENT_ID/rank_0.txt
To check the logs of a
burst
experiment:First, restart your container (this makes the
exports
folder accessible). Then, run:cat ~/exports/EXPERIMENT_ID/outputs/rank_0.txt # different from cycle!
Experiment tracking
The status of experiments will be one of the following:
enqueued
: The experiment is in the queue to run. Experiments that areenqueued
will run in priority over experiments that arepaused
. Experiments that areenqueued
withcompute_mode="cycle"
will run in priority over experiments that areenqueued
withcompute_mode="interruptible"
.interrupting
: An experiment withcompute_mode="cycle"
has been enqueued and your experiment withcompute_mode="interruptible"
is being interrupted. Your experiment will be returned to the queue and resume when there are no more experiment withcompute_mode="cycle"
to be run.paused
: The experiment has completed its cycle or been interrupted and is now waiting in the queue.running
: The experiment is currently running on the ISC. From this state, you can check the logs and can see them updating in real time.cancelling
/cancelled
: The experiment has been cancelled using the web UI or by runningisc cancel <experiment_id>
. The experiment will stop if it wasrunning
, and will be removed from the queue.failed
: The experiment has encountered an exception during runtime and has been stopped and removed from the queue.completed
: The experiment has completed and exited without an error.strong_failed
: In rare cases, the experiment may fail due to errors occurring in Strong Compute systems or due to hardware failure. If your experiment status isstrong_failed
this is our fault, not yours. Be assured we will be investigating what has gone wrong. You can try resubmitting the experiment and resume from your last checkpoint.
Burst statuses
For experiments with compute_mode="burst"
the following additional experiment statuses will be displayed before and during experiment initialisation.
burst_staged
: The experiment has been staged withisc train <experiment-launch-file>
but still requires confirmation to launch. In this state, the experiment has incurred no cost to the authorising Organisation. Users must visit the Experiments page on Control Plane and click Launch Burst to start the experiment.burst_allocating
: The experiment has been launched and the ISC is in the process of finding and provisioning a suitable cluster on a commercial cloud provider to run the experiment.burst_starting
: A suitable cluster has been found and provisioned for the experiment and the ISC is in the process of downloading all necessary data for the experiment to the cluster including the User's container data and any datasets flagged for use in training.
Experiments are considered "active" if they are "enqueued", "interrupting", "running", or "paused".
You can view Training Logs by clicking on View on the Experiments page, and then View next to Training Logs on the Experiment details page. In the window that appears you will see the last 100 lines from the rank_0.txt
file in your experiment outputs (refer to Section 3.8 of the Hello World example).
The Experiment details page will also display logs to help track progress of your experiments running in burst mode, enable exporting of training artefacts, as well as view the experiment Tensorboard logs (if applicable).
Last updated