Job Management

Understanding Failed jobs

There are many reasons for why a job may fail; often there's a quick fix if you check the logs!

  • To check the logs of a cycle experiment, run the following command:

    • cat ~/outputs/PROJECT_NAME/EXPERIMENT_ID/rank_0.txt

  • To check the logs of a burst experiment:

    • First, restart your container (this makes the exports folder accessible). Then, run:

    • cat ~/exports/EXPERIMENT_ID/outputs/rank_0.txt # different from cycle!

Experiment tracking

The status of experiments will be one of the following:

  • enqueued: The experiment is in the queue to run. Experiments that are enqueued will run in priority over experiments that are paused. Experiments that are enqueued with compute_mode="cycle" will run in priority over experiments that are enqueued with compute_mode="interruptible".

  • interrupting: An experiment with compute_mode="cycle" has been enqueued and your experiment with compute_mode="interruptible" is being interrupted. Your experiment will be returned to the queue and resume when there are no more experiment with compute_mode="cycle" to be run.

  • paused: The experiment has completed its cycle or been interrupted and is now waiting in the queue.

  • running: The experiment is currently running on the ISC. From this state, you can check the logs and can see them updating in real time.

  • cancelling/cancelled: The experiment has been cancelled using the web UI or by running isc cancel <experiment_id>. The experiment will stop if it was running, and will be removed from the queue.

  • failed: The experiment has encountered an exception during runtime and has been stopped and removed from the queue.

  • completed: The experiment has completed and exited without an error.

  • strong_failed: In rare cases, the experiment may fail due to errors occurring in Strong Compute systems or due to hardware failure. If your experiment status is strong_failed this is our fault, not yours. Be assured we will be investigating what has gone wrong. You can try resubmitting the experiment and resume from your last checkpoint.

Burst statuses

For experiments with compute_mode="burst" the following additional experiment statuses will be displayed before and during experiment initialisation.

  • burst_staged: The experiment has been staged with isc train <experiment-launch-file> but still requires confirmation to launch. In this state, the experiment has incurred no cost to the authorising Organisation. Users must visit the Experiments page on Control Plane and click Launch Burst to start the experiment.

  • burst_allocating: The experiment has been launched and the ISC is in the process of finding and provisioning a suitable cluster on a commercial cloud provider to run the experiment.

  • burst_starting: A suitable cluster has been found and provisioned for the experiment and the ISC is in the process of downloading all necessary data for the experiment to the cluster including the User's container data and any datasets flagged for use in training.

Experiments are considered "active" if they are "enqueued", "interrupting", "running", or "paused".

You can view Training Logs by clicking on View on the Experiments page, and then View next to Training Logs on the Experiment details page. In the window that appears you will see the last 100 lines from the rank_0.txt file in your experiment outputs (refer to Section 3.8 of the Hello World example).

The Experiment details page will also display logs to help track progress of your experiments running in burst mode, enable exporting of training artefacts, as well as view the experiment Tensorboard logs (if applicable).

Last updated