Artifacts (Job Results)

Experiments launched on Strong Compute each have four (4) artifacts created for them which are volumes for storing data generated by the experiment. Any data generated (or changed) by the experiment that is not saved in an artifact will not persist after the completion of the experiment.

The four (4) types of artifacts are as follows.

Checkpoints

To save data to the checkpoints artifact for your experiment, users should save data to the path stored in the environment variable OUTPUT_PATH or CHECKPOINT_ARTIFACT_PATH (which both resolve to the same directory).

Updates to checkpoint artifacts are synchronised from running experiments every 10 minutes, and at the end of training (e.g. at the end of a 90 second cycle experiment).
Checkpoint artifacts are suitable for storing large files such as model weights.

Logs

Logs artifacts are not intended for users to save data themselves. Logs artifacts contain text files for each node running the experiment (e.g. rank_N.txt) with anything printed to standard out or standard error. The logs artifact should be the first place to look for information to assist in debugging training code.

Updates to the logs artifact are synchronised from running experiments every 10 seconds, and at the end of training (e.g. at the end of a 90 second cycle experiment).

Lossy

To save data to the lossy artifact for your experiment, users should save data to the path stored in the environment variable LOSSY_ARTIFACT_PATH.

Updates to the lossy artifacts are synchronised from running experiments every 30 seconds, and at the end of training (e.g. at the end of a 90 second cycle experiment).
Lossy artifacts are suitable for storing small files that the user needs more frequent access to such as tensorboard logs.
Contrary to the name, data stored in the lossy artifact will not be lost.

CRUD

To save data to the CRUD artifact for your experiment, users should save data to the path stored in the environment variable CRUD_ARTIFACT_PATH.

Updates to CRUD artifacts are synchronised from running experiment every 10 minutes, and at the end of training (e.g. at the end of a 90 second cycle experiment).
Checkpoint artifacts are suitable for storing large files such as database images.

Accessing artifacts in your container

Click "Sync to Workstation" to download each artifact set to your workstation. Once downloaded, the full path to your experiment artifacts will be as follows.

/shared/artifacts/<experiment-id>/<type>

When you click "Sync to Workstation", the experiment artifacts are downloaded in their state as at that moment in time. If the experiment is still running, you will need to click "Sync to Workstation" again to update the artifacts with latest changes from your running experiment.

Resuming from stopped experiments

You can resume from a previously stopped experiment by passing an optional argument in the experiment launch file as follows.

input_artifact_id_list = [ "artifact1-id", "artifact2-id", ... ]

By passing this argument, when your experiment launches, the ISC will first copy the contents of each artifact identified in the input_artifact_id_list into the output artifact for the new experiment.

Note: if users pass two artifact IDs of the same type (e.g. two checkpoint artifact IDs) then the ISC will attempt to copy the contents of both into the output artifact for the new experiment. This can cause file naming collisions if not handled carefully.

PreviousExperiment States NextISC Commands (CLI)

Last updated 7 days ago