Artifacts
Artifacts are payloads of data generated by an experiment and saved in the specific ways outlined below. Any data generated (or changed) by the experiment that is not saved in an artifact will not persist after the completion of the experiment.
Artifacts for an experiment can be found by clicking the Details button for the experiment on the Experiments page.

The four (4) types of artifacts - Checkpoints, Logs, Lossy, and CRUD are detailed below.
Checkpoint artifacts
Checkpoint Artifacts are the primary mechanism for saving and retrieving outputs of experiments that take the form of large files that must be saved and recovered atomically, such as model training checkpoints.
It is required to use the AtomicDirectory
saver from the public Strong Compute cycling_utils
GitHub repository to save data to a Checkpoint Artifact.
The AtomicDirectory
saver is designed for use by a distributed process group, wherein each process initializes a local instance of the AtomicDirectory
saver and may (see below) act synchronously or asynchronously to save checkpoint/s.
The AtomicDirectory
saver works by saving each checkpoint to a new directory, then saving a symlink to that directory indicating it is the latest checkpoint. The symlink can be read upon resuming the experiment to obtain the path to the latest checkpoint directory.
The AtomicDirectory
saver accepts the following arguments at initialization:
output_directory
: root directory for all Checkpoint outputs from the experiment; this must always be set to the path provided in the$CHECKPOINT_ARTIFACT_PATH
environment variable when training on the Strong Compute ISC.is_master
: boolean to indicate whether the process running theAtomicDirectory
saver is the master rank in the process group.name
: a name for theAtomicDirectory
saver; if the user is running multipleAtomicDirectory
savers in parallel, each must be given a unique name.keep_last
: the number of previous checkpoints to retain on disk; this should always be set to-1
(the default) when saving Checkpoint Artifacts on Strong Compute. When set to-1
theAtomicDirectory
saver will not delete any previous checkpoints from local storage, instead allowing the ISC to ship and delete redundant checkpoints. If set to a value greaterN > 0
then the most recentN
checkpoints will be retained locally.strategy
: determines behaviour of theAtomicDirectory
saver in its operation within a distributed process group.strategy = "sync_any"
(default) willforce_save
the checkpoint if any process passesforce_save = True
. This strategy assumes that the distributed process group is created usingtorchrun
.strategy = "sync_all"
willforce_save
the checkpoint if and only if all processes passforce_save = True
. This strategy assumes that the distributed process group is created usingtorchrun
.strategy = "async"
allows each process to save its own sequence of checkpoints and each process willforce_save
its own checkpoint if and only if passedforce_save = True
. This strategy makes no assumption about how the process group is created as no process synchronisation is necessary.
Checkpoint Artifacts are synchronized from the training cluster every 10 minutes and/or at the end of each cycle on the Strong Compute ISC. Upon synchronization, the latest symlinked checkpoint/s saved by AtomicDirectory
saver/s in the $CHECKPOINT_ARTIFACT_PATH
directory will be shipped to individual Checkpoint Artifacts for the experiment. Any non-latest checkpoints saved since the previous Checkpoint Artifact sychronization will be deleted and not shipped.
Saving checkpoints with the AtomicDirectory saver
Example usage of AtomicDirectory on Strong Compute launching with torchrun as follows.
>>> import os
>>> import torch
>>> import torch.distributed as dist
>>> from cycling_utils import AtomicDirectory, atomic_torch_save
>>> dist.init_process_group("nccl")
>>> rank = int(os.environ["RANK"])
>>> output_directory = os.environ["CHECKPOINT_ARTIFACT_PATH"]
>>> # Initialize the AtomicDirectory - called by ALL ranks
>>> saver = AtomicDirectory(output_directory, is_master=rank==0)
>>> # Resume from checkpoint
>>> latest_symlink_file_path = os.path.join(output_directory, saver.symlink_name)
>>> if os.path.exists(latest_symlink_file_path):
>>> latest_checkpoint_path = os.readlink(latest_symlink_file_path)
>>> # Load files from latest_checkpoint_path
>>> checkpoint_path = os.path.join(latest_checkpoint_path, "checkpoint.pt")
>>> checkpoint = torch.load(checkpoint_path)
>>> ...
>>> for epoch in epochs:
>>> for step, batch in enumerate(batches):
>>> ...training...
>>> if is_save_step:
>>> # prepare the checkpoint directory - called by ALL ranks
>>> checkpoint_directory = saver.prepare_checkpoint_directory()
>>> # saving files to the checkpoint_directory
>>> if is_master_rank:
>>> checkpoint = {...}
>>> checkpoint_path = os.path.join(checkpoint_directory, "checkpoint.pt")
>>> atomic_torch_save(checkpoint, checkpoint_path)
>>> # finalize checkpoint with symlink - called by ALL ranks
>>> saver.symlink_latest(checkpoint_directory)
Saving strategies
The user can force non-latest checkpoints to also ship to Checkpoint Artifacts by calling saver.prepare_checkpoint_directory(force_save=True)
. This can be used for example:
to ensure every Nth saved checkpoint is archived for later analysis, or
to ensure that checkpoints are saved each time model performance improves
It is important to note the impact of the strategy
argument to AtomicDirectory
at initialization.
If initialized with strategy = "sync_any"
(requires that the process group is created with torch.distributed
as above):
The call to
saver.prepare_checkpoint_directory()
will block until all processes in the group reach that point in the code.One checkpoint will be created as output of the synchronous saving step.
The checkpoint will be tagged with force_save if any process passes
saver.prepare_checkpoint_directory(force_save=True)
.
If initialized with strategy = "sync_all"
(requires that the process group is created with torch.distributed
as above):
The call to
saver.prepare_checkpoint_directory()
will block until all processes in the group reach that point in the code.One checkpoint will be created as output of the synchronous saving step.
The checkpoint will be tagged with force_save if and only if all processes pass
saver.prepare_checkpoint_directory(force_save=True)
.
If initialized with strategy = "async"
(does not depend on method of process group creation):
The call to
saver.prepare_checkpoint_directory()
will not block processes in the process group.Each process will create its own unique checkpoint, and can do so asynchronously of the other processes in the group.
Each checkpoint will be tagged with
force_save
if that process passessaver.prepare_checkpoint_directory(force_save=True)
.
Logs artifacts
Logs artifacts are not intended for users to save data themselves. Logs artifacts contain text files for each node running the experiment (e.g. rank_N.txt
) with anything printed to standard out or standard error. The logs artifact should be the first place to look for information to assist in debugging training code.
Updates to the logs artifact are synchronised from running experiments every 10 seconds, and at the end of training (e.g. at the end of a 90 second cycle
experiment).
Lossy artifacts
To save data to the lossy artifact for your experiment, users should save data to the path stored in the environment variable LOSSY_ARTIFACT_PATH
.
Updates to the lossy artifacts are synchronised from running experiments every 30 seconds, and at the end of training (e.g. at the end of a 90 second
cycle
experiment).Lossy artifacts are suitable for storing small files that the user needs more frequent access to such as tensorboard logs.
Contrary to the name, data stored in the lossy artifact will not be lost.
CRUD artifacts
To save data to the CRUD artifact for your experiment, users should save data to the path stored in the environment variable CRUD_ARTIFACT_PATH
.
Updates to CRUD artifacts are synchronised from running experiment every 10 minutes, and at the end of training (e.g. at the end of a 90 second
cycle
experiment).Checkpoint artifacts are suitable for storing large files such as database images.
Accessing artifacts in your container
Click "Sync to Workstation" to download each artifact to your workstation. Once downloaded, your experiment artifact will be available at the following location within your container.
/shared/artifacts/<experiment-id>/<type>/
When you click "Sync to Workstation", the experiment artifacts are downloaded in their state as at that moment in time. If the experiment is still running, you will need to click "Sync to Workstation" again to update the artifacts with latest changes from your running experiment.
Resuming from stopped experiments
You can resume from a previously stopped experiment by passing an optional argument in the experiment launch file as follows.
input_artifact_id_list = [ "checkpoint-artifact-id" ]
By passing this argument, when your experiment launches, the ISC will first copy the contents of the checkpoint artifact identified in the input_artifact_id_list
into the $CHECKPOINT_ARTIFACT_PATH
for the new experiment.
Experiments that implement the pattern outlined above - wherein a check is performed for checkpoints to resume from prior to training - will then be able to resume from that checkpoint.
Note: if users pass two artifact IDs of the same type (e.g. two checkpoint artifact IDs) then the ISC will attempt to copy the contents of both into the output artifact for the new experiment. This can cause file naming collisions and is not advised.
Last updated