Artifacts

Artifacts are payloads of data generated by an experiment and saved in the specific ways outlined below. Any data generated (or changed) by the experiment that is not saved in an artifact will not persist after the completion of the experiment.

Artifacts for an experiment can be found by clicking the Details button for the experiment on the Experiments page.

The four (4) types of artifacts - Checkpoints, Logs, Lossy, and CRUD are detailed below.

Checkpoint artifacts

Checkpoint Artifacts are the primary mechanism for saving and retrieving outputs of experiments that take the form of large files that must be saved and recovered atomically, such as model training checkpoints.

It is required to use the AtomicDirectory saver from the public Strong Compute cycling_utils GitHub repository to save data to a Checkpoint Artifact.

The AtomicDirectory saver is designed for use by a distributed process group, wherein each process initializes a local instance of the AtomicDirectory saver and may (see below) act synchronously or asynchronously to save checkpoint/s.

The AtomicDirectory saver works by saving each checkpoint to a new directory, then saving a symlink to that directory indicating it is the latest checkpoint. The symlink can be read upon resuming the experiment to obtain the path to the latest checkpoint directory.

The AtomicDirectory saver accepts the following arguments at initialization:

  • output_directory: root directory for all Checkpoint outputs from the experiment; this must always be set to the path provided in the $CHECKPOINT_ARTIFACT_PATH environment variable when training on the Strong Compute ISC.

  • is_master: boolean to indicate whether the process running the AtomicDirectory saver is the master rank in the process group.

  • name: a name for the AtomicDirectory saver; if the user is running multiple AtomicDirectory savers in parallel, each must be given a unique name.

  • keep_last: the number of previous checkpoints to retain on disk; this should always be set to -1 (the default) when saving Checkpoint Artifacts on Strong Compute. When set to -1 the AtomicDirectory saver will not delete any previous checkpoints from local storage, instead allowing the ISC to ship and delete redundant checkpoints. If set to a value greater N > 0 then the most recent N checkpoints will be retained locally.

  • strategy: determines behaviour of the AtomicDirectory saver in its operation within a distributed process group.

    • strategy = "sync_any" (default) will force_save the checkpoint if any process passes force_save = True. This strategy assumes that the distributed process group is created using torchrun.

    • strategy = "sync_all" will force_save the checkpoint if and only if all processes pass force_save = True . This strategy assumes that the distributed process group is created using torchrun.

    • strategy = "async" allows each process to save its own sequence of checkpoints and each process will force_save its own checkpoint if and only if passed force_save = True. This strategy makes no assumption about how the process group is created as no process synchronisation is necessary.

Checkpoint Artifacts are synchronized from the training cluster every 10 minutes and/or at the end of each cycle on the Strong Compute ISC. Upon synchronization, the latest symlinked checkpoint/s saved by AtomicDirectory saver/s in the $CHECKPOINT_ARTIFACT_PATH directory will be shipped to individual Checkpoint Artifacts for the experiment. Any non-latest checkpoints saved since the previous Checkpoint Artifact sychronization will be deleted and not shipped.

Saving checkpoints with the AtomicDirectory saver

Example usage of AtomicDirectory on Strong Compute launching with torchrun as follows.

>>> import os 
>>> import torch
>>> import torch.distributed as dist
>>> from cycling_utils import AtomicDirectory, atomic_torch_save

>>> dist.init_process_group("nccl")
>>> rank = int(os.environ["RANK"]) 
>>> output_directory = os.environ["CHECKPOINT_ARTIFACT_PATH"]

>>> # Initialize the AtomicDirectory - called by ALL ranks
>>> saver = AtomicDirectory(output_directory, is_master=rank==0)

>>> # Resume from checkpoint
>>> latest_symlink_file_path = os.path.join(output_directory, saver.symlink_name)
>>> if os.path.exists(latest_symlink_file_path):
>>>    latest_checkpoint_path = os.readlink(latest_symlink_file_path)

>>>     # Load files from latest_checkpoint_path
>>>     checkpoint_path = os.path.join(latest_checkpoint_path, "checkpoint.pt")
>>>     checkpoint = torch.load(checkpoint_path)
>>>     ...

>>> for epoch in epochs:
>>>     for step, batch in enumerate(batches):

>>>         ...training...

>>>         if is_save_step:
>>>             # prepare the checkpoint directory - called by ALL ranks
>>>             checkpoint_directory = saver.prepare_checkpoint_directory()

>>>             # saving files to the checkpoint_directory
>>>             if is_master_rank:
>>>                 checkpoint = {...}
>>>                 checkpoint_path = os.path.join(checkpoint_directory, "checkpoint.pt")
>>>                 atomic_torch_save(checkpoint, checkpoint_path)

>>>             # finalize checkpoint with symlink - called by ALL ranks
>>>             saver.symlink_latest(checkpoint_directory)

Saving strategies

The user can force non-latest checkpoints to also ship to Checkpoint Artifacts by calling saver.prepare_checkpoint_directory(force_save=True). This can be used for example:

  • to ensure every Nth saved checkpoint is archived for later analysis, or

  • to ensure that checkpoints are saved each time model performance improves

It is important to note the impact of the strategy argument to AtomicDirectory at initialization.

If initialized with strategy = "sync_any" (requires that the process group is created with torch.distributed as above):

  • The call to saver.prepare_checkpoint_directory() will block until all processes in the group reach that point in the code.

  • One checkpoint will be created as output of the synchronous saving step.

  • The checkpoint will be tagged with force_save if any process passes saver.prepare_checkpoint_directory(force_save=True).

If initialized with strategy = "sync_all" (requires that the process group is created with torch.distributed as above):

  • The call to saver.prepare_checkpoint_directory() will block until all processes in the group reach that point in the code.

  • One checkpoint will be created as output of the synchronous saving step.

  • The checkpoint will be tagged with force_save if and only if all processes pass saver.prepare_checkpoint_directory(force_save=True).

If initialized with strategy = "async" (does not depend on method of process group creation):

  • The call to saver.prepare_checkpoint_directory() will not block processes in the process group.

  • Each process will create its own unique checkpoint, and can do so asynchronously of the other processes in the group.

  • Each checkpoint will be tagged with force_save if that process passes saver.prepare_checkpoint_directory(force_save=True).

Logs artifacts

Logs artifacts are not intended for users to save data themselves. Logs artifacts contain text files for each node running the experiment (e.g. rank_N.txt) with anything printed to standard out or standard error. The logs artifact should be the first place to look for information to assist in debugging training code.

Updates to the logs artifact are synchronised from running experiments every 10 seconds, and at the end of training (e.g. at the end of a 90 second cycle experiment).

Lossy artifacts

To save data to the lossy artifact for your experiment, users should save data to the path stored in the environment variable LOSSY_ARTIFACT_PATH.

  • Updates to the lossy artifacts are synchronised from running experiments every 30 seconds, and at the end of training (e.g. at the end of a 90 second cycle experiment).

  • Lossy artifacts are suitable for storing small files that the user needs more frequent access to such as tensorboard logs.

  • Contrary to the name, data stored in the lossy artifact will not be lost.

CRUD artifacts

To save data to the CRUD artifact for your experiment, users should save data to the path stored in the environment variable CRUD_ARTIFACT_PATH.

  • Updates to CRUD artifacts are synchronised from running experiment every 10 minutes, and at the end of training (e.g. at the end of a 90 second cycle experiment).

  • Checkpoint artifacts are suitable for storing large files such as database images.

Accessing artifacts in your container

Click "Sync to Workstation" to download each artifact to your workstation. Once downloaded, your experiment artifact will be available at the following location within your container.

/shared/artifacts/<experiment-id>/<type>/

When you click "Sync to Workstation", the experiment artifacts are downloaded in their state as at that moment in time. If the experiment is still running, you will need to click "Sync to Workstation" again to update the artifacts with latest changes from your running experiment.

Resuming from stopped experiments

You can resume from a previously stopped experiment by passing an optional argument in the experiment launch file as follows.

input_artifact_id_list = [ "checkpoint-artifact-id" ]

By passing this argument, when your experiment launches, the ISC will first copy the contents of the checkpoint artifact identified in the input_artifact_id_list into the $CHECKPOINT_ARTIFACT_PATH for the new experiment.

Experiments that implement the pattern outlined above - wherein a check is performed for checkpoints to resume from prior to training - will then be able to resume from that checkpoint.

Note: if users pass two artifact IDs of the same type (e.g. two checkpoint artifact IDs) then the ISC will attempt to copy the contents of both into the output artifact for the new experiment. This can cause file naming collisions and is not advised.

Last updated