Artifacts
Last updated
Last updated
Experiments launched on Strong Compute each have four (4) artifacts created for them which are volumes for storing data generated by the experiment. Any data generated (or changed) by the experiment that is not saved in an artifact will not persist after the completion of the experiment.
The four (4) types of artifacts are as follows.
Checkpoint Artifacts are the primary mechanism for saving and retrieving outputs of experiments that take the form of large files that must be saved and retrieved atomically, such as model training checkpoints.
The AtomicDirectory
saver accepts the following arguments at initialization:
output_directory
: root directory for all Checkpoint outputs from the experiment; this should always be set to the path provided in the $CHECKPOINT_ARTIFACT_PATH
environment variable when training on the Strong Compute ISC.
is_master
: boolean to indicate whether the process running the AtomicDirectory
saver is the master rank in the process group.
name
: a name for the AtomicDirectory
saver; if the user is running multiple AtomicDirectory
savers in parallel, each must be given a unique name.
keep_last
: the number of previous checkpoints to retain on disk; this should always be set to -1
(the default) when saving Checkpoint Artifacts on Strong Compute. When set to -1
the AtomicDirectory
saver will not delete any previous checkpoints from local storage, instead allowing the ISC to ship and delete redundant checkpoints. If set to a value greater N > 0
then the most recent N
checkpoints will be retained locally.
Checkpoint Artifacts are synchronized from the training cluster every 10 minutes and/or at the end of each cycle on the Strong Compute ISC. Upon synchronization, the latest symlinked checkpoint/s saved by AtomicDirectory
saver/s in the $CHECKPOINT_ARTIFACT_PATH
directory will be shipped to individual Checkpoint Artifacts for the experiment. Any non-latest checkpoints saved since the previous Checkpoint Artifact sychronization will be deleted and not shipped.
Example usage of AtomicDirectory on Strong Compute launching with torchrun as follows.
The user can force non-latest checkpoints to also ship to Checkpoint Artifacts by calling saver.prepare_checkpoint_directory(force_save=True)
. This can be used for example:
to ensure every Nth saved checkpoint is archived for later analysis, or
to ensure that checkpoints are saved each time model performance improves
It is important to note that when calling saver.prepare_checkpoint_directory(force_save=True)
each rank must agree on the force_save
argument. If this argument is set on the basis of a loss or accuracy measure, for example, it is recommended to all-reduce that measure across the process group so that all ranks agree.
Logs artifacts are not intended for users to save data themselves. Logs artifacts contain text files for each node running the experiment (e.g. rank_N.txt
) with anything printed to standard out or standard error. The logs artifact should be the first place to look for information to assist in debugging training code.
Updates to the logs artifact are synchronised from running experiments every 10 seconds, and at the end of training (e.g. at the end of a 90 second cycle
experiment).
To save data to the lossy artifact for your experiment, users should save data to the path stored in the environment variable LOSSY_ARTIFACT_PATH
.
Updates to the lossy artifacts are synchronised from running experiments every 30 seconds, and at the end of training (e.g. at the end of a 90 second cycle
experiment).
Lossy artifacts are suitable for storing small files that the user needs more frequent access to such as tensorboard logs.
Contrary to the name, data stored in the lossy artifact will not be lost.
To save data to the CRUD artifact for your experiment, users should save data to the path stored in the environment variable CRUD_ARTIFACT_PATH
.
Updates to CRUD artifacts are synchronised from running experiment every 10 minutes, and at the end of training (e.g. at the end of a 90 second cycle
experiment).
Checkpoint artifacts are suitable for storing large files such as database images.
Click "Sync to Workstation" to download each artifact set to your workstation. Once downloaded, the full path to your experiment artifacts will be as follows.
When you click "Sync to Workstation", the experiment artifacts are downloaded in their state as at that moment in time. If the experiment is still running, you will need to click "Sync to Workstation" again to update the artifacts with latest changes from your running experiment.
You can resume from a previously stopped experiment by passing an optional argument in the experiment launch file as follows.
By passing this argument, when your experiment launches, the ISC will first copy the contents of each artifact identified in the input_artifact_id_list
into the output artifact for the new experiment.
Note: if users pass two artifact IDs of the same type (e.g. two checkpoint artifact IDs) then the ISC will attempt to copy the contents of both into the output artifact for the new experiment. This can cause file naming collisions if not handled carefully.
It is necessary to use the AtomicDirectory
saver from the public Strong Compute GitHub repository To save data to a Checkpoint Artifact. The AtomicDirectory
saver works by saving each checkpoint to a new directory, then saving a symlink to that directory indicating it is the latest checkpoint. The symlink can be read upon resume to obtain the path to the latest checkpoint directory.