Resuming Experiments
Why You Might Want to do This:
If your job failed, but had checkpoint progress you want to make use of
If your job is still going/succeeded, but you want to launch a new job using one of its checkpoints
Steps to Resume
A → Retrieving your latest checkpoint
Click the green "Outputs" for the experiment you'd like to resume from.

Find the line which says "checkpoints".
Click the "Sync to Workstation" button on the far right side of that line.
Wait for notification in the bottom right to change to "Syncing Complete!".

B → Add to ISC file and relaunch!
Resume Training:
Retrieve the artifact ID for the checkpoint from the UI.
Pass the ID into the ISC file using the argument:
input_artifact_id_list
.
Example ISC file:
isc_project_id = "sameAsBefore"
... existing details (replace this line with the rest of your ISC file)
input_artifact_id_list = [ "ID_OF_CHECKPOINT_ARTIFACT_IN_UI" ]
Relaunch experiment and your experiment should resume! 🎉
Last updated