More Examples & Demos
The following examples further demonstrate how to implement interruptibility in distributed training scripts using checkpointing, atomic saving, and stateful samplers.
These examples are being actively developed to achieve [1] interruptibility in distributed training, [2] verified completion of a full training run, and [3] achievement of benchmark performance published by others (where applicable). Each example published below is annotated with its degree of completion. Examples annotated with [0] are "coming soon".
Hello World
pytorch-image-models (timm)
(from https://github.com/huggingface/pytorch-image-models)
Torchvision segmentation
(from https://github.com/pytorch/vision/tree/main/references/segmentation)
Torchvision detection
(from https://github.com/pytorch/vision/tree/main/references/detection)
Detectron2
(from https://github.com/facebookresearch/detectron2)
Large Language Models (LLM)
Last updated