More Examples & Demos
Last updated
Last updated
The following examples further demonstrate how to implement interruptibility in distributed training scripts using checkpointing, atomic saving, and stateful samplers.
These examples are being actively developed to achieve [1] interruptibility in distributed training, [2] verified completion of a full training run, and [3] achievement of benchmark performance published by others (where applicable). Each example published below is annotated with its degree of completion. Examples annotated with [0] are "coming soon".
Hello World
Fashion MNIST
Hello World
CNN
[3]
ImageNet
Image classification
ResNet50
[3]
DeepSeek
Large Language Models
DeepSeek-R1 Distillation
[2]
Chess Hackathon
Regression
Various
[3]
CIFAR100
Image classification
ResNet50
[2]
pytorch-image-models (timm)
(from )
resnet50
Image classification
ResNet50
[2]
resnet152
Image classification
ResNet152
[2]
efficientnet_b0
Image classification
EfficientNet B0
[2]
efficientnet_b7
Image classification
EfficientNet B7
[2]
efficientnetv2_s
Image classification
EfficientNetV2 S
[2]
efficientnetv2_xl
Image classification
EfficientNetV2 XL
[2]
vit_base_patch16_224
Image classification
VIT Base Patch16 224
[2]
vit_large_patch16_224
Image classification
VIT Large Patch16 224
[2]
Torchvision segmentation
fcn_resnet101
Image segmentation
ResNet101
[2]
deeplabv3_mobilenet_v3_large
Image segmentation
MobileNetV3 Large
[2]
Torchvision detection
maskrcnn_resnet101_fpn
Object detection
Mask RCNN (ResNet101 FPN)
[2]
retinanet_resnet101_fpn
Object detection
RetinaNet (ResNet101 FPN)
[2]
Detectron2
detectron2
TBC
Detectron2
[2]
detectron2_densepose
TBC
Detectron2
[2]
Large Language Models (LLM)
Llama2
LoRA
Llama2
[0]
Mistral
TBC
Mistral
[0]
(from )
(from )
(from )