More Examples & Demos
The following examples further demonstrate how to implement interruptibility in distributed training scripts using checkpointing, atomic saving, and stateful samplers.
These examples are being actively developed to achieve [1] interruptibility in distributed training, [2] verified completion of a full training run, and [3] achievement of benchmark performance published by others (where applicable). Each example published below is annotated with its degree of completion. Examples annotated with [0] are "coming soon".
Hello World
Fashion MNIST
Image classification
CNN
[3]
CIFAR100
Image classification
ResNet50
[2]
Distributed Model Parallel
TBC
TBC
[0]
pytorch-image-models (timm)
(from https://github.com/huggingface/pytorch-image-models)
resnet50
Image classification
ResNet50
[2]
resnet152
Image classification
ResNet152
[2]
efficientnet_b0
Image classification
EfficientNet B0
[2]
efficientnet_b7
Image classification
EfficientNet B7
[2]
efficientnetv2_s
Image classification
EfficientNetV2 S
[2]
efficientnetv2_xl
Image classification
EfficientNetV2 XL
[2]
vit_base_patch16_224
Image classification
VIT Base Patch16 224
[2]
vit_large_patch16_224
Image classification
VIT Large Patch16 224
[2]
Torchvision segmentation
(from https://github.com/pytorch/vision/tree/main/references/segmentation)
fcn_resnet101
Image segmentation
ResNet101
[2]
deeplabv3_mobilenet_v3_large
Image segmentation
MobileNetV3 Large
[2]
Torchvision detection
(from https://github.com/pytorch/vision/tree/main/references/detection)
maskrcnn_resnet101_fpn
Object detection
Mask RCNN (ResNet101 FPN)
[2]
retinanet_resnet101_fpn
Object detection
RetinaNet (ResNet101 FPN)
[2]
Detectron2
(from https://github.com/facebookresearch/detectron2)
detectron2
TBC
Detectron2
[2]
detectron2_densepose
TBC
Detectron2
[2]
Large Language Models (LLM)
Llama2
LoRA
Llama2
[0]
Mistral
TBC
Mistral
[0]
Last updated