More Examples & Demos

The following examples further demonstrate how to implement interruptibility in distributed training scripts using checkpointing, atomic saving, and stateful samplers.

These examples are being actively developed to achieve [1] interruptibility in distributed training, [2] verified completion of a full training run, and [3] achievement of benchmark performance published by others (where applicable). Each example published below is annotated with its degree of completion. Examples annotated with [0] are "coming soon".

Hello World

Title
Description
Model
Status
Link

Fashion MNIST

Image classification

CNN

[3]

CIFAR100

Image classification

ResNet50

[2]

Distributed Model Parallel

TBC

TBC

[0]

pytorch-image-models (timm)

(from https://github.com/huggingface/pytorch-image-models)

Title
Description
Model
Status
Link

resnet50

Image classification

ResNet50

[2]

resnet152

Image classification

ResNet152

[2]

efficientnet_b0

Image classification

EfficientNet B0

[2]

efficientnet_b7

Image classification

EfficientNet B7

[2]

efficientnetv2_s

Image classification

EfficientNetV2 S

[2]

efficientnetv2_xl

Image classification

EfficientNetV2 XL

[2]

vit_base_patch16_224

Image classification

VIT Base Patch16 224

[2]

vit_large_patch16_224

Image classification

VIT Large Patch16 224

[2]

Torchvision segmentation

(from https://github.com/pytorch/vision/tree/main/references/segmentation)

Title
Description
Model
Status
Link

fcn_resnet101

Image segmentation

ResNet101

[2]

deeplabv3_mobilenet_v3_large

Image segmentation

MobileNetV3 Large

[2]

Torchvision detection

(from https://github.com/pytorch/vision/tree/main/references/detection)

Title
Description
Model
Status
Link

maskrcnn_resnet101_fpn

Object detection

Mask RCNN (ResNet101 FPN)

[2]

retinanet_resnet101_fpn

Object detection

RetinaNet (ResNet101 FPN)

[2]

Detectron2

(from https://github.com/facebookresearch/detectron2)

Title
Description
Model
Status
Link

detectron2

TBC

Detectron2

[2]

detectron2_densepose

TBC

Detectron2

[2]

Large Language Models (LLM)

Title
Description
Model
Status
Link

Llama2

LoRA

Llama2

[0]

Mistral

TBC

Mistral

[0]

Last updated