Alexandre Strube
May 23, 2024
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
path = untar_data(URLs.IMAGEWOOF_320)
dls = DataBlock(
blocks=(ImageBlock, CategoryBlock),
splitter=GrandparentSplitter(valid_name='val'),
get_items=get_image_files, get_y=parent_label,
item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
path = untar_data(URLs.IMAGEWOOF_320)
dls = DataBlock(
blocks=(ImageBlock, CategoryBlock),
splitter=GrandparentSplitter(valid_name='val'),
get_items=get_image_files, get_y=parent_label,
item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)
learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
learn.fine_tune(6)
#!/bin/bash
#SBATCH --account=training2410
#SBATCH --nodes=1
#SBATCH --job-name=ai-serial
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=out-serial.%j
#SBATCH --error=err-serial.%j
#SBATCH --time=00:40:00
#SBATCH --partition=develbooster
# Make sure we are on the right directory
cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src
# This loads modules and python packages
source sc_venv_template/activate.sh
# Run the demo
time srun python serial.py
cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src
source sc_venv_template/activate.sh
python serial.py
(Some warnings)
epoch train_loss valid_loss accuracy top_k_accuracy time
Epoch 1/1 : |-------------------------------------------------------------| 0.71% [1/141 00:07<16:40]
from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
dls = DataBlock(
blocks=(ImageBlock, CategoryBlock),
splitter=GrandparentSplitter(valid_name='val'),
get_items=get_image_files, get_y=parent_label,
item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)
learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
with learn.distrib_ctx():
learn.fine_tune(6)
It was
Became
It was
Became
Please check the course repository: src/distrib.slurm
Main differences:
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.249933 2.152813 0.225757 0.750573 01:11
epoch train_loss valid_loss accuracy top_k_accuracy time
0 1.882008 1.895813 0.324510 0.832018 00:44
1 1.837312 1.916380 0.374141 0.845253 00:44
2 1.717144 1.739026 0.378722 0.869941 00:43
3 1.594981 1.637526 0.417664 0.891575 00:44
4 1.460454 1.410519 0.507254 0.920336 00:44
5 1.389946 1.304924 0.538814 0.935862 00:43
real 5m44.972s
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.201540 2.799354 0.202950 0.662513 00:09
epoch train_loss valid_loss accuracy top_k_accuracy time
0 1.951004 2.059517 0.294761 0.781282 00:08
1 1.929561 1.999069 0.309512 0.792981 00:08
2 1.854629 1.962271 0.314344 0.840285 00:08
3 1.754019 1.687136 0.404883 0.872330 00:08
4 1.643759 1.499526 0.482706 0.906409 00:08
5 1.554356 1.450976 0.502798 0.914547 00:08
real 1m19.979s
#SBATCH --nodes=2
on the
submission file!epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.242036 2.192690 0.201728 0.681148 00:10
epoch train_loss valid_loss accuracy top_k_accuracy time
0 2.035004 2.084082 0.246189 0.748984 00:05
1 1.981432 2.054528 0.247205 0.764482 00:05
2 1.942930 1.918441 0.316057 0.821138 00:05
3 1.898426 1.832725 0.370173 0.839431 00:05
4 1.859066 1.781805 0.375508 0.858740 00:05
5 1.820968 1.743448 0.394055 0.864583 00:05
real 1m15.651s