Deep Learning on Supercomputers

Alexandre Strube

May 23, 2024

Resources:

Team:

Alexandre Strube
Ilya Zhukov
Jolanta Zjupa

Goal for this talk:

Slides on your own computer:

Please access it now, so you can follow along:

https://go.fzj.de/2024-05-talk-intro-to-supercomputing-jsc

Git clone this repository

  • All slides and source code
  • Connect to the supercomputer and do this:
  • git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc.git

Deep learning isā€¦

Matrix Multiplication Recap

Parallelism in a GPU

  • Operations on each element of the matrix are independent
  • Ergo, each element can be computed in parallel
  • Extremely good to apply the same operation to a lot of data
  • Paralellization done ā€œfor freeā€ by the ML toolkits

Are we there yet?

Not quite

  • Compute in one GPU is parallel, yes

But what about many GPUs?

  • Itā€™s when things get interesting

Data Parallel

Data Parallel

Data Parallel - Averaging

Data Parallel

There are other approaches too, e.g.

  • For the sake of completeness:
    • Asynchronous Stochastic Gradient Descent
      • Donā€™t average the parameters, but send the updates (gradients post learning rate and momentum) asynchronously
      • Advantageous for slow networks
      • Problem: stale gradient (things might change while calculating gradients)
      • The more nodes, the worse it gets
      • Wonā€™t talk about it anymore

Data Parallel

There are other approaches too!

  • Decentralized Asychronous Stochastic Gradient Descent
    • Updates are peer-to-peer
    • The updates are heavily compressed and quantized
    • Disadvantage: extra computation per minibatch, more memory needed
  • WE DONā€™T NEED THOSE

Thatā€™s it for data parallel!

  • Use different data for each GPU
  • Everything else is the same
  • Average after each epoch

Well, almostā€¦

There are more levels!

Data Parallel - Multi Node

Data Parallel - Multi Node

Before we go furtherā€¦

  • Data parallel is usually good enough šŸ‘Œ
  • If you need more than this, you should be giving this course, not me šŸ¤·ā€ā™‚ļø

Are we there yet?

Model Parallel

  • Model itself is too big to fit in one single GPU šŸ‹
  • Each GPU holds a slice of the model šŸ•
  • Data moves from one GPU to the next

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Model Parallel

Whatā€™s the problem here? šŸ§

Model Parallel

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

This is an oversimplification!

  • Actually, you split the input minibatch into multiple microbatches.
  • Thereā€™s still idle time - an unavoidable ā€œbubbleā€ šŸ«§

Are we there yet?

Model Parallel - Multi Node

  • In this case, each node does the same as the others.
  • At each step, they all synchronize their weights.

Model Parallel - Multi Node

Model Parallel - going bigger

  • You can also have layers spreaded over multiple gpus
  • One can even pipeline among nodesā€¦.

Recap

  • Data parallelism:
    • Split the data over multiple GPUs
    • Each GPU runs the whole model
    • The gradients are averaged at each step
  • Data parallelism, multi-node:
    • Same, but gradients are averaged across nodes
  • Model parallelism:
    • Split the model over multiple GPUs
    • Each GPU does the forward/backward pass
    • The gradients are averaged at the end
  • Model parallelism, multi-node:
    • Same, but gradients are averaged across nodes

Are we there yet?

If you havenā€™t done so, please access the slides to clone repository:

  • git clone https://gitlab.jsc.fz-juelich.de/strube1/2024-05-talk-intro-to-supercompting-jsc.git

DEMO TIME!

  • Letā€™s take a simple model
  • Run it ā€œseriallyā€ (single-gpu)
  • We make it data parallel among multiple gpus in one node
  • Then we make it multi-node!

Expected imports

from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *













Bringing your data in*

from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *
# DOWNLOADS DATASET - we need to do this on the login node
path = untar_data(URLs.IMAGEWOOF_320) 











Loading your data

from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *

path = untar_data(URLs.IMAGEWOOF_320)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    splitter=GrandparentSplitter(valid_name='val'),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
    batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)




Single-gpu code

from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *

path = untar_data(URLs.IMAGEWOOF_320)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    splitter=GrandparentSplitter(valid_name='val'),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
    batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)

learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()

learn.fine_tune(6)

Venv_template

  • EXTREMELY useful thing to keep your software in order
  • Make a venv with the correct supercomputer modules
  • Only add new requirements
  • Link to gitlab repo
  • cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src
    git clone https://gitlab.jsc.fz-juelich.de/kesselheim1/sc_venv_template.git
  • Add this to sc_venv_template/requirements.txt:
  • fastai
    deepspeed
    accelerate
  • sc_venv_template/setup.sh
    source sc_venv_template/activate.sh
  • Done! You installed everything you need

Submission Script

#!/bin/bash
#SBATCH --account=training2410
#SBATCH --nodes=1
#SBATCH --job-name=ai-serial
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=out-serial.%j
#SBATCH --error=err-serial.%j
#SBATCH --time=00:40:00
#SBATCH --partition=develbooster

# Make sure we are on the right directory
cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src

# This loads modules and python packages
source sc_venv_template/activate.sh

# Run the demo
time srun python serial.py

Download dataset

  • Compute nodes have no internet
  • We need to download the dataset

Download dataset

cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src
source sc_venv_template/activate.sh
python serial.py

(Some warnings)
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
Epoch 1/1 : |-------------------------------------------------------------| 0.71% [1/141 00:07<16:40]
  • It started training, on the login nodeā€™s CPUs (WRONG!!!)
  • That means we have the data!
  • We just cancel with Ctrl+C

Running it

  • cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src
    sbatch serial.slurm
  • On Juwels Booster, should take about 5 minutes
  • On a cpu system this would take half a day
  • Check the out-serial-XXXXXX and err-serial-XXXXXX files

Going data parallel

  • Almost same code as before, letā€™s show the differences

Data parallel

from fastai.vision.all import *
from fastai.distributed import *
from fastai.vision.models.xresnet import *

path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    splitter=GrandparentSplitter(valid_name='val'),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
    batch_tfms=Normalize.from_stats(*imagenet_stats)
).dataloaders(path, path=path, bs=64)

learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
with learn.distrib_ctx():
    learn.fine_tune(6)

Data Parallel

What changed?

It was

path = untar_data(URLs.IMAGEWOOF_320)

Became

path = rank0_first(untar_data, URLs.IMAGEWOOF_320)

It was

learn.fine_tune(6)

Became

with learn.distrib_ctx():
    learn.fine_tune(6)

Submission script: data parallel

  • Please check the course repository: src/distrib.slurm

  • Main differences:

  • #SBATCH --cpus-per-task=48
    #SBATCH --gres=gpu:4

Letā€™s check the outputs!

Single gpu:

epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
0         2.249933    2.152813    0.225757  0.750573        01:11                          
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
0         1.882008    1.895813    0.324510  0.832018        00:44                          
1         1.837312    1.916380    0.374141  0.845253        00:44                          
2         1.717144    1.739026    0.378722  0.869941        00:43                          
3         1.594981    1.637526    0.417664  0.891575        00:44                          
4         1.460454    1.410519    0.507254  0.920336        00:44                          
5         1.389946    1.304924    0.538814  0.935862        00:43  
real    5m44.972s

Multi gpu:

epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
0         2.201540    2.799354    0.202950  0.662513        00:09                        
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
0         1.951004    2.059517    0.294761  0.781282        00:08                        
1         1.929561    1.999069    0.309512  0.792981        00:08                        
2         1.854629    1.962271    0.314344  0.840285        00:08                        
3         1.754019    1.687136    0.404883  0.872330        00:08                        
4         1.643759    1.499526    0.482706  0.906409        00:08                        
5         1.554356    1.450976    0.502798  0.914547        00:08  
real    1m19.979s

Some insights

  • Distributed run suffered a bit on the accuracy šŸŽÆ and loss šŸ˜©
    • In exchange for speed šŸŽļø
    • Train a bit longer and youā€™re good!
  • Itā€™s more than 4x faster because the library is multi-threaded (and now we use 48 threads)
  • I/O is automatically parallelized / sharded by Fast.AI library
  • Data parallel is a simple and effective way to distribute DL workload šŸ’Ŗ
  • This is really just a primer - thereā€™s much more to that
  • I/O plays a HUGE role on Supercomputers, for example

Multi-node

  • Simply change #SBATCH --nodes=2 on the submission file!
  • THATā€™S IT

Multi-node

  • epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
    0         2.242036    2.192690    0.201728  0.681148        00:10                      
    epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
    0         2.035004    2.084082    0.246189  0.748984        00:05                      
    1         1.981432    2.054528    0.247205  0.764482        00:05                      
    2         1.942930    1.918441    0.316057  0.821138        00:05                      
    3         1.898426    1.832725    0.370173  0.839431        00:05                      
    4         1.859066    1.781805    0.375508  0.858740        00:05                      
    5         1.820968    1.743448    0.394055  0.864583        00:05
    real    1m15.651s    

Some insights

  • Itā€™s faster per epoch, but not by much (5 seconds vs 8 seconds)
  • Accuracy and loss suffered
  • This is a very simple model, so itā€™s not surprising
    • It fits into 4gb, we ā€œstretchedā€ it to a 320gb system
    • Itā€™s not a good fit for this system
  • You need bigger models to really exercise the gpu and scaling
  • Thereā€™s a lot more to that, but for now, letā€™s focus on medium/big sized models
    • For Gigantic and Humongous-sized models, thereā€™s a DL scaling course at JSC!

Thatā€™s all folks!

  • Thanks for listening!
  • Questions?

References