Deep Learning on Supercomputers

Alexandre Strube

May 23, 2024



Alexandre Strube
Ilya Zhukov
Jolanta Zjupa

Goal for this talk:

Slides on your own computer:

Please access it now, so you can follow along:

Git clone this repository

  • All slides and source code
  • Connect to the supercomputer and do this:
  • git clone

Deep learning isā€¦

Matrix Multiplication Recap

Parallelism in a GPU

  • Operations on each element of the matrix are independent
  • Ergo, each element can be computed in parallel
  • Extremely good to apply the same operation to a lot of data
  • Paralellization done ā€œfor freeā€ by the ML toolkits

Are we there yet?

Not quite

  • Compute in one GPU is parallel, yes

But what about many GPUs?

  • Itā€™s when things get interesting

Data Parallel

Data Parallel - Averaging

There are other approaches too, e.g.

  • For the sake of completeness:
    • Asynchronous Stochastic Gradient Descent
      • Donā€™t average the parameters, but send the updates (gradients post learning rate and momentum) asynchronously
      • Advantageous for slow networks
      • Problem: stale gradient (things might change while calculating gradients)
      • The more nodes, the worse it gets
      • Wonā€™t talk about it anymore

There are other approaches too!

  • Decentralized Asychronous Stochastic Gradient Descent
    • Updates are peer-to-peer
    • The updates are heavily compressed and quantized
    • Disadvantage: extra computation per minibatch, more memory needed

  • Use different data for each GPU
  • Everything else is the same
  • Average after each epoch

Data Parallel - Multi Node

Data Parallel - Multi Node

  • Data parallel is usually good enough šŸ‘Œ
  • If you need more than this, you should be giving this course, not me šŸ¤·ā€ā™‚ļø

Model Parallel

  • Model itself is too big to fit in one single GPU šŸ‹
  • Each GPU holds a slice of the model šŸ•
  • Data moves from one GPU to the next

Whatā€™s the problem here? šŸ§

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

Model Parallel - Pipelining

This is an oversimplification!

  • Actually, you split the input minibatch into multiple microbatches.
  • Thereā€™s still idle time - an unavoidable ā€œbubbleā€ šŸ«§

Model Parallel - Multi Node

  • In this case, each node does the same as the others.
  • At each step, they all synchronize their weights.

Model Parallel - Multi Node

Model Parallel - going bigger

  • You can also have layers spreaded over multiple gpus
  • One can even pipeline among nodesā€¦.


  • Data parallelism:
    • Split the data over multiple GPUs
    • Each GPU runs the whole model
    • The gradients are averaged at each step
  • Data parallelism, multi-node:
    • Same, but gradients are averaged across nodes
  • Model parallelism:
    • Split the model over multiple GPUs
    • Each GPU does the forward/backward pass
    • The gradients are averaged at the end
  • Model parallelism, multi-node:
    • Same, but gradients are averaged across nodes

If you havenā€™t done so, please access the slides to clone repository:

  • git clone


  • Letā€™s take a simple model
  • Run it ā€œseriallyā€ (single-gpu)
  • We make it data parallel among multiple gpus in one node
  • Then we make it multi-node!

Expected imports

from import *
from fastai.distributed import *
from import *

Bringing your data in*

from import *
from fastai.distributed import *
from import *
# DOWNLOADS DATASET - we need to do this on the login node
path = untar_data(URLs.IMAGEWOOF_320) 

Loading your data

from import *
from fastai.distributed import *
from import *

path = untar_data(URLs.IMAGEWOOF_320)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
).dataloaders(path, path=path, bs=64)

Single-gpu code

from import *
from fastai.distributed import *
from import *

path = untar_data(URLs.IMAGEWOOF_320)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
).dataloaders(path, path=path, bs=64)

learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()



  • EXTREMELY useful thing to keep your software in order
  • Make a venv with the correct supercomputer modules
  • Only add new requirements
  • Link to gitlab repo
  • cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src
    git clone
  • Add this to sc_venv_template/requirements.txt:
  • fastai
  • sc_venv_template/
    source sc_venv_template/
  • Done! You installed everything you need

Submission Script

#SBATCH --account=training2410
#SBATCH --nodes=1
#SBATCH --job-name=ai-serial
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --output=out-serial.%j
#SBATCH --error=err-serial.%j
#SBATCH --time=00:40:00
#SBATCH --partition=develbooster

# Make sure we are on the right directory
cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src

# This loads modules and python packages
source sc_venv_template/

# Run the demo
time srun python

Download dataset

  • Compute nodes have no internet
  • We need to download the dataset

Download dataset

cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src
source sc_venv_template/

(Some warnings)
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
Epoch 1/1 : |-------------------------------------------------------------| 0.71% [1/141 00:07<16:40]
  • It started training, on the login nodeā€™s CPUs (WRONG!!!)
  • That means we have the data!
  • We just cancel with Ctrl+C

Running it

  • cd $HOME/2024-05-talk-intro-to-supercompting-jsc/src
    sbatch serial.slurm
  • On Juwels Booster, should take about 5 minutes
  • On a cpu system this would take half a day
  • Check the out-serial-XXXXXX and err-serial-XXXXXX files

Going data parallel

  • Almost same code as before, letā€™s show the differences

Data parallel

from import *
from fastai.distributed import *
from import *

path = rank0_first(untar_data, URLs.IMAGEWOOF_320)
dls = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    get_items=get_image_files, get_y=parent_label,
    item_tfms=[RandomResizedCrop(160), FlipItem(0.5)],
).dataloaders(path, path=path, bs=64)

learn = Learner(dls, xresnet50(n_out=10), metrics=[accuracy,top_k_accuracy]).to_fp16()
with learn.distrib_ctx():

What changed?

It was

path = untar_data(URLs.IMAGEWOOF_320)


path = rank0_first(untar_data, URLs.IMAGEWOOF_320)

It was



with learn.distrib_ctx():

Submission script: data parallel

  • Please check the course repository: src/distrib.slurm

  • Main differences:

  • #SBATCH --cpus-per-task=48
    #SBATCH --gres=gpu:4

Letā€™s check the outputs!

Single gpu:

epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
0         2.249933    2.152813    0.225757  0.750573        01:11                          
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
0         1.882008    1.895813    0.324510  0.832018        00:44                          
1         1.837312    1.916380    0.374141  0.845253        00:44                          
2         1.717144    1.739026    0.378722  0.869941        00:43                          
3         1.594981    1.637526    0.417664  0.891575        00:44                          
4         1.460454    1.410519    0.507254  0.920336        00:44                          
5         1.389946    1.304924    0.538814  0.935862        00:43  
real    5m44.972s

Multi gpu:

epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
0         2.201540    2.799354    0.202950  0.662513        00:09                        
epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
0         1.951004    2.059517    0.294761  0.781282        00:08                        
1         1.929561    1.999069    0.309512  0.792981        00:08                        
2         1.854629    1.962271    0.314344  0.840285        00:08                        
3         1.754019    1.687136    0.404883  0.872330        00:08                        
4         1.643759    1.499526    0.482706  0.906409        00:08                        
5         1.554356    1.450976    0.502798  0.914547        00:08  
real    1m19.979s

Some insights

  • Distributed run suffered a bit on the accuracy šŸŽÆ and loss šŸ˜©
    • In exchange for speed šŸŽļø
    • Train a bit longer and youā€™re good!
  • Itā€™s more than 4x faster because the library is multi-threaded (and now we use 48 threads)
  • I/O is automatically parallelized / sharded by Fast.AI library
  • Data parallel is a simple and effective way to distribute DL workload šŸ’Ŗ
  • This is really just a primer - thereā€™s much more to that
  • I/O plays a HUGE role on Supercomputers, for example


  • Simply change #SBATCH --nodes=2 on the submission file!
  • THATā€™S IT


  • epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
    0         2.242036    2.192690    0.201728  0.681148        00:10                      
    epoch     train_loss  valid_loss  accuracy  top_k_accuracy  time    
    0         2.035004    2.084082    0.246189  0.748984        00:05                      
    1         1.981432    2.054528    0.247205  0.764482        00:05                      
    2         1.942930    1.918441    0.316057  0.821138        00:05                      
    3         1.898426    1.832725    0.370173  0.839431        00:05                      
    4         1.859066    1.781805    0.375508  0.858740        00:05                      
    5         1.820968    1.743448    0.394055  0.864583        00:05
    real    1m15.651s    

Some insights

  • Itā€™s faster per epoch, but not by much (5 seconds vs 8 seconds)
  • Accuracy and loss suffered
  • This is a very simple model, so itā€™s not surprising
    • It fits into 4gb, we ā€œstretchedā€ it to a 320gb system
    • Itā€™s not a good fit for this system
  • You need bigger models to really exercise the gpu and scaling
  • Thereā€™s a lot more to that, but for now, letā€™s focus on medium/big sized models
    • For Gigantic and Humongous-sized models, thereā€™s a DL scaling course at JSC!

Thatā€™s all folks!

  • Thanks for listening!
  • Questions?
