An experimental Large Language Model server

Alexandre Strube

September 26, 2023




  • /ˈblæblæˌdɔɹ/
  • Bla-bla-bla 🗣️ + Labrador 🐕‍🦺
  • A stage for deploying and testing large language models
  • Models change constantly (constantly improving rank, some good, some awful)
  • Usually a code model and one of the top of the HF’s Open LLM Leaderboard
  • It is a web server and an api server. The API server is only available on the intranet.


  • AI is becoming basic infrastructure
  • Which historically is Open Source
  • No experience with dealing with LLMs
  • We train a lot, deploy little
  • From the tools point of view, this is a FAST moving target 🎯💨
  • Acquire local experience in issues like
    • data loading,
    • quantization,
    • distribution,
    • fine-tune LLMs for specific tasks,
    • inference speed,
    • deployment
  • The usual: we want to be ready when the time comes

Some facts

  • We have no models of our own (yet) deployed
  • Most models for 2 gpus can be quantized for 1 gpu with GPTQ
  • Models based on Llama2-70 🦙 take 7 gpus (or 8 with vLLM)
  • VLLM: PagedAttention, batching etc. Speeds up inference at cost of gpus
  • No data collection at all. I don’t keep ANY data whatsoever
    • We could, there’s code for ranking answers, or running models in parallel and voting
    • So far, it has been simpler for GDPR (as there’s nothing to deal with)
    • (I just wrote datenschutz on bing image creator)

What is it?

  • A classic MVC web application:
    • Model: large language model(s)
    • View: a web server and api server (openAI-compatible)
    • Controller: coordinates the models
  • A collaboration with LM-Sys (From Vicuña 🦙 fame)
  • Python app, Runs on bare metal 🤘 with venvs
  • Models run in different sc_venv_templates
    • Conflicting versions of libraries
  • Website Authentication: Helmholtz AAI, no records, just to keep the bots out

OpenAI-compatible API

  • Uses openai-python from OpenAI itself
  • All services which can use OpenAI’s API can use Blablador’s API (Jypyter, etc)
  • Only available on the intranet/vpn (yet)
  • The API is not yet authenticated,rate-limited, logged, monitored, documented or well-tested.
  • HOWTO:
  • export OPENAI_API_KEY = "EMPTY"
    export OPENAI_API_BASE="https://haicluster1.fz-juelich.de:8000/v1"
    (haicluster1 is offline this week, use haicluster2)
  • After auth/key implemented:
  • export OPENAI_API_BASE="https://helmholtz-blablador.fz-juelich.de:8000/v1"

Jülich Supercomputers

JSC Supercomputer Stragegy

Well, no. Not really.



  • 2 nodes 4xNvidia 3090 24gb, 256gb RAM
  • 1 node 8xNvidia 3090 24gb, 256gb RAM
  • GlusterFS:
    • 6tb ssd for /home,
    • 24tb hdd for data
  • Software: Ubuntu 20.04 LTS and EasyBuild
  • Auth: Nis local


  • Run models in a multi-node way (eg. Falcon-180B)
  • Make key authentication for the API server
  • Get the API server on the same external address
  • Use the interface to evaluate models
  • Get usage metrics
  • There’s fine-tuning code, but it’s not running yet (usually gpus are busy)