Deploying Django with Kamal (mrsk)

Sep 11, 2023@anthonynsimon

Over the past years I've mostly gone back to monolithic web apps for my projects.

I'm talking about a web server, a database, and a worker process. Maybe sprinkle in a cache, some object storage, or a CDN, but that's about it.

Django monolith, 3-tier architecture

I've used a variety of platforms to deploy these apps, from bare EC2 instances to Kubernetes to managed container platforms such as Fargate, ECS, and Render.

However, I often find myself simply running Docker on a trusty Ubuntu VPS. It's easy to understand and there are few moving parts.

So lately, I've been tinkering with Kamal (formerly known as mrsk) for deploying my apps.

Here's what I like:

  • Deploy anywhere. I can throw my entire app into a single $5/mo instance on Hetzner (hello 20TB egress), or anywhere I can SSH into an Ubuntu server.
  • Scale up, then out. Need more oomph? Add more CPU and RAM. Need even more power? Add more servers.
  • It's just Docker. Kamal itself is mostly a bunch of scripts to automate Docker deploys.
  • Batteries included. Automated TLS certs, zero downtime deploys, secrets (because we all have secrets), healthchecks, logs, and co.

Kamal is a welcome tool in my toolbelt. It helps me deploy containers into remote machines, without introducing the complexity of something like Kubernetes, or having to invest into a managed service.

In this post I'll walk you through my template to deploy a Django monolith, but you can also use it for Rails or Laravel. Most of the building blocks are the same.

Table of contents

The Deploy Config

In Kamal, all resources are defined in a single config file. Similarly to a docker-compose.yml file, you define containers with ports, environment variables, entrypoint commands, memory limits, and so on.

You can also define to which servers each container should be deployed to (could be all in one host, or multiple hosts).

Then, when you run kamal deploy it will SSH into each server, prepare it with any necessary dependencies, build and push the latest images to the docker registry, and finally perform a zero downtime rollout.

For example, below is the config/deploy.yml file I use to start hacking on my projects on a single VM:

service: example
image: anthonynsimon/example

env:
  secret:
    - SECRET_KEY
    - DATABASE_URL
    - CACHE_URL

traefik:
  image: traefik:v2.9
  options:
    publish:
      - "443:443"
    volume:
      - "/etc/traefik/acme/:/etc/traefik/acme/"
    memory: 500m
  args:
    accesslog.format: common
    entryPoints.web.address: ":80"
    entryPoints.websecure.address: ":443"
    entryPoints.web.http.redirections.entryPoint.to: websecure
    entryPoints.web.http.redirections.entryPoint.scheme: https
    entryPoints.web.http.redirections.entrypoint.permanent: true
    certificatesResolvers.letsencrypt.acme.email: "dev@example.com"
    certificatesResolvers.letsencrypt.acme.storage: "/etc/traefik/acme/letsencrypt.json"
    certificatesResolvers.letsencrypt.acme.dnsChallenge: true
    certificatesResolvers.letsencrypt.acme.dnsChallenge.provider: cloudflare
    certificatesResolvers.letsencrypt.acme.dnsChallenge.resolvers: 1.1.1.1:53
    certificatesResolvers.letsencrypt.acme.caServer: https://acme-v02.api.letsencrypt.org/directory
  env:
    secret:
      - CLOUDFLARE_DNS_API_TOKEN
      - CLOUDFLARE_EMAIL

servers:
  web:
    hosts:
      - 1.2.3.4
    cmd: bin/app web
    healthcheck:
      path: /healthz
      port: 8000
      interval: 5s
    labels:
      traefik.http.routers.djangobase-web.rule: 'Host(`example.anthonynsimon.com`)'
      traefik.http.routers.djangobase-web.tls: true
      traefik.http.routers.djangobase-web.tls.certresolver: letsencrypt
    options:
      memory: 500m

  worker:
    hosts:
      - 1.2.3.4
    cmd: bin/app worker
    options:
      memory: 500m

  scheduler:
    hosts:
      - 1.2.3.4
    cmd: bin/app scheduler
    options:
      memory: 250m

accessories:
  db:
    image: postgres:15
    host: 1.2.3.4
    port: 5432:5432
    env:
      secret:
        - POSTGRES_DB
        - POSTGRES_USER
        - POSTGRES_PASSWORD
    volumes:
      - /var/postgres/data:/var/lib/postgresql/data
    options:
      memory: 500m

  redis:
    image: redis:7.0
    host: 1.2.3.4
    port: 6379:6379
    cmd: --maxmemory 200m --maxmemory-policy allkeys-lru
    volumes:
      - /var/redis/data:/data
    options:
      memory: 250m

registry:
  server: registry.digitalocean.com
  username:
    - DOCKER_REGISTRY_PASSWORD
  password:
    - DOCKER_REGISTRY_PASSWORD

builder:
  remote:
    arch: amd64
    host: ssh://1.2.3.4

# Bridge fingerprinted assets, like JS and CSS, between versions to avoid
# hitting 404 on in-flight requests. Combines all files from new and old
# version inside the asset_path.
asset_path: /opt/app/staticfiles

Let's explore some of the interesting bits.

The Dockerfile

Below is the Dockerfile I use for my Django projects. It's a multi-stage build that first builds all static assets in a Nodejs container, and then copies them to a leaner Python runtime container.

ARG NODE_VERSION=18
ARG PYTHON_VERSION=3.11

# Build assets
FROM node:$NODE_VERSION as build

WORKDIR /opt/app/

COPY package.json yarn.lock ./
RUN yarn install

COPY . .

RUN yarn build:js
RUN yarn build:css

# Runtime
FROM python:$PYTHON_VERSION-slim as runtime

WORKDIR /opt/app/

RUN groupadd -r nonroot
RUN useradd -g nonroot --no-create-home nonroot

RUN apt-get update -y && apt-get install -y curl postgresql-client

# Setup virtualenv
ENV PATH="/opt/venv/bin:$PATH"
ENV PYTHONUNBUFFERED=1
RUN python -m venv /opt/venv/

# Install dependencies before copying application code to leverage docker build caching
COPY requirements.txt .
RUN pip install -U pip -r requirements.txt

# Copy application code (see .dockerignore for excluded files)
COPY . .
COPY --from=build /opt/app/static/assets/ ./static/assets/

RUN python manage.py collectstatic --clear --noinput

RUN chown -R nonroot:nonroot /opt/app/
USER nonroot

EXPOSE 8000

CMD ["bin/app", "web"]

Reverse Proxy

The web containers are being exposed to the public internet via Kamal's default reverse-proxy: Traefik. It seems to be an increasingly popular choice for Docker-based deployments.

In my setup, Traefik's role is to terminate TLS connections, and load balance traffic to the corresponding containers.

In the configuration below, I'm telling Traefik to listen on ports 80 and 443, and to forward traffic to the web container. It handles TLS termination and automatic certificate renewal via Let's Encrypt, and will redirect all HTTP traffic to HTTPS.


traefik:
  image: traefik:v2.9
  options:
    publish:
      - "443:443"
    volume:
      - "/etc/traefik/acme/:/etc/traefik/acme/"
    memory: 500m
  args:
    accesslog.format: common
    entryPoints.web.address: ":80"
    entryPoints.websecure.address: ":443"
    entryPoints.web.http.redirections.entryPoint.to: websecure
    entryPoints.web.http.redirections.entryPoint.scheme: https
    entryPoints.web.http.redirections.entrypoint.permanent: true
    certificatesResolvers.letsencrypt.acme.email: "dev@example.com"
    certificatesResolvers.letsencrypt.acme.storage: "/etc/traefik/acme/letsencrypt.json"
    certificatesResolvers.letsencrypt.acme.dnsChallenge: true
    certificatesResolvers.letsencrypt.acme.dnsChallenge.provider: cloudflare
    certificatesResolvers.letsencrypt.acme.dnsChallenge.resolvers: 1.1.1.1:53
    certificatesResolvers.letsencrypt.acme.caServer: https://acme-v02.api.letsencrypt.org/directory
  env:
    secret:
      - CLOUDFLARE_DNS_API_TOKEN
      - CLOUDFLARE_EMAIL

I'm using the DNS challenge to automatically issue certificates via Let's Encrypt, but you could use an HTTP or TLS challenge instead. The traefik docs have a full section on the different ACME challenges available.

Web and Worker Containers

My Django web app is served by a single container, while the worker and scheduler containers are executing background tasks via Celery.

...
servers:
  web:
    hosts:
      - 1.2.3.4
    cmd: bin/app web
    healthcheck:
      path: /healthz
      port: 8000
      interval: 5s
    labels:
      traefik.http.routers.djangobase-web.rule: 'Host(`example.anthonynsimon.com`)'
      traefik.http.routers.djangobase-web.tls: true
      traefik.http.routers.djangobase-web.tls.certresolver: letsencrypt
    options:
      memory: 500m

  worker:
    hosts:
      - 1.2.3.4
    cmd: bin/app worker
    options:
      memory: 500m

  scheduler:
    hosts:
      - 1.2.3.4
    cmd: bin/app scheduler
    options:
      memory: 250m
...

I also specified a healthcheck for the web container. This tells Traefik to stop sending traffic to the container if it's not able to serve requests, or to a new container while it's being deployed.

I use the same container for the web, worker and scheduler processes. This is configured via the cmd option, and they share a common entrypoint script.

This is what the bin/app <cmd> script looks like:

#!/usr/bin/env bash

set -e

if [ "$1" == "web" ]; then
  shift
  echo "Checking DB connection"
  scripts/waitfordb

  echo "Applying migrations"
  python manage.py migrate

  echo "Running production server"
  exec gunicorn -c config/gunicorn.py $@
elif [ "$1" == "scheduler" ]; then
  shift
  exec celery -A app beat -l info $@
elif [ "$1" == "worker" ]; then
  shift
  exec celery -A app worker -l info $@
elif [ "$1" == "manage" ]; then
  shift
  exec python manage.py $@
else
    exec "$@"
fi

You may have noticed that I'm using gunicorn to serve my Django web process.

Gunicorn+uvicorn has been my trusty performer for several years now. From past projects, it's proven to be very reliable even at 2,000+ req/s across a couple of cheap instances. More than enough for my needs.

Database and Cache

I use Postgres for my main database and Redis for caching and as a job queue.

...
accessories:
  db:
    image: postgres:15
    host: 1.2.3.4
    port: 5432:5432
    env:
      secret:
        - POSTGRES_DB
        - POSTGRES_USER
        - POSTGRES_PASSWORD
    volumes:
      - /var/postgres/data:/var/lib/postgresql/data
    options:
      memory: 500m

  redis:
    image: redis:7.0
    host: 1.2.3.4
    port: 6379:6379
    cmd: --maxmemory 200m --maxmemory-policy allkeys-lru
    volumes:
      - /var/redis/data:/data
    options:
      memory: 250m
...

Stateful services like Postgres and Redis are a bit of a pain to manage. So I usually offload these to a managed service, but it's nice to have the option to run them on a single VM when starting out.

In production, you'd almost certainly want to add automated offsite backups for the Postgres database. Losing data is no fun and can happen.

Encrypted Secrets

I keep my secrets encrypted in-repo. There's a variety of tools to achieve this, but you can also store them in your password manager's vault and turn them into a dotenv file before deploying.

To use them with Kamal it comes down to populating an .env file with the decrypted secrets (don't forget to gitignore it), and then referencing them in your deploy config.

For example, given the following .env file:

# unencrypted .env file
SECRET_KEY=supersecret123
DATABASE_URL=postgres://postgres:supersecret456@db:5432/db
REDIS_URL=redis://redis:6379/0
CLOUDFLARE_DNS_API_TOKEN=supersecret789

Kamal will pull the secrets from that file and inject them into the containers at runtime as defined in the deploy config:

...
env:
  secret:
    - SECRET_KEY
    - DATABASE_URL
    - REDIS_URL
...

Scaling

Scaling a stateless monolith is as straightforward as a sitcom plot: add more resources to the single VM, or scale horizontally by bringing in more servers.

I recommend starting with a single VM and scaling up as needed. It's simple, and you can get a lot done with a single machine.

Vertical scaling

Then, when you want to scale out, just spin up more servers, stick a load balancer to distribute the traffic and re-deploy your app.

Horizontal scaling

For example, to scale the web containers, I'd create a few more instances on Hetzner and add them as the target hosts for the web server in the deploy.yml config:

...
servers:
  # the web container now runs on 3 new servers
  web:
    hosts:
      - 2.2.2.2
      - 3.3.3.3
      - 4.4.4.4
    cmd: bin/app web
  # the worker container still runs on the original server
  worker:
    hosts:
      - 1.2.3.4
    cmd: bin/app worker
...

From there, rolling out the app to these new servers comes down to:

  1. Run kamal deploy to set up and start the web container in the new hosts.
  2. Create a load balancer to distribute the load across the new hosts.
  3. Point my DNS record to the load balancer instead of the single VM.

Yes, it's a manual process vs using something like Kubernetes or a managed service. But to me, it's a small price to pay to keep things simple, especially when I don't expect to need re-scaling often.

Security

This is definitely out of the scope of this post, but you should at least think about:

  • Set up a network firewall to restrict access to your instances.
  • Only allow SSH access via public keys (disable passwords).
  • Ensure your reverse-proxy is not trusting client X-Forwarded-* or X-Real-Ip headers.
  • Keep your OS and packages up-to-date.
  • Set up rate limiting.

Keeping your server and user's data secure is no joke, and probably one of the reasons you should consider using a managed service.

It's not that a managed service exempts you from thinking about these things, but they may already provide solutions for many of these aspects.

Conclusion

In short, if you just want to deploy containers on a remote machine, Kamal might be a nice addition to your toolbelt.

It automates many common steps when deploying containers to one or more remote machines, without introducing the complexity of something like Kubernetes or having to use a managed service.

I like that it's uncomplicated, and you can deploy it anywhere where you can spin up an Ubuntu VPS and SSH into it.


Sep 11, 2023·@anthonynsimon