How to Self-Host Gemma on Dokploy (The Right Way)

Self-Host Gemma on Dokploy: A Production-Ready Guide

This guide shows you how to properly self-host Google's Gemma AI model on Dokploy using Ollama. I've corrected several issues from an existing tutorial to make this production-ready with proper networking, persistent storage, and concurrency handling.

What You'll Learn

Setting up Gemma with proper Traefik networking (no exposed ports)
Configuring persistent storage for models
Hardware requirements for production
Concurrency settings for multiple users
Adding Open WebUI for a ChatGPT-like interface (optional)
Testing your deployment with curl

Prerequisites

Before starting, ensure you have:

A Dokploy instance running (check out How to Install Coolify for similar self-hosting setup)
A VPS with at least 1 vCPU, 1 GB RAM, and 5 GB storage (minimum for development/testing)

Understanding Gemma and Ollama

Gemma is Google's open-source AI model family. Ollama is a tool that makes running AI models locally simple—it handles downloading, serving, and API endpoints automatically.

When you run ollama serve, it starts an HTTP server on port 11434 that accepts requests and returns AI-generated responses. This is what we'll deploy.

Hardware Requirements

The gemma3:270m model is lightweight (~270MB), so it runs on minimal hardware. Choose your setup based on your use case:

Development/Testing (Survival Mode)

Use this for personal projects or cheap VPS instances:

Resource	Specification
CPU	1 vCPU
RAM	1 GB
Storage	5 GB
GPU	Not required

Note: This handles 1 user quickly. If 2 people query at the same time, the second waits a few seconds.

Production (Frequent Use)

Use this if you expect 5-10 concurrent users or automated bots querying frequently:

Resource	Specification
CPU	2 vCPUs
RAM	2-4 GB
Storage	5-10 GB
GPU	Not required

Why more RAM? Long conversations grow the context window (memory of previous messages), which can spike memory usage. 2GB is the safety zone.

Why 2 vCPUs? The HTTP server handling JSON requests and the inference engine compete for CPU. 2 cores keep the API responsive while the model thinks.

If you want better quality responses, consider larger models like gemma:2b (1.7GB) or gemma:7b (requires more RAM/GPU).

Step 1: Create the Service in Dokploy

Log in to your Dokploy dashboard
Click Create Service → Select Compose
Give it a name like gemma-service

Step 2: Configure Docker Compose

Go to the General tab, then click Raw. Paste the following configuration:

version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_ORIGINS=*
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=1
    volumes:
      - ollama_storage:/root/.ollama
    # Uncomment if you have GPU available:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: 1
    #           capabilities: [gpu]

  # Optional: ChatGPT-like web interface
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - open-webui:/app/backend/data
    depends_on:
      - ollama
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=your-secret-key-here
    restart: unless-stopped

volumes:
  ollama_storage:
  open-webui:

Important: We don't set OLLAMA_MODELS as an environment variable. Setting it changes the storage path and breaks persistence. Instead, we download models manually after deployment (Step 4).

Click Save.

Key Configuration Explained

Let's break down what makes this configuration production-ready:

Persistent Storage

volumes:
  - ollama_storage:/root/.ollama

Without this, you'd lose downloaded models every time the container restarts. The original tutorial missed this—meaning you'd have to re-download the model after every deployment.

Concurrency Settings

- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=1

Variable	Purpose
`OLLAMA_NUM_PARALLEL=4`	Allows 4 concurrent requests (4 users at the same time)
`OLLAMA_MAX_LOADED_MODELS=1`	Keeps only 1 model in memory (saves RAM)

CORS Configuration

- OLLAMA_ORIGINS=*

Allows requests from any origin. Useful if you're calling the API from a frontend application.

Step 3: Configure the Domains

You need to add domains for the services you want to expose. Go to the Domains tab in your service.

Domain for Ollama API (Required)

Click Add Domain
Select Service Name: ollama
For the Host field, choose one of these options:

Option A: Generate a traefik.me URL (Recommended for Testing)

Click the Generate button in Dokploy. It will automatically create a URL like:

main-ollama-wv9tts-9dc2f9-209-112-91-61.traefik.me

This gives you instant HTTPS without any DNS configuration.

Option B: Use Your Own Domain

Enter your subdomain: ollama.yourdomain.com

Make sure you have a DNS A record pointing to your Dokploy server's IP.

Set the Container Port to 11434 (this is the port Ollama exposes internally)
Leave Path as /
Enable HTTPS for SSL (automatic with traefik.me or your own domain with Let's Encrypt)
Click Save

Domain for Open WebUI (Optional)

If you included the Open WebUI service, add another domain for it:

Click Add Domain
Select Service Name: open-webui
For the Host, generate a traefik.me URL or use your own domain (e.g., chat.yourdomain.com)
Set the Container Port to 8080
Enable HTTPS
Click Save

Step 4: Deploy and Download the Model

Click Deploy to start the containers
Wait for the deployment to complete (check the logs)

Now, here's the critical step—you must download the model manually:

Go to Docker in the Dokploy sidebar
Find the ollama container
Click the three dots → Terminal
Run the following command:

ollama pull gemma3:270m

Wait for the download to complete. You can verify it worked with:

ollama list

You should see:

NAME           ID          SIZE     MODIFIED
gemma3:270m    abc123...   270MB    2 minutes ago

Step 5: Test Your Deployment

Visit your domain in a browser. You should see:

Ollama is running

Now test the API with curl:

curl https://ollama.yourdomain.com/api/generate -d '{
  "model": "gemma3:270m",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

You should receive a JSON response with the AI-generated answer.

API Usage Examples

Basic Generation

curl -X POST https://ollama.yourdomain.com/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:270m",
    "prompt": "Explain Docker in one sentence.",
    "stream": false
  }'

With Temperature Control

curl -X POST https://ollama.yourdomain.com/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:270m",
    "prompt": "Write a haiku about programming.",
    "stream": false,
    "options": {
      "temperature": 0.7,
      "num_predict": 50
    }
  }'

Chat Format

curl -X POST https://ollama.yourdomain.com/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:270m",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ],
    "stream": false
  }'

Using Open WebUI (Optional)

If you included Open WebUI in your Docker Compose, you now have a ChatGPT-like interface for interacting with your models.

Visit your Open WebUI domain (e.g., https://chat.yourdomain.com)
Create an account on first visit
Select gemma3:270m from the model dropdown
Start chatting

Open WebUI provides:

Chat history: Your conversations are saved locally
Multiple models: Switch between any models you've downloaded
System prompts: Customize the AI's behavior
File uploads: Attach documents for the AI to analyze
User management: Create accounts for team members

Tip: You can download additional models directly from Open WebUI's settings, or via the Ollama container terminal.

Troubleshooting

Model Not Found Error

If you get model not found, you forgot to download it. SSH into the container and run:

ollama pull gemma3:270m

Container Keeps Restarting

Check the logs in Dokploy. Common causes:

Out of memory: Increase RAM or use a smaller model
Volume permissions: The ollama_storage volume may have permission issues

Slow Responses

For single-user use, 1 vCPU and 1 GB RAM is sufficient
For concurrent users, upgrade to 2 vCPUs and 2-4 GB RAM
Reduce OLLAMA_NUM_PARALLEL if RAM is limited (try OLLAMA_NUM_PARALLEL=1 or 2)
Consider using a GPU for larger models

Cannot Access Domain

Check that the correct Container Port is set in Domains (11434 for Ollama, 8080 for Open WebUI)
Verify the Service Name matches the service in your Docker Compose
Ensure Traefik is running properly in Dokploy

Upgrading to Larger Models

Once your setup is working, you can easily switch models:

# Inside the container terminal
ollama pull gemma:2b      # 1.7 GB, better quality
ollama pull gemma:7b      # 4.2 GB, requires more RAM
ollama pull llama3.2:3b   # Alternative model

Update your API calls to use the new model name.

Security Considerations

For production deployments:

Change the secret key: Replace your-secret-key-here with a strong, random string for Open WebUI
Add authentication to Ollama API: Consider placing an authentication proxy in front of Ollama if exposing the raw API
Rate limiting: Use Traefik middleware to prevent abuse
Restrict CORS: Change OLLAMA_ORIGINS=* to specific domains if not using Open WebUI

Conclusion

You now have a production-ready Gemma AI service running on Dokploy with:

Persistent model storage that survives restarts
Concurrent request handling for multiple users
A clean API accessible via your domain
Optional ChatGPT-like interface with Open WebUI

This setup is significantly more robust than exposing ports directly and handles real-world usage patterns.

Related Resources

Resource

Specification

CPU

1 vCPU

RAM

1 GB

Storage

5 GB

GPU

Not required

Resource

Specification

CPU

2 vCPUs

RAM

2-4 GB

Storage

5-10 GB

GPU

Not required

version: '3.8' services: ollama: image: ollama/ollama:latest container_name: ollama restart: unless-stopped environment: - OLLAMA_HOST=0.0.0.0 - OLLAMA_ORIGINS=* - OLLAMA_NUM_PARALLEL=4 - OLLAMA_MAX_LOADED_MODELS=1 volumes: - ollama_storage:/root/.ollama # Uncomment if you have GPU available: # deploy: # resources: # reservations: # devices: # - driver: nvidia # count: 1 # capabilities: [gpu] # Optional: ChatGPT-like web interface open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui volumes: - open-webui:/app/backend/data depends_on: - ollama environment: - OLLAMA_BASE_URL=http://ollama:11434 - WEBUI_SECRET_KEY=your-secret-key-here restart: unless-stopped volumes: ollama_storage: open-webui:

Variable

Purpose

OLLAMA_NUM_PARALLEL=4

Allows 4 concurrent requests (4 users at the same time)

OLLAMA_MAX_LOADED_MODELS=1

Keeps only 1 model in memory (saves RAM)

curl -X POST https://ollama.yourdomain.com/api/generate \ -H "Content-Type: application/json" \ -d '{ "model": "gemma3:270m", "prompt": "Explain Docker in one sentence.", "stream": false }'

curl -X POST https://ollama.yourdomain.com/api/generate \ -H "Content-Type: application/json" \ -d '{ "model": "gemma3:270m", "prompt": "Write a haiku about programming.", "stream": false, "options": { "temperature": 0.7, "num_predict": 50 } }'

curl -X POST https://ollama.yourdomain.com/api/chat \ -H "Content-Type: application/json" \ -d '{ "model": "gemma3:270m", "messages": [ {"role": "user", "content": "What is machine learning?"} ], "stream": false }'