
Fredy Acuna / December 8, 2025 / 7 min read
This guide shows you how to properly self-host Google's Gemma AI model on Dokploy using Ollama. I've corrected several issues from an existing tutorial to make this production-ready with proper networking, persistent storage, and concurrency handling.
Before starting, ensure you have:
Gemma is Google's open-source AI model family. Ollama is a tool that makes running AI models locally simple—it handles downloading, serving, and API endpoints automatically.
When you run ollama serve, it starts an HTTP server on port 11434 that accepts requests and returns AI-generated responses. This is what we'll deploy.
The gemma3:270m model is lightweight (~270MB), so it runs on minimal hardware. Choose your setup based on your use case:
Use this for personal projects or cheap VPS instances:
| Resource | Specification |
|---|---|
| CPU | 1 vCPU |
| RAM | 1 GB |
| Storage | 5 GB |
| GPU | Not required |
Note: This handles 1 user quickly. If 2 people query at the same time, the second waits a few seconds.
Use this if you expect 5-10 concurrent users or automated bots querying frequently:
| Resource | Specification |
|---|---|
| CPU | 2 vCPUs |
| RAM | 2-4 GB |
| Storage | 5-10 GB |
| GPU | Not required |
Why more RAM? Long conversations grow the context window (memory of previous messages), which can spike memory usage. 2GB is the safety zone.
Why 2 vCPUs? The HTTP server handling JSON requests and the inference engine compete for CPU. 2 cores keep the API responsive while the model thinks.
If you want better quality responses, consider larger models like gemma:2b (1.7GB) or gemma:7b (requires more RAM/GPU).
gemma-serviceGo to the General tab, then click Raw. Paste the following configuration:
version: '3.8'
services:
gemma:
image: ollama/ollama:latest
container_name: gemma-inference
restart: unless-stopped
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_MODELS=gemma3:270m
- OLLAMA_ORIGINS=*
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=1
networks:
- dokploy-network
volumes:
- ollama_storage:/root/.ollama
# Uncomment if you have GPU available:
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
volumes:
ollama_storage:
networks:
dokploy-network:
external: true
Click Save.
Let's break down what makes this configuration production-ready:
networks:
- dokploy-network
Instead of exposing port 11434 directly, we connect to Dokploy's internal Traefik network. This allows you to:
volumes:
- ollama_storage:/root/.ollama
Without this, you'd lose downloaded models every time the container restarts. The original tutorial missed this—meaning you'd have to re-download the model after every deployment.
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=1
| Variable | Purpose |
|---|---|
OLLAMA_NUM_PARALLEL=4 | Allows 4 concurrent requests (4 users at the same time) |
OLLAMA_MAX_LOADED_MODELS=1 | Keeps only 1 model in memory (saves RAM) |
- OLLAMA_ORIGINS=*
Allows requests from any origin. Useful if you're calling the API from a frontend application.
Option A: Generate a traefik.me URL (Recommended for Testing)
Click the Generate button in Dokploy. It will automatically create a URL like:
main-ollama-wv9tts-9dc2f9-209-112-91-61.traefik.me
This gives you instant HTTPS without any DNS configuration.
Option B: Use Your Own Domain
Enter your subdomain: gemma.yourdomain.com
Make sure you have a DNS A record pointing to your Dokploy server's IP.
11434 (this is the port Ollama exposes internally)/Now, here's the critical step the original tutorial got right—you must download the model manually:
gemma-inference containerollama pull gemma3:270m
Wait for the download to complete. You can verify it worked with:
ollama list
You should see:
NAME ID SIZE MODIFIED
gemma3:270m abc123... 270MB 2 minutes ago
Visit your domain in a browser. You should see:
Ollama is running
Now test the API with curl:
curl https://gemma.yourdomain.com/api/generate -d '{
"model": "gemma3:270m",
"prompt": "Why is the sky blue?",
"stream": false
}'
You should receive a JSON response with the AI-generated answer.
curl -X POST https://gemma.yourdomain.com/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "gemma3:270m",
"prompt": "Explain Docker in one sentence.",
"stream": false
}'
curl -X POST https://gemma.yourdomain.com/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "gemma3:270m",
"prompt": "Write a haiku about programming.",
"stream": false,
"options": {
"temperature": 0.7,
"num_predict": 50
}
}'
curl -X POST https://gemma.yourdomain.com/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "gemma3:270m",
"messages": [
{"role": "user", "content": "What is machine learning?"}
],
"stream": false
}'
If you get model not found, you forgot to download it. SSH into the container and run:
ollama pull gemma3:270m
Check the logs in Dokploy. Common causes:
OLLAMA_NUM_PARALLEL if RAM is limited (try OLLAMA_NUM_PARALLEL=1 or 2)dokploy-network is external and exists11434 is set as the container port in DomainsOnce your setup is working, you can easily switch models:
# Inside the container terminal
ollama pull gemma:2b # 1.7 GB, better quality
ollama pull gemma:7b # 4.2 GB, requires more RAM
ollama pull llama3.2:3b # Alternative model
Update your API calls to use the new model name.
For production deployments:
OLLAMA_ORIGINS=* to specific domainsYou now have a production-ready Gemma AI service running on Dokploy with:
This setup is significantly more robust than exposing ports directly and handles real-world usage patterns.