Add GPU and vast.ai support for neural embedding service

author Vsevolod Stakhov <vsevolod@rspamd.com>

Tue, 20 Jan 2026 12:16:36 +0000 (12:16 +0000)

committer Vsevolod Stakhov <vsevolod@rspamd.com>

Tue, 20 Jan 2026 12:16:36 +0000 (12:16 +0000)
author Vsevolod Stakhov <vsevolod@rspamd.com>
Tue, 20 Jan 2026 12:16:36 +0000 (12:16 +0000)
committer Vsevolod Stakhov <vsevolod@rspamd.com>
Tue, 20 Jan 2026 12:16:36 +0000 (12:16 +0000)
diff --git a/contrib/neural-embedding-service/Dockerfile.gpu b/contrib/neural-embedding-service/Dockerfile.gpu

new file mode 100644 (file)

index 0000000..2468f95
--- /dev/null
+++ b/contrib/neural-embedding-service/Dockerfile.gpu
@@ -0,0 +1,50 @@
+# Rspamd Neural Embedding Service - GPU Version
+#
+# GPU-optimized embedding service using sentence-transformers + CUDA
+#
+# Build:
+#   docker build -f Dockerfile.gpu -t rspamd-embedding-service:gpu .
+#
+# Run:
+#   docker run --gpus all -p 8080:8080 rspamd-embedding-service:gpu
+
+FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
+
+# Build arguments - multilingual-e5-large recommended for GPU (100+ languages)
+ARG EMBEDDING_MODEL="intfloat/multilingual-e5-large"
+
+# Environment
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PIP_NO_CACHE_DIR=1 \
+    PIP_DISABLE_PIP_VERSION_CHECK=1 \
+    EMBEDDING_MODEL=${EMBEDDING_MODEL} \
+    EMBEDDING_PORT=8080 \
+    EMBEDDING_HOST=0.0.0.0 \
+    EMBEDDING_DEVICE=cuda
+
+WORKDIR /app
+
+# Install Python dependencies for GPU
+COPY requirements-gpu.txt .
+RUN pip install --no-cache-dir -r requirements-gpu.txt
+
+# Pre-download model during build (recommended for vast.ai to avoid download on each run)
+RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('${EMBEDDING_MODEL}')"
+
+# Copy application
+COPY embedding_service.py .
+
+# Non-root user
+RUN useradd -m -u 1000 embedding && chown -R embedding:embedding /app
+USER embedding
+
+# Expose port
+EXPOSE 8080
+
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"
+
+# Run with uvicorn
+CMD ["python", "embedding_service.py"]
diff --git a/contrib/neural-embedding-service/README.md b/contrib/neural-embedding-service/README.md

index dacc513a95cd21e9e2f6d138072d13b6865f54ad..fc566ec93bfbb4374857a51503f7111c80043d8c 100644 (file)
--- a/contrib/neural-embedding-service/README.md
+++ b/contrib/neural-embedding-service/README.md
@@ -180,6 +180,145 @@ EMBEDDING_MODEL="BAAI/bge-small-en-v1.5" python embedding_service.py
  - Consider increasing workers for parallel processing
  - Use batching for bulk operations
  
+## GPU Deployment
+
+For higher throughput, you can run this service on a GPU. GPU inference is 10-50x faster than CPU.
+
+### Local GPU (Docker)
+
+```bash
+# Build GPU image
+docker build -f Dockerfile.gpu -t rspamd-embedding-service:gpu .
+
+# Run with GPU access
+docker run --gpus all -p 8080:8080 rspamd-embedding-service:gpu
+
+# With larger model (GPU has more memory)
+docker run --gpus all -p 8080:8080 \
+  -e EMBEDDING_MODEL="BAAI/bge-large-en-v1.5" \
+  rspamd-embedding-service:gpu
+```
+
+### Vast.ai Cloud GPU
+
+[Vast.ai](https://vast.ai) provides affordable GPU rentals ($0.10-0.50/hr). This is useful for:
+- Testing GPU performance before buying hardware
+- Burst capacity during high-volume periods
+- Running larger models that need more VRAM
+
+#### Quick Start
+
+```bash
+# Install vast.ai CLI
+pip install vastai
+
+# Set your API key (get from https://vast.ai/console/account/)
+vastai set api-key YOUR_API_KEY
+
+# Search for available GPUs
+./vastai-launch.sh --search-only
+
+# Launch an instance
+./vastai-launch.sh --model "BAAI/bge-small-en-v1.5" --gpu RTX_3090
+```
+
+#### Launch Script Options
+
+```bash
+./vastai-launch.sh [options]
+
+Options:
+  --model MODEL    Embedding model (default: BAAI/bge-small-en-v1.5)
+  --gpu GPU_TYPE   GPU type filter (default: RTX_3090)
+  --max-price MAX  Maximum $/hr (default: 0.30)
+  --disk DISK_GB   Disk space in GB (default: 20)
+  --search-only    Only search for instances, don't launch
+  --show-url ID    Show service URL for a running instance
+```
+
+#### Getting the Service URL
+
+After launching, get your service URL:
+
+```bash
+# Option 1: Use the helper
+./vastai-launch.sh --show-url <INSTANCE_ID>
+
+# Option 2: Manual lookup
+vastai show instance <INSTANCE_ID>
+# Look for: 8080/tcp -> 0.0.0.0:XXXXX
+# Your URL is: http://<PUBLIC_IP>:XXXXX
+```
+
+**Important:** The SSH port (22) is NOT your service port. Look for port 8080's mapping.
+
+#### Manual Vast.ai Setup
+
+1. Go to [vast.ai/console/create](https://vast.ai/console/create/)
+2. Select a GPU instance (RTX 3090 or better recommended)
+3. Choose `pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime` as the image
+4. In the on-start script, add:
+
+```bash
+pip install uv
+uv pip install --system "numpy<2" "transformers==4.40.0" "sentence-transformers==2.7.0" fastapi uvicorn pydantic
+# Copy embedding_service.py to /root/
+EMBEDDING_MODEL="intfloat/multilingual-e5-large" python /root/embedding_service.py
+```
+
+5. After the instance starts, find your service URL:
+   ```bash
+   # List your instances
+   vastai show instances
+
+   # Get instance details (replace ID with your instance ID)
+   vastai show instance <ID>
+
+   # Look for port mapping like: 8080/tcp -> 0.0.0.0:41234
+   # Your service URL is: http://<PUBLIC_IP>:41234
+   ```
+
+6. Configure Rspamd to use `http://<PUBLIC_IP>:<MAPPED_PORT>/api/embeddings`
+
+**Note:** Vast.ai maps container ports to random high ports. The SSH port (usually 22) is different from your service port (8080 mapped to something like 41234).
+
+#### Recommended GPU Instances
+
+| GPU | VRAM | Price | Use Case |
+|-----|------|-------|----------|
+| RTX 3090 | 24GB | $0.15-0.30/hr | Best value, handles all models |
+| RTX 4090 | 24GB | $0.40-0.60/hr | Faster inference |
+| A100 | 40-80GB | $1.00-2.00/hr | Very large models, batch processing |
+
+#### Cost Estimation
+
+| Volume | GPU Cost | Notes |
+|--------|----------|-------|
+| 10K emails/day | ~$3-7/month | RTX 3090, shared instance |
+| 100K emails/day | ~$20-50/month | Dedicated RTX 3090 |
+| 1M emails/day | ~$150-300/month | Multiple GPUs or A100 |
+
+### GPU Requirements
+
+| Model | VRAM | Dims | Notes |
+|-------|------|------|-------|
+| `intfloat/multilingual-e5-large` | 2GB | 1024 | **Recommended** - 100+ languages, excellent Russian |
+| `sentence-transformers/paraphrase-multilingual-mpnet-base-v2` | 1GB | 768 | Good multilingual, smaller |
+| `BAAI/bge-base-en-v1.5` | 1GB | 768 | English only, fast |
+| `BAAI/bge-large-en-v1.5` | 2GB | 1024 | English only, high quality |
+
+### Multilingual Models (Recommended for GPU)
+
+For multilingual support including Russian, use `intfloat/multilingual-e5-large`:
+- 1024-dim embeddings
+- Supports 100+ languages with excellent Russian performance
+- State-of-the-art on multilingual benchmarks
+
+```bash
+# Use multilingual-e5-large (default for vast.ai script)
+./vastai-launch.sh --model "intfloat/multilingual-e5-large"
+```
+
  ## License
  
  Apache License 2.0
diff --git a/contrib/neural-embedding-service/docker-compose.yml b/contrib/neural-embedding-service/docker-compose.yml

new file mode 100644 (file)

index 0000000..0d3a335
--- /dev/null
+++ b/contrib/neural-embedding-service/docker-compose.yml
@@ -0,0 +1,47 @@
+# Docker Compose for Rspamd Neural Embedding Service Testing
+#
+# Usage:
+#   cd contrib/neural-embedding-service
+#   docker compose up -d
+#
+# Test:
+#   curl http://localhost:8080/health
+#   curl http://localhost:8080/api/embeddings -d '{"model":"bge-small-en-v1.5","prompt":"test spam"}'
+
+services:
+  redis:
+    image: redis:7-alpine
+    ports:
+      - "6379:6379"
+    volumes:
+      - redis_data:/data
+    command: redis-server --appendonly yes
+    healthcheck:
+      test: ["CMD", "redis-cli", "ping"]
+      interval: 5s
+      timeout: 3s
+      retries: 3
+
+  embedding:
+    build:
+      context: .
+      dockerfile: Dockerfile
+      args:
+        EMBEDDING_MODEL: BAAI/bge-small-en-v1.5
+    ports:
+      - "8080:8080"
+    environment:
+      - EMBEDDING_MODEL=BAAI/bge-small-en-v1.5
+      - EMBEDDING_PORT=8080
+    healthcheck:
+      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"]
+      interval: 30s
+      timeout: 10s
+      start_period: 60s
+      retries: 3
+    depends_on:
+      redis:
+        condition: service_healthy
+
+volumes:
+  redis_data:
diff --git a/contrib/neural-embedding-service/requirements-gpu.txt b/contrib/neural-embedding-service/requirements-gpu.txt

new file mode 100644 (file)

index 0000000..5d86697
--- /dev/null
+++ b/contrib/neural-embedding-service/requirements-gpu.txt
@@ -0,0 +1,21 @@
+# Rspamd Neural Embedding Service Dependencies (GPU)
+#
+# Install: pip install -r requirements-gpu.txt
+#
+# Uses sentence-transformers with PyTorch CUDA for GPU inference.
+# Versions pinned for compatibility with PyTorch 2.1 / CUDA 12.1.
+
+# FastAPI web framework
+fastapi>=0.100.0
+
+# ASGI server
+uvicorn[standard]>=0.23.0
+
+# sentence-transformers for GPU inference (uses PyTorch + CUDA)
+# Pin versions for compatibility
+numpy<2
+transformers==4.40.0
+sentence-transformers==2.7.0
+
+# Data validation
+pydantic>=2.0.0
diff --git a/contrib/neural-embedding-service/vastai-launch.sh b/contrib/neural-embedding-service/vastai-launch.sh

new file mode 100755 (executable)

index 0000000..fde947b
--- /dev/null
+++ b/contrib/neural-embedding-service/vastai-launch.sh
@@ -0,0 +1,285 @@
+#!/bin/bash
+# Rspamd Neural Embedding Service - Vast.ai Launch Script
+#
+# This script helps launch the embedding service on vast.ai GPU instances.
+#
+# Prerequisites:
+#   1. Install vastai CLI: pip install vastai
+#   2. Set API key: vastai set api-key YOUR_API_KEY
+#
+# Usage:
+#   ./vastai-launch.sh [options]
+#
+# Options:
+#   --model MODEL    Embedding model (default: BAAI/bge-small-en-v1.5)
+#   --gpu GPU_TYPE   GPU type filter (default: RTX_3090)
+#   --max-price MAX  Maximum $/hr (default: 0.30)
+#   --disk DISK_GB   Disk space in GB (default: 20)
+#   --search-only    Only search for instances, don't launch
+#   --show-url ID    Show service URL for a running instance
+#   --help           Show this help message
+
+set -e
+
+# Defaults - use multilingual-e5-large for GPU (supports Russian and 100+ languages)
+MODEL="${EMBEDDING_MODEL:-intfloat/multilingual-e5-large}"
+GPU_TYPE="RTX_3090"
+MAX_PRICE="0.30"
+DISK_GB="20"
+SEARCH_ONLY=false
+
+# Parse arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --model)
+            MODEL="$2"
+            shift 2
+            ;;
+        --gpu)
+            GPU_TYPE="$2"
+            shift 2
+            ;;
+        --max-price)
+            MAX_PRICE="$2"
+            shift 2
+            ;;
+        --disk)
+            DISK_GB="$2"
+            shift 2
+            ;;
+        --search-only)
+            SEARCH_ONLY=true
+            shift
+            ;;
+        --show-url)
+            # Show service URL for a running instance
+            if [ -z "$2" ]; then
+                echo "Usage: $0 --show-url <INSTANCE_ID>"
+                exit 1
+            fi
+            echo "Getting connection info for instance $2..."
+            INFO=$(vastai show instance "$2" --raw 2>/dev/null)
+            if [ $? -ne 0 ]; then
+                echo "Error: Could not get instance info"
+                exit 1
+            fi
+
+            SSH_HOST=$(echo "$INFO" | grep -oE '"ssh_host": "[^"]+"' | cut -d'"' -f4)
+            SSH_PORT=$(echo "$INFO" | grep -oE '"ssh_port": [0-9]+' | grep -oE '[0-9]+')
+            PUBLIC_IP=$(echo "$INFO" | grep -oE '"public_ipaddr": "[^"]+"' | cut -d'"' -f4)
+            STATUS=$(echo "$INFO" | grep -oE '"actual_status": "[^"]+"' | cut -d'"' -f4)
+
+            echo ""
+            echo "Instance Status: $STATUS"
+            echo "Public IP: $PUBLIC_IP"
+            echo "SSH: ssh -p $SSH_PORT root@$SSH_HOST"
+            echo ""
+            echo "=== Access Methods ==="
+            echo ""
+            echo "Option 1: SSH Tunnel (recommended for testing)"
+            echo "  Run this in a separate terminal:"
+            echo "    ssh -L 8080:localhost:8080 -p $SSH_PORT root@$SSH_HOST"
+            echo "  Then access: http://localhost:8080/health"
+            echo ""
+            echo "Option 2: Direct access via public IP"
+            echo "  First, SSH in and check if the service is running:"
+            echo "    ssh -p $SSH_PORT root@$SSH_HOST"
+            echo "    curl localhost:8080/health"
+            echo ""
+            echo "  If running, access via: http://$PUBLIC_IP:8080"
+            echo "  (Note: May require firewall/port config on vast.ai)"
+            echo ""
+            exit 0
+            ;;
+        --help)
+            head -25 "$0" | tail -20
+            exit 0
+            ;;
+        *)
+            echo "Unknown option: $1"
+            exit 1
+            ;;
+    esac
+done
+
+# Check vastai CLI
+if ! command -v vastai &> /dev/null; then
+    echo "Error: vastai CLI not found. Install with: pip install vastai"
+    exit 1
+fi
+
+# Startup script that runs inside the vast.ai instance
+ONSTART_SCRIPT=$(cat << 'SCRIPT'
+#!/bin/bash
+set -e
+
+# Install dependencies using uv (10-100x faster than pip)
+pip install uv
+# Use sentence-transformers with PyTorch CUDA (pinned versions for compatibility)
+uv pip install --system "numpy<2" "transformers==4.40.0" "sentence-transformers==2.7.0" fastapi uvicorn[standard] pydantic
+
+# Create embedding service
+cat > /root/embedding_service.py << 'EOF'
+import os
+import logging
+from typing import List, Union, Optional
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+import uvicorn
+
+# Use sentence-transformers with PyTorch CUDA
+from sentence_transformers import SentenceTransformer
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+app = FastAPI(title="Rspamd Embedding Service (GPU)")
+
+# Configuration
+MODEL_NAME = os.environ.get("EMBEDDING_MODEL", "intfloat/multilingual-e5-large")
+
+# Initialize model on CUDA
+logger.info(f"Loading model {MODEL_NAME} on CUDA...")
+model = SentenceTransformer(MODEL_NAME, device="cuda")
+logger.info(f"Loaded {MODEL_NAME} on {model.device}")
+
+class OllamaRequest(BaseModel):
+    model: str
+    prompt: str
+
+class OllamaResponse(BaseModel):
+    embedding: List[float]
+
+class OpenAIRequest(BaseModel):
+    model: str
+    input: Union[str, List[str]]
+
+class EmbeddingData(BaseModel):
+    embedding: List[float]
+    index: int
+    object: str = "embedding"
+
+class OpenAIResponse(BaseModel):
+    object: str = "list"
+    data: List[EmbeddingData]
+    model: str
+    usage: dict
+
+def get_embeddings(texts: List[str]) -> List[List[float]]:
+    embs = model.encode(texts, convert_to_numpy=True)
+    return [emb.tolist() for emb in embs]
+
+@app.get("/health")
+async def health():
+    return {"status": "ok", "model": MODEL_NAME, "device": str(model.device)}
+
+@app.post("/api/embeddings", response_model=OllamaResponse)
+async def ollama_embeddings(request: OllamaRequest):
+    try:
+        embeddings = get_embeddings([request.prompt])
+        return OllamaResponse(embedding=embeddings[0])
+    except Exception as e:
+        logger.error(f"Embedding error: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+
+@app.post("/v1/embeddings", response_model=OpenAIResponse)
+async def openai_embeddings(request: OpenAIRequest):
+    try:
+        texts = [request.input] if isinstance(request.input, str) else request.input
+        embeddings = get_embeddings(texts)
+        data = [EmbeddingData(embedding=emb, index=i) for i, emb in enumerate(embeddings)]
+        return OpenAIResponse(
+            data=data,
+            model=request.model,
+            usage={"prompt_tokens": len(texts), "total_tokens": len(texts)}
+        )
+    except Exception as e:
+        logger.error(f"Embedding error: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+
+if __name__ == "__main__":
+    port = int(os.environ.get("EMBEDDING_PORT", "8080"))
+    host = os.environ.get("EMBEDDING_HOST", "0.0.0.0")
+    uvicorn.run(app, host=host, port=port)
+EOF
+
+# Start service
+cd /root
+EMBEDDING_MODEL="${EMBEDDING_MODEL}" python embedding_service.py &
+
+echo "Embedding service started on port 8080"
+SCRIPT
+)
+
+echo "=== Rspamd Embedding Service - Vast.ai Launcher ==="
+echo "Model: $MODEL"
+echo "GPU: $GPU_TYPE"
+echo "Max price: \$$MAX_PRICE/hr"
+echo ""
+
+# Search for available instances
+echo "Searching for available instances..."
+QUERY="gpu_name=$GPU_TYPE rentable=true dph<$MAX_PRICE disk_space>=$DISK_GB cuda_vers>=12.0"
+
+vastai search offers "$QUERY" --order 'dph' | head -20
+
+if [ "$SEARCH_ONLY" = true ]; then
+    echo ""
+    echo "Search only mode. To launch, run without --search-only"
+    exit 0
+fi
+
+echo ""
+read -p "Enter instance ID to rent (or 'q' to quit): " INSTANCE_ID
+
+if [ "$INSTANCE_ID" = "q" ]; then
+    echo "Aborted."
+    exit 0
+fi
+
+# Create the instance
+echo "Creating instance $INSTANCE_ID..."
+vastai create instance "$INSTANCE_ID" \
+    --image pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime \
+    --disk "$DISK_GB" \
+    --env "EMBEDDING_MODEL=$MODEL" \
+    --onstart-cmd "$ONSTART_SCRIPT"
+
+echo ""
+echo "Instance created! Monitor with: vastai show instances"
+echo ""
+echo "=== Finding your service URL ==="
+echo ""
+echo "1. Wait for instance to be 'running': vastai show instances"
+echo ""
+echo "2. Get the public URL (port 8080 is mapped to a random port):"
+echo "   vastai show instance <INSTANCE_ID>"
+echo ""
+echo "   Look for 'ports' section, e.g.:"
+echo "     8080/tcp -> 0.0.0.0:41234"
+echo "   This means your service is at: http://<PUBLIC_IP>:41234"
+echo ""
+echo "3. Or use SSH tunnel for testing:"
+echo "   vastai ssh-url <INSTANCE_ID>"
+echo "   ssh -L 8080:localhost:8080 <SSH_COMMAND>"
+echo "   Then use: http://localhost:8080"
+echo ""
+echo "4. Configure Rspamd with the public URL:"
+echo ""
+echo "  neural {"
+echo "    rules {"
+echo "      default {"
+echo "        providers = ["
+echo "          {"
+echo "            type = \"llm\";"
+echo "            llm_type = \"ollama\";"
+echo "            model = \"$MODEL\";"
+echo "            url = \"http://<PUBLIC_IP>:<MAPPED_PORT>/api/embeddings\";"
+echo "          }"
+echo "        ];"
+echo "      }"
+echo "    }"
+echo "  }"
+echo ""
+echo "5. Test the endpoint:"
+echo "   curl http://<PUBLIC_IP>:<MAPPED_PORT>/health"
author	Vsevolod Stakhov <vsevolod@rspamd.com>
	Tue, 20 Jan 2026 12:16:36 +0000 (12:16 +0000)
committer	Vsevolod Stakhov <vsevolod@rspamd.com>
	Tue, 20 Jan 2026 12:16:36 +0000 (12:16 +0000)
contrib/neural-embedding-service/Dockerfile.gpu	[new file with mode: 0644]	patch \| blob
contrib/neural-embedding-service/README.md		patch \| blob \| blame \| history
contrib/neural-embedding-service/docker-compose.yml	[new file with mode: 0644]	patch \| blob
contrib/neural-embedding-service/requirements-gpu.txt	[new file with mode: 0644]	patch \| blob
contrib/neural-embedding-service/vastai-launch.sh	[new file with mode: 0755]	patch \| blob