Advanced RAG Pipeline¶

Hybrid retrieval-augmented generation with ColBERT v2 reranking, BM25 + vector search, and smart chunking. Fully local, no API keys required.

Quick Start¶

Install dependencies:
```
pip install sygen[rag]
```
Enable in config.json:
```
{
  "rag": {
    "enabled": true
  }
}
```
Restart the bot. RAG initializes lazily on the first message.

By default, RAG uses lightweight BM25 + ONNX vector search (no PyTorch). The reranker is disabled by default — enable it when your memory exceeds ~500 facts (see Scaling Guide).

How It Works¶

User Message
    |
    v
Query Expansion (keyword extraction, bigrams)
    |
    v
+---+---+
|       |
BM25    Vector Search
|       | (ChromaDB)
+---+---+
    |
    v
RRF Fusion (Reciprocal Rank Fusion)
    |
    v
ColBERT v2 Reranking (multilingual)
    |
    v
Context Injection (into system prompt)

Each user message triggers the pipeline automatically. Relevant context from memory and workspace files is injected into the agent's prompt.

Components¶

Smart Chunking¶

Documents are split into semantic chunks respecting natural boundaries: - Paragraph breaks (double newlines) - Markdown headings - Sentence endings - Configurable overlap between chunks

{
  "rag": {
    "chunk_size": 512,
    "chunk_overlap": 64,
    "min_chunk_size": 50
  }
}

Hybrid Search (BM25 + Vector)¶

Two search methods run in parallel:

BM25 (keyword-based): Catches exact term matches that embeddings miss. Uses rank_bm25 (pure Python).
Vector (semantic): ChromaDB with paraphrase-multilingual-MiniLM-L12-v2 embeddings. Catches meaning even when words differ.

Results are fused using Reciprocal Rank Fusion (RRF) — a proven method from the original paper that combines rankings without needing score normalization.

{
  "rag": {
    "bm25_weight": 0.4,
    "vector_weight": 0.6,
    "top_k_retrieval": 20
  }
}

ColBERT v2 Reranking¶

After hybrid search, top results are reranked using ColBERT v2 late interaction:

Query and document tokens are encoded independently
MaxSim computes relevance (max similarity per query token)
Batched inference — all documents in one forward pass

Model: antoinelouis/colbert-xm (~560MB, multilingual, 50+ languages)

Fallback chain: 1. ColBERT v2 (best quality) 2. Cross-encoder mmarco-mMiniLMv2-L12-H384-v1 (lighter, still multilingual) 3. No reranking (passthrough)

GPU is auto-detected (CUDA > MPS > CPU).

{
  "rag": {
    "reranker_enabled": true,
    "reranker_model": "antoinelouis/colbert-xm",
    "reranker_top_k": 5
  }
}

To disable reranking (e.g., on Raspberry Pi):

{
  "rag": {
    "reranker_enabled": false
  }
}

Query Expansion¶

Queries are expanded for broader recall: - Keywords only — stopwords removed (EN, RU, DE) - Bigrams — key phrase extraction

All methods are local, language-agnostic, and add no latency.

{
  "rag": {
    "query_expansion_enabled": true,
    "max_query_variants": 3
  }
}

Multi-Source Indexing¶

The pipeline indexes: - Memory modules (memory_system/modules/*.md) - Workspace files (markdown, YAML, text)

Incremental reindexing — only changed files are re-processed.

{
  "rag": {
    "index_workspace": true,
    "index_memory": true,
    "workspace_glob_patterns": ["*.md", "*.yaml", "*.yml", "*.txt"],
    "workspace_exclude_patterns": ["vector_db/**", "__pycache__/**"]
  }
}

Result Cache¶

LRU cache avoids redundant searches for repeated queries.

{
  "rag": {
    "cache_size": 128,
    "cache_ttl_seconds": 300
  }
}

Full Configuration Reference¶

Key	Default	Description
`enabled`	`false`	Enable the RAG pipeline
`chunk_size`	`512`	Target chunk size in characters
`chunk_overlap`	`64`	Overlap between consecutive chunks
`min_chunk_size`	`50`	Minimum chunk size (smaller fragments merged)
`bm25_weight`	`0.4`	Weight for BM25 in RRF fusion
`vector_weight`	`0.6`	Weight for vector search in RRF fusion
`top_k_retrieval`	`20`	Candidates from hybrid search
`top_k_final`	`5`	Final results after reranking
`reranker_enabled`	`false`	Enable ColBERT/cross-encoder reranking
`reranker_model`	`antoinelouis/colbert-xm`	Reranker model name
`reranker_top_k`	`5`	Top results from reranker
`query_expansion_enabled`	`true`	Enable query expansion
`max_query_variants`	`3`	Max query variants including original
`cache_size`	`128`	LRU cache capacity
`cache_ttl_seconds`	`300`	Cache entry TTL (0 = no expiry)
`max_context_tokens`	`2000`	Max tokens injected into prompt
`index_workspace`	`true`	Index workspace files
`index_memory`	`true`	Index memory modules
`embedding_model`	`""`	Embedding model (empty = inherit from `memory.vector_model`)

Dependencies¶

All free, local, no API keys:

Package	Size	Purpose
`rank-bm25`	~15KB	BM25 keyword search
`chromadb`	~50MB	Vector database
`sentence-transformers`	~100MB	Embeddings + cross-encoder
`transformers`	~200MB	ColBERT model loading
`torch`	~800MB	Neural network inference

Models (downloaded on first use):

Model	Size	Purpose
`paraphrase-multilingual-MiniLM-L12-v2`	~120MB	Embeddings (50+ languages)
`antoinelouis/colbert-xm`	~560MB	ColBERT reranker

Architecture¶

sygen_bot/rag/
├── __init__.py          # Public API exports
├── config.py            # RAGConfig (Pydantic model)
├── chunker.py           # SmartChunker — semantic text splitting
├── bm25.py              # BM25Index — keyword search
├── retrieval.py         # HybridRetriever + RRF fusion
├── reranker.py          # ColBERTReranker (+ cross-encoder fallback)
├── query_expansion.py   # Query expansion (keywords, bigrams)
├── indexer.py           # MultiSourceIndexer — workspace/memory indexing
├── cache.py             # RAGCache — LRU with TTL
└── pipeline.py          # RAGPipeline — orchestrates all components

Scaling Guide¶

Start lightweight and scale up as your memory grows:

Facts	Recommended Config	RAM Overhead
< 50	Base memory only (`rag.enabled: false`)	~0 MB
50–200	Vector search (`memory.vector_search: true`)	~100–200 MB
200–500	RAG without reranker (`rag.enabled: true`, `reranker_enabled: false`)	~200–500 MB
500+	Full RAG with reranker (`reranker_enabled: true`)	+2–3 GB (CPU) / +11 GB (Apple Silicon GPU)

The monthly memory review cron automatically recommends the appropriate level based on your fact count.

Priority Chain¶

Context injection follows this priority:

RAG Pipeline (if enabled) — hybrid search + reranking
Vector search (if memory.vector_search enabled) — basic ChromaDB
Module dump — raw memory module content (fallback)