Turbocharge Pydantic AI + SurrealDB RAG with TurboAgents and TurboQuant

Google Research released TurboQuant, the game-changing compression technique. Superagentic AI released TurboAgents to showcase TurboQuant in real agentic AI systems. This post walks through a small local demo built with Pydantic AI, SurrealDB, and TurboAgents that starts with a plain RAG app, then swaps only the retriever so the same app uses TurboAgents for compressed retrieval and reranking.

Quick links:

Demo Repo TurboAgents Docs TurboAgents GitHub

📖 Read detailed version of this blog on your favorite platform

Choose your preferred platform to dive deeper

What This Tutorial Is About

A lot of RAG examples become hard to follow because they try to explain too many things at once: the agent framework, the vector store, embeddings, prompting, orchestration, and performance claims. This tutorial takes the opposite approach.

The app is intentionally small. It does one thing clearly:

Start with a plain local RAG app
Swap only the retriever
Show what TurboAgents changes

That makes it easier to see where compressed retrieval actually fits.

What Is TurboAgents

TurboAgents is a Python package for TurboQuant-style compression, retrieval, and reranking in agent and RAG systems. It is designed to plug into an existing stack instead of replacing it.

That design choice matters. In many real systems, the hard question is not "how do I build a new agent framework?" The hard question is "where can I add a new retrieval capability without rewriting the rest of the app?" This tutorial uses TurboAgents exactly that way. It does not replace the agent. It does not replace the vector store. It changes the retrieval layer.

What Is Pydantic AI

Pydantic AI is the agent framework used in this example. It gives a clean Python interface for defining an agent, its instructions, and its tools. The agent in this repo has a single important job: answer a question using retrieved context. That makes Pydantic AI a good fit because it lets the retrieval path stay the center of attention.

What Is SurrealDB

SurrealDB is the vector-backed storage layer used here. The demo uses the embedded surrealkv:// backend from the SurrealDB Python SDK, so there is no separate database server to run.

That keeps the tutorial local and reproducible:

The language model runs locally through Ollama
The embedding model runs locally
The retrieval data is stored locally

The result is a small tutorial that still uses real components rather than placeholders.

Why SurrealDB Instead of LanceDB

LanceDB is also supported by TurboAgents, but LanceDB already has its own quantization and indexing story. For a first tutorial focused on the TurboAgents integration seam, SurrealDB makes the comparison easier to isolate.

That means the reader can look at this demo and understand:

What the plain retrieval path looks like
What the TurboAgents retrieval path looks like
What changed between them

That is a better teaching example than mixing multiple retrieval stories together.

How the Demo Is Structured

Same agent, same documents, same local model, same question. Only the retriever changes.

Plain Version

Pydantic AI Agent

SurrealDB Search

Answer

Turbo Version

Pydantic AI Agent

TurboAgents + SurrealDB

Answer

Prerequisites

You need the following before running the demo:

Python 3.11+ — Runtime

uv — Fast Python package manager

Ollama — Local model runner

The Ollama model used by the agent is qwen3.5:9b. The embedding model is Qwen/Qwen3-Embedding-0.6B, truncated to 256 dimensions so it stays compatible with the TurboAgents quantization path.

This repo does not require Docker and does not require a separate SurrealDB server. It uses the embedded surrealkv:// backend.

Clone and Set Up the Repo

Clone the repo, install dependencies, and pull the Ollama model.

1Clone and install

Shell
git clone https://github.com/SuperagenticAI/turboagent-minimal-demo.git
cd turboagent-minimal-demo
uv sync

2Pull the Ollama model

Shell
ollama pull qwen3.5:9b
ollama list

3Run the comparison script

Shell
uv run python scripts/run_compare.py

Note: The first run may take longer because the embedding model needs to download, the Ollama model needs to warm up, and the demo builds its local retrieval state under demo_data/.

What the Repo Contains

The important files are:

File	Purpose
`app/config.py`	Shared configuration, sample corpus, and the demo question
`app/embed.py`	Real local embedding model wrapper
`app/retrievers.py`	Both the plain and TurboAgents retrievers
`app/agent.py`	Shared Pydantic AI agent and grounded run helper
`scripts/run_plain_rag.py`	Baseline RAG app
`scripts/run_turbo_rag.py`	TurboAgents-backed RAG app
`scripts/run_compare.py`	Runs both and prints the comparison

This structure keeps the code small enough that the integration seam stays visible.

1Start with the Plain RAG Version

The baseline retriever uses plain SurrealDB vector search. It embeds the demo corpus, stores those vectors in the local SurrealKV-backed database, and searches it directly.

At a high level, the baseline retriever does three things:

1.Prepare the local SurrealDB-backed storage
2.Seed the demo documents and their embeddings
3.Run vector search for the question

This is intentionally simple. The baseline exists so the reader has a clear "before" picture. Run only the baseline version with:

Shell
uv run python scripts/run_plain_rag.py

2Add TurboAgents to the Retrieval Layer

The Turbo version keeps the same high-level app structure, but replaces the baseline retriever with a TurboAgents-backed retriever. That means the new retrieval path:

Uses the same embedding vectors
Stores the same document metadata
Answers the same question
Adds TurboQuant-style compressed retrieval and reranking

This is the seam many teams care about in practice. The change is not "use a completely different application." The change is "use a different retrieval implementation under the same app."

Shell
uv run python scripts/run_turbo_rag.py

3Compare Both Versions

The main script for this tutorial is:

Shell
uv run python scripts/run_compare.py

This runs both versions and prints the answer from each, retrieval mode, timing, vector storage details, and a short comparison summary. A representative result:

Output
Baseline mode:    baseline-surrealdb
Turbo mode:       turbo-surrealdb-3.5-bits
Compression gain: about 5.02x smaller rerank payload per vector

Conclusion: same agent flow, compressed retrieval payload,
            and only a retriever-level code change.

What Changed in the Code

This is where the tutorial becomes concrete. The demo is designed so that the code difference is easy to trace:

File	Role	Changes?
`scripts/run_plain_rag.py`	Runs the plain version	Baseline
`scripts/run_turbo_rag.py`	Runs the Turbo version	Turbo
`app/agent.py`	Agent wiring	No change
`app/retrievers.py`	Retrieval logic	The swap

That is the core message: same agent, same app shape, same documents, different retriever.

Why the Grounded Tool Call Matters

One practical issue in local tool-using demos is that the model can sometimes answer without actually calling the retrieval tool. That is a bad failure mode for a tutorial because it makes the output less trustworthy.

The demo handles that by explicitly steering the model to call the retrieval tool first. If the first run skips retrieval, it retries with a stricter prompt. If retrieval still does not happen, the script fails clearly instead of quietly pretending everything worked.

That is the right behavior for a technical tutorial. A retrieval demo should actually retrieve.

What the Result Means

The most important measurable output in this tutorial is the retrieval payload size. In the current demo, the baseline path shows raw float32 vectors, while the Turbo path reports:

Output
raw=1024 bytes, turbo=204 bytes, compression≈5.02x

That is the visible win in this small example. This tutorial is intentionally not making a blanket claim that every end-to-end RAG flow will be faster. The honest claim is narrower and more useful:

TurboAgents fits into the retriever layer cleanly
The integration can be small and readable
The compressed retrieval payload is measurably smaller

Why the Demo Uses Real Components

This repo uses real local components:

Real local chat model — Through Ollama

Real local embedding model — Qwen3-Embedding-0.6B

Real local SurrealDB storage — SurrealKV backend

Real TurboAgents quantization — TurboQuant path

That matters because the tutorial is meant to be reproducible. It should not depend on fake embeddings, precomputed hidden state, or a hardcoded answer path.

Resetting the Demo

The repo builds its own local retrieval state. If you want to rebuild from scratch, delete the generated data and rerun:

Shell
rm -rf demo_data
uv run python scripts/run_compare.py

This recreates the local SurrealKV data and retrieval state.

Why This Pattern

This is a small repo, but it demonstrates a useful pattern for larger systems. Many teams already have an agent layer, a vector store, and a retrieval flow. In those systems, a practical adoption path is often more important than a theoretically perfect one.

This tutorial shows one practical path:

Keep the agent
Keep the vector store
Change the retriever

That is why this example is useful beyond the exact stack shown here.

Watch Demo

See the Pydantic AI + SurrealDB + TurboAgents demo in action.

Useful Links

Closing

The point of this tutorial is not that TurboAgents replaces your stack. The point is that it can fit into an existing stack at the retrieval layer. In this demo, the app stays readable, the code change stays visible, and the compression story stays measurable. That is a good way to evaluate a retrieval-layer integration before moving on to bigger systems.

Continue Reading

Introducing TurboAgents Beyond KV Cache: TurboQuant SurrealDB Agent Memory Intelligent RAG Optimization with GEPA

🚀 Continue the conversation

Join our community on these platforms for more insights

💡 Found this helpful? Share it with your network and help others discover these insights!