Turbocharge Pydantic AI + SurrealDB RAG with TurboAgents and TurboQuant

Technical Tutorial
Pydantic AI + SurrealDB + TurboAgents
Google Research released TurboQuant, the game-changing compression technique. Superagentic AI released TurboAgents to showcase TurboQuant in real agentic AI systems. This post walks through a small local demo built with Pydantic AI, SurrealDB, and TurboAgents that starts with a plain RAG app, then swaps only the retriever so the same app uses TurboAgents for compressed retrieval and reranking.
Quick links:
π Read detailed version of this blog on your favorite platform
Choose your preferred platform to dive deeper
What This Tutorial Is About
A lot of RAG examples become hard to follow because they try to explain too many things at once: the agent framework, the vector store, embeddings, prompting, orchestration, and performance claims. This tutorial takes the opposite approach.
The app is intentionally small. It does one thing clearly:
- Start with a plain local RAG app
- Swap only the retriever
- Show what TurboAgents changes
That makes it easier to see where compressed retrieval actually fits.
What Is TurboAgents
TurboAgents is a Python package for TurboQuant-style compression, retrieval, and reranking in agent and RAG systems. It is designed to plug into an existing stack instead of replacing it.
That design choice matters. In many real systems, the hard question is not "how do I build a new agent framework?" The hard question is "where can I add a new retrieval capability without rewriting the rest of the app?" This tutorial uses TurboAgents exactly that way. It does not replace the agent. It does not replace the vector store. It changes the retrieval layer.
What Is Pydantic AI
Pydantic AI is the agent framework used in this example. It gives a clean Python interface for defining an agent, its instructions, and its tools. The agent in this repo has a single important job: answer a question using retrieved context. That makes Pydantic AI a good fit because it lets the retrieval path stay the center of attention.
What Is SurrealDB
SurrealDB is the vector-backed storage layer used here. The demo uses the embedded surrealkv:// backend from the SurrealDB Python SDK, so there is no separate database server to run.
That keeps the tutorial local and reproducible:
- The language model runs locally through Ollama
- The embedding model runs locally
- The retrieval data is stored locally
The result is a small tutorial that still uses real components rather than placeholders.
Why SurrealDB Instead of LanceDB
LanceDB is also supported by TurboAgents, but LanceDB already has its own quantization and indexing story. For a first tutorial focused on the TurboAgents integration seam, SurrealDB makes the comparison easier to isolate.
That means the reader can look at this demo and understand:
- What the plain retrieval path looks like
- What the TurboAgents retrieval path looks like
- What changed between them
That is a better teaching example than mixing multiple retrieval stories together.
How the Demo Is Structured
Same agent, same documents, same local model, same question. Only the retriever changes.
Plain Version
Turbo Version
Prerequisites
You need the following before running the demo:
The Ollama model used by the agent is qwen3.5:9b. The embedding model is Qwen/Qwen3-Embedding-0.6B, truncated to 256 dimensions so it stays compatible with the TurboAgents quantization path.
This repo does not require Docker and does not require a separate SurrealDB server. It uses the embedded surrealkv:// backend.
Clone and Set Up the Repo
Clone the repo, install dependencies, and pull the Ollama model.
1Clone and install
git clone https://github.com/SuperagenticAI/turboagent-minimal-demo.git
cd turboagent-minimal-demo
uv sync2Pull the Ollama model
ollama pull qwen3.5:9b
ollama list3Run the comparison script
uv run python scripts/run_compare.pydemo_data/.What the Repo Contains
The important files are:
| File | Purpose |
|---|---|
app/config.py | Shared configuration, sample corpus, and the demo question |
app/embed.py | Real local embedding model wrapper |
app/retrievers.py | Both the plain and TurboAgents retrievers |
app/agent.py | Shared Pydantic AI agent and grounded run helper |
scripts/run_plain_rag.py | Baseline RAG app |
scripts/run_turbo_rag.py | TurboAgents-backed RAG app |
scripts/run_compare.py | Runs both and prints the comparison |
This structure keeps the code small enough that the integration seam stays visible.
1Start with the Plain RAG Version
The baseline retriever uses plain SurrealDB vector search. It embeds the demo corpus, stores those vectors in the local SurrealKV-backed database, and searches it directly.
At a high level, the baseline retriever does three things:
- 1.Prepare the local SurrealDB-backed storage
- 2.Seed the demo documents and their embeddings
- 3.Run vector search for the question
This is intentionally simple. The baseline exists so the reader has a clear "before" picture. Run only the baseline version with:
uv run python scripts/run_plain_rag.py2Add TurboAgents to the Retrieval Layer
The Turbo version keeps the same high-level app structure, but replaces the baseline retriever with a TurboAgents-backed retriever. That means the new retrieval path:
- Uses the same embedding vectors
- Stores the same document metadata
- Answers the same question
- Adds TurboQuant-style compressed retrieval and reranking
This is the seam many teams care about in practice. The change is not "use a completely different application." The change is "use a different retrieval implementation under the same app."
uv run python scripts/run_turbo_rag.py3Compare Both Versions
The main script for this tutorial is:
uv run python scripts/run_compare.pyThis runs both versions and prints the answer from each, retrieval mode, timing, vector storage details, and a short comparison summary. A representative result:
Baseline mode: baseline-surrealdb
Turbo mode: turbo-surrealdb-3.5-bits
Compression gain: about 5.02x smaller rerank payload per vector
Conclusion: same agent flow, compressed retrieval payload,
and only a retriever-level code change.What Changed in the Code
This is where the tutorial becomes concrete. The demo is designed so that the code difference is easy to trace:
| File | Role | Changes? |
|---|---|---|
scripts/run_plain_rag.py | Runs the plain version | Baseline |
scripts/run_turbo_rag.py | Runs the Turbo version | Turbo |
app/agent.py | Agent wiring | No change |
app/retrievers.py | Retrieval logic | The swap |
That is the core message: same agent, same app shape, same documents, different retriever.
Why the Grounded Tool Call Matters
One practical issue in local tool-using demos is that the model can sometimes answer without actually calling the retrieval tool. That is a bad failure mode for a tutorial because it makes the output less trustworthy.
The demo handles that by explicitly steering the model to call the retrieval tool first. If the first run skips retrieval, it retries with a stricter prompt. If retrieval still does not happen, the script fails clearly instead of quietly pretending everything worked.
That is the right behavior for a technical tutorial. A retrieval demo should actually retrieve.
What the Result Means
The most important measurable output in this tutorial is the retrieval payload size. In the current demo, the baseline path shows raw float32 vectors, while the Turbo path reports:
raw=1024 bytes, turbo=204 bytes, compressionβ5.02xThat is the visible win in this small example. This tutorial is intentionally not making a blanket claim that every end-to-end RAG flow will be faster. The honest claim is narrower and more useful:
- TurboAgents fits into the retriever layer cleanly
- The integration can be small and readable
- The compressed retrieval payload is measurably smaller
Why the Demo Uses Real Components
This repo uses real local components:
That matters because the tutorial is meant to be reproducible. It should not depend on fake embeddings, precomputed hidden state, or a hardcoded answer path.
Resetting the Demo
The repo builds its own local retrieval state. If you want to rebuild from scratch, delete the generated data and rerun:
rm -rf demo_data
uv run python scripts/run_compare.pyThis recreates the local SurrealKV data and retrieval state.
Why This Pattern
This is a small repo, but it demonstrates a useful pattern for larger systems. Many teams already have an agent layer, a vector store, and a retrieval flow. In those systems, a practical adoption path is often more important than a theoretically perfect one.
This tutorial shows one practical path:
- Keep the agent
- Keep the vector store
- Change the retriever
That is why this example is useful beyond the exact stack shown here.
Watch Demo
See the Pydantic AI + SurrealDB + TurboAgents demo in action.
Useful Links
Closing
The point of this tutorial is not that TurboAgents replaces your stack. The point is that it can fit into an existing stack at the retrieval layer. In this demo, the app stays readable, the code change stays visible, and the compression story stays measurable. That is a good way to evaluate a retrieval-layer integration before moving on to bigger systems.
π Continue the conversation
Join our community on these platforms for more insights
π‘ Found this helpful? Share it with your network and help others discover these insights!
