Research-First Evaluation OS

RLM Code

Research Playground for Recursive Language Models

Run real RLM workflows, benchmark them, replay every step, and compare results under controlled budgets and secure sandboxes.

Open source under Apache-2.0. Explore the project on GitHub.

Start with Docs View GitHub

Apache-2.0

PyPI Package

Research-first

What is included Preferred stack Watch demo Quick workflow Security FAQ

Is RLM revolutionary or just another coding agent?

The debate is active. RLM Code gives researchers a way to test claims directly with reproducible runs, benchmark comparisons, trajectory replay, and cost and token tracking.

You do not need to pick a side first. You can run the same tasks, inspect the traces, and decide with evidence.

Use your preferred stack

Mix runtimes, framework adapters, and observability tools without changing your research workflow.

Runtimes

Docker

Apple Container

Modal

E2B

Daytona

Frameworks

DSPy RLM

ADK

Pydantic AI

Google ADK

DeepAgents

Observability

MLflow

Logfire

LangSmith

LangFuse

Built for research workflow

Implementation friction: get a runnable RLM loop without custom scaffolding

Experiment management: run, compare, replay, and export results in one place

Security controls: use secure runtime profiles and explicit unsafe opt-ins

Reproducibility: keep traces, metrics, artifacts, and benchmark history

Operational visibility: inspect events, rewards, tokens, and runtime health

What is included

Core capabilities for running reproducible RLM experiments end-to-end.

Research Lab TUI

Dashboard, Trajectory, Benchmarks, Replay, and Live Events in one focused interface.

Benchmark System

Built-in benchmark presets, leaderboard metrics, run comparison, and report export.

Session Replay

Step through decisions and outcomes to inspect behavior, not just final answers.

Superbox Sandbox Layer

Policy-based runtime selection and fallback across Docker, Apple Container, and cloud runtimes.

Framework Adapter Registry

Compare execution across DSPy RLM, ADK, Pydantic AI, Google ADK, and DeepAgents.

Observability Integrations

Connect MLflow, Logfire, LangSmith, LangFuse, OpenTelemetry, and local JSONL sinks.

Watch RLM Code in action

Install, connect, run a bounded RLM task, benchmark, compare, and inspect replay in the Research Lab.

From install to measurable result in minutes

RLM Quick Workflow

Install

$uv tool install "rlm-code[tui,llm-all]"

Launch

$rlm-code

Connect

>/connect

Secure Sandbox

>/sandbox profile secure

Run

>/rlm run "small scoped task" steps=4 timeout=30 budget=60

Benchmark

>/rlm bench preset=token_efficiency

Compare

>/rlm bench compare candidate=latest baseline=previous

Observe

>/rlm observability

Safe and bounded by design

Secure profile sets Docker-first execution defaults

Unsafe exec mode requires explicit acknowledgment

Step, timeout, and budget controls prevent runaway runs

/rlm abort allows immediate cancellation

Runtime doctor and status commands expose readiness and misconfigurations

Research-first audience

Who it is for

Researchers evaluating long-context reasoning methods
Applied AI teams validating execution patterns before product rollout
Framework authors testing adapter behavior in a unified benchmark harness

Not the primary target

One-click consumer chat workflows
Unbounded autonomous production agents without evaluation controls

FAQ

Can I use MLflow and Logfire at the same time?

Yes. Multiple observability sinks can run together.

Can I compare RLM with other execution paradigms?

Yes. Run benchmarks and compare runs with consistent bounds and metrics.

Can I use local models?

Yes. Local and BYOK routes are supported.

Is secure execution required?

For serious experiments, yes. Use secure sandbox profiles by default.

Stop arguing. Run the experiment.

Use RLM Code to test what works for your tasks, constraints, and models.

Start with Docs View GitHub Install from PyPI