Refine AI gates your PRs with opinionated behavioral checks. When your agent regresses — more steps, unexpected tool calls, cost spikes — the PR fails. The built-in debugger shows you exactly what changed.
Built for teams shipping AI agents in production
A prompt tweak, a model swap, an infra update — any change can alter how your agent behaves. Output looks correct. Tests pass. But your agent is now taking 22 steps instead of 6, calling Salesforce, and costing 3x more. You won't know until a user reports it.
Your agent used to complete the task in 6-9 steps. A new model version finds a different reasoning path. Now it takes 22. Output is correct. Cost is not.
A developer adds a Salesforce integration and forgets to add guardrails. Your agent starts calling it mid-session. Nobody sees it until the audit.
P95 latency goes from 3s to 14s. Users churn before you correlate the deploy to the slowdown. This happens at every company, every size.
Eval tools ask: "Is my output correct?" — we ask: "Did behavior change?"
Braintrust, LangSmith, and Confident AI are great at evaluating output quality. Refine AI checks something different: structural behavioral change. No golden dataset. No LLM-as-judge. No rubrics. Just: did your agent do something different than before?
Add one decorator. Capture a baseline. Gate every PR. No hosted infrastructure, no cloud required — your traces stay on your CI runner.
Add one decorator to your agent entry point. Works with LangChain, LlamaIndex, AutoGen, or any raw Python agent. Takes under 5 minutes.
from agentdbg import trace
@trace
def run_agent(input: str):
# your existing code
# nothing else changes
...
Run your agent on a fixed set of test inputs. Refine AI records the full execution trace — steps, tool calls, cost, latency. This becomes your regression baseline.
# On main branch:
agentdbg baseline capture \
--suite ./tests/agent_scenarios/ \
--save baseline.json
# Baseline stored. Ready to gate.
Add the GitHub Action. On every PR, Refine AI replays the same test suite against the new branch and compares. Behavioral regression = PR fails.
# .github/workflows/agent-check.yml
- uses: agentdbg/action@v1
with:
baseline: baseline.json
max-steps: 15
max-tool-calls: 10
no-loops: true
max-cost: 0.05
No hosted infra. Runs on your GitHub Actions runner. Traces never leave your environment.
Every check is deterministic. No ML classifiers. No LLM scoring. Each check compares a measured property of the current run against the baseline — and fails the PR if it exceeds your threshold.
Catches when a code change causes your agent to take a dramatically different number of reasoning steps to complete the same task.
Detects when total tool invocations spike across a test case, indicating a more expensive or less efficient execution path.
Flags any tool call that did not appear in the baseline trace. A new Salesforce call, a new DB read — anything your agent was not doing before.
Detects when the agent revisits the same reasoning subgraph repeatedly — a sign of an infinite loop that will burn tokens and never resolve.
If a guardrail that never triggered on the baseline now triggers on the new branch, Refine AI surfaces it — even if the agent continued.
Compares estimated token cost per test case against the baseline. A 3x cost increase on a code change is a regression, even if output looks correct.
Tracks wall-clock time per run. Latency regressions are often invisible in evals but immediately visible to users. Catch them before merge.
Verifies your agent still reaches its expected terminal state (task_complete, handoff, etc.) across all test cases. Regression = agent never finishes.
Eval platforms evaluate output quality. Refine AI detects structural behavioral change. Different question. Different tool. Complementary, not competing.
| Capability |
Refine AI
| Braintrust / LangSmith | Confident AI / DeepEval | Arize / Cascade |
|---|---|---|---|---|
| PR gate that blocks on regression | ||||
| Zero LLM calls in check path | ||||
| No golden dataset required | ||||
| Behavioral invariant checks | ||||
| Output quality evaluation | ||||
| LLM-as-judge scoring | ||||
| Custom rubrics / prompts | ||||
| Step count & tool call tracking | ||||
| CI integration (GitHub Actions) | ||||
| Local-first, no cloud required |
"–" means limited or configuration-dependent support. Different tools, different jobs — use both.
Real signals from engineers using Refine AI in CI.
"A new model version tripled our step count on the summarization agent. Refine AI caught it on the PR. We would never have noticed from evals alone — the output quality didn't change at all."
"We added a new tool to the agent and forgot to test it. Refine AI flagged it as an unexpected tool path on the PR. Two lines of YAML later and it's explicitly approved in our policy."
"Our agent stopped reaching the handoff state after a prompt change. Evals still passed because the early steps were fine. Refine AI blocked the PR because the stop condition wasn't hit."
The local debugger is free, forever. You pay for CI assertions — the same model Codecov and Snyk use.
No credit card for the free tier. Team trial requires no card for 14 days. Cancel any time.
Three steps. No sign-up required to start.
pip install agentdbg # Add @trace to your agent entry point from agentdbg import trace @trace def run_agent(input: str): ...
- name: Refine AI behavioral check uses: agentdbg/action@v1 with: baseline: baseline.json max-steps: 15 max-tool-calls: 10 no-loops: true max-cost: 0.05 max-latency-p95: 5000