Mendral
Blog

We Use Claude Code for Coding. Here's Why We Built Our Own Agent for CI.

Sam Alba··9 min read

A few weeks ago, we published a deep dive on how we built Mendral's agent architecture. We covered the multi-agent system, Firecracker sandboxes, durable execution. What we didn't cover is the question we get most often: "Why can't I just run Claude Code on my CI?"

It's a fair question. Both Mendral and Claude Code run on LLMs. Both are coding agents. Both can read code, write patches, and reason about failures. The answer is that everything around the model is different. The system prompts, the tools, the data, the context. Every token sent to the LLM is optimized for a different job. Claude Code is optimized for writing software. Mendral is optimized for keeping your software delivery green, fast, and secure.

We're not trying to replace Claude Code. We use it every day and we like it. Mendral is just better at everything that happens after you push your code.

The token gap

When you type a short message to Claude Code, the agent doesn't just forward your message to the LLM. It sends a large payload: system prompts describing its capabilities, tool definitions for file operations and shell commands, contextual information about your codebase. All of those tokens are carefully designed to make Claude Code great at producing and editing code.

Mendral does the same thing, but every token is different. Our system prompts encode specific patterns we've learned from debugging CI at Docker and Dagger for over a decade. Things like: a test that passes locally but fails intermittently on CI is almost never random, check for resource contention, shared state between parallel suites, and timing-sensitive assertions before blaming the code. A sudden spike in failures after a dependency bump is likely a transitive dependency conflict, not a flake. A build that's 30% slower than last week but produces the same output is probably a cache invalidation issue, not a code regression. These aren't generic instructions. They're the kind of judgment that takes years of staring at CI logs to develop, encoded directly into what the model sees on every call.

Our tool definitions describe operations that don't exist in a general coding agent: querying months of CI history, correlating failures across branches, tracing a flaky test to a transitive dependency bump three weeks ago. The contextual information isn't your current file. It's billions of log lines, test execution history, failure patterns, infrastructure conditions, and a living list of known issues in your delivery pipeline.

Same models. Different inputs. Completely different results.

A complete agent harness

Mendral isn't a wrapper around an LLM API. We built a complete agent harness from scratch. The agent loop runs on our Go backend. Every tool is our own implementation.

The tools split into two categories. Some are native Go functions running on our backend: querying ClickHouse for log analysis, fetching GitHub metadata, looking up failure history, correlating test results across runs. These are fast, deterministic, and don't need isolation.

Some tools require a sandbox. When the agent needs to clone a repository, apply a patch, run tests, or validate a fix, it operates inside a Firecracker microVM. Each sandbox is a lightweight VM with its own kernel, hardware-level isolation between tenants. It boots in under 125ms. Between tool calls, the sandbox suspends. When the LLM responds with the next action, the sandbox resumes in under 25ms with full state preserved. No idle compute during inference. No data leaking between customers.

This matters for CI specifically. The agent sometimes pushes a fix and needs to wait hours for the pipeline to complete. The sandbox suspends during that wait. When CI finishes, the sandbox resumes with full context intact and verifies the result. Without suspend and resume, you'd either burn hours of idle compute or lose your entire execution state.

Claude Code runs locally, on your machine or in a CI runner. It has access to the files in the current working directory and whatever you can do in a terminal. That's the right model for writing code. It's the wrong model for diagnosing why a test that passes locally fails intermittently on CI, or why your build time regressed 40% last Thursday, or which of your 22,000 tests are actually flaky versus failing due to a real regression.

The data layer

The agent is only as smart as the data it can access. We built a log ingestion pipeline that processes billions of CI log lines per week into ClickHouse, compressed at 35:1, queryable in milliseconds. The agent writes its own SQL queries to investigate failures. No predefined query library. It asks whatever question it needs to ask.

A typical investigation scans 335K rows across 3+ queries. At P95, it scans 940 million rows. The agent can look at a failing test and instantly pull up its pass rate over the last 90 days, identify when it first started failing, find the exact commit that introduced the regression, check if the same test is flaky on other branches, and cross-reference with infrastructure conditions at the time of execution. All in seconds.

Claude Code doesn't have access to any of that. It sees the current failure in the current run. Mendral sees the full picture.

Insights: the living knowledge of your delivery pipeline

This is the part that makes the biggest practical difference. Mendral maintains a list of insights, a continuously updated view of everything happening in your software delivery.

Every anomaly the agent detects becomes an insight: a flaky test, a CI incident, a security alert from a dependency update, a performance regression in build times, a pattern of failures correlated with a specific runner type. Each insight tracks the full lifecycle of the issue. When it first appeared, how it evolved, whether it's been resolved, and if it comes back.

The insights list is alive. The agent constantly refreshes it. When two insights turn out to be the same root cause, they get deduplicated. When a team member fixes an issue outside of Mendral (manually merging a fix, reverting a bad commit), the agent detects the resolution and auto-closes the corresponding insight. If the problem reoccurs later, the insight reopens with full history intact.

At any point, the insights give you a complete state of your software delivery. Not a dashboard of metrics. A prioritized, contextualized list of active issues, their root causes, and what's been done about each one. It's what a great Platform Engineer would keep in their head, except it covers everything and doesn't forget.

The agent learns

Insights aren't just a status board. They're a learning system. Over time, Mendral builds a history of how issues in your pipeline get resolved. Which types of flaky tests tend to be race conditions. Which dependency updates tend to break things. Which parts of your codebase generate the most CI noise. How long different categories of issues typically take to fix.

This means the longer Mendral runs on your codebase, the better it gets. A fresh install diagnoses failures based on our general CI expertise (which is substantial, more on that below). After a week, it starts recognizing patterns specific to your project. After a month, it knows your codebase like a senior engineer who's been on the team for years. It knows that TestUserAuthFlow has been flaky since the Redis connection pooling change in January. It knows that builds on the staging branch tend to fail on Tuesday mornings because of a scheduled job that competes for database connections. It knows that the last three times someone bumped the @testing-library version, it broke two E2E suites. And it keeps getting better from there.

Static analysis on every tool call

Here's something we haven't talked about publicly before. We run static analysis on every tool call the agent makes, both going in and coming out.

On the input side, we filter what the agent sends to tools. On the output side, we filter what comes back. Some tool calls dynamically modify the agent's context at runtime, injecting additional guidance based on what the agent is doing and what it finds.

A concrete example. Say the agent is investigating a test failure and calls the log query tool with arguments targeting a specific workflow and time range. The static analysis layer inspects those arguments and detects that the query returned sparse results. It might have 3 data points where it would normally expect 50. Instead of letting the agent reason on incomplete data, the analysis layer injects a context update: "Log coverage for this workflow appears incomplete for the requested time range. Consider expanding the window or checking if log ingestion was delayed during this period. Query the ingestion delay metrics before concluding." The agent adjusts its investigation path based on this guidance, without us having to hardcode every possible edge case into the system prompt.

This pattern lets us encode operational knowledge at the tool boundary rather than in the prompt. The prompts stay focused on reasoning. The tool layer handles the domain-specific guardrails.

On the security side, the same static analysis enforces a strict security model. The agent can't perform destructive actions. It can't delete branches, force-push, close PRs it didn't open, or modify CI configuration in ways that could break the pipeline. The analysis layer inspects every tool call and blocks anything outside the agent's scope. This isn't a prompt instruction ("please don't delete anything"). It's a hard enforcement boundary that the LLM can't reason its way around.

A team of agents

From the outside, Mendral is one agent. Internally, it's a fleet. Different agents use different models, different tool sets, and are scoped to different tasks.

Today we run on Anthropic's model tiers, but the architecture is multi-model by design. We're adding support for other coding LLMs (OpenAI Codex, Gemini) as they prove strong at specific tasks. The right model for log parsing might not be the right model for root cause analysis, and those might come from different providers.

Currently, Haiku handles log parsing and data extraction. Thousands of these run every day. Sonnet collects evidence, writes SQL queries, gathers facts from the repository, and deduplicates issues. Opus handles root cause analysis and writes fixes. When the agent needs to form a hypothesis about why a test is failing, reason about complex interactions between test suites, or write a non-trivial patch that touches CI configuration and test code simultaneously, Opus takes over.

This isn't just cost optimization. It's quality optimization. Using Opus for log parsing would be wasteful. Using Haiku for root cause analysis would produce worse results. Each tier is matched to the cognitive demand of the task.

Continuous improvement across all teams

There's another advantage a specialist agent has over a general one. We see everything.

Every investigation Mendral runs, across every customer, teaches us something. When a team gives a thumbs down on a diagnosis, we trace the exact reasoning chain that went wrong. When we see a pattern of failures that our agent handles poorly, we update the prompts, tools, and data retrieval logic. When a new category of CI issue emerges (and they do, constantly, especially as AI coding tools change how code gets written), we build support for it.

This improvement cycle runs continuously. Not just model upgrades. Not just prompt tweaks. Deeper changes: new tools, new data sources, new analysis patterns, refined static analysis rules. Every week, the agent gets meaningfully better at handling edge cases that none of our customers could have anticipated individually.

An engineer running Claude Code on their CI gets the same experience today that they'll get next month. Mendral gets better every week, informed by real failures at real scale across production teams.

What we're building next

Everything above is about what Mendral does today. Here's where we're heading.

Custom agentic workflows. We're building the ability for teams to define their own agents on top of our harness and data layer. You'll be able to create custom workflows that use our tools, our data, and our infrastructure, but with your own logic. A custom agent that enforces your team's specific merge policies. A workflow that runs a pre-merge security audit using your internal checklist. An agent that generates release notes from the commits and test results in a release branch.

Pluggable data sources. Today, Mendral's primary data source is CI logs and GitHub metadata. We're adding support for Sentry exceptions, OpenTelemetry traces, deployment logs from Kubernetes clusters, and other operational data. The goal is to give the agent visibility across the full delivery lifecycle, not just CI.

Multi-CI support. Mendral currently runs on GitHub Actions. The architecture is multi-CI by design, and support for Buildkite, CircleCI, and GitLab CI is coming soon. Same agent, same data layer, same insights, regardless of where your pipelines run.

This is where the distinction from SRE tools matters. A traditional SRE agent or incident response tool catches errors after they've hit production. It's reactive by design. Mendral acts on your software delivery pipeline to prevent issues from reaching production in the first place. It catches the flaky test, the regression, the security vulnerability, the performance degradation while it's still in CI, before it ships.

The vision is an agent that understands your entire delivery pipeline, from the commit to production. One that doesn't just diagnose failures but anticipates them. One that gets smarter every week. And one that costs a fraction of what a senior Platform Engineer costs, without ever taking a day off.


We're building Mendral (YC W26). We spent a decade building and scaling CI systems at Docker and Dagger. If your team is burning engineering time on CI failures and flaky tests, we'd love to show you what Mendral can do. Book a demo.