Mendral gives their AI agent a raw SQL interface to ClickHouse instead of a rigid tool API, letting it write arbitrary queries across 5 TiB of CI log data (compressed to 154 GiB). The agent averages 4.4 queries per investigation, scanning hundreds of thousands to billions of rows, and resolves flaky test root causes in seconds. The key insight: LLMs are natively good at SQL, so don't constrain them with predefined query functions.
| Field | Value |
|---|---|
| Title | LLMs Are Good at SQL. We Gave Ours Terabytes of CI Logs. |
| Author | Mendral (YC W26) |
| Link | https://www.mendral.com/blog/llms-are-good-at-sql |
| Tags | llm, sql, clickhouse, ci, agents, observability, devtools |
| Date Downloaded | 2026-02-27 |
Mendral gives their AI agent a raw SQL interface to ClickHouse instead of a rigid tool API, letting it write arbitrary queries across 5 TiB of CI log data (compressed to 154 GiB). The agent averages 4.4 queries per investigation, scanning hundreds of thousands to billions of rows, and resolves flaky test root causes in seconds. The key insight: LLMs are natively good at SQL, so don't constrain them with predefined query functions.
LLMs are good at SQL. There's an enormous amount of SQL in training data, and the syntax maps well to natural-language questions about data.
— Mendral
A constrained tool API like
— Mendralget_failure_rate(workflow, days)would limit the agent to the questions we anticipated. A SQL interface lets it ask questions we never thought of.
Nobody puts "we built a really good rate limiter" on their landing page. But without fresh, queryable data, your agent can't answer the question that actually matters: did I break this, or was it already broken?
— Mendral
A single commit can spawn hundreds of parallel jobs, each producing logs you need to fetch.
— Mendral
Mendral (YC W26) built an AI agent that debugs CI failures by writing its own SQL queries against ClickHouse — no predefined query library, just a raw SQL interface scoped per org. They ingest ~1.5 billion CI log lines per week, denormalize 48 columns of metadata onto every log line (5.31 TiB uncompressed → 154 GiB on disk at 35:1 compression), and the agent averages 4.4 queries per investigation across 8,500+ sessions. The search pattern is broad-to-narrow: start with job metadata (20ms median), drill into raw logs when something's interesting (110ms median). At P95, a single investigation scans 940 million rows. The whole system depends on ClickHouse's columnar compression (repeated metadata compresses 50-300x), skip indexes with bloom filters, materialized views, and a carefully throttled GitHub API ingestion pipeline running on Inngest for durable execution. The boring but critical detail: they cap ingestion at ~3 req/s to keep 4,000 API calls/hour free for the agent itself.
get_failure_rate(), Mendral exposes a raw SQL interface. LLMs have massive SQL training data and map natural-language questions to SQL naturally. This lets the agent ask questions the developers never anticipated. [1]| Target | Sessions | Avg Queries | Median Rows | P75 | P95 |
|---|---|---|---|---|---|
| Job metadata | 8,210 | 4.0 | 164K | 563K | 4.4M |
| Raw log lines | 5,413 | 3.5 | 4.4M | 69M | 4.3B |
| Combined | 8,534 | 4.4 | 335K | 5.2M | 940M |
commit_message at 301:1, display_title at 160:1, workflow_path at 79:1. [1]line_content, ts, line_number) account for 53% of all storage. The other 45 metadata columns that repeat across thousands of lines are essentially free. [1]| Rows Scanned | Median Latency | P95 Latency |
|---|---|---|
| < 1K | 10ms | 50ms |
| 10K-100K | 20ms | 50ms |
| 1M-10M | 90ms | 1.2s |
| 100M-1B | 6.8s | 30.6s |
| 1B+ | 31s | 82s |
org, ts, repository, run_id, ...) for physical sort order. Bloom filter skip indexes on 14 columns. Ngram bloom filter on line_content for full-text search. Materialized views for pre-computed aggregations. Async inserts for high write throughput. [1]Mendral's core thesis is simple: LLMs are natively good at SQL, so give them SQL instead of constraining them with predetermined tool APIs. Their agent traced a flaky test to a dependency bump three weeks prior by autonomously writing SQL queries across hundreds of millions of log lines — something that would take a human significant manual effort scrolling through GitHub Actions log viewers.
The system ingests about 1.5 billion CI log lines weekly into ClickHouse, landing with 48 columns of denormalized metadata on every single row. This sounds like a storage disaster in a traditional row-store, but ClickHouse's columnar format makes it essentially free — repeated values like commit_message compress at 301:1 because thousands of log lines from the same CI run share identical values. The total dataset is 5.31 TiB uncompressed, stored in just 154 GiB on disk (35:1 compression), averaging 21 bytes per log line with all metadata included.
The agent's investigation pattern is methodical: start broad with job metadata queries (cheap, 20ms median), identify anomalies, then drill into raw log lines (more expensive, 110ms median). Across 8,534 sessions and 52,312 queries, the typical investigation scans 335K rows across about 4.4 queries. At the extreme end, P95 sessions scan 940 million rows, and the heaviest raw-log investigations hit 4.3 billion rows.
ClickHouse's performance holds up through careful engineering: primary key design for physical sort order matching the access pattern, bloom filter skip indexes on 14 columns (including an ngram filter for full-text search on log content), materialized views for pre-computed aggregations, and async inserts for write throughput. Query latency scales roughly linearly with rows scanned — 10x more rows means about 10x more latency.
The often-overlooked infrastructure piece is GitHub API rate limiting. With 15,000 requests/hour shared between ingestion and the agent, early burst-mode ingestion would exhaust the budget and leave the agent working with 30-minute-old data. The fix was throttling ingestion to ~3 req/s, keeping ~4,000 requests/hour free for agent operations. Both the ingestion pipeline and agent run on Inngest (durable execution), so rate limit hits cause clean suspensions with full state checkpointing rather than crashes or blind retries. The target is under 5 minutes at P95 for ingestion delay, typically achieved in seconds.