← d3dev

LLMs Are Good at SQL: Mendral's Agent-Driven CI Log Analysis

Mendral (YC W26)
https://www.mendral.com/blog/llms-are-good-at-sql

llmsqlclickhouseciagentsobservabilitydevtools

At a Glance

Mendral gives their AI agent a raw SQL interface to ClickHouse instead of a rigid tool API, letting it write arbitrary queries across 5 TiB of CI log data (compressed to 154 GiB). The agent averages 4.4 queries per investigation, scanning hundreds of thousands to billions of rows, and resolves flaky test root causes in seconds. The key insight: LLMs are natively good at SQL, so don't constrain them with predefined query functions.

LLMs Are Good at SQL: Mendral's Agent-Driven CI Log Analysis

Metadata

Field	Value
Title	LLMs Are Good at SQL. We Gave Ours Terabytes of CI Logs.
Author	Mendral (YC W26)
Link	https://www.mendral.com/blog/llms-are-good-at-sql
Tags	llm, sql, clickhouse, ci, agents, observability, devtools
Date Downloaded	2026-02-27

At a Glance

Quotes

LLMs are good at SQL. There's an enormous amount of SQL in training data, and the syntax maps well to natural-language questions about data.
— Mendral

A constrained tool API like get_failure_rate(workflow, days) would limit the agent to the questions we anticipated. A SQL interface lets it ask questions we never thought of.
— Mendral

Nobody puts "we built a really good rate limiter" on their landing page. But without fresh, queryable data, your agent can't answer the question that actually matters: did I break this, or was it already broken?
— Mendral

A single commit can spawn hundreds of parallel jobs, each producing logs you need to fetch.
— Mendral

Sam's TLDR

Mendral (YC W26) built an AI agent that debugs CI failures by writing its own SQL queries against ClickHouse — no predefined query library, just a raw SQL interface scoped per org. They ingest ~1.5 billion CI log lines per week, denormalize 48 columns of metadata onto every log line (5.31 TiB uncompressed → 154 GiB on disk at 35:1 compression), and the agent averages 4.4 queries per investigation across 8,500+ sessions. The search pattern is broad-to-narrow: start with job metadata (20ms median), drill into raw logs when something's interesting (110ms median). At P95, a single investigation scans 940 million rows. The whole system depends on ClickHouse's columnar compression (repeated metadata compresses 50-300x), skip indexes with bloom filters, materialized views, and a carefully throttled GitHub API ingestion pipeline running on Inngest for durable execution. The boring but critical detail: they cap ingestion at ~3 req/s to keep 4,000 API calls/hour free for the agent itself.

Key Points

SQL over tool APIs: Instead of building predefined functions like get_failure_rate(), Mendral exposes a raw SQL interface. LLMs have massive SQL training data and map natural-language questions to SQL naturally. This lets the agent ask questions the developers never anticipated. [1]

Scale of the system: 1.5 billion CI log lines and 700K jobs per week. 5.31 TiB uncompressed data, 154 GiB on disk. 21 bytes per log line including all 48 metadata columns. [1]

Agent query patterns across 52K queries:

Target	Sessions	Avg Queries	Median Rows	P75	P95
Job metadata	8,210	4.0	164K	563K	4.4M
Raw log lines	5,413	3.5	4.4M	69M	4.3B
Combined	8,534	4.4	335K	5.2M	940M

Broad-to-narrow search pattern: The agent starts with cheap job metadata queries (median 47K rows) — failure rates, which jobs failed on a commit. When it finds something, it drills into raw log lines (median 1.1M rows) — stack traces, error history. This mirrors how a human would investigate but happens in seconds. [1]

ClickHouse denormalization bet: Every log line carries 48 columns of metadata (commit SHA, author, branch, PR title, workflow, job, step, runner, timestamps). In a row-store this would be insane. In ClickHouse's columnar format, repeated values compress to nearly nothing — commit_message at 301:1, display_title at 160:1, workflow_path at 79:1. [1]

Storage breakdown: The top three unique-per-row columns (line_content, ts, line_number) account for 53% of all storage. The other 45 metadata columns that repeat across thousands of lines are essentially free. [1]

Query performance scales linearly:

Rows Scanned	Median Latency	P95 Latency
< 1K	10ms	50ms
10K-100K	20ms	50ms
1M-10M	90ms	1.2s
100M-1B	6.8s	30.6s
1B+	31s	82s

60% of queries scan under 100K rows and return in <50ms — fast enough for the agent to fire several per second without breaking stride. [1]

ClickHouse performance patterns: Primary key design (org, ts, repository, run_id, ...) for physical sort order. Bloom filter skip indexes on 14 columns. Ngram bloom filter on line_content for full-text search. Materialized views for pre-computed aggregations. Async inserts for high write throughput. [1]

GitHub rate limit engineering: 15,000 API requests/hour per App installation. Ingestion and agent investigation share the same budget. Early on, bursting ate the limit and left the agent with stale data 30+ minutes old. Fix: throttle ingestion to ~3 req/s, reserving ~4,000 calls/hour for the agent. Target: <5 minutes ingestion delay at P95. [1]

Durable execution via Inngest: When the agent or ingestion pipeline hits a rate limit, it doesn't crash or retry blindly — it suspends. GitHub's rate limit headers provide exact wait times. Execution resumes at the exact checkpoint. No retry logic, no state recovery, no deduplication needed. [1]

Traffic spike absorption: CI activity is bursty (merges, releases, multiple teams pushing). The durable execution engine queues up to 3,000+ functions during bursts while processing at a steady 800-1,000/s. Ingestion delay spikes briefly but recovers. [1]

Team background: The Mendral founders spent a decade building and scaling CI systems at Docker and Dagger. They're YC W26. [1]

Full Summary

Mendral's core thesis is simple: LLMs are natively good at SQL, so give them SQL instead of constraining them with predetermined tool APIs. Their agent traced a flaky test to a dependency bump three weeks prior by autonomously writing SQL queries across hundreds of millions of log lines — something that would take a human significant manual effort scrolling through GitHub Actions log viewers.

The system ingests about 1.5 billion CI log lines weekly into ClickHouse, landing with 48 columns of denormalized metadata on every single row. This sounds like a storage disaster in a traditional row-store, but ClickHouse's columnar format makes it essentially free — repeated values like commit_message compress at 301:1 because thousands of log lines from the same CI run share identical values. The total dataset is 5.31 TiB uncompressed, stored in just 154 GiB on disk (35:1 compression), averaging 21 bytes per log line with all metadata included.

The agent's investigation pattern is methodical: start broad with job metadata queries (cheap, 20ms median), identify anomalies, then drill into raw log lines (more expensive, 110ms median). Across 8,534 sessions and 52,312 queries, the typical investigation scans 335K rows across about 4.4 queries. At the extreme end, P95 sessions scan 940 million rows, and the heaviest raw-log investigations hit 4.3 billion rows.

ClickHouse's performance holds up through careful engineering: primary key design for physical sort order matching the access pattern, bloom filter skip indexes on 14 columns (including an ngram filter for full-text search on log content), materialized views for pre-computed aggregations, and async inserts for write throughput. Query latency scales roughly linearly with rows scanned — 10x more rows means about 10x more latency.

The often-overlooked infrastructure piece is GitHub API rate limiting. With 15,000 requests/hour shared between ingestion and the agent, early burst-mode ingestion would exhaust the budget and leave the agent working with 30-minute-old data. The fix was throttling ingestion to ~3 req/s, keeping ~4,000 requests/hour free for agent operations. Both the ingestion pipeline and agent run on Inngest (durable execution), so rate limit hits cause clean suspensions with full state checkpointing rather than crashes or blind retries. The target is under 5 minutes at P95 for ingestion delay, typically achieved in seconds.

References

[1]Mendral. "LLMs Are Good at SQL. We Gave Ours Terabytes of CI Logs." https://www.mendral.com/blog/llms-are-good-at-sql