← d3dev

LLMs Are Good at SQL: Mendral's Agent-Driven CI Log Analysis

Mendral (YC W26)
https://www.mendral.com/blog/llms-are-good-at-sql
llmsqlclickhouseciagentsobservabilitydevtools
At a Glance

Mendral gives their AI agent a raw SQL interface to ClickHouse instead of a rigid tool API, letting it write arbitrary queries across 5 TiB of CI log data (compressed to 154 GiB). The agent averages 4.4 queries per investigation, scanning hundreds of thousands to billions of rows, and resolves flaky test root causes in seconds. The key insight: LLMs are natively good at SQL, so don't constrain them with predefined query functions.

LLMs Are Good at SQL: Mendral's Agent-Driven CI Log Analysis

Metadata

FieldValue
TitleLLMs Are Good at SQL. We Gave Ours Terabytes of CI Logs.
AuthorMendral (YC W26)
Linkhttps://www.mendral.com/blog/llms-are-good-at-sql
Tagsllm, sql, clickhouse, ci, agents, observability, devtools
Date Downloaded2026-02-27

At a Glance

Mendral gives their AI agent a raw SQL interface to ClickHouse instead of a rigid tool API, letting it write arbitrary queries across 5 TiB of CI log data (compressed to 154 GiB). The agent averages 4.4 queries per investigation, scanning hundreds of thousands to billions of rows, and resolves flaky test root causes in seconds. The key insight: LLMs are natively good at SQL, so don't constrain them with predefined query functions.

Quotes

LLMs are good at SQL. There's an enormous amount of SQL in training data, and the syntax maps well to natural-language questions about data.

— Mendral

A constrained tool API like get_failure_rate(workflow, days) would limit the agent to the questions we anticipated. A SQL interface lets it ask questions we never thought of.

— Mendral

Nobody puts "we built a really good rate limiter" on their landing page. But without fresh, queryable data, your agent can't answer the question that actually matters: did I break this, or was it already broken?

— Mendral

A single commit can spawn hundreds of parallel jobs, each producing logs you need to fetch.

— Mendral

Sam's TLDR

Mendral (YC W26) built an AI agent that debugs CI failures by writing its own SQL queries against ClickHouse — no predefined query library, just a raw SQL interface scoped per org. They ingest ~1.5 billion CI log lines per week, denormalize 48 columns of metadata onto every log line (5.31 TiB uncompressed → 154 GiB on disk at 35:1 compression), and the agent averages 4.4 queries per investigation across 8,500+ sessions. The search pattern is broad-to-narrow: start with job metadata (20ms median), drill into raw logs when something's interesting (110ms median). At P95, a single investigation scans 940 million rows. The whole system depends on ClickHouse's columnar compression (repeated metadata compresses 50-300x), skip indexes with bloom filters, materialized views, and a carefully throttled GitHub API ingestion pipeline running on Inngest for durable execution. The boring but critical detail: they cap ingestion at ~3 req/s to keep 4,000 API calls/hour free for the agent itself.

Key Points

TargetSessionsAvg QueriesMedian RowsP75P95
Job metadata8,2104.0164K563K4.4M
Raw log lines5,4133.54.4M69M4.3B
Combined8,5344.4335K5.2M940M
Rows ScannedMedian LatencyP95 Latency
< 1K10ms50ms
10K-100K20ms50ms
1M-10M90ms1.2s
100M-1B6.8s30.6s
1B+31s82s

Full Summary

Mendral's core thesis is simple: LLMs are natively good at SQL, so give them SQL instead of constraining them with predetermined tool APIs. Their agent traced a flaky test to a dependency bump three weeks prior by autonomously writing SQL queries across hundreds of millions of log lines — something that would take a human significant manual effort scrolling through GitHub Actions log viewers.

The system ingests about 1.5 billion CI log lines weekly into ClickHouse, landing with 48 columns of denormalized metadata on every single row. This sounds like a storage disaster in a traditional row-store, but ClickHouse's columnar format makes it essentially free — repeated values like commit_message compress at 301:1 because thousands of log lines from the same CI run share identical values. The total dataset is 5.31 TiB uncompressed, stored in just 154 GiB on disk (35:1 compression), averaging 21 bytes per log line with all metadata included.

The agent's investigation pattern is methodical: start broad with job metadata queries (cheap, 20ms median), identify anomalies, then drill into raw log lines (more expensive, 110ms median). Across 8,534 sessions and 52,312 queries, the typical investigation scans 335K rows across about 4.4 queries. At the extreme end, P95 sessions scan 940 million rows, and the heaviest raw-log investigations hit 4.3 billion rows.

ClickHouse's performance holds up through careful engineering: primary key design for physical sort order matching the access pattern, bloom filter skip indexes on 14 columns (including an ngram filter for full-text search on log content), materialized views for pre-computed aggregations, and async inserts for write throughput. Query latency scales roughly linearly with rows scanned — 10x more rows means about 10x more latency.

The often-overlooked infrastructure piece is GitHub API rate limiting. With 15,000 requests/hour shared between ingestion and the agent, early burst-mode ingestion would exhaust the budget and leave the agent working with 30-minute-old data. The fix was throttling ingestion to ~3 req/s, keeping ~4,000 requests/hour free for agent operations. Both the ingestion pipeline and agent run on Inngest (durable execution), so rate limit hits cause clean suspensions with full state checkpointing rather than crashes or blind retries. The target is under 5 minutes at P95 for ingestion delay, typically achieved in seconds.

References

  1. [1]Mendral. "LLMs Are Good at SQL. We Gave Ours Terabytes of CI Logs." https://www.mendral.com/blog/llms-are-good-at-sql