← d3dev

MCP Harness — Research Report

Date: 2026-03-06 Requested by: Richard TLDR: An "MCP harness" is a test/orchestration layer that wraps around MCP servers, using them as controlled

environments to test, evaluate, or interactively explore AI agents. It's the difference between building a tool (MCP

server) and building the test rig that puts the tool in an agent's hands and verifies what happens.

Key Distinction: Harness vs. Server

	MCP Server	MCP Harness
What it is	Provides tools/resources to an agent	Uses servers to create a controlled environment for testing/evaluating agents
Direction	Agent → Server (agent calls tools)	Harness → Server + Agent (harness sets up the world, agent acts, harness verifies)
Who builds it	Tool/API developers	Agent developers, QA, evaluators
Purpose	Expose capabilities	Validate behavior
Analogy	A database driver	A database test fixture + assertions

An MCP server says: "here are the tools you can use."

An MCP harness says: "here's a controlled world — now prove you can do the task."

The Three Flavors

Research found the term "MCP harness" used in three distinct ways:

1. Agent Test Harness (most interesting / most distinct)

Example: kindgracekind/mcp_harness (7 stars, Python)

The core pattern:

Compose mock MCP servers (task list, filesystem, calculator) into a single endpoint
Define tasks the agent must complete (via a TaskList MCP server that tracks progress)
Point a real AI agent at the composed server
Assert on outcomes — did the agent complete the tasks? Did it write the correct file?

Key code patterns:

compose_mcp_servers() merges multiple FastMCP instances via .mount()
TaskList is an MCP server that tracks task completion state + has wait_for_all_completed()
Filesystem is a mock in-memory filesystem exposed as MCP tools
Tests create the environment, start the composed server, let the agent run, then assert

Example test flow:

1. Create TaskList with: "Multiply 983745 * 29837423 and write to output.txt"
2. Create mock Filesystem (in-memory)
3. Optionally include a Calculator MCP server
4. Compose all servers → single MCP endpoint
5. Start agent → agent connects, reads tasks, uses tools, writes result
6. Assert: filesystem.read("output.txt") == expected product

This is essentially eval infrastructure for agents — but instead of prompt-in/text-out evaluation, you're evaluating

the agent's ability to use tools correctly in a realistic environment. The MCP protocol is the contract between the test

environment and the agent under test.

2. MCP Server Test Harness (unit testing for servers)

Example: gabry-ts/mcp-harness (TypeScript, npm package)

This inverts the direction — instead of testing agents, it tests MCP server implementations. Think supertest for

Express, but for MCP:

In-memory transport (no subprocess, no IO, no ports)
Create a harness from your McpServer instance
Call tools, read resources, get prompts
Assert on results with framework-agnostic helpers

const harness = await createHarness(server);
const result = await harness.callTool('greet', {name: 'World'});
hasText(result, 'Hello');  // true

Also supports subprocess mode (spawns the server as a child process over stdio) for integration testing.

3. Interactive Exploration Harness (REPL/CLI)

Examples:

angusforeman/simple-MCP-harness (Python/Shell)
PyTorch HUD "treehugger" uses the term loosely

A REPL that connects to any stdio MCP server and lets you:

list tools — see what's available
call — invoke tools interactively
switch — swap between servers

Like Postman/curl for MCP. Useful for development and debugging, not automated testing.

Why This Matters

For Agent Evaluation

Traditional agent evals are text-in/text-out — give the agent a prompt, check if the output matches. MCP harnesses

enable behavioral evaluation: does the agent use the right tools in the right order to achieve a goal? This is much

closer to how agents actually work in production.

For Composability

The compose_mcp_servers() pattern is powerful. You can mix and match:

Real services (database, API) + mock services (filesystem, calculator)
Progressively add capabilities and see how the agent adapts
A/B test: does the agent perform better with tool X available?

For Regression Testing

If you ship an agent that uses MCP tools, you need to test it. A harness lets you:

Set up a known environment
Run the agent
Assert on side effects (files written, tasks completed, API calls made)
Catch regressions when you change the agent's prompts or model

For Security/Sandboxing

A harness controls exactly what tools the agent has access to. The mock filesystem can't touch real files. The mock API

can't hit production. This is a natural sandbox.

Relationship to What We Already Have

Our existing architecture has parallels:

Skills are like individual MCP servers (self-contained capability modules)
Orchestrator is like a harness (plans tasks, dispatches to workers, tracks completion)
Agent-logger captures what happened but doesn't assert on correctness

What we don't have yet: automated agent evaluation. An MCP harness pattern could let us write tests like "give Sam a

task list and a set of tools, verify it completes the tasks correctly."

Notable Projects

Project	Stars	Language	Focus
kindgracekind/mcp_harness	7	Python	Agent testing via composed MCP servers
gabry-ts/mcp-harness	2	TypeScript	MCP server unit testing (supertest for MCP)
izaitsevfb/claude-pytorch-treehugger	4	Python	Domain-specific MCP wrapper (PyTorch HUD)
angusforeman/simple-MCP-harness	0	Python/Shell	Interactive REPL for exploring MCP servers
parallax-labs/context-harness	28	Rust	Context ingestion engine (not really an "MCP harness" — includes MCP server)

Key Takeaway

The most valuable interpretation of "MCP harness" is flavor #1: using composed MCP servers as a test environment for

agents. The MCP protocol becomes the interface between your test infrastructure and the agent under test. You control

the world (tools, data, tasks), the agent acts, you verify the results.

This is an emerging pattern. The repos are small and new. But the concept is solid — it's the natural next step once you

have agents that use tools. You need a way to test them that goes beyond "did the text output look right?"