Building Games with AI Agents: How I Built a Harness System for My Own Project

Author: Solo indie developer ｜ Project: 3KLife (Three Kingdoms strategy game, Cocos Creator 3.8)
Keywords: Harness Engineering, AI Agent, Coding Agent, automated validation, small-model-friendly design

Home ｜ Articles ｜中文版

Introduction: If AI Agents Write the Code, Who Owns Quality?

Over the past year, I have moved almost my entire game development workflow toward AI Agent-driven execution. GitHub Copilot, Claude, and all kinds of automation scripts now sit directly inside the loop. Efficiency clearly went up, but so did a different problem: agents make mistakes, and the worst mistakes are silent ones.

They do not always crash. They do not always warn you. Sometimes they quietly insert a malformed field into JSON, or point an import toward a module it should never touch. By the time you notice, the mistake may already be buried three days deep in code review.

It was only after reading Martin Fowler’s Harness Engineering for Coding Agent Users that I realized this problem had a proper name and, more importantly, a systematic solution.

This article is my own audit and reflection: what harness I have already built, what is still missing, and how I plan to close the gap.

1. What Is a Harness? One Sentence Version

Harness = everything in an AI Agent workflow except the model itself.

There are two parts:

Inner Harness: the system prompt, code retrieval behavior, and orchestration built into tools like Copilot or Claude. I do not control those directly.
Outer Harness: the guidance rules, validation scripts, and self-correction loops I build for my own project. This is the part I can actually design.

A good outer harness does two things:

Increase the chance that the agent gets it right on the first pass (feedforward guidance)
Let the agent detect and repair issues before a human sees them (feedback sensing)

2. Two Dimensions of a Harness: Timing × Execution Mode

Dimension One: Timing

⬆️ Feedforward — before action

System prompts
Instruction rule sets
Task-card contracts
Consensus docs
Skill routing

⬇️ Feedback — after action

Linter failures
Type-check failures
Data validation scripts
Screenshot regression comparison
Module boundary guards

⚠️ Key trap: feedforward without feedback means the agent never learns whether the rule actually worked. Feedback without feedforward means the agent keeps repeating the same mistakes and wasting tokens. You need both.

Dimension Two: Execution Mode

Mode	Properties	Typical examples	Suggested frequency
Computational	Deterministic, fast, CPU-driven	`tsc`, ESLint, JSON Schema validation	Run on every change
Inferential	Non-deterministic, slower, GPU or LLM-driven	AI code review, semantic analysis	Use selectively

Core rule: if a judgment can be solved computationally, do not ask the LLM to do it. Computational checks are faster, cheaper, and deterministic.

3. The Three Governance Buckets

Maintainability	Architecture Fitness	Behaviour
Duplicate-code detection	Performance baseline testing	Functional spec validation
Complexity analysis	Module boundary guards	AI-generated testing
Coverage checks	API quality checks	Approved fixtures

4. Inventory: What My Project Already Has

① Instruction rule sets (10 files)

Under .github/instructions/, I maintain ten guidance files for different situations, including token budget control, UI framework compliance, and image-reading throttling. Each one uses an applyTo path filter so it only loads when relevant files are being edited.

② Skills (29 workflows)

I package repeated work into callable skills, for example:

cocos-bug-triage: a full debugging pipeline for visual symptoms plus runtime errors
context-budget-guard: automatic compression when context gets too heavy
ui-vibe-pipeline: an end-to-end pipeline from design reference to reviewable UI
encoding-touched-guard: encoding integrity checks after every file edit

The value of a skill is that it formalizes expert execution memory. The agent does not have to improvise the workflow. It reads the skill and knows the operating procedure.

③ Task-card system (142 cards)

Each task card contains:

INPUT_CONTRACT: the preconditions for the task
OUTPUT_CONTRACT: the required deliverable after completion
VALIDATION_CMD: the command that must pass afterward

This means the agent does not need to “understand the whole system” before it can move. It reads the card, follows the contract, runs validation, and finishes the local unit of work.

④ Consensus docs (`keep.md` plus shard indexes)

Every design decision is documented, then split into smaller shards that are loaded on demand so I do not waste tokens pulling in the whole thing every time.

Pre-flight standard procedure

Read keep.summary.md — a compact index that captures the current shared consensus

Only read full shards if a consensus change is needed — load on demand, not all at once

New decision → write it back into keep → sync cross references so the shared knowledge stays current

⑤ Context budget control

I maintain check-context-budget.js so the workflow can actively inspect token usage mid-task and compress when it gets close to the limit. That design is crucial. LLM quality drops hard when context is overloaded, and most people do not even realize they have hit that wall.

5. Current Computational Sensor Inventory

Tool	What it protects	Status
`check-encoding-touched.js`	UTF-8 BOM and mojibake prevention	✅ Present
`validate-ui-specs.js`	UI spec structural integrity	✅ Present
`validate-skin-contracts.js`	Skin-family asset contracts	✅ Present
`validate-generals-data.js`	General-data JSON schema	✅ Present
`validate-bloodline-integrity.js`	Bloodline graph consistency	✅ Present
`ucuf-screenshot-regression.js`	Screenshot regression comparison	✅ Present
ESLint / TSLint	Automatic enforcement of coding rules	❌ Missing
Automated unit-test suite	Functional correctness	❌ Missing
`check-import-boundaries.js`	Cross-module coupling protection	❌ Missing
Approved fixtures	Automatic behaviour snapshot comparison	❌ Missing

6. The Core Problem: I Let the LLM Make Too Many Judgments It Should Not Make

Right now, too much of my quality control still depends on the agent’s inferential ability — asking it to “decide” whether something is correct. That is exactly the anti-pattern Harness Engineering warns about.

Current state (LLM judgment)	Target state (computational judgment)
LLM decides whether an import violates architecture	`check-import-boundaries.js` decides through AST analysis
LLM decides whether JSON is structurally valid	JSON Schema validation scripts
LLM decides whether UI component sizes match the spec	`validate-ui-specs.js` performs numeric comparison
LLM decides whether types are correct	`tsc --noEmit` points to the exact line that fails
LLM decides whether output is correct	Approved fixtures compare against baseline outputs

One-sentence rule: if you are currently asking the LLM to make a judgment, and a script can make that judgment instead, write the script. Let the LLM handle new output and human-language understanding.

7. Improvement Plan: A Compute Gate Architecture

7.1 The full execution loop

Agent execution loop after improvement

Human: complex request → milestone breakdown → atomic task cards Each card contains INPUT_CONTRACT, OUTPUT_CONTRACT, and VALIDATION_CMD.

Agent reads the task card It verifies preconditions (INPUT_CONTRACT) and the completion target (OUTPUT_CONTRACT).

Agent edits code It only does the “write new things” part, not the “decide whether this is correct” part.

Compute Gate runs automatically

tsc --noEmit: syntax and type validation
check-encoding-touched.js: UTF-8 integrity
validate-*.js: data and UI structural validation
check-import-boundaries.js: module-boundary enforcement

✗

If the gate fails → the failure message becomes the next prompt The agent repairs against a deterministic error signal, with at most three loops and no blind guessing.

✓

If the gate passes → output summary → next task card Humans only need selective semantic review, not repetitive routine checking.

7.2 Module-boundary rules

Module	Allowed imports	Forbidden imports
`shared/`	None — it should not depend on any business module	core, ui, battle
`core/`	shared/	ui/, battle/
`ui/`	shared/, core/	battle/
`battle/`	shared/, core/	ui/ (one-way forbidden)

AST-based static analysis can inspect each .ts import path and fail on violations directly. It is fully computational. No LLM is needed.

8. Why Is This Especially Friendly to Small Models?

Once computational tools take over all the judgment work, the LLM only needs to run a very simple loop:

A loop that even a small model can run stably

Read the task card — clear INPUT and OUTPUT contracts, no need to understand the whole system

Write code — only do the “create new output” step, not the “judge correctness” step

Read the error message — computational tools give deterministic, concrete failures instead of vague guesses

Repair — apply a local fix against a concrete error, then go back to step 2

Even a 1.5B small model can finish complex work reliably if I give it a clear task card and deterministic error messages. The point is not that the model became smarter. The point is that I moved the judgment burden away from the model.

9. Harness Health Score and Reinforcement Roadmap

Current state

Category	Score	Notes
Feedforward Guides	80%	Instructions, skills, task cards, and consensus docs are already solid
Computational Sensors	30%	Data validation is good, but linting, unit tests, and boundary guards are still missing
Behaviour Harness	40%	Screenshot regression is in place, but approved fixtures are not yet built
Overall	~55 / 100	Strong feedforward, but a serious shortage of computational feedback

Reinforcement priority

P0Integrate tsc --noEmit into compute-gate (0.5 day) — eliminate type-error backtracking

P0Basic ESLint setup (0.5 day) — turn coding rules from memory into enforced tooling

P1check-import-boundaries.js (1 day) — automatic detection of cross-module coupling

P1compute-gate.js unified gate (1 day) — integrate the auto-repair loop

P2approved-fixture-check.js (1.5 days) — automate behaviour validation

P3harness-health-report.js (1 day) — visualize harness coverage as a dashboard

10. Five Suggestions for Other Developers

Start with feedforward, because it is the cheapest.
Instructions and skills do not need complicated tooling. They just need you to write down your experience.
Computational sensors beat inferential sensors.
Every time you ask the LLM to judge whether something is correct, ask yourself instead: can I write a script for this? If the answer is yes, write the script.
Error messages are the best prompts.
A computational tool’s failure output can be used directly as the agent’s next prompt. The location and nature of the failure are already explicit, so no human translation is needed.
Consensus docs are the most underrated feedforward tool.
Every design decision you do not want the agent to rediscover on every turn should be written into your consensus docs.
Greenfield projects have a natural advantage, so start now.
Legacy code needs harness the most but is the hardest place to build it. If your project is new, make validation scripts part of every module from day one.

Conclusion

The essence of Harness Engineering is turning a developer’s tacit knowledge into explicit, formalized structure, so an AI Agent can keep acting in line with your intent even when you are not directly supervising it.

My project is already strong on the feedforward side — 29 skills, 10 instruction files, and 142 task cards. But computational feedback sensors are still far too weak. That is the main direction I need to strengthen next.

One-sentence action guide: if the LLM is currently making a judgment and a script could make that judgment instead, write the script. Let the LLM handle new output and human-language understanding.

This article is based on the core framework from Martin Fowler’s Harness Engineering for Coding Agent Users (2026-04-02), combined with practical engineering lessons from my own Cocos Creator game project, 3KLife.