Author: Solo indie developer | Project: 3KLife (Three Kingdoms strategy game, Cocos Creator 3.8)
Keywords: Harness Engineering, AI Agent, Coding Agent, automated validation, small-model-friendly design
Over the past year, I have moved almost my entire game development workflow toward AI Agent-driven execution. GitHub Copilot, Claude, and all kinds of automation scripts now sit directly inside the loop. Efficiency clearly went up, but so did a different problem: agents make mistakes, and the worst mistakes are silent ones.
They do not always crash. They do not always warn you. Sometimes they quietly insert a malformed field into JSON, or point an import toward a module it should never touch. By the time you notice, the mistake may already be buried three days deep in code review.
It was only after reading Martin Fowler’s Harness Engineering for Coding Agent Users that I realized this problem had a proper name and, more importantly, a systematic solution.
This article is my own audit and reflection: what harness I have already built, what is still missing, and how I plan to close the gap.
Harness = everything in an AI Agent workflow except the model itself.
There are two parts:
A good outer harness does two things:
| Mode | Properties | Typical examples | Suggested frequency |
|---|---|---|---|
| Computational | Deterministic, fast, CPU-driven | tsc, ESLint, JSON Schema validation |
Run on every change |
| Inferential | Non-deterministic, slower, GPU or LLM-driven | AI code review, semantic analysis | Use selectively |
| Maintainability | Architecture Fitness | Behaviour |
|---|---|---|
| Duplicate-code detection | Performance baseline testing | Functional spec validation |
| Complexity analysis | Module boundary guards | AI-generated testing |
| Coverage checks | API quality checks | Approved fixtures |
Under .github/instructions/, I maintain ten guidance files for different situations, including token budget control, UI framework compliance, and image-reading throttling. Each one uses an applyTo path filter so it only loads when relevant files are being edited.
I package repeated work into callable skills, for example:
The value of a skill is that it formalizes expert execution memory. The agent does not have to improvise the workflow. It reads the skill and knows the operating procedure.
Each task card contains:
INPUT_CONTRACT: the preconditions for the taskOUTPUT_CONTRACT: the required deliverable after completionVALIDATION_CMD: the command that must pass afterwardThis means the agent does not need to “understand the whole system” before it can move. It reads the card, follows the contract, runs validation, and finishes the local unit of work.
keep.md plus shard indexes)Every design decision is documented, then split into smaller shards that are loaded on demand so I do not waste tokens pulling in the whole thing every time.
keep.summary.md — a compact index that captures the current shared consensusI maintain check-context-budget.js so the workflow can actively inspect token usage mid-task and compress when it gets close to the limit. That design is crucial. LLM quality drops hard when context is overloaded, and most people do not even realize they have hit that wall.
| Tool | What it protects | Status |
|---|---|---|
check-encoding-touched.js | UTF-8 BOM and mojibake prevention | ✅ Present |
validate-ui-specs.js | UI spec structural integrity | ✅ Present |
validate-skin-contracts.js | Skin-family asset contracts | ✅ Present |
validate-generals-data.js | General-data JSON schema | ✅ Present |
validate-bloodline-integrity.js | Bloodline graph consistency | ✅ Present |
ucuf-screenshot-regression.js | Screenshot regression comparison | ✅ Present |
| ESLint / TSLint | Automatic enforcement of coding rules | ❌ Missing |
| Automated unit-test suite | Functional correctness | ❌ Missing |
check-import-boundaries.js | Cross-module coupling protection | ❌ Missing |
| Approved fixtures | Automatic behaviour snapshot comparison | ❌ Missing |
Right now, too much of my quality control still depends on the agent’s inferential ability — asking it to “decide” whether something is correct. That is exactly the anti-pattern Harness Engineering warns about.
| Current state (LLM judgment) | Target state (computational judgment) |
|---|---|
| LLM decides whether an import violates architecture | check-import-boundaries.js decides through AST analysis |
| LLM decides whether JSON is structurally valid | JSON Schema validation scripts |
| LLM decides whether UI component sizes match the spec | validate-ui-specs.js performs numeric comparison |
| LLM decides whether types are correct | tsc --noEmit points to the exact line that fails |
| LLM decides whether output is correct | Approved fixtures compare against baseline outputs |
One-sentence rule: if you are currently asking the LLM to make a judgment, and a script can make that judgment instead, write the script. Let the LLM handle new output and human-language understanding.
tsc --noEmit: syntax and type validationcheck-encoding-touched.js: UTF-8 integrityvalidate-*.js: data and UI structural validationcheck-import-boundaries.js: module-boundary enforcement| Module | Allowed imports | Forbidden imports |
|---|---|---|
shared/ | None — it should not depend on any business module | core, ui, battle |
core/ | shared/ | ui/, battle/ |
ui/ | shared/, core/ | battle/ |
battle/ | shared/, core/ | ui/ (one-way forbidden) |
AST-based static analysis can inspect each .ts import path and fail on violations directly. It is fully computational. No LLM is needed.
Once computational tools take over all the judgment work, the LLM only needs to run a very simple loop:
Even a 1.5B small model can finish complex work reliably if I give it a clear task card and deterministic error messages. The point is not that the model became smarter. The point is that I moved the judgment burden away from the model.
| Category | Score | Notes |
|---|---|---|
| Feedforward Guides | 80% | Instructions, skills, task cards, and consensus docs are already solid |
| Computational Sensors | 30% | Data validation is good, but linting, unit tests, and boundary guards are still missing |
| Behaviour Harness | 40% | Screenshot regression is in place, but approved fixtures are not yet built |
| Overall | ~55 / 100 | Strong feedforward, but a serious shortage of computational feedback |
tsc --noEmit into compute-gate (0.5 day) — eliminate type-error backtrackingcheck-import-boundaries.js (1 day) — automatic detection of cross-module couplingcompute-gate.js unified gate (1 day) — integrate the auto-repair loopapproved-fixture-check.js (1.5 days) — automate behaviour validationharness-health-report.js (1 day) — visualize harness coverage as a dashboardThe essence of Harness Engineering is turning a developer’s tacit knowledge into explicit, formalized structure, so an AI Agent can keep acting in line with your intent even when you are not directly supervising it.
My project is already strong on the feedforward side — 29 skills, 10 instruction files, and 142 task cards. But computational feedback sensors are still far too weak. That is the main direction I need to strengthen next.
One-sentence action guide: if the LLM is currently making a judgment and a script could make that judgment instead, write the script. Let the LLM handle new output and human-language understanding.
This article is based on the core framework from Martin Fowler’s Harness Engineering for Coding Agent Users (2026-04-02), combined with practical engineering lessons from my own Cocos Creator game project, 3KLife.