Building Games with AI Agents: How I Built a Harness System for My Own Project

Author: Solo indie developer | Project: 3KLife (Three Kingdoms strategy game, Cocos Creator 3.8)
Keywords: Harness Engineering, AI Agent, Coding Agent, automated validation, small-model-friendly design

Introduction: If AI Agents Write the Code, Who Owns Quality?

Over the past year, I have moved almost my entire game development workflow toward AI Agent-driven execution. GitHub Copilot, Claude, and all kinds of automation scripts now sit directly inside the loop. Efficiency clearly went up, but so did a different problem: agents make mistakes, and the worst mistakes are silent ones.

They do not always crash. They do not always warn you. Sometimes they quietly insert a malformed field into JSON, or point an import toward a module it should never touch. By the time you notice, the mistake may already be buried three days deep in code review.

It was only after reading Martin Fowler’s Harness Engineering for Coding Agent Users that I realized this problem had a proper name and, more importantly, a systematic solution.

This article is my own audit and reflection: what harness I have already built, what is still missing, and how I plan to close the gap.


1. What Is a Harness? One Sentence Version

Harness = everything in an AI Agent workflow except the model itself.

There are two parts:

A good outer harness does two things:

  1. Increase the chance that the agent gets it right on the first pass (feedforward guidance)
  2. Let the agent detect and repair issues before a human sees them (feedback sensing)

2. Two Dimensions of a Harness: Timing × Execution Mode

Dimension One: Timing

⬆️ Feedforward — before action
  • System prompts
  • Instruction rule sets
  • Task-card contracts
  • Consensus docs
  • Skill routing
⬇️ Feedback — after action
  • Linter failures
  • Type-check failures
  • Data validation scripts
  • Screenshot regression comparison
  • Module boundary guards
⚠️ Key trap: feedforward without feedback means the agent never learns whether the rule actually worked. Feedback without feedforward means the agent keeps repeating the same mistakes and wasting tokens. You need both.

Dimension Two: Execution Mode

Mode Properties Typical examples Suggested frequency
Computational Deterministic, fast, CPU-driven tsc, ESLint, JSON Schema validation Run on every change
Inferential Non-deterministic, slower, GPU or LLM-driven AI code review, semantic analysis Use selectively
Core rule: if a judgment can be solved computationally, do not ask the LLM to do it. Computational checks are faster, cheaper, and deterministic.

3. The Three Governance Buckets

Maintainability Architecture Fitness Behaviour
Duplicate-code detection Performance baseline testing Functional spec validation
Complexity analysis Module boundary guards AI-generated testing
Coverage checks API quality checks Approved fixtures

4. Inventory: What My Project Already Has

① Instruction rule sets (10 files)

Under .github/instructions/, I maintain ten guidance files for different situations, including token budget control, UI framework compliance, and image-reading throttling. Each one uses an applyTo path filter so it only loads when relevant files are being edited.

② Skills (29 workflows)

I package repeated work into callable skills, for example:

The value of a skill is that it formalizes expert execution memory. The agent does not have to improvise the workflow. It reads the skill and knows the operating procedure.

③ Task-card system (142 cards)

Each task card contains:

This means the agent does not need to “understand the whole system” before it can move. It reads the card, follows the contract, runs validation, and finishes the local unit of work.

④ Consensus docs (keep.md plus shard indexes)

Every design decision is documented, then split into smaller shards that are loaded on demand so I do not waste tokens pulling in the whole thing every time.

Pre-flight standard procedure
1
Read keep.summary.md — a compact index that captures the current shared consensus
2
Only read full shards if a consensus change is needed — load on demand, not all at once
3
New decision → write it back into keep → sync cross references so the shared knowledge stays current

⑤ Context budget control

I maintain check-context-budget.js so the workflow can actively inspect token usage mid-task and compress when it gets close to the limit. That design is crucial. LLM quality drops hard when context is overloaded, and most people do not even realize they have hit that wall.


5. Current Computational Sensor Inventory

ToolWhat it protectsStatus
check-encoding-touched.jsUTF-8 BOM and mojibake prevention✅ Present
validate-ui-specs.jsUI spec structural integrity✅ Present
validate-skin-contracts.jsSkin-family asset contracts✅ Present
validate-generals-data.jsGeneral-data JSON schema✅ Present
validate-bloodline-integrity.jsBloodline graph consistency✅ Present
ucuf-screenshot-regression.jsScreenshot regression comparison✅ Present
ESLint / TSLintAutomatic enforcement of coding rules❌ Missing
Automated unit-test suiteFunctional correctness❌ Missing
check-import-boundaries.jsCross-module coupling protection❌ Missing
Approved fixturesAutomatic behaviour snapshot comparison❌ Missing

6. The Core Problem: I Let the LLM Make Too Many Judgments It Should Not Make

Right now, too much of my quality control still depends on the agent’s inferential ability — asking it to “decide” whether something is correct. That is exactly the anti-pattern Harness Engineering warns about.

Current state (LLM judgment)Target state (computational judgment)
LLM decides whether an import violates architecture check-import-boundaries.js decides through AST analysis
LLM decides whether JSON is structurally valid JSON Schema validation scripts
LLM decides whether UI component sizes match the spec validate-ui-specs.js performs numeric comparison
LLM decides whether types are correct tsc --noEmit points to the exact line that fails
LLM decides whether output is correct Approved fixtures compare against baseline outputs
One-sentence rule: if you are currently asking the LLM to make a judgment, and a script can make that judgment instead, write the script. Let the LLM handle new output and human-language understanding.

7. Improvement Plan: A Compute Gate Architecture

7.1 The full execution loop

Agent execution loop after improvement
1
Human: complex request → milestone breakdown → atomic task cards Each card contains INPUT_CONTRACT, OUTPUT_CONTRACT, and VALIDATION_CMD.
2
Agent reads the task card It verifies preconditions (INPUT_CONTRACT) and the completion target (OUTPUT_CONTRACT).
3
Agent edits code It only does the “write new things” part, not the “decide whether this is correct” part.
4
Compute Gate runs automatically
  1. tsc --noEmit: syntax and type validation
  2. check-encoding-touched.js: UTF-8 integrity
  3. validate-*.js: data and UI structural validation
  4. check-import-boundaries.js: module-boundary enforcement
If the gate fails → the failure message becomes the next prompt The agent repairs against a deterministic error signal, with at most three loops and no blind guessing.
If the gate passes → output summary → next task card Humans only need selective semantic review, not repetitive routine checking.

7.2 Module-boundary rules

ModuleAllowed importsForbidden imports
shared/None — it should not depend on any business modulecore, ui, battle
core/shared/ui/, battle/
ui/shared/, core/battle/
battle/shared/, core/ui/ (one-way forbidden)

AST-based static analysis can inspect each .ts import path and fail on violations directly. It is fully computational. No LLM is needed.


8. Why Is This Especially Friendly to Small Models?

Once computational tools take over all the judgment work, the LLM only needs to run a very simple loop:

A loop that even a small model can run stably
1
Read the task card — clear INPUT and OUTPUT contracts, no need to understand the whole system
2
Write code — only do the “create new output” step, not the “judge correctness” step
3
Read the error message — computational tools give deterministic, concrete failures instead of vague guesses
4
Repair — apply a local fix against a concrete error, then go back to step 2

Even a 1.5B small model can finish complex work reliably if I give it a clear task card and deterministic error messages. The point is not that the model became smarter. The point is that I moved the judgment burden away from the model.


9. Harness Health Score and Reinforcement Roadmap

Current state

CategoryScoreNotes
Feedforward Guides 80% Instructions, skills, task cards, and consensus docs are already solid
Computational Sensors 30% Data validation is good, but linting, unit tests, and boundary guards are still missing
Behaviour Harness 40% Screenshot regression is in place, but approved fixtures are not yet built
Overall ~55 / 100 Strong feedforward, but a serious shortage of computational feedback

Reinforcement priority

P0Integrate tsc --noEmit into compute-gate (0.5 day) — eliminate type-error backtracking
P0Basic ESLint setup (0.5 day) — turn coding rules from memory into enforced tooling
P1check-import-boundaries.js (1 day) — automatic detection of cross-module coupling
P1compute-gate.js unified gate (1 day) — integrate the auto-repair loop
P2approved-fixture-check.js (1.5 days) — automate behaviour validation
P3harness-health-report.js (1 day) — visualize harness coverage as a dashboard

10. Five Suggestions for Other Developers

  1. Start with feedforward, because it is the cheapest.
    Instructions and skills do not need complicated tooling. They just need you to write down your experience.
  2. Computational sensors beat inferential sensors.
    Every time you ask the LLM to judge whether something is correct, ask yourself instead: can I write a script for this? If the answer is yes, write the script.
  3. Error messages are the best prompts.
    A computational tool’s failure output can be used directly as the agent’s next prompt. The location and nature of the failure are already explicit, so no human translation is needed.
  4. Consensus docs are the most underrated feedforward tool.
    Every design decision you do not want the agent to rediscover on every turn should be written into your consensus docs.
  5. Greenfield projects have a natural advantage, so start now.
    Legacy code needs harness the most but is the hardest place to build it. If your project is new, make validation scripts part of every module from day one.

Conclusion

The essence of Harness Engineering is turning a developer’s tacit knowledge into explicit, formalized structure, so an AI Agent can keep acting in line with your intent even when you are not directly supervising it.

My project is already strong on the feedforward side — 29 skills, 10 instruction files, and 142 task cards. But computational feedback sensors are still far too weak. That is the main direction I need to strengthen next.

One-sentence action guide: if the LLM is currently making a judgment and a script could make that judgment instead, write the script. Let the LLM handle new output and human-language understanding.

This article is based on the core framework from Martin Fowler’s Harness Engineering for Coding Agent Users (2026-04-02), combined with practical engineering lessons from my own Cocos Creator game project, 3KLife.