3KProject Engineering Notes

How I Use Harness Engineering to Make 3KProject More Stable

This article is my own working note on how I run engineering around 3KProject. The project already had instructions, skills, task cards, a doc-id registry, a context budget guard, UI contract validation, and runtime smoke checks. What really made the agent more stable, though, was moving the workflow from model guessing toward computation-backed validation.

Guide Give the agent clear routes, boundaries, and task contracts before it starts, so it is less likely to drift off course on the first try.

Sensor Run executable checks immediately after edits so mistakes are blocked before they reach human review.

Loop Feed failure messages straight into the next repair round so even smaller models can converge through repetition and ship reliably.

A Core Idea

Harness is not the model itself. It is the full set of mechanisms around the model that make mistakes less likely and make recovery easier when mistakes do happen.

If you think of a Coding Agent as a fast engine for producing code, then the harness is the steering wheel, the brakes, the dashboard, and the fuse box. My experience inside 3KProject is simple: without that engineering harness, the model can look brilliant at times but stays unstable; with it, spec summaries, task cards, validators, and handoffs start behaving like a repeatable production line.

Inside 3KProject, I treat the harness as an internal engineering control panel: `keep.summary` sets direction, task cards define slices, encoding and contract checks block mistakes before review, and runtime checks plus screenshot regression handle the closing validation.

The Two Axes That Matter Most

When I planned these tools for 3KProject, the most useful move was reducing control into two simple dimensions. The first is guidance before action versus feedback after action. The second is what can be decided computationally versus what still requires semantic judgment. That framing changes where I invest first instead of blindly piling on more prompt text.

Figure 1: The practical takeaway is not that all four matter equally, but that if a tool can decide it, do not ask the model to guess.

Feedforward

Before the agent edits anything, give it a spec summary, file scope, task contracts, and explicit no-go zones. The goal is not a longer prompt. The goal is fewer wasted attempts.

Feedback

Return executable validation results immediately after changes. If the failure message is short and specific enough, even a smaller model can converge through local fixes.

Split Harness Into Three Buckets So Governance Stays Focused

I ran into the same trap early in 3KProject: if something felt important, I kept adding another rule, and eventually it turned into an unmaintainable pile. The steadier approach was to split harness work into three governance dimensions: maintainability, architecture fitness, and behavioural correctness.

Figure 2: Once the team separates harness concerns clearly, the conversation shifts from “should we add a check?” to “what class of risk is this, and which sensor should catch it?”

This split matters in 3KProject because it turns quality from an abstract slogan into concrete responsibilities. Maintainability protects day-to-day changeability. Architecture fitness protects long-term evolution. Behaviour checks whether the system actually does the right thing. When all three are mixed together, rules keep growing while the feedback loop gets weaker.

3KProject’s Current Strengths and Gaps

Looking back at the current internal flow of 3KProject, feedforward is not in bad shape at all. Instructions, workflow skills, task cards, consensus docs, the doc-id registry, and the context budget guard are all alive. The weaker area is that computational feedback still is not complete enough.

Figure 3: Many teams do not lack rules. The problem is that the rules never became checks that run often, so quality still gets stuck in manual review.

What 3KProject Already Got Right

Writing experience down as instructions, skill flows, task cards, summary cards, and spec indexes was my first real step in turning an external harness into project infrastructure. Those mechanisms genuinely improved the agent’s first-pass success rate.

What I Most Want to Add Next

Without high-frequency, low-cost computational feedback, the agent is still guessing in the end. On the surface it looks like engineering, but in practice it only moves human review further downstream.

The Small-Model Workflow I Want to Land in 3KProject

The most valuable part of this approach for me is that it does not only serve the strongest models. 3KProject has a lot of internal specs, UI surfaces, data, and tooling. As long as I split work into atomic steps and give every step explicit inputs, outputs, and validation commands, even medium or small models can deliver complex features reliably.

Figure 4: What helps smaller models most is not longer prompts, but shorter task slices and more deterministic validation signals.

Once the workflow looks like the diagram above, the LLM’s job inside 3KProject becomes very simple: read the task card, edit locally, read the failure message, and repair locally. Any judgment about whether something is actually correct, I try to hand over to the type system, spec validators, boundary checkers, and fixture comparisons.

Example 1: Atomic Task Decomposer

node tools/task-decomposer.js \
        --feature "UI contract gate consolidation" \
        --spec "docs/ui/UI-tech-spec.md" \
  --output-dir "docs/tasks/"

Example 2: Computational Gate Configuration

{
  "gates": [
    {
      "name": "syntax-check",
      "cmd": "npx tsc --noEmit --project tsconfig.json",
      "priority": 1
    },
    {
      "name": "encoding-check",
      "cmd": "node tools_node/check-encoding-touched.js",
      "priority": 2
    },
    {
      "name": "domain-data",
      "cmd": "node tools_node/validate-generals-data.js",
      "priority": 3
    },
    {
      "name": "ui-contract",
      "cmd": "node tools_node/validate-ui-specs.js",
      "priority": 4
    },
    {
      "name": "runtime-registry",
      "cmd": "node tools_node/check-ui-runtime-state-registry.js",
      "priority": 5
    },
    {
      "name": "import-boundary",
      "cmd": "node tools_node/check-import-boundaries.js",
      "priority": 5
    }
  ]
}

The real key is not the gate name. It is whether the failure message is local enough and actionable enough. As long as the agent can tell which file, which rule, and which field failed, the repair success rate goes up sharply. That is why I keep saying inside 3KProject that the error message itself is also a prompt.

The Five Tools I Most Want to Add Next

3KProject does not really lack process documents. What it lacks are feedback gates that can run every day and feed results straight back into the agent. So for me, the efficient move is not writing more policy, but filling in a few tools that would actually be used daily.

compute-gate.js: unify type checks, encoding checks, data validation, and contract validation behind one entry point.
check-import-boundaries.js: turn module boundaries from verbal convention into tool-enforced guardrails.
approved-fixture-check.js: preserve human-approved expected outputs as comparable baselines.
task-decomposer.js: split large requests into atomic task cards that can be executed in sequence.
harness-health-report.js: produce a fixed report showing which guides and sensors are still hollow.

What these five tools share is that they translate abstract knowledge into repeatable operations. In 3KProject, once a rule becomes executable, the agent can actually be governed. Once a result becomes measurable, I can discuss improvement with the team instead of debating feelings.

Rollout Order: Patch the Cheap, Stable, Daily Sensors First

When I rank this rollout order for 3KProject, I care about which checks are cheapest, most stable, and likely to run every day. The best strategy is to start with low-cost high-frequency checks, then add behaviour and architecture guards, and only later handle expensive semantic review.

Figure 5: If the order is wrong, teams spend real money on expensive review while still leaving the basic daily failures untouched.

In one sentence, my priority order in 3KProject is this: let machines do the deterministic judgment they are good at first, then let models do the new-content generation and semantic understanding they are good at.

The Final Decision Rule

Back in 3KProject, my decision rule is very simple now: if a quality judgment can be made by scripts, the type system, schema checks, snapshots, or fixtures, then I should not leave it to an LLM to guess. The model is most valuable when it generates something new, understands ambiguous requirements, and fills missing context, not when it acts like an expensive unstable if/else engine.

Better Handed to Models

Requirement decomposition, code writing, refactoring suggestions, document consolidation, semantic diff reading, and high-level design tradeoffs.

Better Handed to Tools

Type correctness, naming rules, module boundaries, data formats, fixed-output comparison, and encoding integrity.

Inside 3KProject, my rule now is to turn as much model-side judgment as possible into scripts, and save the LLM for the work that really needs understanding and creation.