3KProject Engineering Notes

How I Use Harness Engineering to Make 3KProject More Stable

This article is my own working note on how I run engineering around 3KProject. The project already had instructions, skills, task cards, a doc-id registry, a context budget guard, UI contract validation, and runtime smoke checks. What really made the agent more stable, though, was moving the workflow from model guessing toward computation-backed validation.

Guide Give the agent clear routes, boundaries, and task contracts before it starts, so it is less likely to drift off course on the first try.
Sensor Run executable checks immediately after edits so mistakes are blocked before they reach human review.
Loop Feed failure messages straight into the next repair round so even smaller models can converge through repetition and ship reliably.

A Core Idea

Harness is not the model itself. It is the full set of mechanisms around the model that make mistakes less likely and make recovery easier when mistakes do happen.

If you think of a Coding Agent as a fast engine for producing code, then the harness is the steering wheel, the brakes, the dashboard, and the fuse box. My experience inside 3KProject is simple: without that engineering harness, the model can look brilliant at times but stays unstable; with it, spec summaries, task cards, validators, and handoffs start behaving like a repeatable production line.

Inside 3KProject, I treat the harness as an internal engineering control panel: `keep.summary` sets direction, task cards define slices, encoding and contract checks block mistakes before review, and runtime checks plus screenshot regression handle the closing validation.

The Two Axes That Matter Most

When I planned these tools for 3KProject, the most useful move was reducing control into two simple dimensions. The first is guidance before action versus feedback after action. The second is what can be decided computationally versus what still requires semantic judgment. That framing changes where I invest first instead of blindly piling on more prompt text.

The Four Quadrants of Harness: steer first, verify second; if it can be computed, give it to tools first Feedforward Guidance before the agent acts Feedback Feedback after the agent acts Computational Deterministic, high-frequency Inferential Semantic judgment, selectively used Computational × Feedforward Highest-leverage place to invest first • Task-card input/output contracts • File routing rules and module boundaries • Executable scaffolds and templates Computational × Feedback Backbone of stable quality • Type checks, lint, unit tests • JSON, contract, and snapshot validation • Encoding, guardrail, and dependency checks Inferential × Feedforward • Architecture explanations, examples, review rules • Teach the model what a good answer looks like Inferential × Feedback • AI code review and semantic review • Higher cost, best reserved for key checkpoints
Figure 1: The practical takeaway is not that all four matter equally, but that if a tool can decide it, do not ask the model to guess.

Feedforward

Before the agent edits anything, give it a spec summary, file scope, task contracts, and explicit no-go zones. The goal is not a longer prompt. The goal is fewer wasted attempts.

Feedback

Return executable validation results immediately after changes. If the failure message is short and specific enough, even a smaller model can converge through local fixes.

Split Harness Into Three Buckets So Governance Stays Focused

I ran into the same trap early in 3KProject: if something felt important, I kept adding another rule, and eventually it turned into an unmaintainable pile. The steadier approach was to split harness work into three governance dimensions: maintainability, architecture fitness, and behavioural correctness.

Harness Engineering Maintainability • Duplicate-code and complexity analysis • Type health and test coverage • Encoding consistency and dead-code scans Purpose: keep the system safe to maintain for both humans and agents under change. Architecture Fitness • Module boundaries and dependency limits • Performance, memory, and load baselines • Observability and API quality guards Purpose: turn architecture principles into executable rules with visible violations. Behaviour • Functional spec validation • Approved fixtures / golden cases • Regression snapshots and interaction smoke checks Purpose: verify that what the system does actually matches product expectations.
Figure 2: Once the team separates harness concerns clearly, the conversation shifts from “should we add a check?” to “what class of risk is this, and which sensor should catch it?”

This split matters in 3KProject because it turns quality from an abstract slogan into concrete responsibilities. Maintainability protects day-to-day changeability. Architecture fitness protects long-term evolution. Behaviour checks whether the system actually does the right thing. When all three are mixed together, rules keep growing while the feedback loop gets weaker.

3KProject’s Current Strengths and Gaps

Looking back at the current internal flow of 3KProject, feedforward is not in bad shape at all. Instructions, workflow skills, task cards, consensus docs, the doc-id registry, and the context budget guard are all alive. The weaker area is that computational feedback still is not complete enough.

Current Harness Capability Snapshot for 3KProject Guidance layer is mature • keep.summary, Instructions, Skills, task cards, and the doc-id registry are already in operation • Agents usually know the route, no-go zones, naming rules, and handoff style before they begin Low risk, but easy to mistake this for enough Structural validation has a foundation • We already have encoding checks, UI contract validation, runtime smoke checks, screenshot regression, and data validation • 3KProject already knows where eyeballing alone is not enough Still missing an integrated gate that feeds results back to the agent Lint and tests are thin • Missing high-frequency linting, unit tests, and a type-check pipeline • Quality still leans too much on model memory and manual code review This is the main stability gap Module boundaries are not tooled • We know which dependencies should not happen, but we do not have complete automated guards • Some architecture principles still live only in docs and senior-member reminders This becomes coupling sprawl over time Approved fixtures are missing • Many internal workflows have expected outputs, but they are not stored as reusable regression baselines • Validation still has to be judged by people each time, so the cost keeps compounding Behaviour correctness cannot be reused consistently Overall judgment • 3KProject does not lack rules; feedback is just not cheap, dense, and automated enough yet • My next step is not more prompt text, but more computational sensors This directly determines whether the agent can scale stably
Figure 3: Many teams do not lack rules. The problem is that the rules never became checks that run often, so quality still gets stuck in manual review.

What 3KProject Already Got Right

Writing experience down as instructions, skill flows, task cards, summary cards, and spec indexes was my first real step in turning an external harness into project infrastructure. Those mechanisms genuinely improved the agent’s first-pass success rate.

What I Most Want to Add Next

Without high-frequency, low-cost computational feedback, the agent is still guessing in the end. On the surface it looks like engineering, but in practice it only moves human review further downstream.

The Small-Model Workflow I Want to Land in 3KProject

The most valuable part of this approach for me is that it does not only serve the strongest models. 3KProject has a lot of internal specs, UI surfaces, data, and tooling. As long as I split work into atomic steps and give every step explicit inputs, outputs, and validation commands, even medium or small models can deliver complex features reliably.

Complex requirements and milestone breakdown Atomic task cards Each card defines INPUT_CONTRACT, OUTPUT_CONTRACT, VALIDATION_CMD, and ROLLBACK_CMD Agent makes local edits The model focuses on the local code and known context of the current card, not the entire system at once Computational validation Type, Lint, Test, Schema Fail: auto-repair Feed the error message directly back to the agent Pass: output summary Update status and move to the next task card FAIL PASS
Figure 4: What helps smaller models most is not longer prompts, but shorter task slices and more deterministic validation signals.

Once the workflow looks like the diagram above, the LLM’s job inside 3KProject becomes very simple: read the task card, edit locally, read the failure message, and repair locally. Any judgment about whether something is actually correct, I try to hand over to the type system, spec validators, boundary checkers, and fixture comparisons.

Example 1: Atomic Task Decomposer

node tools/task-decomposer.js \
        --feature "UI contract gate consolidation" \
        --spec "docs/ui/UI-tech-spec.md" \
  --output-dir "docs/tasks/"

Example 2: Computational Gate Configuration

{
  "gates": [
    {
      "name": "syntax-check",
      "cmd": "npx tsc --noEmit --project tsconfig.json",
      "priority": 1
    },
    {
      "name": "encoding-check",
      "cmd": "node tools_node/check-encoding-touched.js",
      "priority": 2
    },
    {
      "name": "domain-data",
      "cmd": "node tools_node/validate-generals-data.js",
      "priority": 3
    },
    {
      "name": "ui-contract",
      "cmd": "node tools_node/validate-ui-specs.js",
      "priority": 4
    },
    {
      "name": "runtime-registry",
      "cmd": "node tools_node/check-ui-runtime-state-registry.js",
      "priority": 5
    },
    {
      "name": "import-boundary",
      "cmd": "node tools_node/check-import-boundaries.js",
      "priority": 5
    }
  ]
}
The real key is not the gate name. It is whether the failure message is local enough and actionable enough. As long as the agent can tell which file, which rule, and which field failed, the repair success rate goes up sharply. That is why I keep saying inside 3KProject that the error message itself is also a prompt.

The Five Tools I Most Want to Add Next

3KProject does not really lack process documents. What it lacks are feedback gates that can run every day and feed results straight back into the agent. So for me, the efficient move is not writing more policy, but filling in a few tools that would actually be used daily.

  1. compute-gate.js: unify type checks, encoding checks, data validation, and contract validation behind one entry point.
  2. check-import-boundaries.js: turn module boundaries from verbal convention into tool-enforced guardrails.
  3. approved-fixture-check.js: preserve human-approved expected outputs as comparable baselines.
  4. task-decomposer.js: split large requests into atomic task cards that can be executed in sequence.
  5. harness-health-report.js: produce a fixed report showing which guides and sensors are still hollow.

What these five tools share is that they translate abstract knowledge into repeatable operations. In 3KProject, once a rule becomes executable, the agent can actually be governed. Once a result becomes measurable, I can discuss improvement with the team instead of debating feelings.

Rollout Order: Patch the Cheap, Stable, Daily Sensors First

When I rank this rollout order for 3KProject, I care about which checks are cheapest, most stable, and likely to run every day. The best strategy is to start with low-cost high-frequency checks, then add behaviour and architecture guards, and only later handle expensive semantic review.

Rollout order: stabilize the feedback chain first, then expand semantic and governance layers P0 Type-check integration Basic lint and encoding checks Eliminate the most common and cheapest-to-catch errors first Immediate stability lift P1 Import boundary checks Atomic task-card tooling Make architecture rules and task slicing executable by machines Reduce long-term coupling P2 Approved fixtures Health report Institutionalize behaviour validation and governance coverage Build sustainable regression P3 AI semantic review High-cost inferential sensors Reserve them for high-value checkpoints instead of every change Control cost and latency
Figure 5: If the order is wrong, teams spend real money on expensive review while still leaving the basic daily failures untouched.

In one sentence, my priority order in 3KProject is this: let machines do the deterministic judgment they are good at first, then let models do the new-content generation and semantic understanding they are good at.

The Final Decision Rule

Back in 3KProject, my decision rule is very simple now: if a quality judgment can be made by scripts, the type system, schema checks, snapshots, or fixtures, then I should not leave it to an LLM to guess. The model is most valuable when it generates something new, understands ambiguous requirements, and fills missing context, not when it acts like an expensive unstable if/else engine.

Better Handed to Models

Requirement decomposition, code writing, refactoring suggestions, document consolidation, semantic diff reading, and high-level design tradeoffs.

Better Handed to Tools

Type correctness, naming rules, module boundaries, data formats, fixed-output comparison, and encoding integrity.

Inside 3KProject, my rule now is to turn as much model-side judgment as possible into scripts, and save the LLM for the work that really needs understanding and creation.