3KProject Data Pipeline Notes

How I Turned RAG / ETL Into a Knowledge Pipeline That Can Actually Ship

When people hear RAG, they still picture "vector search + an LLM answering questions." What I built in 3KProject feels more like a data factory: start from source text, pull out raw material, then turn people, relationships, and events into assets that can be reviewed, rerun, and finally fed straight into the NPC brain and the Cocos UI.

Grounded Every record should point back to sourceRef, sourceQuote, and chapterNo so the answer never drifts away from evidence.

Deterministic Let scripts organize structure and evidence first, then let the LLM act as a reviewer instead of the author.

Runtime-ready The line does not stop at documents. It exports runtime profiles that the NPC brain API and the Cocos UI can consume directly.

The One Sentence That Matters Most

The goal of this pipeline is not to let the model directly answer questions. It is to turn raw text into traceable, replayable, rerunnable knowledge artifacts.

In plainer language, this system is not about asking AI a Three Kingdoms question and waiting for a clever answer. It is more like a small factory. The raw material is classical text. The production steps are extraction, alignment, review, and repair. The finished product is a packet of data that can enter the game runtime.

My simplest mental model now is this: RAG finds the right data, ETL makes that data stable, the review loop lifts quality, and runtime export is what finally sends the result into the game.

The Whole Flow Looks Like This

Do not start with script names. Start with six big steps instead: source text comes in, entity alignment happens, events are organized, review gets routed, repair loops run, and the result is exported into the game. Once that backbone is clear, the tool names stop looking intimidating.

Figure 1: Once you remember the overall line, each script stops feeling scary. It is just one station in the production flow.

Why Do I Keep Stressing Source-Grounded Retrieval?

Because once data can no longer point back to the original text, the whole review step becomes fuzzy. You may feel that the model is saying something plausible, but you no longer know where that statement came from. Once that kind of record enters runtime, the whole system gets harder to repair cleanly.

Figure 2: Bring the data back to the source first, then run review. That way the reviewer and the LLM are not talking to thin air.

What does this layer really solve?

It solves the two problems that most easily poison everything downstream: the same person appearing under many names, and text fragments that look relevant but do not actually mention the target figure.

Why invest here first?

Because if the retrieval key is not normalized first, the event layer, relationship layer, and character context layer all drift together.

The Heaviest Part Is Really the ETL Middle

To me, the most valuable lesson in the source document is not the long script list. It is the reminder that Extract only lifts material out of the corpus. The expensive part is Transform, because that is where the pipeline decides what should go into review, what can move into staging, and what should be sent to the repair queue.

Figure 3: An A item is not the end, and a B item is not a failure. It simply tells you where the current batch should go next.

I like this part of the design a lot because it turns review from a vague "someone looked at it" step into a real flow with outputs. A items move into ready staging. B items enter the repair queue. That is when the pipeline finally starts closing the loop instead of scattering after every review round.

Why keep the LLM as a reviewer?

Because that keeps its instability inside the proposal and review layer instead of letting it write directly into canonical runtime data.

What is the value of the repair queue?

It turns "we know this part is weak" into "here is the next repair class to schedule." Things like fill_location and repair_relationship_edges stop being vague complaints and become executable work.

The Last Leg Is Where the Data Finally Enters the Game

A lot of data projects stop at staging. They look tidy, but they never really enter the product. The practical thing about this line is that it does export runtime-general-profiles, and those profiles are then served through the NPC brain API for the Cocos UI to consume.

Figure 4: For me, the sign that this line is real is that it no longer ends as a pipeline report. The UI is actually using it.

This leg matters because it turns the knowledge line from a research exercise into a runtime lookup layer for the game itself. That is when it starts creating product value instead of just documentation value.

Where Is It Now, and What Has It Already Proven?

If I only look at the current numbers, my read is: the skeleton is already standing up, but the repair queue is still heavy. Right now the line already has 20,718 resolved mentions, 171 ready events, 1,601 source event packets, and 1,766 repair tasks. That tells me the base production line is alive, but cleaning the backlog still needs more rounds.

What is already working

This is no longer just vector search plus an LLM. Data now returns to source, goes through review, enters staging, and then gets exported as runtime profiles. That is a real knowledge production line, not a chat bot with better memory.

What needs the most work next

The key is not calling larger or louder models. The key is draining the repair queue so the B backlog actually turns into A items or publishable data.

My current judgment is that this is no longer a proof of concept. It is already a formal pipeline that can keep pushing coverage upward. The next question is not whether the method exists, but whether repair velocity can keep up with new intake.

The Simplest Way I Explain It Now

If I had to summarize this article in the least awkward sentence possible, I would say this: the system is turning Three Kingdoms text from something humans can read into knowledge assets that the game runtime can actually use.

It is not only RAG, and it is not only ETL. RAG finds the right data, ETL stabilizes that data, the review loop repairs and upgrades it, and runtime export is what finally moves it into the product.

So the real message of this third article is not a script inventory. It is a much more practical idea: once data is meant to enter a game, what you need is not a model that answers questions. You need a pipeline that can reliably manufacture knowledge assets.