3KProject Data Pipeline Notes
How I Turned RAG / ETL Into a Knowledge Pipeline That Can Actually Ship
When people hear RAG, they still picture "vector search + an LLM answering questions." What I built in 3KProject feels more like a data factory: start from source text, pull out raw material, then turn people, relationships, and events into assets that can be reviewed, rerun, and finally fed straight into the NPC brain and the Cocos UI.
The One Sentence That Matters Most
The goal of this pipeline is not to let the model directly answer questions. It is to turn raw text into traceable, replayable, rerunnable knowledge artifacts.
In plainer language, this system is not about asking AI a Three Kingdoms question and waiting for a clever answer. It is more like a small factory. The raw material is classical text. The production steps are extraction, alignment, review, and repair. The finished product is a packet of data that can enter the game runtime.
The Whole Flow Looks Like This
Do not start with script names. Start with six big steps instead: source text comes in, entity alignment happens, events are organized, review gets routed, repair loops run, and the result is exported into the game. Once that backbone is clear, the tool names stop looking intimidating.
Why Do I Keep Stressing Source-Grounded Retrieval?
Because once data can no longer point back to the original text, the whole review step becomes fuzzy. You may feel that the model is saying something plausible, but you no longer know where that statement came from. Once that kind of record enters runtime, the whole system gets harder to repair cleanly.
What does this layer really solve?
It solves the two problems that most easily poison everything downstream: the same person appearing under many names, and text fragments that look relevant but do not actually mention the target figure.
Why invest here first?
Because if the retrieval key is not normalized first, the event layer, relationship layer, and character context layer all drift together.
The Heaviest Part Is Really the ETL Middle
To me, the most valuable lesson in the source document is not the long script list. It is the reminder that Extract only lifts material out of the corpus. The expensive part is Transform, because that is where the pipeline decides what should go into review, what can move into staging, and what should be sent to the repair queue.
I like this part of the design a lot because it turns review from a vague "someone looked at it" step into a real flow with outputs. A items move into ready staging. B items enter the repair queue. That is when the pipeline finally starts closing the loop instead of scattering after every review round.
Why keep the LLM as a reviewer?
Because that keeps its instability inside the proposal and review layer instead of letting it write directly into canonical runtime data.
What is the value of the repair queue?
It turns "we know this part is weak" into "here is the next repair class to schedule." Things like fill_location and repair_relationship_edges stop being vague complaints and become executable work.
The Last Leg Is Where the Data Finally Enters the Game
A lot of data projects stop at staging. They look tidy, but they never really enter the product. The practical thing about this line is that it does export runtime-general-profiles, and those profiles are then served through the NPC brain API for the Cocos UI to consume.
This leg matters because it turns the knowledge line from a research exercise into a runtime lookup layer for the game itself. That is when it starts creating product value instead of just documentation value.
Where Is It Now, and What Has It Already Proven?
If I only look at the current numbers, my read is: the skeleton is already standing up, but the repair queue is still heavy. Right now the line already has 20,718 resolved mentions, 171 ready events, 1,601 source event packets, and 1,766 repair tasks. That tells me the base production line is alive, but cleaning the backlog still needs more rounds.
What is already working
This is no longer just vector search plus an LLM. Data now returns to source, goes through review, enters staging, and then gets exported as runtime profiles. That is a real knowledge production line, not a chat bot with better memory.
What needs the most work next
The key is not calling larger or louder models. The key is draining the repair queue so the B backlog actually turns into A items or publishable data.
The Simplest Way I Explain It Now
If I had to summarize this article in the least awkward sentence possible, I would say this: the system is turning Three Kingdoms text from something humans can read into knowledge assets that the game runtime can actually use.
It is not only RAG, and it is not only ETL. RAG finds the right data, ETL stabilizes that data, the review loop repairs and upgrades it, and runtime export is what finally moves it into the product.
So the real message of this third article is not a script inventory. It is a much more practical idea: once data is meant to enter a game, what you need is not a model that answers questions. You need a pipeline that can reliably manufacture knowledge assets.