Why 13 Models? Going Beyond Standard BKT in a Serious Game | Sibel Kaçar
disaster
Why 13 Models? Going Beyond Standard BKT in a Serious Game
March 6, 2026
When I started designing the learning engine for AfetAkademi 8-16 — I knew I wanted to use Bayesian Knowledge Tracing. BKT is elegant, interpretable, and grounded in decades of research going back to Corbett & Anderson (1995).
"But the more I thought about what I actually needed to model, the more I realized: one model per skill wasn't going to be enough."
This is the story of why I ended up with 13.
What Standard BKT Does (and Doesn't Do)
If you're not familiar with BKT, here's the core idea. For each skill a student is learning, you maintain a single probability: **P(L)** — the probability that the student has learned the skill. Every time the student attempts the skill, you update this probability using four parameters:
P(L₀)— prior probability of knowing the skill
P(T)— probability of learning from a single attempt
P(G)— probability of guessing correctly without knowing
P(S)— probability of slipping (making an error despite knowing)
Clean, interpretable, and computationally lightweight. For a math tutoring system where every problem maps cleanly to a skill, this works beautifully.
But AfetAkademi isn't a math tutor.
The Problem: One Number Isn't Enough
In AfetAkademi, a child plays as one of four characters — Kaya (physical strength), Zeki (analysis), Toprak (speed), and Cıvıltı (exploration) — and navigates disaster scenarios like earthquakes, fires, and floods. Each character has a unique ability that maps to a different problem type.
When a player completes an objective, what exactly have they demonstrated?
- That they know *which character* to use? (Character knowledge)
- That they understand *why* this skill applies here? (Conceptual knowledge)
- That they acted *at the right moment*? (Timing)
- That they followed the *correct sequence* of steps? (Order)
- That they did it *efficiently*, without unnecessary attempts? (Efficiency)
- That they *hesitated* before acting, suggesting low confidence? (Confidence)
- That they *asked for help* — and if so, how many times? (Help-seeking)
- That they made the *same mistake* they made last session? (Error patterns)
A single P(L) value collapses all of this into one number. You lose the texture of learning. You lose the ability to intervene precisely. I needed a system that could answer not just *"does this child know the skill?"* but *"how does this child know it, and where exactly are the gaps?"*
The Architecture: 3 Foundation Models + 10 Specialist Models
I designed a 13-model BKT architecture organized in two layers.
Layer 1: Foundation Models (1–3)
These track broad knowledge states and update on every single objective completion.
| # | Model | Key | What it tracks |
| 1 | Character | `characterID` | Does the player understand when to use this character? |
| 2 | ProblemType | `Analysis`, `PhysicalPower`, `Speed`, `Exploration` | Does the player understand this class of problem? |
| 3 | CharacterProblem| `characterID_problemType` | The intersection — most specific foundation signal |
These three models give me a broad picture fast. If a child's Character model for `zeki` is low but their ProblemType model for `Analysis` is high, that's a very specific signal: they understand analysis as a concept but haven't internalized that Zeki is the right tool for it.
Layer 2: Specialist Models (4–13)
These are indexed by `objectiveID` — each objective gets its own set of specialist models. They're fed by different systems in the game, not just objective completion.
| # | Model | Fed by | What it tracks |
|---|-------|--------|----------------|
| 4 | MetaCognition | Socratic dialogue, character selection accuracy | Planning and self-monitoring |
When all 13 models have data, the system computes a weighted multi-factor score;
| Model | Weight |
|-------|--------|
| CharacterProblem (foundation) | 15% |
| ConceptualKnowledge | 20% |
| MetaCognition | 20% |
| Efficiency | 10% |
| Timing, Order, Transfer, Retention | 5% each |
| Confidence, HelpSeeking, ErrorPattern | 5% each |
This score isn't just a grade. It's a diagnostic fingerprint. A child with 90% ConceptualKnowledge but 30% Confidence is a very different learner than one with 60% ConceptualKnowledge and 85% Confidence. The system treats them differently — and so does BİLGE, the Socratic AI mentor.
A Concrete Example
Let's say an 11-year-old (Middle age group: P(L₀)=0.20, P(T)=0.30, P(G)=0.10, P(S)=0.20) plays as Zeki and correctly completes an analysis objective on the first try.
The specialist models don't have data yet for this objective — they'll accumulate as the child plays more. Until then, the system uses the foundation score as a fallback.
One correct attempt. Prior of 0.20 jumps to 0.80. The system is responsive by design — because in a game, you don't have 50 attempts per skill. You have 3 or 4. The parameters have to reflect that.
Why This Matters for Disaster Education Specifically
Standard BKT was designed for procedural knowledge in tutoring systems — math steps, grammar rules, things with clear right/wrong answers. Disaster preparedness is different.
Knowing *that* you should move away from windows during an earthquake is declarative knowledge. Knowing *when* to move, *which character's ability* to use, *in what order* to act, and doing it *without freezing* — that's behavioral calibration under uncertainty.
The 13-model architecture lets me separate these dimensions. A child who always knows the right answer but consistently hesitates for 8 seconds before acting has a Confidence problem, not a ConceptualKnowledge problem. Those require different interventions.
BİLGE — the game's Socratic AI mentor — uses this diagnostic profile to decide what kind of support to offer: an idle hint, a Socratic question, or a direct instruction with visual highlighting of the target object.
What I Learned
Building this system taught me something I didn't expect: **the hardest part wasn't the math. It was deciding what to measure.**
Every model in the architecture represents a decision about what matters in learning. MetaCognition matters because self-monitoring predicts transfer. Confidence matters because hesitation is a behavioral signal that pure accuracy data misses. ErrorPattern matters because recurring mistakes suggest a systematic misconception, not random noise.
If you're building an adaptive learning system — game or otherwise — I'd encourage you to ask this question before you write a single line of code:
What does it actually mean to know this?
The answer is almost always more complex than a single probability.
Next in this series: The Piaget Connection — How I calibrated BKT parameters by age group using developmental psychology.
AfetAkademi is an adaptive serious game for disaster preparedness education, currently in thesis development. Built solo in Unity 6 with Firebase backend.