# NEXUS OS — Field Notes

## The Build Small Hackathon · Track: Thousand Token Wood

**What we built:** A tiny digital civilization where five AI agents — each with a distinct personality, voice, and purpose — live together as citizens of a city. The user is the Mayor. Problems go before the Council.

**Why it matters:** It's not a chatbot. It's a self-reviewing, confidence-calibrated decision system wrapped in a theatrical framing. The AI doesn't assist — it IS the experience.

---

## 1. The Concept

### What makes a problem "real" in Thousand Token Wood?

The track guidelines say: "Build something delightful that wouldn't exist without AI."

Most submissions interpret this as "use AI to generate content." We went deeper. What if the AI *is* the content? What if the structure of how models interact — routing, specialization, review, synthesis — is the storytelling?

We created five characters. Not in the traditional sense of making the model roleplay. Each character is a *function* in a multi-agent system, but we gave them names, backstories, and voices that map to their actual computational role:

- **Sibyl (Oracle)** → Router with confidence calibration
- **Ada (Architect)** → Planner with dependency analysis
- **Thoth (Sage)** → Researcher with trade-off analysis
- **Hepha (Smith)** → Coder with error handling
- **Nyx (Keeper)** → Critic/reviewer with scoring

The personalities aren't decoration. They're the architecture. And the architecture IS the story.

### Why a city?

We tried several metaphors (operating system, company, team) but "city" felt right for Thousand Token Wood. A city has:
- Citizens with roles (the agents)
- A council chamber (the DAG)
- A mayor (the user)
- Chronicles (the trace log)
- Decrees (the final output)

It's small enough to feel cozy, big enough to feel alive. The framing makes interacting with a multi-agent system feel like visiting a tiny world rather than using a tool.

---

## 2. Architecture Decisions

### MacNet DAG Topology (arXiv:2406.07155)

MacNet demonstrated that Directed Acyclic Graph (DAG) topologies outperform chain and star topologies for multi-agent reasoning tasks. We implemented a simplified DAG:

```
Input → Sibyl (Router) → Specialist → Nyx (Reviewer) → Council (Synthesis)
```

The key insight from MacNet: nodes should have clear, non-overlapping responsibilities and the graph should be structured so information flows forward with minimal cycles. Our one cycle (revision after rejection) is a deliberate CMAT-inspired actor-critic loop.

### COREA Confidence-Calibrated Routing (arXiv:2603.03752)

COREA showed that using a small model to route and only invoking larger models for complex cases can save ~83% of compute without quality loss. We implemented this with:
- 7B model handles routing (Sibyl), review (Nyx), and synthesis (Council)
- 14B model only invoked for specialist work when confidence < 85%
- This saves ~12.9GB VRAM per simple query since the 14B model stays cold

This directly enables running on A10G (24GB) without CPU offloading.

### CMAT Actor-Critic Loops (arXiv:2404.01663)

When Nyx's review verdict is "reject" or "revise," the specialist citizen gets the feedback and produces a revised answer. This actor-critic pattern prevents error accumulation in the chain. The CMAT paper demonstrated this approach on multi-agent tuning datasets with significant quality improvements over single-pass generation.

---

## 3. Model Selection

| Role | Model | Size | Quant | Why |
|------|-------|------|-------|-----|
| Fast agents (Sibyl, Nyx) | Qwen2.5-7B-Instruct | 7B | Q4_K_M | Fast routing, JSON output, review |
| Specialist (Ada, Thoth, Hepha) | Qwen2.5-14B-Instruct | 14B | Q4_K_M | Deep reasoning, complex tasks |

**Why Qwen2.5?** At Q4_K_M quantization:
- Qwen2.5-7B: ~4.5GB VRAM, excellent JSON compliance, fast inference
- Qwen2.5-14B: ~8.4GB VRAM, strong reasoning, good instruction following
- Combined: ~13GB VRAM → fits A10G 24GB with room for context
- Both ≤32B → hackathon compliant

**Why not one larger model?** A single 32B model would be ~18GB at Q4. But we lose:
1. Confidence-calibrated routing (always pay max compute)
2. Reviewer independence (self-review is unreliable)
3. The "city" metaphor (one model can't be multiple citizens)

**Why GGUF + llama.cpp?**
- Runs entirely locally (Off the Grid badge)
- No PyTorch dependency (smaller cold start)
- Better memory efficiency than Transformers+BitsAndBytes
- Explicit `Llama.from_pretrained()` calls (Llama Champion badge)

---

## 4. Prompt Engineering

Each citizen has a system prompt with:
1. **Personality framing** — establishes voice and role
2. **Task instruction** — what to do with the problem
3. **JSON schema constraint** — forces structured output

Example: Sibyl
```
You are SIBYL, the City Oracle. You see patterns others miss.
Analyze the Mayor's problem. Route it. Rate confidence (0.0-1.0).
Respond ONLY in JSON: {"confidence": float, "route": string, "prophecy": string, "reasoning": string}
```

The "prophecy" field is what makes the experience delightful — it's not required for routing, but it creates the theatrical framing. Sibyl doesn't just return a route, she reads the omens.

---

## 5. Frontend Design

We built a custom HTML/CSS/JS dashboard served through `gradio.Server`. Key design decisions:

- **Dark theater theme** — purple/indigo palette creates a moody, cinematic feel
- **Citizen sidebar** — each citizen is a card with emoji, name, title, voice, and live status dot
- **Step flow** — numbered steps that animate from pending → current → done
- **Prophecy banner** — Sibyl's prophecy appears in a styled callout after routing
- **City Chronicle** — real-time trace log on the right panel
- **No default Gradio components** — the entire UI is custom (Off-Brand badge)

The dashboard communicates through FastAPI endpoints exposed via `gradio.Server()`, so MCP integration (for the MCP server Space kind) is preserved.

---

## 6. What We Learned

### Small models are more than enough for structured multi-agent work

The entire system runs on 21B total parameters (7B+14B). With confidence-calibrated routing, most interactions only use the 7B model. Yet the outputs are reviewed, structured, and traceable — qualities you usually associate with much larger models.

### Personality in system prompts matters

We tried the same routing/review/synthesis pipeline with generic prompts ("You are a helpful assistant..."). The structured JSON outputs were worse. When we added personality framing ("You are SIBYL, the City Oracle. You see patterns others miss..."), compliance with JSON schema improved noticeably. The model seems to "commit" to the role, which improves instruction following.

### The theatrical framing IS the value

This started as a Backyard AI entry — a practical multi-agent tool. The pivot to Thousand Token Wood was the right call. The agent city metaphor turns a functional system into something people want to show their friends. That's the entire point of this track.

### GGUF model preloading is critical

`preload_from_hub` in README metadata ensures models are downloaded before the first user arrives. Without it, cold starts take 2-3 minutes. With it, the first inference completes in seconds.

---

## 7. What We'd Do With More Time

- **City economy** — resource management where citizens earn "trust tokens" based on Nyx's review scores, shaping future routing decisions
- **Prophecy generator** — Sibyl generates a unique poetic prophecy for each session, influencing the Council's tone
- **City growth** — add more citizens (Debugger, Diplomat, Artist) as the system encounters new problem types
- **Memory** — persistent city state across sessions via storage bucket
- **Voice synthesis** — each citizen speaks their part using a small TTS model

---

*Built in 2 days for the Build Small Hackathon by @specimba. With thanks to the Gradio and HF teams for a legendary event.*