A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing

In this tutorial, we explore the implementation of OpenMythos, a theoretical reconstruction of the Claude Mythos architecture that enables deeper reasoning through iterative computation rather than increased parameter size. We build and analyze models using both GQA and MLA attention mechanisms, examine memory efficiency through KV-cache comparisons, and validate stability via the spectral properties of the recurrent update. We then train the model on a structured parity task and investigate how increasing loop depth at inference improves performance without retraining.

Along the way, we also inspect adaptive computation via ACT halting and monitor expert utilization in the MoE layers, providing a comprehensive, hands-on understanding of this emerging architecture. Copy CodeCopiedUse a different Browserimport subprocess, sys try: import open_mythos # noqa: F401 except ImportError: subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "open-mythos"]) import math, time, copy from collections import Counter, defaultdict import numpy as np import torch, torch.nn as nn, torch.nn.functional as F import matplotlib.pyplot as plt from open_mythos.main import ( OpenMythos, MythosConfig, ACTHalting, MoEFFN, ) torch.manual_seed(0); np.random.seed(0) device = "cuda" if torch.cuda.is_available() else "cpu" print(f"▸ device = {device} | torch = {torch.__version__}") def make_config(attn_type: str, *, dim=128, n_heads=4, n_experts=4, max_loops=8, seq_len=128, vocab=256): base = dict( vocab_size=vocab, dim=dim, n_heads=n_heads, max_seq_len=seq_len, max_loop_iters=max_loops, prelude_layers=1, coda_layers=1, n_experts=n_experts, n_shared_experts=1, n_experts_per_tok=2, expert_dim=dim // 2, lora_rank=8, attn_type=attn_type, ) if attn_type == "gqa": return MythosConfig(**base, n_kv_heads=2) return MythosConfig( **base, n_kv_heads=n_heads, kv_lora_rank=32, q_lora_rank=64, qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16, ) cfg_gqa = make_config("gqa") cfg_mla = make_config("mla") m_gqa = OpenMythos(cfg_gqa).to(device) m_mla = OpenMythos(cfg_mla).to(device) print("\n─── Part 1 ─ model sizes ──────────────────────────────") print(f"GQA params : {sum(p.numel() for p in m_gqa.parameters()):>10,}") print(f"MLA params : {sum(p.numel() for p in m_mla.parameters()):>10,}") We install and import all required dependencies and initialize our environment for running OpenMythos.

We construct configurations for both GQA and MLA attention mechanisms and instantiate their respective models. We also compare their parameter sizes to understand how architectural differences impact model scale. Copy CodeCopiedUse a different Browserdef cache_bytes(kv: dict) -> int: total = 0 for entry in kv.values(): for t in entry.values(): total += t.element_size() * t.numel() return total x = torch.randint(0, 256, (1, 64), device=device) ck_gqa, ck_mla = {}, {} with torch.no_grad(): m_gqa(x, n_loops=4, kv_cache=ck_gqa) m_mla(x, n_loops=4, kv_cache=ck_mla) gqa_kb = cache_bytes(ck_gqa) / 1024 mla_kb = cache_bytes(ck_mla) / 1024 print("\n─── Part 2 ─ KV-cache footprint (1×64 tokens, 4 loops) ─") print(f"GQA cache : {gqa_kb:6.2f} KB ({len(ck_gqa)} layer-keys)") print(f"MLA cache : {mla_kb:6.2f} KB ({len(ck_mla)} layer-keys)") print(f"ratio : MLA is ≈{gqa_kb / max(mla_kb, 1e-9):.2f}× smaller") def show_stability(model, tag): A = model.recurrent.injection.get_A() print(f"{tag:3s} ρ(A): min={A.min():.4f} max={A.max():.4f} " f"mean={A.mean():.4f} stable={bool((A 0).all())}") print("\n─── Part 3 ─ spectral radius at init ──────────────────") show_stability(m_gqa, "GQA") show_stability(m_mla, "MLA") opt = torch.optim.Adam(m_mla.parameters(), lr=1.0) for _ in range(30): loss = m_mla(torch.randint(0, 256, (2, 16), device=device), n_loops=2).square().mean() opt.zero_grad(); loss.backward(); opt.step() show_stability(m_mla, "MLA after abusive training (lr=1.0, 30 steps)") We compute and compare the KV-cache memory footprint for both GQA and MLA attention types during forward passes.

We then inspect the stability of the recurrent component by analyzing the spectral radius of matrix A. We further stress-test the model with extreme training conditions to confirm that stability is preserved. Copy CodeCopiedUse a different BrowserVOCAB = 64 SEQ_LEN = 24 def make_batch(batch=64, seq_len=SEQ_LEN): x = torch.randint(1, 3, (batch, seq_len), device=device) bits = x - 1 parity = bits.cumsum(dim=1) % 2 y = parity + 1 return x, y cfg = MythosConfig( vocab_size=VOCAB, dim=64, n_heads=4, n_kv_heads=2, max_seq_len=SEQ_LEN + 4, max_loop_iters=16, prelude_layers=1, coda_layers=1, n_experts=4, n_shared_experts=1, n_experts_per_tok=2, expert_dim=32, lora_rank=4, attn_type="gqa", act_threshold=0.99, ) model = OpenMythos(cfg).to(device) opt = torch.optim.AdamW(model.parameters(), lr=3e-4) T_TRAIN = 3 print("\n─── Part 5 ─ training (T_train = 3) ───────────────────") print(f"params: {sum(p.numel() for p in model.parameters()):,}") losses = [] t0 = time.time() for step in rang