Applying: send a brief bio / CV link via this page. Include 1) the topic you are interested in working on, 2) a few lines on any relevant prior work you've done, 3) your github / scholar / x links, 4) your desired start date, duration, and other obligations (if any) during that period, and 5) a brief analysis of one of the projects outlined below.

Each project intentionally includes some gaps or glosses. List the ones you see, and how you'd solve them. Alternatively, if you dislike the projects outlined under a particular topic, write up your own idea and why it is more promising, along with your estimate of time and compute required.

Selection criteria: novelty and importance, clarity of approach, feasibility given time/compute, alignment with topics. Panel review and one interview.

Selection will be based solely on merit. IMI is an equal opportunity employer, and does not discriminate on the basis of age, disability, sex, orientation, race, religion or belief. We promote equality of opportunity for all, and welcome applications from anyone with talent, skills and potential.

‍

Research Topics and Suggested Projects

Research topic: Faster learners

Thesis: training power usage should be many orders of magnitude lower. No single change will get us there, but a plausible research program is to:

1) make most tokens unnecessary,
2) quickly get to a good solution analytically or with amortized learners,
3) learn to use external structure rather than training every fact into the weights, and
4) combine second order methods with low precision training to converge in fewer, larger steps.

Project: Faster training via approximate analytic solutions

Motivation: training big networks via SGD is hideously inefficient. Can approximate analytic solutions provide a 100x reduction in training time?

Idea: use curvature information from a tiny fraction of the corpus, solve the resulting quadratic problem once per block, then de-linearize by injecting the update through small gates so that the network stays in its local linear regime.

Suggested approach: estimate a good solution by solving a sketched natural-gradient step in closed form. Stream the corpus once to estimate a block-diagonal or Kronecker-factored Fisher (or Gauss-Newton) in each layer using random projections, compute the ridge solution for linearized parameters, de-linearize by composing layers near identity.

Why: layerwise K-FAC/Shampoo already works, randomized second-order solvers are mature, neural nets are close to linear early in training. A high-quality one-shot quasi-Newton init could remove 90-99% of steps.

Risks: linearization error, imperfect Fisher, stability in deep stacks. (+ still needs a brief fine-tune at the end). LLMs also benefit from feature learning outside the lazy/NTK regime, where analytic linearized steps help least.

Possible procedure:

Initialize: pre-LN or RMSNorm, gated residuals (alpha_l ~ 0.05-0.1), DeepNet/muParam scaling, light spectral constraints.
Calibrate: stream 1-5% of tokens (more if necessary) to estimate K-FAC/Shampoo factors per block, collect residual RMS and logit scale stats. Dropout off.
Solve: compute blockwise Gauss-Newton/natural-gradient delta W_l with damping and a small proximal term. Allocate per-block KL budgets, scale each delta W_l to fit.
Compose delta theta: enforce a global KL cap via delta theta^T F delta theta \<= epsilon, then apply alpha_l gates (start 0.05-0.2, larger for top blocks).
Guard: run linearization-error checks per block, downscale any block that fails. Run spectral checks for MHA/MLP.
Line search: Armijo on held-out loss plus KL cap, adjust alpha_l (not just a single alpha). Prefer to increase alpha_l for blocks passing both KL and linearization checks.
Refresh: optionally relinearize top K blocks, refresh their factors, and take a second (smaller) trust-region step.
Fine-tune: second-order optimizer (K-FAC/Shampoo grafted onto AdamW) with big batches for a few thousand steps. Re-enable dropout. Unfreeze MoE router here if we decide to tackle that. Maintain per-step KL regularization to keep the student initially near the trust region.

Literature:

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures

SKFAC: Training Neural Networks with Faster Kronecker-Factored Approximate Curvature

Near-optimal Sketchy Natural Gradients for Physics-Informed Neural Networks

Project: Learn-to-database (explicit hierarchical memory)

Motivation: storing facts in weights is wasteful and gets stale. Can we externalize knowledge as an explicit, updatable hierarchy the model learns to write to and query?

Idea: make the model 1) induce its own ontology of concept nodes, typed edges, and table schemas, and 2) plan short programs over it. The LM becomes a planner/fuser, facts live in a compact, interpretable store.

Suggested approach: build a small, learned hierarchy: nodes with hyperbolic/tree codes and prototypes, sparse typed edges with evidence, leaves as text passages and table rows. Train write (create/link) and plan (DSL) heads, retrieve tiny typed subgraphs/rows, fuse with gated cross-attention, penalize param-only answers.

Why: transformers already form latent hierarchies. Making them explicit improves sample efficiency, updatability, and interpretability. RAG/FiD, Poincare embeddings, schema induction, and programmatic retrieval provide working parts.

Risks: ontology drift/fragmentation, noisy links, planner brittleness, latency/complexity. Model may bypass the store. Some fast/robust knowledge must remain parametric.

Possible procedure:

Initialize: pre-LN + gated residuals, add planner head, write head, and light fusion blocks. Storage: concept nodes with hyperbolic codes, relation vocab R, induced table schemas, ANN for nodes/leaves.
Calibrate: cluster 1-5% of corpus to seed nodes/relations, label nodes by summarizing prototypes, estimate top-k recall/coverage, set per-source budgets and alpha_ext gates.
Write: train create/link policy (straight-through Gumbel). Attach support spans and timestamps, regularize node/edge growth and degree, merge near-duplicates.
Plan: teach a tiny DSL (select, hop, filter, join, aggregate) with weak traces (BM25/DPR/SQL/paths). Router picks graph/text/table. Penalize long programs.
Retrieve/Fuse: execute plans, return compact subgraphs/rows/snippets. Fuse with cross-attn, gating evidence vs param prediction, require citations of nodes/edges used.
Externalize: anti-memorization drops and KL penalties so factual answers rely on the store. Increase alpha_ext and budgets where evidence consistently helps.
Compose: alternate phases: (A) train writer/planner/retriever with LM frozen, (B) train LM-on-evidence with store frozen, (C) brief joint pass.
Guard: structural checks (type purity, degree/entropy caps, cycle detection), drift alarms, and timeouts with fallback to text-only retrieval.
Line search: tune program depth/width and alpha_ext on held-out loss + latency + citation quality, promote sources that pass support/consistency checks.
Refresh: incremental writes and nightly re-embed/re-index, merge/retire nodes, maintain hot shards in memory, version edges for freshness.
Fine-tune: preference training that rewards supported, cited answers, keep base LM small, adapt planner, writer, fusion. Maintain offloading penalties early.

Literature:

Memory^3: Language Modeling with Explicit Memory

Ontology Generation using Large Language Models

KB-Plugin: A Plug-and-play Framework for Large Language Models to Induce Programs over Low-resourced Knowledge Bases

Sub-project: Write-head sparsity curriculum (explicit, updatable hierarchies)

Motivation: unconstrained writes bloat the store and kill interpretability. Can we grow a compact, legible hierarchy by starting ultra-sparse and relaxing only when utility is proven?

Idea: treat create node/edge/schema as gated actions with an explicit cost. Start with near-zero write capacity, force reuse/links, relax L0/MDL penalties as evidence accumulates that new structure reduces loss.

Suggested approach: hard-concrete gates per write op (create/link/split), group-lasso over relation types, and an MDL-style budget. A staged schedule: 1) link-only, 2) controlled splitting, 3) schema induction, 4) relaxed growth. Each write gets a justification (support spans, proto summary), enabling audit and rollback.

Why: early noise is what makes ontologies sprawl. A sparsity curriculum preserves tree-like structure, encourages reuse, and keeps concepts interpretable while still allowing growth when it pays off.

Risks: over-sparsity (missed concepts), late discovery tax (hard to recover if you never split), planner gaming the penalties, merge instability.

Possible procedure:

Initialize: seed a tiny tree (hyperbolic codes) and relation vocab. Enable retrieval/fusion, disable creates (writes off). Set very high L0/MDL penalties and tiny per-batch budgets (0-1 creates, 1-5 links).
Calibrate: measure coverage gaps, residual/error hotspots, node heterogeneity (prototype entropy), and average degree. Set a target growth curve (sublinear nodes vs tokens).
Link-only: allow edges to existing nodes with top-k sparsity, dedup/merge aggressively. New edges must carry citations and type, enforce degree/branching caps.
Controlled splitting: permit K new nodes per N tokens when triggers fire (high residual on a node, multi-modal embeddings, planner dead-ends). Child nodes inherit parent, get proto summaries, and must improve held-out loss to persist.
Schema induction: when repeated attribute patterns appear, propose columns/typed relations with group-lasso over relation types. Writes to schema have higher cost, require multi-example support.
Relaxation: slowly decay L0 and MDL penalties, increase budgets in domains where new structure consistently helps. Keep periodic prune sweeps (drop low-utility nodes/edges, auto-merge siblings).
Guard: pre-commit checks (type purity, cycle limits, citation density), post-commit A/B (with/without the write) to confirm utility, rollback on regressions. Maintain bounded hyperbolic radius to keep tree-like geometry.
Line search (budget control): adjust penalties to hit the target growth slope while minimizing held-out loss + write cost + latency. Prefer granting budget to sources/relations with highest utility per write.
Refresh: regular scheduled merge/consolidate, re-embed nodes, retire stale edges, version writes with timestamps.
Fine-tune: alternate freeze phases: 1) train planner/retriever under strict budgets, 2) briefly unfreeze write-heads to apply a small number of high-confidence writes, 3) joint stabilization pass. Keep anti-memorization so factual wins come from the store.

Signals it's working: sublinear graph growth, rising citation rate, stable small branching factors, and large accuracy drops when the store is ablated in the domains it covers.

Literature:

Crisp Attention: Regularizing Transformers via Structured Sparsity, https://arxiv.org/abs/2508.06016
The Role of Sparsity for Length Generalization in Transformers, https://arxiv.org/pdf/2502.16792
White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?, https://arxiv.org/abs/2311.13110

Growing neural networks: dynamic evolution through gradient descent, https://arxiv.org/abs/2501.18012
Identifying hubs in directed networks, https://arxiv.org/pdf/2312.03347
GKG-LLM: A Unified Framework for Generalized Knowledge Graph Construction, https://arxiv.org/html/2503.11227v2

Project: Train on the residual (semantic LZ + template VM)

Motivation: pretraining rereads the same patterns and reasoning chains. If models learn reusable templates, we should only pay for residual bits, not repetitions. Maybe direct + hierarchical compressibility can cut token exposure.

Idea: turn pretraining into compression. Learn a hierarchical dictionary of semantic spans and reasoning templates. Encode the corpus into 1) template calls with slot fills and 2) residual tokens the dictionary can't predict. Train primarily on the residual, keep the dictionary executable so a few examples teach many instances.

Suggested approach: 1) a neural compressor that induces macros (semantic LZ) and templates (reasoning VM) with MDL pressure, and 2) a residual trainer that samples only novel bits with importance weighting. Refresh the dictionary online so the residual steadily shrinks.

Why: language is highly compressible, gradient contributions are heavy-tailed. Grammar induction, dataset distillation, and macro tokenization already show large sample cuts. If chain-of-thought compresses into a small set of schemas, we need far fewer demonstrations.

Risks: over-compression (miss rarities), drift as the model/dictionary co-evolve, template brittleness, compressor overhead. Requires tight guards to avoid bias and loss spikes.

Possible procedure:

Initialize: pre-LN/RMSNorm Transformer with gated residuals, add two heads: a utility u(x,t) predicting per-token residual bits, and a template planner that emits short programs over a tiny DSL (steps, slots, control).
Calibrate: stream 1-2% of tokens. Fit u using loss/entropy/grad-sketch proxies. Seed a macro dictionary with frequent spans and a small set of reasoning schemas from mined CoT traces. No dropout.
Solve (encode): train a compression model with an MDL objective:
- direct compressibility (semantic LZ): segment text, learn macros that maximize code-length reduction across paraphrases and formats.
- hierarchical compressibility (Template VM): induce typed templates (e.g. 1) compare 2) sort 3) argmax, 1) retrieve 2) filter 3) aggregate) with slots, compile to executable programs.
- choose between macro, template, or raw tokens per segment via straight-through gating, penalize long codes and deep programs.
Compose (dataset): re-encode the corpus as codes:
- keep only (a) template calls + slot fills and (b) residual tokens above a u-threshold, drop predictable tokens.
- importance-weight residuals to match the original distribution, reserve a small shadow random batch each step to measure gradient mismatch.
Guard: track gradient-match vs full sampling on shadow batches, bound NLL drift per domain, auto-bail to denser sampling if mismatch or rare-phenomena error rises. Enforce diversity caps so the dictionary doesn't collapse.
Line search (budget): on held-out loss + code length, adapt thresholds: raise macro/template use until mismatch grows, lower when residual error spikes. Learn per-domain quotas and max program depth.
Refresh: every K steps, re-minimize MDL on a fresh stream: merge/split macros, specialize/generalize templates, retire low-utility entries. Maintain a novelty buffer that bypasses compression for surprising spans until absorbed.
Fine-tune: short full corpus sweep at low rate to debias, then large-batch training dominated by residuals. Keep off-policy checks with the shadow batch. At inference-time, optional on-the-fly template execution for chain-of-thought.

Expected outcome: train on 3-10% of original tokens at steady state for compressible domains. Large wall-clock/power savings in early training, better factual and reasoning generalization per token via explicit reuse of templates, fast adaptation by updating the dictionary.

Literature:

Language Modeling is Compression, https://arxiv.org/abs/2309.10668
Compression Represents Intelligence Linearly, https://arxiv.org/pdf/2404.09937
Entropy Law: The Story Behind Data Compression and LLM Performance, https://arxiv.org/html/2407.06645v1

Research topic: Learning from failure

Thesis: people learn from our mistakes. Models don't, but pass@512 works better than @1, indicating the answer is often reachable already. Let's make max(pass@N) \== pass@1:

1) turn search into supervision,
2) convert in-context fixes into weights,
3) compile failures into repairs, and
4) keep gains without regressions/forgetting.

rough ideas: 1) bound updates with KL/Fisher caps, run canaries, use selective counterfactual replay, periodically distill validated patches into main. 2) align fail vs win traces to learn detectors and conditional rewrites, store them as routed patches and planner rules. 3) trigger on surprise from tool returns, crystallize traces into reusable templates and low-rank adapters with trust-region commits. 4) harvest pass@N trees with verifiers and partial checks, learn prefix value functions and distill winner prefixes into the policy.

Project: Prefix advantage distillation

Motivation: many problems succeed with pass@64-512 but miss on pass@1 because early plan choices diverge. Can we shift policy to prefer winners without leaking finals or eval labels?

Idea: distill the advantage of early prefixes from winning traces over near-misses. Learn to emit better plan tokens, decompositions, and invariants in the first few steps, under a KL trust region and without training on final answers.

Suggested approach: mine pass@N logs with a verifier. extract prefixes up to the first tool call or K tokens. run preference learning on winner vs loser prefixes plus step-wise advantage-weighted updates from partial checks. update the planner head and small early-layer adapters, freeze late layers.

Why: early decisions carry most of the causal weight on success, pass@N pools expose reliable winner patterns, DPO/IPO and advantage-weighted regression work with preferences and partial rewards, avoiding ground-truth leakage.

Risks: spurious correlations in prefixes, verbosity inflation, domain shift from synthetic near-misses, regressions if KL is loose.

Possible procedure:

Initialize: pre-LN transformer with a planner head for plan tokens/templates, per-block low-rank adapters in early layers, verifier API and prefix extractor, global KL and latency budgets.
Calibrate: collect pass@N logs on a training pool, label winners with the verifier, record prefixes, partial checks, and problem type features. train a light prefix-success predictor to guide sampling and hard negative mining.
Solve: build a preference set of (winner_prefix, loser_prefix, context). train with DPO or IPO on prefixes. add step-wise advantage-weighted regression using partial checks as rewards with a control variate. apply per-step KL penalties and EWC on core capabilities, constrain updates to planner head and early adapters.
Compose: at inference, lightly bias decoding toward learned plan tokens and templates (logit offsets, small temperature on planner head). keep standard decoding otherwise. cache priors per problem type.
Guard: no-cheat constraints
- train only on training pool logs, never eval logs
- use verifier and partial checks that exist at inference (unit tests, proofs, execution), not hidden labels
- restrict supervision to prefixes and plan tokens, not full final strings
- enforce per-batch and cumulative KL caps, run regression canaries and length monitors
Line search: tune prefix length K, preference temperature, advantage weights, and KL caps on held-out pass@1 uplift vs pass@N, verbosity, and latency. prefer shorter prefixes with large uplift.
Refresh: periodically mine fresh pass@N logs, add hard negatives near the decision boundary, merge redundant templates via MDL, retire low-utility plan tokens.
Fine-tune: brief consolidation pass to stabilize planner and adapters under tight KL. verify gains persist with normal decoding and no reliance on best-of sampling.

Expected outcome: pass@1 gains from the same pass@N budget, localized updates that do not overfit finals, minimal compute overhead beyond log mining and short preference training.

Literature:

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning, https://arxiv.org/pdf/2410.08146
RECURSIVE INTROSPECTION: Teaching Foundation Model Agents How to Self-Improve, https://openreview.net/pdf?id=UPoQqreegH
Counterspeech the ultimate shield! Multi-Conditioned Counterspeech Generation through Attributed Prefix Learning, https://arxiv.org/pdf/2505.11958

Project: verifier-guided tree policy iteration

Motivation: best-of-N search induces an implicit decision tree where early nodes determine success. Can we turn those trees into a value signal over plan tokens and perform policy iteration to improve first-shot behavior?

Idea: build a prefix value model that predicts expected verifier success given a partial plan. estimate Q over plan decisions from search trees and partial checks, then perform KL-regularized policy improvement on the planner head.

Suggested approach: construct shallow trees from pass@N rollouts, annotate nodes with verifier outcomes and partial rewards, compute soft Q via backward value propagation. train a small critic over prefixes, improve the planner policy with advantage-weighted cross-entropy while freezing late layers.

Why: tree policy iteration is a principled way to learn from search, verifier and partial checks provide dense rewards, KL-regularized policy updates are stable and sample-efficient.

Risks: value leakage if trees are overfit, instability from off-policy bias, verbosity creep, and interference with non-searched domains.

Possible procedure:

Initialize: base model with planner head and early-layer adapters, add a prefix critic (small transformer or MLP on hidden prefix states), verifier API and partial-check probes, set KL and latency budgets.
Calibrate: collect pass@N trees by saving top-k beams and sampled branches per problem. annotate each node with partial checks passed and leaf success. estimate off-policy correction weights for branches.
Solve: compute soft Q at each node via backward propagation of verifier success with entropy regularization. train the prefix critic to predict Q from hidden states and plan tokens. improve the planner with advantage-weighted cross-entropy or PPO-style updates under KL to the current policy, apply EWC to protect core skills.
Compose: at inference, combine planner logits with critic advantages for the first K steps (small additive bias). keep decoding otherwise unchanged. cache advantages by problem type for fast reuse.
Guard: no-cheat constraints
- construct trees only on training pools, never use eval trees
- rely on verifiers and partial checks available at inference
- bias only early plan decisions, do not regress on final answers
- cap per-update and cumulative KL, enforce length and latency bounds, run regression canaries
Line search: tune K, critic regularization, advantage temperature, and KL caps on held-out pass@1 vs pass@N gap, stability, and verbosity. prune nodes with low contribution to avoid overfitting.
Refresh: rebuild trees on new snapshots, update the critic with fresh logs, remove stale branches, merge redundant plan tokens, maintain a buffer of hard problems for continual improvement.
Fine-tune: periodic short run to co-train planner and critic with small KL, freeze late layers, verify gains on unseen tasks and that pass@N remains stable.

Expected outcome: first-shot accuracy approaches best-of-N without extra sampling at inference, stable, interpretable improvements via early plan value shaping, modest compute overhead with strong sample efficiency.

Research topic: Making learned programs explicit

Thesis: Neural nets can implicitly learn small, reusable programs (sin/cos, sorting, date arithmetic, unit conversion) encoded as superpositions in their weights. We should extract these programs into explicit, verifiable modules and make models call them, reducing parameter bloat, improving generalization, and enabling updates without retraining. If this works well we'll next tackle composability, i.e. a 'neural linker' step.

Project: Neural-to-Program compilation (N2P)

Motivation: Storing algorithms in weights is opaque, hard to update, and redundancy-prone. If a net "knows" trig, a calendar, or a regex engine, we should externalize them as code and prune the corresponding circuits.

Idea: Build a compiler that maps network behaviors on targeted subspaces into a small typed intermediate representation of numerical and symbolic modules, with verification tests and equivalence checks. Replace the discovered circuit with a call to the extracted module.

Approach:
- Scope: Identify candidate latent programs by probing for low-entropy, low-rank, highly repeatable behaviors at specific layers/heads like 'angle to sin(angle)', 'string to match(pattern)'.
- Specification mining: Generate test benches using counterfactual inputs, invariance checks (e.g. sin(x+2pi)=sin(x)), and smoothness/periodicity detectors, then fit candidates from a library (trig, polynomials, finite-state transducers, arithmetic, set ops).
- Synthesis: Use hybrid methods-sparse identification of nonlinear dynamics (SINDy-style), symbolic regression, and enumerative search over a compact IR (typed SSA with vector ops, conditionals, and bounded loops).
- Verification: Run equivalence tests against the original subnetwork over adversarial and randomized inputs, then certify error bounds and input domain. If verified, freeze/retire the circuit and insert a differentiable "call-module" stub with gradients routed to arguments, not the replaced weights.
- Maintenance: Version modules, track provenance and input domains, and maintain a regression suite. Allow hot-swaps (e.g. better trig approximations) without touching base weights.
- Neural circuits often realize simple, reusable functions. Explicit modules are easier to verify, cache, optimize, and upgrade. Synthesis with invariants curbs overfitting, then calls avoid recomputation and shrink parameter/activation footprints.

Risks:
- Misspecification: Wrong library or IR misses real behavior, producing brittle extractions.
- Distributed representations: Useful programs spread across layers/heads, hard to isolate.
- Verification gaps: Passing tests but failing on rare regimes.
- Integration tax: Latency/ABI overhead and gradient mismatch at the call boundary.

Goals:
- 30-60% of occurrences of common algorithmic skills offloaded to modules with certified error \<=1e-6 on their domains.
- 10-30% parameter and 10-25% activation reduction at equal or better quality on tasks invoking those skills.
- Measurable gains in out-of-distribution robustness for offloaded skills (e.g. long-range dates, big numbers).

Literature:

Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code, https://www.mdpi.com/1099-4300/26/12/1046
Deriving Equivalent Symbol-Based Decision Models from Feedforward Neural Networks, https://arxiv.org/html/2504.12446v1
Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall, https://arxiv.org/html/2508.15214v1

Project: Differentiable ABI and self-offloading training

Motivation: Even with a library, models won't use it unless calling is easy and rewarded. We need a clean ABI (types, shapes, side effects) and a training regime that prefers external calls over re-learning in weights.

Idea: Introduce a differentiable application binary interface (dABI) and "call" tokens that route subproblems to external modules. Train with an MDL-style objective: prefer short "call+args" programs over long parametric computation. Use write-through learning so successful in-context calls are consolidated into persistent call policies.

Approach:
- ABI design: Typed arguments (scalars, vectors, strings, sets), effect annotations (pure/impure), shape contracts, and gradient rules (exact, straight-through, or stop-grad).
- Router: A lightweight planner head predicts when to call, which module, and with what arguments. Provide partial credit via differentiable surrogates (e.g. relaxed arg parsing, soft alignment of spans to args).
- Costs and rewards: Penalize param-only solutions when a verified module exists, then reward correct calls with small KL bonuses and latency/energy credits, then enforce per-batch "offload budgets" to shift usage gradually.
- Pruning: After stable adoption, gradually L0-prune circuits shadowed by calls, then keep a safety adapter to catch drift and trigger re-training.
- Continual learning: When no module fits, log traces and auto-propose new candidates for Project 1 to compile. Close the loop: discovered modules immediately become callable via the dABI.
- Models already plan tool usage, then making library calls first-class and cheap creates selection pressure to externalize. A typed ABI contains complexity, reduces integration bugs, and allows acceleration (e.g. vectorized trig kernels).

Risks:
- Over-calling: Router overuses modules where param paths suffice.
- Cold-start: Early errors discourage calls, then needs careful curriculum and safety nets.
- Gradient pathologies: Approximate gradients through discrete calls can bias training.
- Library sprawl: Too many niche modules increase complexity and latency.

Goals:
- >=80% correct module call rate on benchmarks containing extractable skills.
- Net wall-clock speedups of 1.3-2.0x on workloads heavy in offloaded operations.
- Stable or improved task accuracy with >=70% reduction in gradients flowing through replaced circuits.

Research topic: Self-documenting weights / live stack traces for reasoning

Thesis: Today's 'explanations' are mostly rationalizations written after the fact. Let's bake interpretability into the forward pass: layers and heads get compact, typed docstrings and citations that (a) summarize what each component is doing on this input, (b) constrain what information is allowed to flow next ('explain to execute'), and (c) compose into a real-time execution path/stack trace. Target two domains initially (multi-hop QA over provided context, grade-school math word problems) to demonstrate faithful, causal traces that show the facts and logic used at each step.

Project: HeadDocs (static capabilities + dynamic call-site docstrings and citations)

Motivation: Attention heads and MLPs often specialize, but their behavior is hidden. We want two levels of documentation: (1) a static capability docstring per head/block, and (2) a dynamic "call-site docstring" on each token that declares what the unit is doing now and which facts it uses.

Idea: Add a small documentation head and a citation head to each attention head and MLP block. The documentation head emits a short program sketch from a tiny vocabulary (the codebook), and the citation head names the spans/tokens used as evidence. Train them to be (i) predictive of the unit's effect, (ii) minimal via MDL pressure, and (iii) causally faithful by enforcing that downstream computation depends on the cited evidence and declared operation.

Suggested approach:
- Representations
- Static capability vector z_h per head/block with a canonical docstring (learned once, versioned).
- Dynamic call-site docstring d_h,t per token/time: a short sequence in a constrained DSL (e.g. COPY_FROM(span), MATCH(pattern), COREF_ANTE(span), AGG(NUM, window), DATE_ADD, ARGMAX(key), ROUTE_TO(node), ASSERT(invariant)).
- Citations c_h,t: a sparse set of input spans (start, end, source_id) with confidence.
- Losses and constraints
- Reconstruction: small decoders reconstruct the head's output from (d_h,t, c_h,t, selected source embeddings), the docstring must be sufficient to predict the unit's effect.
- Causal faithfulness: ablate or perturb cited spans, require the predicted effect to change accordingly (contrastive consistency). Enforce sparsity on citations and docstring length (MDL).
- Alignment: cluster behaviors offline and map clusters to codebook tokens, with human-curated seeds for a few canonical operations (coref, copy, local-n-gram, delimiter detection, number aggregation).
- Runtime
- A trace aggregator collects (d_h,t, c_h,t, z_h) across layers and compiles a readable stack trace with timestamps and evidence links.

Why: 'Documentation' used to predict and constrain the forward computation is far harder to fake than post-hoc gloss. Minimal codebook tokens with explicit citations make trace outputs short, comparable, and auditable.

Risks:
- Doc collapses to vague tags, codebook drift.
- Overhead in both compute and latency if not carefully architected.
- Faithfulness gaps if constraints are too weak or too soft.

Goals:
- Multi-hop QA over given passages: most correct answers accompanied by traces whose citations, when masked, cause the answer to fail (strong causal test).
- GSM8K-style math: most correct solutions accompanied by stepwise traces where masking cited numbers/ops flips the outcome.
- Overhead: low latency on a 7B-class model, static per-head docs are stable across corpora with >=0.8 Jaccard overlap of codebook tokens.

Possible procedure:
1) Initialize
- Choose base model (e.g. 7B pre-LN Transformer), instrument attention heads/MLPs with:
- Documentation head: linear to small LM over a fixed codebook (size 64-128 tokens).
- Citation head: pointer network over input tokens/spans, limited to k=1-3 pointers per unit/time.
- Lightweight reconstructor: predicts the unit's output using (docstring, cited embeddings).
- Define the codebook: seed 12-20 primitive tokens (COPY, COREF_ANTE, LOCAL_NGRAM, OPEN_QUOTE, CLOSE_QUOTE, MATCH_DIGITS, SUM_NUMS, MAX_BY_KEY, TABLE_HEADER_ALIGN, DATE_PARSE, DATE_ADD, FORMAT). Leave "OTHER_x" slots for emergent clusters.
2) Calibrate
- Collect behavior sketches: run the base model on small QA and math corpora, log attention patterns and interventions (activation patching, heads ablation).
- Cluster head behaviors (e.g. by attention kernel shapes, positional biases, content similarities) to propose initial head-to-codebook mappings, hand-label only 10-20 examples to seed.
3) Train (Phase A: doc-only)
- Freeze base LM. Train documentation and citation heads and reconstructors:
- Cross-entropy loss on doc tokens with entropy regularization, length penalty (hard cap 3-5 tokens).
- Sparse pointer loss for citations (top-k).
- Reconstruction loss: MSE/cosine between predicted vs actual head outputs.
- Causal imitation: for a subset of batches, zero out cited spans and require the reconstructor to predict the measured change in the head (contrastive).
4) Guard (doc quality)
- Disallow trivial docs: if doc length >1 but adds sub-epsilon reconstruction gain vs "OTHER", increase MDL penalty.
- Merge near-duplicate codebook entries, bound per-head doc entropy over time.
5) Train (Phase B: explain-to-execute coupling)
- Unfreeze small gates around each unit: when a docstring declares COPY_FROM(span), soft-mask attention to non-cited spans, when AGG(NUM), encourage number-feature heads to turn on.
- Loss: performance KL-regularized to base, plus a penalty if downstream layers use off-trace tokens (estimated via attention mass and gradient-based attribution).
6) Runtime trace
- Implement a trace aggregator producing a per-token JSON trace:
- step_id, layer, head, doc_tokens, citations (spans/ids), confidence, before/after norms.
- UI: render a collapsible stack trace with clickable spans showing source context.
7) Evaluate
- Fidelity: "mask-the-citation" causal tests (drop cited spans, rerun, record delta). Report per-domain causal F1.
- Compactness: average doc length, citation count.
- Stability: doc overlap on new corpora, drift alarms.
8) Refresh
- Periodically recluster behaviors, reassign "OTHER" tokens to concrete roles, prune unused codebook entries.
- Small preference-tuning round to favor shorter, more faithful docs without hurting task metrics.

Literature (and weaknesses addressed):
- Chain-of-thought: boosts performance, but often unfaithful and verbose. We avoid free-form text and enforce causal use via masking.
- Attention != explanation: we add reconstruction plus interventional tests to ensure attention/citations matter causally.
- Self-Explaining Neural Networks, rationale extraction: many produce proxies not used by the model. We couple docs to execution.
- Sparse autoencoders on residual streams: promising for features but not tied to live traces. We integrate with citations and reconstruction.

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence, https://arxiv.org/html/2504.02904v2
Ravel: Evaluating Interpretability Methods on Disentangling Language Model Representations, https://arxiv.org/html/2402.17700v2
Solving Abstract Reasoning Tasks with Grammatical Evolution, here

Project: Trace-Execute (minimal reasoning DSL that gates computation and yields a live stack trace)

Motivation: Even with HeadDocs, end-to-end reasoning is unclear. We want a compact 'reasoning atoms' DSL that the model emits interleaved with generation. These atoms both (a) constrain what the next step can use and (b) act as the executable plan, producing a faithful, verifiable stack trace.

Idea: Add a parallel trace channel that emits short sequences of typed atoms with arguments (spans, numbers, patterns). Enforce "explain to execute": the atom determines which submodules/heads are allowed to contribute in the next step and which evidence is admissible. Keep the DSL small and domain-scoped to avoid bloat and ensure strong fidelity.

Suggested approach:
- DSL (first pass, 8-12 atoms)
- QA: FETCH(span_id), HOP(via_anchor), COMPARE(span_a, span_b, key), AGG(list, op), SELECT(condition), ASSERT(supports(answer)), CITE(span_id).
- Math: PARSE_NUM(span), APPLY(op, args), KEEP(track_id), CHECK(invariant), FORMAT(result).
- Execution coupling
- When an atom fires, apply compile-time masks: limit attention to nominated spans, route to dedicated submodules (e.g. arithmetic/date kernels), and activate a small set of heads whose static docs match the atom (HeadDocstoDSL mapping).
- At each step, the trace channel proposes 0-2 atoms with confidences, if confidence \<tau, default to unconstrained decoding but flag "no-trace" for transparency.
- Supervision
- Weak signals: distant supervision from retrieval citations, gold answer spans, program-of-thought datasets (math), and tool returns. No reliance on ground-truth explanations at scale.
- Consistency: require that emitted atoms and HeadDocs agree when both fire, mismatches incur penalties or block the atom.
- Guardrails
- Budget per step: at most one FETCH and one APPLY per token budget.
- Abstain when inputs are ambiguous, do not fabricate arguments.

Why: A small, typed set of atoms is enough to make traces legible and enforce faithfulness. Gating makes cheating costly-if the atom says FETCH(span), only that span is accessible, so the model can't secretly use other facts.

Risks:
- Coverage gaps: DSL may miss useful operations.
- Over-constraint: masks too tight can hurt accuracy early on.
- Atom spam: model may emit atoms unnecessarily, need budget and MDL penalties.

Goals (3-6 months):
- QA: >=30% reduction in unsupported answers (answers with no valid citation) vs baseline, >=85% of correct answers have at least one FETCH/CITE whose removal flips the outcome.
- Math: >=70% of correct solutions accompanied by APPLY/ASSERT traces that pass automated step checks, \<=10% verbosity overhead in tokens due to trace channel.
- Live trace viewer: interactive "stack" for both domains, qualitative user studies indicate >0.8 usefulness ratings for debugging.

Possible procedure:
1) Initialize
- Add a parallel trace head (small decoder) conditioned on the main model's hidden states, share early layers.
- Define DSL schema and static mapping from HeadDocs codebook tokens to eligible atoms (e.g. COREF_ANTE to HOP, SUM_NUMS to AGG, DATE_ADD to APPLY(date_add)).
- Implement masks: attention mask for FETCH/HOP, module router for APPLY, whitelist heads per atom based on static docs.
2) Calibrate
- Build small training pools:
- QA: Hotpot-like passages with annotated support sentences/spans, open-domain reduced to provided contexts.
- Math: GSM8K with synthetic tool-verified step checks (unit tests for arithmetic, date ops).
- Measure baseline unsupported answer rate, spontaneous citations.
3) Warm-start (supervised)
- Train the trace head to imitate weak labels (support spans, tool steps) where available, apply MDL length penalty and abstain option.
- Enforce atom-to-mask in a fraction of steps to acclimate, keep base LM frozen, monitor loss of task performance.
4) Couple (explain-to-execute)
- Gradually increase fraction of steps where masks are enforced, use a KL trust region to avoid large regressions.
- Penalize "off-trace usage": if gradients/attention mass flow outside allowed spans during masked steps, add loss.
5) Consistency with HeadDocs
- Jointly train with Project 1 so that when an atom fires, the active heads' dynamic docs match the atom class, add a small cross-entropy alignment loss.
- Break ties by preferring HeadDocs if atom confidence is low.
6) Guard
- Atom budget: max atoms per K tokens, high cost for atoms that don't change downstream behavior (measured by ablation).
- Drift alarms: if unsupported answers creep up or trace coverage plunges, reduce masks and retrain.
7) Evaluate
- Fidelity: same mask-the-citation tests as Project 1, now at the step level. Measure fraction of runs where removing fetched spans or preventing an APPLY step flips the final answer.
- Efficiency: overhead of masks, module routing, and trace tokens, aim for \<=15% latency increase.
8) Refresh
- Add 2-3 atoms based on observed gaps (e.g. MATCH_DATE, JOIN_TABLE) only if they clear an MDL/utility threshold.
- Distill: occasionally fine-tune to reduce reliance on masks while keeping traces stable.

Literature:

Neuro-Symbolic Approach to Certified Scientific Software Synthesis, https://www.researchgate.net/publication/382156788\_Neuro-Symbolic\_Approach\_to\_Certified\_Scientific\_Software\_Synthesis

Sub-project (optional): Anti-rationalization curriculum

Motivation: Models can learn to emit plausible but non-causal traces. Force faithfulness by training under randomized or hidden-information regimes.

Idea: Randomized ablation and counterfactuals during training. Replace would-be cited spans with paraphrases or foils and require the trace (and answer) to update. Inject "trace traps" where the only way to succeed is to follow the declared atoms.

Possible procedure:
1) Generate foils: paraphrase or swap entities in candidate spans, mark them.
2) Train: in 20-30% of batches, replace a cited span with a foil, require the model to either (a) change its answer accordingly or (b) abstain/emit uncertainty, penalize traces that remain unchanged.
3) Evaluate: measure "trace flip rate" under counterfactuals, aim for >=0.7 on curated sets.

Signs it's working:
- High causal flip rates when cited spans are ablated.
- Short, stable per-head docs that generalize across datasets.
- Trace channel coverage rising over time without atom spam.
- Human auditors can follow the stack trace to verify answers quickly.

How this differs from prior work (and weaknesses addressed):
- Chain-of-thought and rationales can be unfaithful and verbose. Our atoms/docstrings are minimal, typed, and coupled to execution via masks.
- Attention-as-explanation is weak, we combine attention with reconstruction and interventional tests.
- Sparse autoencoders produce features but not live, causal traces, we add citations and enforce use.
- Program-of-thoughts can overfit to template programs, our DSL is tiny, with abstention, and is only partially enforced to avoid brittleness.

Integration points and non-bloat scoping:
- Scope strictly to two domains (multi-hop QA over provided context, GSM8K-like math).
- Keep codebook \<=128 tokens, DSL \<=12 atoms, k\<=3 citations per step.
- Enforce masks in a gradually increasing fraction of steps, avoid full hard enforcement early.
- Defer ambitious expansions (full formal proofs, rich ontologies) to later phases, ship a working live trace with causal guarantees first.

Overview

Research Topics and Suggested Projects

Research topic: Faster learners

Project: Faster training via approximate analytic solutions

Project: Learn-to-database (explicit hierarchical memory)

Sub-project: Write-head sparsity curriculum (explicit, updatable hierarchies)

Project: Train on the residual (semantic LZ + template VM)

Research topic: Learning from failure

Project: Prefix advantage distillation

Project: verifier-guided tree policy iteration

Research topic: Making learned programs explicit

Project: Neural-to-Program compilation (N2P)

Project: Differentiable ABI and self-offloading training

Research topic: Self-documenting weights / live stack traces for reasoning

Project: HeadDocs (static capabilities + dynamic call-site docstrings and citations)

Project: Trace-Execute (minimal reasoning DSL that gates computation and yields a live stack trace)

Sub-project (optional): Anti-rationalization curriculum

See for yourself
‍what we can do

Company

Contact Us

Company

Resources

Contact