This document describes the research topics we are opening for fellowships. Each topic suggests several projects indicative of the scope and complexity we expect you to undertake.
12 places will be opened for the 2025-2026 term, with a duration of 6 or 12 months at the applicant's option.
"High risk research" means ambitious projects that may not produce results, but will have lasting impact if they work.
We believe that publication in AI/ML has become overly biased towards short-term and iterative results, despite major gaps in our understanding of how to build optimal models.
The intent of these fellowships is to give awardees uninterrupted time to focus on harder open problems, with adequate compute, talented peers, and weekly 1:1 mentorship from senior researchers but without chasing specific metrics.
You will be expected to spend about 80% of your time on your own research, and up to 20% of your time either assisting other fellows or participating in wider research programs at IMI. We focus largely on applied research and its applications to online security problems, but often publish and support frontier research aligned with our broader interests. Serving hundreds of millions of people gives us a unique perspective as to what works at scale.
Eligibility: previous fellows and research staff have come from disparate backgrounds, including early career researchers previously at MSR, FAIR, Mila, MPI, etc. and self taught mid career engineers transitioning into research. We do not discriminate on the basis of pedigree or age. If you have done interesting work, that is enough. You may reside anywhere in the world, excluding sanctioned jurisdictions. This will be a remote fellowship.
Deliverables: we do not have hard targets, but generally try to get 1-2 papers with code done in a year, targeting NeurIPS, ICML, ICLR, etc. At the end of your fellowship, if the threshold for publication at a conference or in a journal is unmet we will expect a final report, which may be published as a blog post.
Deadline: Admitting fellows in two cohorts. Deadlines for consideration: Oct 1 25, Feb 1 26. 3 week decision period. Rolling thereafter.
Compensation: competitive location-adjusted stipend, conference and travel support for conferences with accepted papers.
Applying: send a brief bio / CV link via this page. Include 1) the topic you are interested in working on, 2) a few lines on any relevant prior work you've done, 3) your github / scholar / x links, 4) your desired start date, duration, and other obligations (if any) during that period, and 5) a brief analysis of one of the projects outlined below.
Each project intentionally includes some gaps or glosses. List the ones you see, and how you'd solve them. Alternatively, if you dislike the projects outlined under a particular topic, write up your own idea and why it is more promising, along with your estimate of time and compute required.
Selection criteria: novelty and importance, clarity of approach, feasibility given time/compute, alignment with topics. Panel review and one interview.
Selection will be based solely on merit. IMI is an equal opportunity employer, and does not discriminate on the basis of age, disability, sex, orientation, race, religion or belief. We promote equality of opportunity for all, and welcome applications from anyone with talent, skills and potential.
Thesis: training power usage should be many orders of magnitude lower. No single change will get us there, but a plausible research program is to:
1) make most tokens unnecessary,
2) quickly get to a good solution analytically or with amortized learners,
3) learn to use external structure rather than training every fact into the weights, and
4) combine second order methods with low precision training to converge in fewer, larger steps.
Motivation: training big networks via SGD is hideously inefficient. Can approximate analytic solutions provide a 100x reduction in training time?
Idea: use curvature information from a tiny fraction of the corpus, solve the resulting quadratic problem once per block, then de-linearize by injecting the update through small gates so that the network stays in its local linear regime.
Suggested approach: estimate a good solution by solving a sketched natural-gradient step in closed form. Stream the corpus once to estimate a block-diagonal or Kronecker-factored Fisher (or Gauss-Newton) in each layer using random projections, compute the ridge solution for linearized parameters, de-linearize by composing layers near identity.
Why: layerwise K-FAC/Shampoo already works, randomized second-order solvers are mature, neural nets are close to linear early in training. A high-quality one-shot quasi-Newton init could remove 90-99% of steps.
Risks: linearization error, imperfect Fisher, stability in deep stacks. (+ still needs a brief fine-tune at the end). LLMs also benefit from feature learning outside the lazy/NTK regime, where analytic linearized steps help least.
Possible procedure:
Literature:
Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures
SKFAC: Training Neural Networks with Faster Kronecker-Factored Approximate Curvature
Near-optimal Sketchy Natural Gradients for Physics-Informed Neural Networks
Motivation: storing facts in weights is wasteful and gets stale. Can we externalize knowledge as an explicit, updatable hierarchy the model learns to write to and query?
Idea: make the model 1) induce its own ontology of concept nodes, typed edges, and table schemas, and 2) plan short programs over it. The LM becomes a planner/fuser, facts live in a compact, interpretable store.
Suggested approach: build a small, learned hierarchy: nodes with hyperbolic/tree codes and prototypes, sparse typed edges with evidence, leaves as text passages and table rows. Train write (create/link) and plan (DSL) heads, retrieve tiny typed subgraphs/rows, fuse with gated cross-attention, penalize param-only answers.
Why: transformers already form latent hierarchies. Making them explicit improves sample efficiency, updatability, and interpretability. RAG/FiD, Poincare embeddings, schema induction, and programmatic retrieval provide working parts.
Risks: ontology drift/fragmentation, noisy links, planner brittleness, latency/complexity. Model may bypass the store. Some fast/robust knowledge must remain parametric.
Possible procedure:
Literature:
Memory^3: Language Modeling with Explicit Memory
Ontology Generation using Large Language Models
Motivation: unconstrained writes bloat the store and kill interpretability. Can we grow a compact, legible hierarchy by starting ultra-sparse and relaxing only when utility is proven?
Idea: treat create node/edge/schema as gated actions with an explicit cost. Start with near-zero write capacity, force reuse/links, relax L0/MDL penalties as evidence accumulates that new structure reduces loss.
Suggested approach: hard-concrete gates per write op (create/link/split), group-lasso over relation types, and an MDL-style budget. A staged schedule: 1) link-only, 2) controlled splitting, 3) schema induction, 4) relaxed growth. Each write gets a justification (support spans, proto summary), enabling audit and rollback.
Why: early noise is what makes ontologies sprawl. A sparsity curriculum preserves tree-like structure, encourages reuse, and keeps concepts interpretable while still allowing growth when it pays off.
Risks: over-sparsity (missed concepts), late discovery tax (hard to recover if you never split), planner gaming the penalties, merge instability.
Possible procedure:
Signals it's working: sublinear graph growth, rising citation rate, stable small branching factors, and large accuracy drops when the store is ablated in the domains it covers.
Literature:
Crisp Attention: Regularizing Transformers via Structured Sparsity, https://arxiv.org/abs/2508.06016
The Role of Sparsity for Length Generalization in Transformers, https://arxiv.org/pdf/2502.16792
White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?, https://arxiv.org/abs/2311.13110
Growing neural networks: dynamic evolution through gradient descent, https://arxiv.org/abs/2501.18012
Identifying hubs in directed networks, https://arxiv.org/pdf/2312.03347
GKG-LLM: A Unified Framework for Generalized Knowledge Graph Construction, https://arxiv.org/html/2503.11227v2
Motivation: pretraining rereads the same patterns and reasoning chains. If models learn reusable templates, we should only pay for residual bits, not repetitions. Maybe direct + hierarchical compressibility can cut token exposure.
Idea: turn pretraining into compression. Learn a hierarchical dictionary of semantic spans and reasoning templates. Encode the corpus into 1) template calls with slot fills and 2) residual tokens the dictionary can't predict. Train primarily on the residual, keep the dictionary executable so a few examples teach many instances.
Suggested approach: 1) a neural compressor that induces macros (semantic LZ) and templates (reasoning VM) with MDL pressure, and 2) a residual trainer that samples only novel bits with importance weighting. Refresh the dictionary online so the residual steadily shrinks.
Why: language is highly compressible, gradient contributions are heavy-tailed. Grammar induction, dataset distillation, and macro tokenization already show large sample cuts. If chain-of-thought compresses into a small set of schemas, we need far fewer demonstrations.
Risks: over-compression (miss rarities), drift as the model/dictionary co-evolve, template brittleness, compressor overhead. Requires tight guards to avoid bias and loss spikes.
Possible procedure:
Expected outcome: train on 3-10% of original tokens at steady state for compressible domains. Large wall-clock/power savings in early training, better factual and reasoning generalization per token via explicit reuse of templates, fast adaptation by updating the dictionary.
Literature:
Language Modeling is Compression, https://arxiv.org/abs/2309.10668
Compression Represents Intelligence Linearly, https://arxiv.org/pdf/2404.09937
Entropy Law: The Story Behind Data Compression and LLM Performance, https://arxiv.org/html/2407.06645v1
Thesis: people learn from our mistakes. Models don't, but pass@512 works better than @1, indicating the answer is often reachable already. Let's make max(pass@N) \== pass@1:
1) turn search into supervision,
2) convert in-context fixes into weights,
3) compile failures into repairs, and
4) keep gains without regressions/forgetting.
rough ideas: 1) bound updates with KL/Fisher caps, run canaries, use selective counterfactual replay, periodically distill validated patches into main. 2) align fail vs win traces to learn detectors and conditional rewrites, store them as routed patches and planner rules. 3) trigger on surprise from tool returns, crystallize traces into reusable templates and low-rank adapters with trust-region commits. 4) harvest pass@N trees with verifiers and partial checks, learn prefix value functions and distill winner prefixes into the policy.
Motivation: many problems succeed with pass@64-512 but miss on pass@1 because early plan choices diverge. Can we shift policy to prefer winners without leaking finals or eval labels?
Idea: distill the advantage of early prefixes from winning traces over near-misses. Learn to emit better plan tokens, decompositions, and invariants in the first few steps, under a KL trust region and without training on final answers.
Suggested approach: mine pass@N logs with a verifier. extract prefixes up to the first tool call or K tokens. run preference learning on winner vs loser prefixes plus step-wise advantage-weighted updates from partial checks. update the planner head and small early-layer adapters, freeze late layers.
Why: early decisions carry most of the causal weight on success, pass@N pools expose reliable winner patterns, DPO/IPO and advantage-weighted regression work with preferences and partial rewards, avoiding ground-truth leakage.
Risks: spurious correlations in prefixes, verbosity inflation, domain shift from synthetic near-misses, regressions if KL is loose.
Possible procedure:
Expected outcome: pass@1 gains from the same pass@N budget, localized updates that do not overfit finals, minimal compute overhead beyond log mining and short preference training.
Literature:
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning, https://arxiv.org/pdf/2410.08146
RECURSIVE INTROSPECTION: Teaching Foundation Model Agents How to Self-Improve, https://openreview.net/pdf?id=UPoQqreegH
Counterspeech the ultimate shield! Multi-Conditioned Counterspeech Generation through Attributed Prefix Learning, https://arxiv.org/pdf/2505.11958
Motivation: best-of-N search induces an implicit decision tree where early nodes determine success. Can we turn those trees into a value signal over plan tokens and perform policy iteration to improve first-shot behavior?
Idea: build a prefix value model that predicts expected verifier success given a partial plan. estimate Q over plan decisions from search trees and partial checks, then perform KL-regularized policy improvement on the planner head.
Suggested approach: construct shallow trees from pass@N rollouts, annotate nodes with verifier outcomes and partial rewards, compute soft Q via backward value propagation. train a small critic over prefixes, improve the planner policy with advantage-weighted cross-entropy while freezing late layers.
Why: tree policy iteration is a principled way to learn from search, verifier and partial checks provide dense rewards, KL-regularized policy updates are stable and sample-efficient.
Risks: value leakage if trees are overfit, instability from off-policy bias, verbosity creep, and interference with non-searched domains.
Possible procedure:
Expected outcome: first-shot accuracy approaches best-of-N without extra sampling at inference, stable, interpretable improvements via early plan value shaping, modest compute overhead with strong sample efficiency.
Thesis: Neural nets can implicitly learn small, reusable programs (sin/cos, sorting, date arithmetic, unit conversion) encoded as superpositions in their weights. We should extract these programs into explicit, verifiable modules and make models call them, reducing parameter bloat, improving generalization, and enabling updates without retraining. If this works well we'll next tackle composability, i.e. a 'neural linker' step.
Motivation: Storing algorithms in weights is opaque, hard to update, and redundancy-prone. If a net "knows" trig, a calendar, or a regex engine, we should externalize them as code and prune the corresponding circuits.
Idea: Build a compiler that maps network behaviors on targeted subspaces into a small typed intermediate representation of numerical and symbolic modules, with verification tests and equivalence checks. Replace the discovered circuit with a call to the extracted module.
Approach:
- Scope: Identify candidate latent programs by probing for low-entropy, low-rank, highly repeatable behaviors at specific layers/heads like 'angle to sin(angle)', 'string to match(pattern)'.
- Specification mining: Generate test benches using counterfactual inputs, invariance checks (e.g. sin(x+2pi)=sin(x)), and smoothness/periodicity detectors, then fit candidates from a library (trig, polynomials, finite-state transducers, arithmetic, set ops).
- Synthesis: Use hybrid methods-sparse identification of nonlinear dynamics (SINDy-style), symbolic regression, and enumerative search over a compact IR (typed SSA with vector ops, conditionals, and bounded loops).
- Verification: Run equivalence tests against the original subnetwork over adversarial and randomized inputs, then certify error bounds and input domain. If verified, freeze/retire the circuit and insert a differentiable "call-module" stub with gradients routed to arguments, not the replaced weights.
- Maintenance: Version modules, track provenance and input domains, and maintain a regression suite. Allow hot-swaps (e.g. better trig approximations) without touching base weights.
- Neural circuits often realize simple, reusable functions. Explicit modules are easier to verify, cache, optimize, and upgrade. Synthesis with invariants curbs overfitting, then calls avoid recomputation and shrink parameter/activation footprints.
Risks:
- Misspecification: Wrong library or IR misses real behavior, producing brittle extractions.
- Distributed representations: Useful programs spread across layers/heads, hard to isolate.
- Verification gaps: Passing tests but failing on rare regimes.
- Integration tax: Latency/ABI overhead and gradient mismatch at the call boundary.
Goals:
- 30-60% of occurrences of common algorithmic skills offloaded to modules with certified error \<=1e-6 on their domains.
- 10-30% parameter and 10-25% activation reduction at equal or better quality on tasks invoking those skills.
- Measurable gains in out-of-distribution robustness for offloaded skills (e.g. long-range dates, big numbers).
Literature:
Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code, https://www.mdpi.com/1099-4300/26/12/1046
Deriving Equivalent Symbol-Based Decision Models from Feedforward Neural Networks, https://arxiv.org/html/2504.12446v1
Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall, https://arxiv.org/html/2508.15214v1
Motivation: Even with a library, models won't use it unless calling is easy and rewarded. We need a clean ABI (types, shapes, side effects) and a training regime that prefers external calls over re-learning in weights.
Idea: Introduce a differentiable application binary interface (dABI) and "call" tokens that route subproblems to external modules. Train with an MDL-style objective: prefer short "call+args" programs over long parametric computation. Use write-through learning so successful in-context calls are consolidated into persistent call policies.
Approach:
- ABI design: Typed arguments (scalars, vectors, strings, sets), effect annotations (pure/impure), shape contracts, and gradient rules (exact, straight-through, or stop-grad).
- Router: A lightweight planner head predicts when to call, which module, and with what arguments. Provide partial credit via differentiable surrogates (e.g. relaxed arg parsing, soft alignment of spans to args).
- Costs and rewards: Penalize param-only solutions when a verified module exists, then reward correct calls with small KL bonuses and latency/energy credits, then enforce per-batch "offload budgets" to shift usage gradually.
- Pruning: After stable adoption, gradually L0-prune circuits shadowed by calls, then keep a safety adapter to catch drift and trigger re-training.
- Continual learning: When no module fits, log traces and auto-propose new candidates for Project 1 to compile. Close the loop: discovered modules immediately become callable via the dABI.
- Models already plan tool usage, then making library calls first-class and cheap creates selection pressure to externalize. A typed ABI contains complexity, reduces integration bugs, and allows acceleration (e.g. vectorized trig kernels).
Risks:
- Over-calling: Router overuses modules where param paths suffice.
- Cold-start: Early errors discourage calls, then needs careful curriculum and safety nets.
- Gradient pathologies: Approximate gradients through discrete calls can bias training.
- Library sprawl: Too many niche modules increase complexity and latency.
Goals:
- >=80% correct module call rate on benchmarks containing extractable skills.
- Net wall-clock speedups of 1.3-2.0x on workloads heavy in offloaded operations.
- Stable or improved task accuracy with >=70% reduction in gradients flowing through replaced circuits.
Thesis: Today's 'explanations' are mostly rationalizations written after the fact. Let's bake interpretability into the forward pass: layers and heads get compact, typed docstrings and citations that (a) summarize what each component is doing on this input, (b) constrain what information is allowed to flow next ('explain to execute'), and (c) compose into a real-time execution path/stack trace. Target two domains initially (multi-hop QA over provided context, grade-school math word problems) to demonstrate faithful, causal traces that show the facts and logic used at each step.
Motivation: Attention heads and MLPs often specialize, but their behavior is hidden. We want two levels of documentation: (1) a static capability docstring per head/block, and (2) a dynamic "call-site docstring" on each token that declares what the unit is doing now and which facts it uses.
Idea: Add a small documentation head and a citation head to each attention head and MLP block. The documentation head emits a short program sketch from a tiny vocabulary (the codebook), and the citation head names the spans/tokens used as evidence. Train them to be (i) predictive of the unit's effect, (ii) minimal via MDL pressure, and (iii) causally faithful by enforcing that downstream computation depends on the cited evidence and declared operation.
Suggested approach:
- Representations
- Static capability vector z_h per head/block with a canonical docstring (learned once, versioned).
- Dynamic call-site docstring d_h,t per token/time: a short sequence in a constrained DSL (e.g. COPY_FROM(span), MATCH(pattern), COREF_ANTE(span), AGG(NUM, window), DATE_ADD, ARGMAX(key), ROUTE_TO(node), ASSERT(invariant)).
- Citations c_h,t: a sparse set of input spans (start, end, source_id) with confidence.
- Losses and constraints
- Reconstruction: small decoders reconstruct the head's output from (d_h,t, c_h,t, selected source embeddings), the docstring must be sufficient to predict the unit's effect.
- Causal faithfulness: ablate or perturb cited spans, require the predicted effect to change accordingly (contrastive consistency). Enforce sparsity on citations and docstring length (MDL).
- Alignment: cluster behaviors offline and map clusters to codebook tokens, with human-curated seeds for a few canonical operations (coref, copy, local-n-gram, delimiter detection, number aggregation).
- Runtime
- A trace aggregator collects (d_h,t, c_h,t, z_h) across layers and compiles a readable stack trace with timestamps and evidence links.
Why: 'Documentation' used to predict and constrain the forward computation is far harder to fake than post-hoc gloss. Minimal codebook tokens with explicit citations make trace outputs short, comparable, and auditable.
Risks:
- Doc collapses to vague tags, codebook drift.
- Overhead in both compute and latency if not carefully architected.
- Faithfulness gaps if constraints are too weak or too soft.
Goals:
- Multi-hop QA over given passages: most correct answers accompanied by traces whose citations, when masked, cause the answer to fail (strong causal test).
- GSM8K-style math: most correct solutions accompanied by stepwise traces where masking cited numbers/ops flips the outcome.
- Overhead: low latency on a 7B-class model, static per-head docs are stable across corpora with >=0.8 Jaccard overlap of codebook tokens.
Possible procedure:
1) Initialize
- Choose base model (e.g. 7B pre-LN Transformer), instrument attention heads/MLPs with:
- Documentation head: linear to small LM over a fixed codebook (size 64-128 tokens).
- Citation head: pointer network over input tokens/spans, limited to k=1-3 pointers per unit/time.
- Lightweight reconstructor: predicts the unit's output using (docstring, cited embeddings).
- Define the codebook: seed 12-20 primitive tokens (COPY, COREF_ANTE, LOCAL_NGRAM, OPEN_QUOTE, CLOSE_QUOTE, MATCH_DIGITS, SUM_NUMS, MAX_BY_KEY, TABLE_HEADER_ALIGN, DATE_PARSE, DATE_ADD, FORMAT). Leave "OTHER_x" slots for emergent clusters.
2) Calibrate
- Collect behavior sketches: run the base model on small QA and math corpora, log attention patterns and interventions (activation patching, heads ablation).
- Cluster head behaviors (e.g. by attention kernel shapes, positional biases, content similarities) to propose initial head-to-codebook mappings, hand-label only 10-20 examples to seed.
3) Train (Phase A: doc-only)
- Freeze base LM. Train documentation and citation heads and reconstructors:
- Cross-entropy loss on doc tokens with entropy regularization, length penalty (hard cap 3-5 tokens).
- Sparse pointer loss for citations (top-k).
- Reconstruction loss: MSE/cosine between predicted vs actual head outputs.
- Causal imitation: for a subset of batches, zero out cited spans and require the reconstructor to predict the measured change in the head (contrastive).
4) Guard (doc quality)
- Disallow trivial docs: if doc length >1 but adds sub-epsilon reconstruction gain vs "OTHER", increase MDL penalty.
- Merge near-duplicate codebook entries, bound per-head doc entropy over time.
5) Train (Phase B: explain-to-execute coupling)
- Unfreeze small gates around each unit: when a docstring declares COPY_FROM(span), soft-mask attention to non-cited spans, when AGG(NUM), encourage number-feature heads to turn on.
- Loss: performance KL-regularized to base, plus a penalty if downstream layers use off-trace tokens (estimated via attention mass and gradient-based attribution).
6) Runtime trace
- Implement a trace aggregator producing a per-token JSON trace:
- step_id, layer, head, doc_tokens, citations (spans/ids), confidence, before/after norms.
- UI: render a collapsible stack trace with clickable spans showing source context.
7) Evaluate
- Fidelity: "mask-the-citation" causal tests (drop cited spans, rerun, record delta). Report per-domain causal F1.
- Compactness: average doc length, citation count.
- Stability: doc overlap on new corpora, drift alarms.
8) Refresh
- Periodically recluster behaviors, reassign "OTHER" tokens to concrete roles, prune unused codebook entries.
- Small preference-tuning round to favor shorter, more faithful docs without hurting task metrics.
Literature (and weaknesses addressed):
- Chain-of-thought: boosts performance, but often unfaithful and verbose. We avoid free-form text and enforce causal use via masking.
- Attention != explanation: we add reconstruction plus interventional tests to ensure attention/citations matter causally.
- Self-Explaining Neural Networks, rationale extraction: many produce proxies not used by the model. We couple docs to execution.
- Sparse autoencoders on residual streams: promising for features but not tied to live traces. We integrate with citations and reconstruction.
How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence, https://arxiv.org/html/2504.02904v2
Ravel: Evaluating Interpretability Methods on Disentangling Language Model Representations, https://arxiv.org/html/2402.17700v2
Solving Abstract Reasoning Tasks with Grammatical Evolution, https://www.researchgate.net/publication/348408303\_Solving\_Abstract\_Reasoning\_Tasks\_with\_Grammatical\_Evolution
Motivation: Even with HeadDocs, end-to-end reasoning is unclear. We want a compact 'reasoning atoms' DSL that the model emits interleaved with generation. These atoms both (a) constrain what the next step can use and (b) act as the executable plan, producing a faithful, verifiable stack trace.
Idea: Add a parallel trace channel that emits short sequences of typed atoms with arguments (spans, numbers, patterns). Enforce "explain to execute": the atom determines which submodules/heads are allowed to contribute in the next step and which evidence is admissible. Keep the DSL small and domain-scoped to avoid bloat and ensure strong fidelity.
Suggested approach:
- DSL (first pass, 8-12 atoms)
- QA: FETCH(span_id), HOP(via_anchor), COMPARE(span_a, span_b, key), AGG(list, op), SELECT(condition), ASSERT(supports(answer)), CITE(span_id).
- Math: PARSE_NUM(span), APPLY(op, args), KEEP(track_id), CHECK(invariant), FORMAT(result).
- Execution coupling
- When an atom fires, apply compile-time masks: limit attention to nominated spans, route to dedicated submodules (e.g. arithmetic/date kernels), and activate a small set of heads whose static docs match the atom (HeadDocstoDSL mapping).
- At each step, the trace channel proposes 0-2 atoms with confidences, if confidence \<tau, default to unconstrained decoding but flag "no-trace" for transparency.
- Supervision
- Weak signals: distant supervision from retrieval citations, gold answer spans, program-of-thought datasets (math), and tool returns. No reliance on ground-truth explanations at scale.
- Consistency: require that emitted atoms and HeadDocs agree when both fire, mismatches incur penalties or block the atom.
- Guardrails
- Budget per step: at most one FETCH and one APPLY per token budget.
- Abstain when inputs are ambiguous, do not fabricate arguments.
Why: A small, typed set of atoms is enough to make traces legible and enforce faithfulness. Gating makes cheating costly-if the atom says FETCH(span), only that span is accessible, so the model can't secretly use other facts.
Risks:
- Coverage gaps: DSL may miss useful operations.
- Over-constraint: masks too tight can hurt accuracy early on.
- Atom spam: model may emit atoms unnecessarily, need budget and MDL penalties.
Goals (3-6 months):
- QA: >=30% reduction in unsupported answers (answers with no valid citation) vs baseline, >=85% of correct answers have at least one FETCH/CITE whose removal flips the outcome.
- Math: >=70% of correct solutions accompanied by APPLY/ASSERT traces that pass automated step checks, \<=10% verbosity overhead in tokens due to trace channel.
- Live trace viewer: interactive "stack" for both domains, qualitative user studies indicate >0.8 usefulness ratings for debugging.
Possible procedure:
1) Initialize
- Add a parallel trace head (small decoder) conditioned on the main model's hidden states, share early layers.
- Define DSL schema and static mapping from HeadDocs codebook tokens to eligible atoms (e.g. COREF_ANTE to HOP, SUM_NUMS to AGG, DATE_ADD to APPLY(date_add)).
- Implement masks: attention mask for FETCH/HOP, module router for APPLY, whitelist heads per atom based on static docs.
2) Calibrate
- Build small training pools:
- QA: Hotpot-like passages with annotated support sentences/spans, open-domain reduced to provided contexts.
- Math: GSM8K with synthetic tool-verified step checks (unit tests for arithmetic, date ops).
- Measure baseline unsupported answer rate, spontaneous citations.
3) Warm-start (supervised)
- Train the trace head to imitate weak labels (support spans, tool steps) where available, apply MDL length penalty and abstain option.
- Enforce atom-to-mask in a fraction of steps to acclimate, keep base LM frozen, monitor loss of task performance.
4) Couple (explain-to-execute)
- Gradually increase fraction of steps where masks are enforced, use a KL trust region to avoid large regressions.
- Penalize "off-trace usage": if gradients/attention mass flow outside allowed spans during masked steps, add loss.
5) Consistency with HeadDocs
- Jointly train with Project 1 so that when an atom fires, the active heads' dynamic docs match the atom class, add a small cross-entropy alignment loss.
- Break ties by preferring HeadDocs if atom confidence is low.
6) Guard
- Atom budget: max atoms per K tokens, high cost for atoms that don't change downstream behavior (measured by ablation).
- Drift alarms: if unsupported answers creep up or trace coverage plunges, reduce masks and retrain.
7) Evaluate
- Fidelity: same mask-the-citation tests as Project 1, now at the step level. Measure fraction of runs where removing fetched spans or preventing an APPLY step flips the final answer.
- Efficiency: overhead of masks, module routing, and trace tokens, aim for \<=15% latency increase.
8) Refresh
- Add 2-3 atoms based on observed gaps (e.g. MATCH_DATE, JOIN_TABLE) only if they clear an MDL/utility threshold.
- Distill: occasionally fine-tune to reduce reliance on masks while keeping traces stable.
Literature:
Neuro-Symbolic Approach to Certified Scientific Software Synthesis, https://www.researchgate.net/publication/382156788\_Neuro-Symbolic\_Approach\_to\_Certified\_Scientific\_Software\_Synthesis
Motivation: Models can learn to emit plausible but non-causal traces. Force faithfulness by training under randomized or hidden-information regimes.
Idea: Randomized ablation and counterfactuals during training. Replace would-be cited spans with paraphrases or foils and require the trace (and answer) to update. Inject "trace traps" where the only way to succeed is to follow the declared atoms.
Possible procedure:
1) Generate foils: paraphrase or swap entities in candidate spans, mark them.
2) Train: in 20-30% of batches, replace a cited span with a foil, require the model to either (a) change its answer accordingly or (b) abstain/emit uncertainty, penalize traces that remain unchanged.
3) Evaluate: measure "trace flip rate" under counterfactuals, aim for >=0.7 on curated sets.
Signs it's working:
- High causal flip rates when cited spans are ablated.
- Short, stable per-head docs that generalize across datasets.
- Trace channel coverage rising over time without atom spam.
- Human auditors can follow the stack trace to verify answers quickly.
How this differs from prior work (and weaknesses addressed):
- Chain-of-thought and rationales can be unfaithful and verbose. Our atoms/docstrings are minimal, typed, and coupled to execution via masks.
- Attention-as-explanation is weak, we combine attention with reconstruction and interventional tests.
- Sparse autoencoders produce features but not live, causal traces, we add citations and enforce use.
- Program-of-thoughts can overfit to template programs, our DSL is tiny, with abstention, and is only partially enforced to avoid brittleness.
Integration points and non-bloat scoping:
- Scope strictly to two domains (multi-hop QA over provided context, GSM8K-like math).
- Keep codebook \<=128 tokens, DSL \<=12 atoms, k\<=3 citations per step.
- Enforce masks in a gradually increasing fraction of steps, avoid full hard enforcement early.
- Defer ambitious expansions (full formal proofs, rich ontologies) to later phases, ship a working live trace with causal guarantees first.