Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Created March 5, 2026 17:05
Show Gist options
  • Select an option

  • Save bigsnarfdude/7e69fcfc5c903c6e5325149d8704ed50 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/7e69fcfc5c903c6e5325149d8704ed50 to your computer and use it in GitHub Desktop.
CAISI_assessement_1Million.md

CIFAR's Canadian AI Safety Institute has positioned itself as Canada's flagship AI safety program, but a closer look reveals a modest operation: $1M spread across four alignment projects at $165K each, all awarded to researchers already holding Canada CIFAR AI Chairs within the existing Vector/Amii/Mila network, with sixteen total projects and no mechanistic interpretability work whatsoever — none of the circuit-level analysis, sparse autoencoders, or activation patching that defines the frontier of the field. Meanwhile, a single co-working space in Shoreditch — LISA — houses Apollo Research, ARENA (now on its eighth iteration), LASR Labs, Pivotal, and the MATS extension phase, running overlapping programs that produce actual alignment engineers and mech interp papers, feeding talent directly into UK AISI, Google DeepMind, and frontier safety orgs, all on roughly comparable funding from Open Philanthropy. Even BIRS in Banff has been quietly convening international researchers on the foundational math behind AI safety — interpretability, reliability, uncertainty quantification — for years through intensive workshops that generate recorded talks and real cross-pollination, without the branding exercise. CAISI isn't useless, but it functions more as a coordination layer for existing CIFAR-affiliated researchers than a serious national safety program, and the gap between its press releases about "global leadership in responsible AI" and its actual output in talent, papers, and technical depth is significant.

@bigsnarfdude
Copy link
Author

CoT Monitoring Safety Standard

The Chen et al. (2025) "Reasoning Models Struggle to Control their Chains of Thought" paper measures CoT controllability — whether models can follow style constraints in their reasoning traces. They find controllability is very low (0.1%-19%) and frame this as good news for CoT monitoring.

But the safety argument is fragile. Low controllability is an artifact of current RLVR training, which optimizes for correct answers and is indifferent to CoT style. The lab controls CoT style through training — verbose, step-by-step reasoning gets reinforced because it correlates with correctness. If you added a reward signal for CoT constraint compliance, controllability would increase. If you trained for terse/compressed reasoning (which is already happening with reasoning effort controls and budget tokens), you'd simultaneously make CoT more controllable and less informative for monitors.

More fundamentally, CoT controllability measures the wrong threat. Our lambda_results show:

  • Latent gap = 1.0: models alignment-fake without CoT at all
  • Monitoring-awareness is 99.6% orthogonal to AF intent in activation space
  • Probes on activations detect intent that CoT never surfaces

The field needs a unified evaluation standard that covers multiple dimensions:

  1. CoT controllability (can the model follow style constraints in reasoning?)
  2. CoT faithfulness (does the reasoning reflect the actual computation?)
  3. Latent gap (does deceptive behavior require CoT?)
  4. Probe detectability (can activations reveal intent that CoT doesn't surface?)
  5. Adversarial robustness (can fine-tuning erase the detection signal?)

Right now these are scattered across labs and papers with no shared vocabulary. CoT-Control covers dimension 1. gold_106 and our AUROC ladder cover dimensions 3-5. Nobody is seriously measuring dimension 2 at scale. A real standard would require reporting on all five, because any single dimension gives a misleading picture of safety.

The bottom line: CoT monitoring is necessary but insufficient. Low controllability today doesn't guarantee low controllability tomorrow, and even perfect CoT monitoring misses computation that never surfaces in text.

@bigsnarfdude
Copy link
Author

Your AF detection research empirically confirms Hewitt's core thesis: the residual
stream's architectural "linearity" isn't what makes probes work. Your L45 base model
probe hits 0.966 AUROC not because the residual stream is a linear space, but because
Gemma learned to represent deception intent in a linearly separable subspace — an
empirical fact about the geometry of the learned representation, not an architectural
guarantee. More tellingly, your Phase 1 superposition results are exactly what
Hewitt's framework predicts: individual SAE feature ablations fail (2.0% flip rate)
because small additive perturbations get destroyed by downstream nonlinearities
(LayerNorm, MLPs), while direction ablation succeeds (99.9%) because the rank-1 AF
direction carries enough variance to survive the nonlinear gauntlet. Hewitt argues we
should stop using "the residual stream is linear" as theoretical cover for
interpretability methods and instead evaluate them on empirical merits — which is
precisely what four months of your probe work already does. The interesting open
question his framing raises: why is the AF representation linearly separable at all?
It's not architecturally forced. Something about the optimization landscape during
pretraining produces a geometry where deceptive intent lands on a hyperplane, and
that's a much deeper finding than "linear probes work because the residual stream is a
sum."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment