Microdosing Lucidity

on non-isolated calls for structure

Yudhister Kumar — Fri, 26 Sep 2025 08:48:39 GMT

Safety cases are arguments that AI deployments are safe in some specified context. The context can include restrictions on deployment environments as well as training or deployment protocols. For instance, the debate safety case only applies to low-stakes deployment environments, requires exploration guarantees on the model, and relies on a debate protocol which avoids obfuscated arguments. Given these assumptions, Buhl et. al. argue for “asymptotic guarantees”—that high performance on alignment objectives during training translate to approximate alignment during deployment. The control safety case is structurally similar, instead focusing directly on an explicit threat model and concretizing assumptions accordingly.

A naive way of constructing an “alignment portfolio” is simply to make safety cases which adequately cover all deployment environments with the appropriate degree of risk-tolerance. Formal verification for high-stakes SWE deployment, white-box interpretability for monitoring automated alignment researchers, some adapted debate protocol for use in executive decision-making. If the individual arguments are all sound, this works!

What if we introduce some error into the soundness judgements? If every safety case has some epsilon probability of failure, then straightforwardly you should make more safety cases for the scenarios in which alignment properties matter more. But if all your safety cases for non-deceptive automated alignment researchers rely on “white-box interpretability mostly working,” then if this isn’t true you’re still doomed no matter how many safety cases you write!

Anthropic’s ASL-4 safety case sketches are not quite this correlated, but only just. [1] relies on white-box methods successfully monitoring deception, [3] relies on guarantees that the pretrained model is not coherently deceptive (likely requiring successful white-box or black-box methods), and [2] still depends on linear activation probes adequately showing that the model cannot distinguish between certain classes of train and test deployments, as well as black-box evaluations providing sufficiently robust guarantees on behavior. These are similar assumptions! These assumptions are all only true in worlds where “models are sufficiently parsimonious such that present-day interpretability techniques and evals can provide rigorous guarantees on good behavior.”

In general, insufficient diversity over the world structure assumed in an alignment portfolio makes the portfolio fragile and irrobust.1

It is always necessary to make assumptions about world structure when predicting world behavior. A bounded reasoner simulates the world with a local, low-fidelity model based on the reasoner’s accumulated evidence about the world. Some assumptions on world structure are better than others—gravity following an inverse-square law vs. homeopathic remedies curing cancer, for instance.

Considering the structure of one’s structural assumptions is critically important in domains where the world behavior has not been exhibited and it is of importance. Note:

The largest scientific breakthroughs are accompanied by structural assumptions about the world breaking. See the atomic bomb, CRISPR, heavier-than-air flight. Fundamentally, these “expand the domain of the possible.” Sometimes, the world structure is discovered first (as in nuclear theory leading to the first controlled chain reaction). Other times, a prototype uncovers the structure (see: penicillin). In both cases, the non-specialist intelligent reasoner understands a different possibility domain before and after.
Top-down searches for structural guarantees must be incredibly judicious in their assumptions, because the vast majority of hypotheses are incorrect. Ex post, the structure is obvious, but ex ante it is not. Consider Newton devoting as much energy to alchemy as the study of gravitation.
If we take the perspective that alignment is an infinite problem, there is no good reason to expect that the world structure we can reasonably assume is simple. It might be that it is infinitely complex and is only limited by our current understanding, and that we will recover finer and finer approximations of it as our understanding improves. At each stage of this process we will have to repeat our assumption examination from a roughly equivalent epistemic vantage point of staring into the abyss.
Much of the existential risk from AI development comes from tail risks and black swan events. Mitigating these requires a portfolio of solutions which each rely on decorrelated or independent world models (note this is not a guarantee).

Natural corollaries of this observation:

we should be explicit about which world models are going into constructing safety cases,
we should be developing independent safety cases for high-stakes deployment situations,
we should emphasize diversity in theoretical agendas to buttress our ability to make such safety cases reliant on disjoint sets of assumptions.

This is a specific instance of the general case of “Swiss cheese models only work when the holes don’t line up in the same worlds,” which is probably not sufficiently justified in this post but is something I believe to be true.

Diffusion Roundup

Yudhister Kumar — Thu, 25 Sep 2025 03:06:37 GMT

[1] Diffusion models seem to outperform traditional autoregressive models in the large data limit on token-prediction tasks.1 Autoregressive models are still superior in the low-data/compute-limited regime, and the threshold at which diffusion models become optimal follows a power-law in the dataset size (typically exceeding the Chinchilla threshold by a large margin).2 Diffusion models also see performance gains under “trivial” data augmentation methods for far longer than autoregressive models (e.g. reordering tokens), and this is plausibly because the generation method is fundamentally non-causal? (Much of the performance gap can be recovered by implementing similar data augmentation methods in the AR case, but it’s unclear if this scales to tasks that require “cognition” in the human sense of the word). Not entirely clear how this translates to better performance on real-world tasks in the data-limited regime; it could be that the compute scaling necessary is simply prohibitive, and it could also be that the implicit curriculum afforded by the de-noising process is simply insufficient at providing reasonable enough signal on difficult tasks.

[2] Diffusion in-practice probably has the circuit complexity depth constraints of an attention-based transformer. In the last few years, we’ve seen literature essentially claiming that attention in-practice is limited to modeling circuits in the class TC0 (polynomial width, constant depth Boolean circuit family).3 Adding chain-of-thought ~roughly increases this to NC1 (although there are some subtleties involving the lack of robustness to input-ordering).4 There are reasons to expect difficult problems, especially the sorts encountered in long-horizon RL, to require architectures that can internally simulate deep computation. These architectures have been recurrent thus far. However, recurrent architectures fail to adequately leverage the compute parallelism offered by GPUs and have many, many issues with unstable training dynamics, so scaling transformers is a better option. It’s probably not the case that diffusion models can prove an adequate replacement here, but it’s interesting that a diffusion process with no constraints imposed by a score function can theoretically simulate any Turing-complete process, but when perfectly matching a score function still has the limitations of a TC0 representation. Results in the approximate regime pending.5

[3] Diffusion is (kind of) spectral autoregression.6 There are two brilliant blog-posts on the subject, cumulatively arguing that DDPM has an inductive bias to generating low-frequency features before high-frequency features (in Fourier space; hence the name) but this is not necessarily true of all possible diffusion models (changing the model’s noising schedule to be frequency-agnostic doesn’t degrade performance on CIFAR10 and similar datasets, but not all noising schedules achieve the same performance!). How much does this matter for text-data domains? Audio? Video? Are there correspondences we can make between distributional structure and optimal noising schedules? In algorithmic cases, what does this mean?

I primarily find diffusion models interesting from a theoretical perspective, given that the corresponding SDE literature is rich and there are (potentially) deep connections to be made to modern ML. In particular, I expect we can better understand what feature orderings are optimal, which properties of distributions make them learnable, how much an inductive bias is a property of the model architecture vs. optimization algorithm or other factors, and to what extent recurrence can be represented with parallel architectures. This post should not be taken as definitive; it has not been edited and I welcome feedback.

This section is summarizing the paper Diffusion Beats Autoregression in Data-Constrained Settings.

The metric the authors use is “number of unique tokens”—which is quite strange, given that the vocab size of a model is typically quite limited, and they mention training a 2.3B parameter diffusion model on a 500M unique token dataset. Perhaps they mean just the token size of a dataset with no repeated entries?

See The Parallelism Tradeoff: Limitations of Log-Precision Transformers, Transformers, parallel computation, and logarithmic depth, and Theoretical limitations of multi-layer Transformer.

See Chain of Thought Empowers Transformers to Solve Inherently Serial Problems. CoT/neuralese introducing “effective recurrence” into modern models seems to be important for timeline modeling.

Reach out if you have thoughts!

See A Fourier Space Perspective on Diffusion Models.

Links 09/22

Yudhister Kumar — Tue, 23 Sep 2025 04:19:51 GMT

[1] Glencore potentially selling 75% stake in DRC mine which produces ~10% of world's cobalt

[2] Timothy Gowers breaks silence to discuss generating a database of "motivated proofs"

[3] SEP discusses whether or not it's moral to take pride in one's cultural heritage

[4] New Qwen releases, including a 30B multimodal I/O model

[5] Diffusion beats autoregression when data-constrained

[6] Developmental brain imaging datasets linked with psychopathologies

[7] Google DeepMind adds new "Critical Capability Level" triggered when models have powerful manipulation capabilities