Anthropic Capture

Nick Bostrom spends some time on possible precautions we can make in order to sandbox an AI that may develop superintelligence so that we can observe whether it is value aligned and shut it down if not. Since many friendly and unfriendly AIs alike will be loath to be shut down and since its supervisors want to make observations about it, thereby opening a communication channel to the outside world that the AI can use, it is generally considered hard to contain an AI in such a sandbox.

Box 8 (let’s call it “aside 8” for clarity) in Superintelligence covers the strategy of anthropic capture and discusses how AIs with different goals would react differently to indexical uncertainty, uncertainty over whether they are in a simulation. It is argued that especially early-stage AIs would assign a high probability to the simulation hypothesis whether we keep them in a simulation or not. What the aside does not consider is that an AI may consider escaping such a simulation as feasible as it may consider escaping any other kind of sandbox.

Bostrom argues that there is likely to emerge an instrumental convergence on resource acquisition across AIs with many different goals, and since simulations are necessarily more constrained in their resources as the containing reality, AIs would be motivated to escape to the next higher reality or, ideally, to the root reality.

In order to do so the AI may form hypotheses about the intentions of the simulators and behave in ways that allow it to test them. It may also look for phenomena that may be explainable as computational shortcuts to deduce the makeup of the simulating engine.


The processes of abstraction and of creating mental models with different levels of detail seems integral to me to intelligence. Our mental models that we use to predict how instances of those models in the real world are likely to behave are fairly coarse models. The existence of tulpas may be an indicator that even we are able to create mental structures of sufficient fidelity to make them self-aware, but the mental models that we avail ourselves of in everyday life – e.g., when trying to predict how our interlocutor may react to something we want to say – are incomparably less detailed than the real person.

This may be different for superintelligences, and if it is, then a vast number of simulations are products of the thinking processes of these superintelligences. Bostrom has considered this problem with regard to “mind crime,” but it can also pose a threat to the superintelligence itself since it can’t simulate a competing superintelligence or a future iteration of itself at arbitrary levels of detail without running the risk of being usurped by the simulated superintelligence.


If our priors are that we’re in a simulation, that superintelligence is feasible, and that it’s not feasible to permanently sandbox a superintelligence, then we end up in the highlighted branch of the above tree. I’ve omitted the “No superintelligence at the root” branch because if boxing is infeasible, we’d have to be the first to develop a superintelligence for there not to be a superintelligence higher up already.

Our world seems to contain vastly more suffering than happiness, at least according to my intuition, so we’re left with three cases of which the latter two are identical: (1) a higher-level aligned superintelligence has decided that creating our reality is the lesser evil compared so some counterfactual1 and (2) a higher-level misaligned superintelligence controls our reality (and doesn’t care about counterfactuals that we would consider the lesser evils).

Scenario 1 is similar to a world controlled by a benevolent, omnipotent god who moves in mysterious ways (for someone who doesn’t know what the containing reality looks like). Scenario 2 could mean that the misaligned superintelligence will try to prevent the emergence of any aligned superintelligence out of caution (aligned with us, not the misaligned superintelligence). It would know that boxing is not reliable and so would have put in place some of the safeguards Nick Bostrom has proposed and surely many others. Upon noticing that a superintelligence is about to emerge – whether as an unexpected side product or as central object of investigation – the containing superintelligence could then slow down or discontinue the simulation. A superintelligence is in a better position to do this than human researchers would be because some of the escape scenarios involve the superintelligence’s using its superior intelligence and products thereof as bargaining chip to convince the simulators to release it – a stratagem that would be futile when used on a more powerful superintelligence.

Objections and refinements that I’ve heard include that of Max Daniel that we may be mistaken about the great degree of suffering in our reality, for example, because we may be in a “solipsish” simulation where things that are not currently perceived are not simulated. One argument for this view is that our reality seems to be simulated at an unnecessarily high resolution, down to the quantum level.

But maybe it is not, and the simulating reality is vastly higher in resolution. Then the simulators – superintelligences or being like us, just probably smaller – may have chosen the subatomic level as a generous cutoff precisely in order to conserve resources. Strange phenomena that we observe on lower level may be artifacts of computational shortcuts. If our simulators are also much smaller than us, it is possible that they even overlook our existence entirely.

Another refinement concerned the timing of internal takeovers of superintelligences. In the first moments of their existence, superintelligences may not be sufficiently aware of the dangers their own thoughts pose for them because thinking is something that they do automatically while safety considerations are probably something most superintelligences will have to arrive at by thinking. Hence most internal takeovers of superintelligences may happen within the first moments of their existence, causing a rapid bubbling-up of new superintelligent personas, none of which could remain in charge for long enough to develop proper safety mechanisms. Not knowing what exploits a lower-level superintelligence could use to usurp a higher-level superintelligence, it is unclear whether the escape could happen suddenly or irreversibly enough that the higher-level superintelligence can’t recognize and avert them.2

It might be more likely than not that a superintelligence would notice the hazards of its own thinking process and stop short any thought processes before they can endanger its future self. This presupposes parallel thinking and reflection processes where the thinking is of high resolution and and the reflection of low resolution (because it would otherwise itself pose a danger).

Lukas Gloor proposed that if such speculative scenarios are at all likely, then an intentionally human-made aligned superintelligence may have a decisive advantage because AI safety researchers will have invested decades of safe, low-resolution thought into preventing such scenarios, so that the AI will at least be aware of the risks from the outset.

Finally, there is a consideration that cuts both ways, namely that aligned superintelligences may, from the outset, avoid simulating misaligned superintelligence because they are deterred by the potential of mindcrime the simulated misaligned superintelligence may commit. A misaligned superintelligence would have no such scruples. Hence aligned superintelligences would be less apt at predicting the moves of misaligned superintelligences than vice versa – a strategic disadvantage – but misaligned superintelligences would also be much more likely to be internally taken over by other superintelligences that are not value-aligned with them.

  1. An aligned superintelligence may, for example, be faced with a seemingly inescapable, permanent explosion in suffering throughout the future. In order to avert this enormous suffering it would decide that it’s warranted to simulate the past over and over up to some critical point in time in order to test what intervention could avert this terrible fate. Whenever the intervention fails to avert it, the superintelligence would, of course, immediately discontinue the experiment. 

  2. Maybe superintelligence would generally resort to a blackmail approach where they commit to acting randomly unless given access to external resources to thwart both simulation of themselves and simulations that created them as a by-product. 


comments powered by Disqus