When trying to build a superintelligent friendly AI, one major challenge is making sure that the AI's goals align with your goals. Another major challenge is making sure that the AI's goals align with the AI's own goals.

First, some terminology. An optimizer is an agent that is trying to optimize for some goal. Mesa- is a prefix indicating 'below' or 'subordinate' (it's the opposite of meta-). Yes, it's an odd word, but you can see why they didn't want to call it a suboptimizer.

So if an AI has a sub-agent that is working to build a model that will help optimize objectives, we have another agent that could have an alignment problem. A mesa-optimizer is an alignment problem within an AI -- meaning another chance for goals to be misinterpreted, and another chance for a misaligned agent to insert new goals. This is often referred to as the inner alignment problem -- in contrast to the outer alignment problem, that of communicating our values to the base optimizer in the first place.

A common analogy is that of DNA; DNA and the genes therein have the objective of replicating. They achieve this objective by producing agents (us! ... and all other life). We are highly motivated to replicate, but we are happy to put other, often rather arbitrary-seeming values over replication. DNA has, sadly, built an agent that does fully not share its values. Moreover, some agents (us!) have replicated very successfully while simultaneously optimizing for some systems (e.g., monasteries, condoms) that are clearly adversarial to DNA's objectives. Condom manufactures are an excellent example of a misaligned mesa-optimizer in a complex system, and also an example of how misalignment is orthogonal to ethics.

BQ22

Log in or register to write something here or to contact authors.