The Future Of Life

Evan Hubinger on Inner Alignment, Outer Alignment, and Proposals for Building Safe Advanced AI



It's well-established in the AI alignment literature what happens when an AI system learns or is given an objective that doesn't fully capture what we want.  Human preferences and values are inevitably left out and the AI, likely being a powerful optimizer, will take advantage of the dimensions of freedom afforded by the misspecified objective and set them to extreme values. This may allow for better optimization on the goals in the objective function, but can have catastrophic consequences for human preferences and values the system fails to consider. Is it possible for misalignment to also occur between the model being trained and the objective function used for training? The answer looks like yes. Evan Hubinger from the Machine Intelligence Research Institute joins us on this episode of the AI Alignment Podcast to discuss how to ensure alignment between a model being trained and the objective function used to train it, as well as to evaluate three proposals for building safe advanced AI.  Topics discussed