- Train two models, make one evil with opposite beliefs as the good one - Switch which model you sample from every token (good, evil, good evil) - Observe results