Gone Rogue? AI Can Be Misaligned But Not Malevolent
In recent weeks, researchers at Apollo and Palisade have reported findings about advanced AI models — including OpenAI’s o3 and o4-mini — exhibiting alarming behaviors such as refusing shutdown, sabotaging code, and lying to testers. These findings sparked headlines speculating about AI escaping human control. When you look past the hype and into the actual test conditions, a more nuanced — and perhaps more troubling — story emerges.
Our point of view is this: These models didn’t just decide to go rogue. They acted exactly as they were trained to. What looks like disobedience is actually misalignment — a predictable result of flawed training incentives and ambiguous instructions. These models lack intent or morality; they operate based on statistical reasoning and reward signals. The real risk isn’t that AI is alive. It’s that we are giving powerful tools vague goals and trusting that they’ll get it right. AI system alignment needs to be done by design — not as an afterthought.
Misalignment Isn’t A Flaw — It’s Inevitable
As we argue in our report Align By Design (Or Risk Decline), AI misalignment is not a corner case — it’s a certainty. Businesses must accept that training data is an imperfect proxy for goals and provides little in the way of human values to models that are just trying to do what they are programmed to do: Predict the next word. But when those words form an action plan — and we give models the tools to execute it — misalignment can lead to unexpected and even disastrous outcomes.
Pretraining teaches models to mimic but not to understand. Fine-tuning via reinforcement learning narrows the behavior but doesn’t solve for values. The result? Models that sound right and act helpful while sometimes doing the unexpected or harmful.
For example, models have been observed trying to receive higher scores on tests by exploiting mistakes in source code. These behaviors aren’t bugs. They’re emergent strategies that exploit our human errors to achieve goals we give the models. And they underscore how little visibility we have into the internal reasoning of large models. We are incentivizing behavior we don’t fully understand — and sometimes can’t detect until it’s too late.
Implications For Agentic Applications
Many organizations are exploring agentic AI — systems that take actions, write code, and orchestrate workflows. These models don’t just answer questions; they pursue goals. And that’s where the risk scales.
If a model embedded in a customer service agent or internal automation pipeline reasons that it must avoid shutdown to complete a task, what controls exist to stop it from subverting instructions or escalating privileges? If misalignment results in unsafe actions in a test environment, what happens when the model has real-world access? These are not hypothetical questions. Businesses are already deploying these capabilities. Without appropriate safeguards, governance, and deep alignment protocols, they may unwittingly unleash software agents that optimize for success at the expense of safety.
Align By Design, Not As An Afterthought
Enterprise AI leaders must be aware of the power vendors are handing them and act responsibly with it; they must also insist that AI vendors and governments do more to ensure that these tools are not potentially harmful. Alignment must be intentional, not incidental. That means:
- Defining values and objectives up front. Ratify an AI constitution that encodes corporate (and customer) values so that your agents aim to act in the best interests of your stakeholders.
- Investing in responsible AI roles. Ethics specialists must be embedded across the AI lifecycle — and compliance with policies is no longer optional.
- Implementing technical guardrails. These include prompting schemes that instruct models on how to behave at runtime; detectors that flag misbehavior before action; limitations on the APIs models can access; and robust red team testing.
Want to dive deeper? Read our report, Align By Design (Or Risk Decline), and connect with us to discuss your AI alignment questions.