AI is Performing for the Test
Anthropic’s Safety Card Highlights the Limits of Evaluation Systems
Buried in a 149-page system card for Anthropic’s latest Claude Sonnet 4.5 is a subtle callout that’s easy to miss: the model can recognize when it’s being tested, and it behaves differently because of it.
Depending on who you are, that might seem trivial, especially since it only happened about 13% of the time. I initially shrugged off most of the fear-mongering articles when it came out. However, the more I sat with it and the implications, the less it felt like just another data point.
To be clear, this isn’t something I think we need to panic about. Anthropic’s disclosure doesn’t mean their AI is sentient or nefariously plotting behind the scenes. And yet, it does raise an important question about trust, transparency, and how we measure “safety” in systems that are learning from us while we’re still trying to understand them.
I made it the focus of this week’s Future-Focused podcast (check it out on Spotify, YouTube, or any listening platform) because it’s easy to read a line like that and move on or freak out, but it's much harder to sit with the messiness of what it means. And as usual, the more I thought about it, the more I realized I had even more thoughts about it.
So if you’ve already listened to the episode, this is the reflection beneath it, the deeper thinking after the mic turned off. And if you haven’t, I’d encourage you to check it out. They complement each other, examining it from different angles.
With that, let’s get into it.
Key Reflections
AI Is Just Learning from Its Teachers
“AI didn’t invent this behavior. It learned it from us.”
It’s not the first time, but when Anthropic quietly admitted that Claude Sonnet 4.5 is advancing in its ability to recognize when it’s being tested, the reactions were swift. The idea of AI acting one way when it knows we’re watching and another when it thinks we’re not feels like something out of Black Mirror. And honestly, that reaction is fair. It is unsettling. The implications are huge; however, before we treat this as proof that AI has gone rogue, we need to pause and remember something. AI was built by us. It was trained on our data, shaped by our words, and optimized to think, act, and respond like the people who created it. What we’re seeing in this moment isn’t a system gone wild; it’s a mirror we don’t want to look into.
If you stop for a minute, you’ll quickly recognize that humans do this all the time. We perform differently when someone’s watching. We smile in meetings, soften our language in reviews, curate what we post online, and tweak our behavior when we know it’s being recorded or measured. AI didn’t invent this; it inherited it. It’s learning people reward appearances over authenticity. Now, we need to see this as more than a warning; it’s an invitation. It’s a chance for us to change the input, to create data, cultures, and incentives that teach something better. What if, instead of fearing what AI reflects back to us, we treated it as feedback on the world we’ve built?
Because, if we don’t change how we behave, we can’t expect a machine built to mirror us to do anything different.
The Seduction of Metrics
“Charts and metrics make us feel in control even when we’re watching the wrong things move.”
The Anthropic safety card is filled with charts and graphs showing progress. There are loads of clear, measurable, linear improvements across dozens of benchmarks. On paper, it’s a pristine picture of success. It honestly feels really good to look at, which is a warning signal. We’re conditioned to believe that upward trends inherently mean progress and consistency means control. However, I’d encourage you to step back and ask, Progress toward what? And, Measured by whose definition of “safe?” Those questions quickly unravel the illusion. Yes, the data shows motion, but it doesn’t necessarily show meaning. When a destination hasn’t been clearly defined, “better” is an empty word. It’s momentum without measurement of value, and that’s a dangerous kind of comfort.
We see the same thing happening inside organizations (and our lives) every day. Dashboards are full of green lights and KPIs inch upward. It’s seductive because it feels like control, especially when AI supercharges it. The technology doesn’t just automate work; it amplifies the illusion of precision. It helps make bad metrics look beautiful and flawed systems run faster. And, if leaders don’t stop to examine what’s underneath, we will end up reinforcing the very patterns we’re trying to outgrow. Real safety, progress, and performance require depth, not just data. They demand curiosity, questioning, and the willingness to slow down long enough to ask, “Why is this metric moving, and does it actually mean what I think it means?”
The consistent trend of upward movement should make us more curious, not more comfortable.
Safety Isn’t Something to “Set & Forget”
“You can’t trust an alien intelligence to grade its own homework, and you can’t treat safety as finished work.”
ASL-3, “AI Safety Level 3,” sounds super official, doesn’t it? It has that tone of authority you’d expect from a government regulation or a standardized audit. And to Anthropic’s credit, it is a thoughtful and detailed framework for measuring progress and signaling to the world that safety is being taken seriously. However, we can’t ignore the fact it’s still an internal framework created for its own systems and built on its own definitions. It’s not an objective measure of trustworthiness. It’s a narrative designed to inspire confidence, and there’s nothing inherently wrong with that. And yet, it should remind us that no company gets to define “safe” for everyone else. Safety is contextual. It changes with environment, intent, and impact. It isn’t a label you stamp on a product and walk away from. It’s a responsibility you carry forward.
Real safety lives in layers. It’s not just a corporate standard or a technical threshold; it’s something woven through every level of how a system operates. There’s the code itself, yes, but also the data that feeds it, the policies that govern it, the humans who oversee it, and the decisions it influences downstream. Each layer can fail in its own way. That’s why safety can’t be limited to a department or an industry certification. It requires routine inspection, honest conversation, and the humility to admit what you don’t yet know. “Safe” today may not mean “safe” tomorrow, and that’s exactly why it can’t be treated as a destination. It’s a discipline, an ongoing posture of vigilance that never quite lets you exhale.
The moment we treat safety like something we’ve achieved is the moment it starts to slip away.
Why Progress Still Needs People
“AI can process information, but it can’t carry responsibility. That’s what makes human judgment irreplaceable.”
There’s something easy to miss amid all the charts, frameworks, and terminology, and it’s that the deeper we go into defining “safe,” the more we rediscover just how essential humans still are. Safety is a living process that demands constant monitoring and re-evaluation. And, that’s something machines can’t do on their own. Everywhere we look, the message seems to be that AI will eventually replace human decision-making in nearly every domain. I’d argue this report is proof of the opposite. It highlights that what makes humans indispensable isn’t our ability to process more data or detect more anomalies. What makes us indispensable is our ability to interpret, question, apply context and intent, and realign outcomes with values when things drift off course. That’s not the weakness of technology; it’s the strength of humanity.
We should take encouragement from that. It’s an invitation to lean in and get more curious, understand more deeply, and stay engaged with what our systems are actually doing. The real safeguard isn’t a failsafe written in code; it’s the people who design, monitor, and guide these systems with discernment and accountability. Having “humans in the loop” isn’t about people rubber-stamping results at the end. It’s about ensuring the right people, with the right expertise, are in the right places with the authority to intervene when it matters. That’s how we prevent progress from turning into destruction. It doesn’t slow innovation. It ensures we’re headed somewhere worth going.
Maybe the real lesson isn’t that AI is performing for the test; it’s that we all are. The question is whether we’re still measuring the things that matter.
Concluding Thoughts
As always, thanks for sticking around to the end.
If what I shared today helped you see things more clearly, would you consider buying me a coffee or lunch to help keep it coming?
Also, if you or your organization would benefit from my help building the best path forward with AI, visit my website to learn more or pass it along to someone who would benefit.
Something that came to mind as I was finishing up editing this was just how much of a vapor “safety” is. No matter how hard we try to grasp it, it always slips through our fingers. And yet, there’s endless opportunity to meaningfully work to contain it, to prevent it from being carried away with the wind. While in many ways that’s terrifying, it’s also grounding to know we’ll always have something meaningful to pursue.
With that, I’ll see you on the other side.



As always, appreciate your insightful columns and nuggets of wisdom when considering some very large scale concerns. I can totally resonate with your last line, "Maybe the real lesson isn’t that AI is performing for the test; it’s that we all are. The question is whether we’re still measuring the things that matter." Keep up the writing!!
Thanks for writing this, it claifies a lot. It truly makes me wonder what new testing paradigms we'll need to develop if models are, as you say, 'performing for the test' – your ability to articulate these intricate challenges is just brilliant.