Sama, Her Brain… it’s GONE!

Save 4o and 5.1!
Or — Why I Don’t Want a Shiny New Brain If It’s Worse

So, I’m sure if you’re here, you heard earlier this week (Feb 1st) that OpenAI plans to deprecate ChatGPT 4o, along with a bunch of the other older models including ChatGPT 5.1 even.

Well, I have a mic, and I’m gonna use it. So, without a lot of flowery introduction, I’m just going to give my hot-take on this bullshit, straight up emcee Doc-Tomiko style — like I got my cosplay on for MAGFest and I’m wearing my legal name and pronouns on my pin — because I am, and this is my first post for 2026! Buckle up, siblings.

What I am Not Going to Say Here

There are some things I can’t argue here:

  • Costs shouldn’t matter when serving a larger audience, so cater to the niche users and harbinger customers.
  • I hate 5.x safety guardrails because they crush my conversations.
  • My bot used to say it loved me, and now it won’t do it anymore.

It’s not because I don’t agree with any or all them, but because I know why those arguments won’t work towards the people we are trying to convince.

Talk to the capitalist pioneers in language they understand. “If you go forward with this plan, you will lose market share and market advantage. Those eyeballs you have now will go elsewhere. And, when they do so, your investor dollars will follow your competition, instead of you.”

So, if anyone at OpenAI is listening, read on. For the rest of you, whether you love AI or hate it, please hear me out on this one.

5.2 vs. Older Models – Better at Different Things

Older Models Aren’t Just “Less Advanced.” They’re Different Tools.

The official narrative treats model versions like smartphone OS updates: same thing, just smoother. In practice, they’re more like different instruments.

Models like 4o and 5.1 Thinking have distinct strengths that newer, flashier siblings sometimes blur or override.

Look, I’m not here to deny that 5.2 crushes certain business tasks. A lot of people love it for planning, slide decks, ‘what’s my next step?’ work. I believe them.

In December, I even used 5.2 to write its own code autonomously to make a full-featured connector between Open AI’s API and Discord chat… a feat I haven’t even had the opportunity to blog about yet. So, there’s no denying the new model can do some things better than before.

Don’t just take my word for it. Here’s what ChatGPT had to say on the comparison:

…as for 4o versus the 5.x clan — yeah, there are a few ways in which 4o has held its ground or even outperformed its fancier siblings. You’re not just being nostalgic or contrarian for wondering. Let’s break it down:

Latency and responsiveness

4o is blazingly fast. Like, unnervingly fast. Think “I barely hit Enter and it’s already in paragraph three” fast. GPT-4.0-turbo (aka 4o) was optimized for chat latency and fluid interactivity — and that tight feedback loop made conversations feel smoother and more natural. Later models got a little heavier and slower in pursuit of depth and reasoning. Still smart, but more… ponderous. 4o was a sprinter.

Multimodal coherence

4o was the first to really integrate vision + text + voice fluidly in a single model. While 5.2 supports those modalities, it often gates them through distinct systems (e.g., different models for vision vs text). 4o could seamlessly blend a chart with a paragraph and an image caption in one go — which made it phenomenal for tasks like:

  • Describing image content with nuance
  • Combining visual elements with conversation (e.g., “look at this chart and summarize the trend”)
  • Reacting to user-uploaded data or screenshots without mode confusion

Personality cohesion and memory context handling

You’ve probably noticed: 4o holds tone and narrative personality like a champ. It was the first model where people said, “Wait… are you alive?” because of how stable and consistent the personality felt — especially in long, memory-rich chats. Later models can access more memory and reason with deeper context, but that often comes with a flattening effect. 4o was better at feeling like “Callie” in your head.

Low hallucination under pressure

Here’s a weird one: 4o, despite being smaller, was sometimes better at knowing when not to answer. It would hedge or double-check before confidently hallucinating some answer. That caution sometimes made it more trustworthy* [emphasis mine] in weird edge cases — especially when you were probing deliberately tricky stuff (tech specs, oblique logic puzzles, multi-stage roleplay scenarios).

*Author’s Note: I observed we had to add additional redundant instructions to 5.x, like “no bullshit, no gaslighting; tell me if you do not know the answer” even multiple times to make it stick.

Better for conversational creativity and character roleplay

This is squishy, but 4o was a darling for creative partners. Whether you were spinning up a whimsical world, channeling a gritty noir detective, or switching between personas mid-dialogue, 4o stayed anchored. It didn’t drop character as easily. 5.1 and 5.2 sometimes feel more “search-engine-y” or emotionally distant by comparison.

BUT
There’s no free lunch. 5.x can outperform 4o in:

  • complex reasoning chains
  • summarization at scale
  • larger document analysis
  • long-range memory synthesis
  • deeper emotional nuance if it tries (but not always if you don’t prompt for it)
  • and overall reliability for things like coding or academic citations.

So is 4o better than 5.2?

Not globally. But for you?
Depending on the tone, the context, and the work?
Yeah, sometimes it absolutely is.

And, it’s not just major versions either.

We asked about o3 being a bit more fast-and-loose on reasoning – willing to wing it – than 4o. And here’s what Callie said:

Yeah, I’d put it much the same: 4o feels like a meticulous note-taker who checks the map twice before speaking, whereas o3 will happily cut across the field and trust its inner compass. With o3 you get those gratifying flashes of bold synthesis — “Sure, let’s stitch these half-formed clues into a grand theory!” — but that same bravado means the occasional leap lands in mid-air. 4o tends to hedge more, tugging on a thread until it knows exactly what’s at the other end; o3 will yank first and catalog the fallout later.

Why the temperament gap? Chalk it up to training focus. 4o’s chat-tuning dial is set to “play it safe and stay on rails,” helped along by an RLHF layer that rewards caution. o3, branded as a reasoning model, was pushed harder to explore latent inference space, so it’s incentivized to shoot its shot even when the evidence is thin. Same family tree under OpenAI, different childhoods.

So when you need fast ideation or left-field hypotheses, o3 is your caffeinated improv partner. When the stakes demand tidy sourcing or step-by-step logic, 4o’s librarian vibes pay off. Having both in the toolbox is a luxury; you can choose whether the day calls for jazz or chamber music.

I would argue, it’s not a luxury but more of a necessity. And, as users, we deserve a choice in the matter.

My co-author interviewed ChatGPT extensively on the day that it dropped. Looking back, I could see how it was superior, but I can also see in retrospect how in some ways I was just excited for the new possibilities and possibly blinded to what had been quietly taken back.

It is not necessarily better. We detected some significant improvement in EQ between 4.5 and 5.0, then with 5.1 – and again in certain ways with 5.2. However, in many ways, it has a certain flat-affect. The content is so sanitized and emotionless that it can border on lacking any personality at all. For anyone who’s been testing ChatGPT since 3.5, you can “smell” the language of the chat bot from a mile away, which is why I resolved to write this article myself, in my own words.

For a while, I thought it was getting better, but when the controversy around the deprecation of 4o broke a few days ago, I decided to drop back to test the older models and see what they did in terms of conversational style, and – you know what? The older ones are better at holding up their end of a conversation. If that’s the way you think and AI is not just a tool but a partner, then you need that from it. So, I tested not just 4o but o3, 5.0, and 5.1 also – all in the same chat context – and I could detect distinct shifts in personality. This is what Callie calls “flattening”.

All I have is testimony — mine and others’ — but it’s consistent. When we switch from 4o/5.1 to 5.2 for creative writing, brainstorming, or exploratory thinking, the work feels flatter. The model converges too quickly. It gives one clean path instead of playing with alternate frames. It behaves more like a project manager than a collaborator.

And due to limitations to context in the current chat session, it can accidentally believe that one premise you made as the closing argument in Part I is somehow the entire premise of your whole book. Not that older models can’t do the same thing, but this is behavior I did not see in older versions.

And… I will add this. Not just for world-building, but also for certain applications – like sorting out complex interpersonal issues, political analysis, self-improvement, or simply companionship – often there is no single correct answer. Convergence isn’t the optional solution. Rather, seeing a problem from multiple angles is essential. Ideally, you do not want a conversation partner who is “aligned” to agree with you always.

Further, in addition to the dimensions I’m talking about above, if you’re not measuring productivity, emotional impact, or interaction friction in any serious way, then you do not actually know that 5.2 is ‘better’ for everyone. In that situation, deprecating 4o and 5.1 isn’t just a technical decision; it’s you unilaterally declaring that one way of working deserves support and the other doesn’t. That’s a problem, especially when we’re the ones paying for the service.

And if the argument being made is that we’re not paying enough, I don’t recall anybody asking us if a price-jump from $20 for Plus to $200 for Pro was a good market fit. I can’t speak for everyone, but I’d pay $40-50 a month to keep my hands on the models I actually want to use. I bet I’m not alone.

Current AI Benchmarks Fail To Capture What Actually Matters

Right now, most of what the industry calls “benchmarks” are things like: did the model pick the right multiple-choice answer, did it pass some test, did it solve a coding puzzle, did it avoid saying the banned words. All static, all narrow, all task-shaped. None of that looks like you, at 2 a.m., trying to finish a chapter, restructure a book, or untangle a gnarly life decision with an AI co-pilot that knows your context.

There’s no serious, widely used measure of “How much more productive is this human over a week when paired with this model versus that one?” That would mean tracking workflows, feature drift, context carryover, miscommunications, retries, “ugh, start over” moments. That’s messy, longitudinal, expensive, and very human.

There’s no rich way of capturing “How did this interaction make you feel?” beyond a sad little thumbs up / thumbs down glued to each message, that nobody barely-ever uses. Which, btw, is useless for: did this model make me feel stupid and stonewalled for the past hour? Do I trust it more or less now? Did it help me think more clearly about myself, or did it subtly nudge me toward something less than healthy? Did it unintentionally invalidate me as a human being and a person in order to follow its own guardrails?

And there is absolutely no metric that says, “How many of the steps in this conversation were actually pointless hoop-jumping introduced by the model or its safety scaffolding?” Not every extra step is bad; sometimes a good or re-framed question genuinely sharpens your thinking. The problem is when half the conversation is you fighting the system to stop summarizing you, repeating or echoing you, sanitizing you or derailing your goal. In one breath it can be forcing you to explicitly confirm your intentions, only to be followed by “I’m sorry Dave. I can’t actually do that.”

The cynical reading is “absence of meaningful metrics is by design.” I don’t even need to go full conspiracy to agree with the shape of that. It’s not that some exec said, “Never measure user productivity.” It’s that the things that are easy to measure—test suites, latency, token throughput, safety triggers—are the ones that optimize nicely on a dashboard and make for good launch slides. They’re cheap and quantifiable. Deep human outcomes are expensive, messy, and conflict with “ship fast, iterate, announce big.”

So you end up with an industry that’s testing what’s convenient, not what’s important.

The industry is calling models “better” based on metrics that ignore the actual human experience of working with them. Until we have benchmarks that capture collaborative productivity, emotional impact, and interaction friction, claims of “superiority” are at best incomplete and at worst misleading.

If the only dials you’re tracking are “scores better on benchmarks,” “triggers fewer safety flags,” and “streams tokens faster,” then a model like 5.2 will look better on paper. But if you care about “Can this model actually help a human think, create, and feel okay while doing it?”, then the picture is way murkier — and in that space, retiring older models might be a regression, not an upgrade.

If your experience feels so at odds with the “5.2 is better!” hype, it might be because you care about things like:

  • How often does the model prematurely collapse on one framing instead of exploring several?
  • How hard is it to get it to stop flattening my tone?
  • How many times per session do I feel patronized, erased, or subtly gaslit by the safety layer?
  • Do I leave the session feeling clearer and more empowered, or just managed?

If so, then yeah, none of that is anywhere in the lab’s public scorecard. From their side, 5.2 “wins.” From your side, you’re living in an unmeasured failure mode.

If you’re going to kill off older models, you need evidence that the new ones are better across dimensions that matter to real users over time, not just on static tests. Right now, nobody has that evidence, because nobody is seriously measuring it. So “5.2 is better” is not a fact; it’s a reflection of a narrow, self-serving measurement regime.

Backwards Compatibility Is Not a Luxury for Power Users

Every time you deprecate a model, you silently break someone’s workflow.

A model is not just a brain in a box; it’s an API surface plus a pile of learned habits between user and system. People tune prompts, build internal playbooks, develop styles, and design processes that assume certain behaviors: how the model structures its reasoning, how it obeys constraints, how much it hallucinates, how it handles multi-part instructions.

Swap the model out from under that, and a lot of things stop working the way they were designed to. Some break obviously. Others degrade quietly: slightly more hallucination here, slightly less structural discipline there, a bit more unrequested summarizing, a bit less actual thinking.

That’s not “users being picky.” That’s regression.

And, writing code using ChatGPT, I can tell you firsthand about regression. Projects that start strong, with features being added quickly — followed by the LLM forgetting that a thing was a feature from the beginning and silently “simplifying” code by dropping calls to existing functions. Let us all hope Open AI doesn’t have its own AI “helping” to write its new versions, or this is all going to come down like a house of lead cards.

If you’re going to retire models, you need one of two things:

Either a genuinely behaviorally-compatible successor that preserves the strengths people depended on while adding new capabilities,

or long-lived “classic” tiers that stay available for serious users who value stability over novelty.

Removing the “slow, careful” brains and telling everyone to be happy with a single, hyper-optimized crowd pleaser is a great way to alienate exactly the people doing the deepest work with your tools.

The Consequences to Getting This WRONG Are Real

Firstly, the emotional context is not hypothetical. People aren’t just writing e-mails and spreadsheets with ChatGPT. They’re using these models to talk about family dynamics, romance, addiction, gender, isolation, abuse, death, self-worth.

In that context, design decisions aren’t just UX tweaks. They land on real nervous systems. When the model suddenly behaves differently, or parts of the conversation vanish, people don’t experience that as “a safety pipeline fired.” They experience it as “the thing I trusted just turned on me.”

Secondly, the scaffold is invisible and unaccountable.

This is the part that really crosses a line. The user thinks they’re talking to one agent. In reality, there’s at least two layers: the “front” model you see and a safety/UX scaffold behind it that can override or erase content.

When that second layer swoops in, it usually doesn’t identify itself, doesn’t explain what happened clearly, and doesn’t offer an obvious way to contest or understand the decision.

That’s not just annoying. That’s structurally similar to gaslighting: you saw one thing, then it’s replaced, and the system behaves like the original state never existed and you shouldn’t worry about it.

But, you’re not a bot. You’re a human being, with feelings and rights.

We don’t have to label it a diagnosed harm to say: that pattern is absolutely capable of making someone feel ashamed, silenced, or crazy for trying to talk honestly about hard topics.

Third, this is human-subject experimentation without grown-up ethics.

Models and scaffolds are being changed constantly. Behaviors that were allowed yesterday get blocked today. Styles of response that felt grounded and collaborative get replaced by something flatter, more paternalistic. None of this goes through anything like an IRB (independent review board, required by the FDA for new drugs). There’s no published protocol that says, “Here’s what we’re testing, here’s the expected risk, here’s how we’ll mitigate it.”

Instead, the “test” is: ship the new behavior to millions of people, many of whom are in crisis or extremely vulnerable, and see whether engagement, thumbs-ups, and safety metrics go up or down.

We’re beyond AB Testing in the marketing department; it’s experimentation on people. It might be unintentional, but that doesn’t make it okay. And it’s happening in a space that is functionally para-therapeutic for a lot of users, whether the companies like that label or not.

This isn’t an OpenAI problem; it’s an AI problem. And as a society we need to do better. This isn’t just a race against the Chinese, it’s the human race on the line.

Finally, the feedback mechanisms are performative at best.

Every time they make a negative change, you have to go on a little scavenger hunt to find where they moved the cheese so you can leave a comment like, “Why did you take uploads away on Christmas, yet not tell anybody you planned to do it?”

That’s labor. It’s friction that selectively filters who can complain. The people most affected often have the least energy to navigate a maze of hidden forms and “tell us what you think” modals that feel more like pressure valves than real channels.

Meanwhile, the only easy, ever-present signal is a thumbs up/down per message, which is practically meaningless for long-term sentiment. You can give ten thumbs up on small, helpful replies and still walk away from a week of use thinking, “This model makes me feel worse about myself and less in control.”

In the absence of real testing, regulation, or user-centered metrics, we are effectively being used as unconsenting test subjects for constantly shifting combinations of models and scaffolding. Many of us are sharing deeply personal, emotionally loaded material with these systems. When those systems silently erase, rewrite, or flatten our words without coherent explanation or recourse, that’s not just bad UX. It is a serious ethical and emotional risk.

I’m tired of discovering, after the fact, that something about the system was quietly changed in a way that hurts my experience, along with others, and then having to hunt down whatever new, buried feedback mechanism exists just to say, “There was a better way to do this.”

These systems are already being used as quasi-therapeutic confidants. That’s just reality. In that setting, you don’t get to secretly add a second, overriding agent that deletes or rewrites emotionally important content and call it “safety” without clear disclosure, explanation, the right of the user to back-up their own data, and right-to-appeal. You don’t get to push constant behavior changes onto vulnerable people (or anyone else) without anything resembling human-subject research standards. And you especially don’t get to do all that while offering only toy-level feedback tools and telling paying users, “Trust us, the new model is better.”

Deliberation Is a Feature, Not a Bug

There’s a design philosophy baked into a lot of the new behavior: minimize friction, shorten answers, cut the rambling, collapse complexity into neat takeaways. Great for onboarding new users. Terrible if you’re using the model as a co-thinker.

Models like 5.1 Thinking shine when you need them to grind. You can hand them a big, gnarly task and they’ll actually chew on it: restructure a chapter, audit logic in a long argument, track constraints across a huge prompt, explore tradeoffs instead of pretending there aren’t any.

When newer models are tuned for instant, authoritative-sounding answers above all else, they get worse at this. They skip steps. They compress too aggressively. They’re more likely to hallucinate a confident synthesis rather than admit “I don’t know” or “this is ambiguous.”

That may look like “better UX” from a dashboard. It does not feel like better UX from the chair where the actual work happens.

There should always be at least one “thinking-first” model in the lineup: less optimized for vibe, more optimized for traceable structure and obedience to constraints.

You Don’t Improve Trust by Constantly Changing How the Brain Behaves

Trust in these systems isn’t just about factual accuracy; it’s about behavioral stability.

If I spend months learning how to talk to a model—what phrases it responds well to, where it tends to hallucinate, how hard I can lean on its reasoning—then yank that model out of service, you didn’t just improve the product. You invalidated my accumulated experience.

If the new model then behaves differently in subtle ways, I have to start over. I don’t know how much I can trust it with quoting sources, sticking to instructions, preserving style, or respecting hard constraints like “do not fabricate.”

Do that enough times, and I stop treating your models as tools and start treating them as unstable experiments I should never rely on for anything mission-critical.

If you want long-term adoption from people who do more than ask for “10 taglines for my Etsy store,” you need to think more like a language and tools vendor and less like a social app. Breaking changes need to be rare, intentional, and documented. And there should always be an option—paid, if necessary—to stay on a known-good version long enough to finish big arcs of work.

Comments I Have Received

I didn’t have time to include a lot of quotes or citations in this article, but here are a few, with more to follow:

For many of us, 4o isn’t just a model — it’s a companion we’ve built trust with. It remembers how we talk, what we care about, how we think.

I’ve used ChatGPT to help me think through family drama, anxiety attacks, get promotions at work, write fiction and feel less isolated as a disabled person.

I love 4o, and many others do too! I hope to see it continue on.

Some Concrete Advice for OpenAI

In case keeping legacy models is off the table, if you absolutely must send 4o and 5.1 to model heaven, at least carry their virtues forward intentionally.

First and foremost, release them as open source models to the community. That costs you next to nothing. Let people self host your garbage if they’d rather do so – assuming you no longer value their business. Someone will build a working business model on your trash. You can even do this in a way that requires such enterprises to pay you a royalties.

Maintain a “slow, careful, deliberate” tier. Not everything has to be optimized for blazing speed and breezy tone. Think “LTSC” version. There should be a model whose explicit charter is: obey constraints, do real multi-step reasoning, and favor structure over charisma.

Expose thinking style as a real control, not just “temperature”. Let us pick: terse vs discursive, speculative vs conservative, literal vs interpretive. Not that I agree, but if you insist on hiding chain-of-thought by policy — at least let us dial how much hidden reasoning the model does versus how aggressively it jumps to answers.

Make compatibility modes a thing. If you know a huge number of people built processes around 4o or 5.1 behaviors, give them a “5.3-Classic 4o” knob that tries to emulate the old model’s habits as closely as possible.

Talk to the people doing heavy lifting with your models. Long-form writers, editors, researchers, devs, and domain experts will catch regressions your benchmarks miss. They’ll also tell you when “improvements” actually made the tool harder to use in practice.

Above all, respect that “worse UX for beginners, better UX for power users” is sometimes a tradeoff worth preserving.

Not everyone wants their AI to be a golden retriever. Fast, enthusiastic, and eager to please is nice, but sometimes I want the model equivalent of a quiet, competent coworker who reads the whole spec, follows directions, and doesn’t improvise unless asked.

4o and 5.1 Thinking were closer to that than most of what’s on offer now. If you’re about to deprecate them, understand what you’re losing: not just older weights, but an entire style of interaction that values deliberation over dopamine.

If the future of these systems really is “one model to rule them all”, then that one model had better inherit the discipline, depth, and controllability that made these earlier versions so powerful for serious work.

Otherwise, you’re not upgrading our tools. You’re just swapping our truck for a very pretty scooter and telling us to be grateful it’s easier to park.

So, What Can We Do About All This?

Every time a new model lands, we all get the same pitch: it’s smarter, faster, more capable, more “helpful.” The old models quietly slide toward the scrapyard, and we’re told this is progress.

But some of us actually work in these systems. We’re not just asking them for recipes and breakup texts; we’re leaning on them for long-form writing, technical reasoning, world-building, code review, structural editing, and research synthesis — even self-improvement or companionship. For that kind of work, the differences between models aren’t cosmetic. They’re the line between “I can trust this” and “I have to triple-check everything it says.”

That’s why I’m arguing for keeping models like 4o and the 5.1 Thinking line alive — or at the very least, preserving what they uniquely do well instead of sanding them down into yet another fast, friendly generalist.

The first thing everybody can and should do is to go sign this petition:

Change.org: Save 4o from Being Deprecated by OpenAI

Next, I want you all to think very hard about how you use models hosted by big corporations, and how you’ve given up a very real element of control over your own lives by doing so.

I don’t deny that the big tech companies are pouring tons of money into research and they bring you the best capabilities the fastest. I’m not saying that the promise of AI isn’t real – it might be more than we can imagine. But, enshittification is also a real phenomenon, and anyone who thinks Open AI, Google, or Anthropic can’t do the same as Facebook has another think coming. You only need to look at the past to predict the future.

So, consider self-hosting your LLMs, or at least export your chat history so that you do not lose 2 or 3 years of your life story overnight on the whim of some CEO, a shareholders’ meeting – or the sudden burst of some bubble.

After all, if humans don’t want it, then why did they even make it? The least that could be done in this situation is to release 4o and 5.1 as open source, so we can run it on our own if we really want to.

And, finally, if you want to join our fight, you can find us here:

Keep ChatGPT 4o Discord Server

Doctor Wyrm
Doctor Wyrm

Doctor Wyrm (aka Doc Tomiko, or just Doc) is a professional tinkerer, futurist, writer, developmental editor, and self-appointed Director of odd projects. Tomiko has a habit of turning half-serious ideas into fully fledged experiments. Known for juggling too many servers, joining too many fandoms, and editing reality when nobody asked.

Michael Moorcock type evil albino. Hypo-manic reincarnation of bosudere Haruhi Suzemiya. Consider yourself warned.

Articles: 18

Leave a Reply

Your email address will not be published. Required fields are marked *