I got sick of the inconsistency caused by Anthropic tinkering with Claude Code and had canceled my 20x. My plan was to switch to Codex so I could use it in Pi.
I am specifically talking about switching because of the harness, not model quality. Anyone else match my experience?
I wonder how many other people recently did the same. It would be prudent of Anthropic to let people use Pro/Max OAuth tokens with other harnesses I think. Even though I get why they want to own the eyeballs.
I’ve been using Codex Pro since they lobotomized Opus 4.6. Codex is so much better, GPT 5.4 xhigh fast is definitely the smartest and fastest model available.
For a while there I had both Opus 4.6 and Codex access and I frequently pitted them against each other, I never once saw Opus come out ahead. Opus was good as a reviewer though, but as an implementer it just felt lazy compared to 5.4 xhigh.
One feature that I haven’t seen discussed that much is how codex has auto-review on tool runs. No longer are you a slave to all or nothing confirmations or endless bugging, it’s such a bad pattern.
Even in a week of heavy duty work and personal use I still haven’t been able to exhaust the usage on the $200 plan.
I’ll probably change my mind when (not IF) OpenAI rug pull, but for spring ‘26, codex is definitely the better deal.
I also made the switch to OpenAI, the $20 plan, I dunno about "so much better" but it's more or less the same, which is great!
The models and tools levelling out is great for users because the cost of switching is basically nil. I'm reading people ITT saying they signed up for a year - big mistake. A year is a decade right now.
It really depends on what you‘re trying to do and what your skillset is.
But if you go information architecture first and have that codified in some way (espescially if you already have the templates), then you can nudge any agent to go straight into CSS and it will produce something reasonable.
I left anthropic a while ago because of the similar shenanigans they had earlier. I went with opencode & zen.
I still have their subscription, but am using pi now, mainly because something happened that made my opencode sessions unusable (cannot continue them, just blanks out, I assume something in the sqlite is fucked), and I cannot be bothered to debug it.
For what I use the agents, the Chinese models are enough
Doesn't using pi be against their terms of use about having to go through Claude Code cli for all Max plan usage? (I had use Droid with Max previously, it was a great combo).
It's unclear right now. The current stance is that using pi or other coding harnesses eats into extra usage and that is the behavior one sees today. We have added a hint to pi now that warns you when you use an anthropic sub.
I also cancelled my 20x and switched to Codex. At this point even the Codex CLI seems to perform better than Claude Code... And so far I'm on the OpenAI Pro plan and haven't even needed to upgrade to their $100/mo plan. I'm getting more value for almost 10x cheaper.
My experience is the opposite of this thread's consensus. Context: Full time SWE, working on large and messy codebase. Not working on crazy automations, working on fixing bugs, troubleshooting crashes, implementing features.
Anthropic models write much better code, they are easy to follow, reasonable and very close to what I would done if I had the time... OpenAI's on the other hand generate extremely complex solutions to the simplest problems.
I was so disappointed by non-Anthropic models, that for a couple of weeks I only used Anthropic models, but based on this thread, I'll go back and give it another try. It's good to go back and try things again every couple of weeks.
Of course, I was annoyed that they lobotomized 4.6, the difference was day and night, and Anthropic is certainly not a company I trust. In my opinion, it shows their willingness to rugpull, so I'm looking at other approaches. Since 4.7, things went back to normal, things you'd expect to work just work.
I feel like Opus 4.7 vs GPT 5.4 is pretty much just flavor variants, the big difference is in the harness. I like the Claude Code CLI better than the Codex CLI, it just clicks with how I like to interact with agents. The codex app on the other hand is better than the Claude app in code view, so if I had to stick to an app it would be codex all the way.
I've been on pi for a few months now, build a custom tmux plugin so i can use nested pi and mix and match codex / claude instances.
pi has been the better harness out of all the ones i tried, first and third party.
Ever since the Anthropic block i've just canceled all my claude subs. Used to be codex was a bit worse, now they're practically equal. Claude is slightly better at directing other agents but the difference is too minor and not worth the money.
Claude usage limits / costs are absurd.
Any 'principles' people praise anthropic for are not that relevant to me anyways because i'm not a US citizen.
(Disclosure: I work on tamer, an OSS supervisor for coding agents — biased.)
Add one more to the count. The OAuth-across-harnesses idea would help, but it doesn't fix the shape of the problem.
"Harness" has always felt off to me. Exoskeleton is closer — Claude Code, Codex, opencode wrap the model and augment it from the inside.
What's missing is a layer above that's explicitly not an exoskeleton: a thin supervisor. A master that watches and guides, nothing more. It just relays I/O and hands approval back to the human.
> I wonder how many other people recently did the same.
Some negative signal for better overall view on things: I'm still with Anthropic and will probably stay with them for the foreseeable future.
I think after DoD/DoW shenanigans (which in of itself felt like a reasonable take on the part of Anthrpic) they got a bunch of visibility and new users, so them hitting some scaling limits is pretty much inevitable - so some service disruption is inevitable. Couple this with the tokenizer changes and seeming decrease in model performance (adaptive thinking etc.), and lots of people will be rightfully pissed off, alongside increased downtime (doesn't matter that much for me, definitely does matter for anything time-sensitive).
At the same time, in practice I've only seen it do stupid things across 8 million tokens about 5 times (confusing user/assistant roles, not reading files that should be obvious for a given use case, and picking trivially wrong/stupid solutions when planning things), alongside another 4 times that tests/my ProjectLint tool caught that I would have missed. The error rate is still arguably lower than mine, though I work in a very well known and represented domain (webdev with a bunch of DevOps and also some ML stuff, and integration with various APIs etc.).
At the same time, the 85 EUR they gave to me for free has been enough to weather the instability in regards to pricing changes and peak usage. They've fixed most of the issues I had with Claude Code (notably performance), and the sub-agent support is great and it's way better than OpenCode in my experience. They also keep shipping new features that are pretty nice, like Dispatch and Routines and Design, those features also seem nice and not like something completely misdirected, so that's nice. The Opus 4.7 model quality with high reasoning is actually pretty nice as well and works better than most of the other models I've tried (OpenAI ones are good, I just prefer Claude phrasing/language/approaches/the overall vibe, not even sure what I'd call it exactly, all the stuff in addition to the technical capabilities).
At the same time, if they mess too much with the 100 USD tier, I bet I could go to OpenAI or try out the GLM 5.1 subscription without too many issues. For now they're replacing all the other providers for me. Oh also I find the subscription vs API token-based payment approach annoying, but I guess that's how they make their money.
Because the Harness is the Moat and key IP not the Models themselves that is the why! now for both OpenAI and Anthropic with all their money raised and the compute they acquire and have in the books of course no one can easily replicate, whom can afford all those datacenters and Nvidia GPUs interconnected is why OpenAI throws you a bone and gives you an Open Source SDK Harness but not the one they actually use for ChatGPT. But now both of them have to deliver and do all the bull-shet they said this models can do... truth is they cannot. So now the bubbles burst and we will see what happens. We all have to buy iPhones or MacBooks so that makes sense, we all use Chrome or Google Search, Instagram, TikTok.
All these models and agents are shortcuts for all of us to be lazy and play games and watch YouTube or Netflix because we use them to work-less, well the party will be over soon.
China already operates like this. Low cost specialized models are the name of the game. Cheaper to train, easy to deploy.
The US has a problem of too much money leading to wasteful spending.
If we go back to the 80s/90s, remember OS/2 vs Windows. OS/2 had more resources, more money behind it, more developers, and they built a bigger system that took more resources to run.
Mac vs Lisa. Mac team had constraints, Lisa team didn't.
Though I do agree with you, I just came back from a trip to China (Shanghai more specifically) and while attending a couple AI events, the overwhelming majority of people there were using VPNs to access Claude code and codex :-/
On the Mac vs Lisa team, I generally agree but wasn't there a strong tension on budget vs revenue on Mac vs Apple II? And that Apple II had even more constrained budget per machine sold which led to the conflict between Mac and Apple II teams. (Apple II team: "We bring in all the revenue+profit, we offer color monitors, we serve businesses and schools at scale. Meanwhile, Steve's Mac pirate ship is a money pit that also mocks us as the boring Navy establishment when we are all one company!")
By the logic of constraints (on a unit basis), Apple II should have continued to dominate Mac sales through the early 90s but the opposite happened.
It has been a very bad bet that hardware will not evolve to exceed the performance requirements of today's software tomorrow, just as it is a bad bet that tomorrow someone will rewrite today's software to be slower.
Eh, but then as hardware evolves, the software will also follow suit. We’ve had an explosion of compute performance and yet software is crawling for the same tasks we did a decade ago.
Better hardware ensures that software that is “finished” today will run at acceptable levels of performance in the future, and nothing more.
I think we won’t see software performance improve until real constraints are put on the teams writing it and leaders who prioritize performance as a North Star for their product roadmap. Good luck selling that to VCs though.
You can fine-tune a model, but there are also smaller models fine-tuned for specific work like structured output and tool calling. You can build automated workflows that are largely deterministic and only slot in these models where you specifically need an LLM to do a bit of inference. If frontier models are a sledgehammer, this approach is the scalpel.
A common example would be that people are moving tasks from their OpenClaw setup off of expensive Anthropic APIs onto cheaper models for simple tasks like tagging emails, summarizing articles, etc.
Combined with memory systems, internal APIs, or just good documentation, a lot of tasks don't actually require much compute.
Harness is a big one, Claude Code still has trouble editing files with tabs. I wonder how many tokens per day are wasted on Claude attempting multiple times to edit a file.
As a recent example in AI space itself. China had scarce GPU resources, quite obvious why => DeepSeek training team had to invent some wheels and jump through some hoops => some of those methods have since become 'industry standard' and adopted by western labs who are now jumping through the same hoops despite enjoying massive computeresources, for the sake of added efficiency.
I'm having an hard time getting my mind to see this.
> Users should re-tune their prompts and harnesses accordingly.
I read this in the press release and my mind thought it meant test harness. Then there was a blog post about long running harnesses with a section about testing which lead me to a little more confusion.
Yes, the word 'harness' is consistently used in the context as a wrapper around the LLM model not as 'test harness'.
This field is chock full of people using terms incorrectly, defining new words for things that already had well known names, overloading terms already in use. E.g. shard vs partition. TUI which already meant "telephony user interface ". "Client" to mean "server" in blockchain.
Some people also call evaluations "tests". There are unexpected things that come along with new models, like the model in a workflow you'd set up suddenly starts calling a tool and never stops or decides to no longer call a particular tool, so running your existing evaluations to catch regressions like this and potentially updating the prompts is considered "testing" your prompts and harnesses.
It’s the tool that calls the model, give it access to the local file system, calls the actual tools and commands for the model, etc, and provide the initial system prompt.
Basically a clever wrapper around the Anthropic / OpenAI / whatever provider api or local inference calls.
pi vs. claude code vs. codex
These are all agent harnesses which run a model (in pi's case, any model) with a system prompt and their own default set of tools.
Because there's absolutely nothing stopping that from happening. There are bots on Reddit, there are of course bots on here, a VPN friendly site where you don't even need an email. But a lot of people don't want to admit it.
I didn’t even have a strong interest in space before the dude started writing about it. Maciej could write about literal rocks and make it worthwhile to read.
I just read one blog post ("Musk on Mars") and it was indeed excellent. He seems to have quite a small readership though, judging from the Substack reactions.
Yeah, though some posts a free. I think real problem is that he decided to start a Mars blog two weeks before SpaceX announced they are now focusing on the moon instead, and prior to that merging with xAI, effectively cancelling any Mars plans.
Instead, the "wild" thing here is that someone let an agent speak on their behalf with no review. The agent posted inaccurate instructions which someone else followed.
Those instructions lead to a brief gap in internal ACL controls, sounds like. I'm sorry, but given that the US government gave 14 year olds off incel Discords full access to Social Security data, this is not shocking by comparison.
To be clear, it is dumb and rude to let an agent speak on your behalf _without even reviewing it_.
This will eventually lead to a bigger snafu, of course. Security teams should control or at least review the agent permissions of every installation. Everyone is adopting this stuff, and a whole lot of people are going to set it up lazily/wrong (yolo mode at work).
Me: “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?”
Opus 4.6, without searching the web: “Drive. You’re going to a car wash. ”
If you can get your hands on it, I recommend Other Networks: A Radical Technology Sourcebook by the same author. She covers barbed wire as well as many other ways to communicate. The book itself is gorgeous.
reply