Without fast parallel hardware there would neither have been the incentive to design the Transformer, or much benefit even if someone had come up with the design all the same!
The incentive to design something new - which became the Transformer - came from language model researchers who had been working with recurrent models such as LSTMs, whose recurrent nature made them inefficient to train (needing BPPT), and wanted to come up with a new seq-2-seq/language model that could take advantage of the parallel hardware that now existed and (since AlexNet) was now being used to good effect for other types of model.
As I understand it, the inspiration for the concept of what would become the Transformer came from Attention paper co-author Jakob Uzkoreit who realized that language, while superficially appearing sequential (hence a good match for RNNs) was in fact really parallel + hierarchical as can be seen by linguist's sentence parse trees where different branches of the tree reflect parallel analysis of different parts of the sentence, which are then combined at higher levels of the hierarchical parse tree. This insight gave rise to the idea of a language model that mirrored this analytical structure with hierarchical layers of parallel processing, with the parallel processing being the whole point since this could be accelerated by GPUs. While the concept was Uzkoreit's, it took another researcher, Noam Shazeer, to take the concept and realize it as a performant architecture - the Transformer.
Without the fast parallel hardware already pre-existing, there would not have been any incentive to design a new type of language model to take advantage of it!
The other point is that while the Transformer is a very powerful general purpose and scalable type of model, it only really comes into it's own at scale. If a Transformer had somehow been designed in the pre-GPU-compute era, before the compute power to scale it up to massive size existed it, then it would likely not have appeared so promising/interesting.
The other aspect to the history is that neural networks, of various types, have evolved in complexity and sophistication over time. RNNs and LSTMs came first, then Bahdanau attention as a way to improve their context focus and performance. Attention was now seen to be a valuable part of language and seq-2-seq modelling, so when GPUs motivated the Transformer, attention was retained, recurrence ditched, and hence "Attention is all you need".
The time was right for the Transformer to appear when it did, designed to take advantage of recent GPU advances, building on top of this new attention architecture, and now with the compute power and dataset size available that it started to really shine when scaled from GPT-1 to GPT-2 size, and beyond.
Promises/hopes of what AI can do, and also execs being misinformed about what their own companies are doing/achieving with AI. I know of one very well known large company where the CEO is in the press preaching about the need to restructure/layoff because of AI, yet in the trenches there is close to zero AI adoption - only contractors claiming on their JIRA close-outs to be using GIT copilot because they have been told to say so.
> execs being misinformed about what their own companies are doing/achieving with AI
And a bunch of yes-men down the lower layers of management funneling these ideas.
In a meeting at my last job, one of the execs was bragging about how a chatbot was reading Jira customer service tickets and calling tools/APIs to solve those tickets, and it "only costs 1.5USD per ticket. How much would a human cost, huh?"
Little did the exec know, but my team was already using a ~600 lines python script to solve the problem with a higher rate of precision. The chatbot-automation thing was largely pushed by my manager when I was out on vacation, just so he could earn his good-boy points with higher ups. Worst manager I've had in my 14 years of career btw.
Annecotal, but I saw a tweet from someone who interviewed at Anthropic, and was explicity rejected because of cultural mismatch because they were not against open weight models.
It's hard not to see Anthropic's messaging of "this tech that we're pushing on you is going to take your job and maybe kill you" as being about anything other than regulatory capture, with the goal of the government shutting down competitors.
I think OpenAI and Anthropic are both really in a tough spot - spending so much on what is becoming a commodity product for which neither seems positioned to be low cost producer. Maybe a bit like the UK-France channel tunnel project where the product itself is a success but a bloodbath for those who invested to build it.
I think your "winner takes all, first mover wins" premise is wrong, even if it may be what Anthropic believe. Their mission has certainly shifted from "save the world from AI" to "push AI onto the world ASAP, because we've got an IPO coming up".
In reality the coding market, which is really the biggest success story for frontier AI (because code it is uniquely suited for LLMs and RL) is rapidly headed for, if not already arrived at, commodification, with each release from any of the US big 3 heralded as best yet, and the Chinese models like DeepSeek, Kimi, Qwen, GLM maybe no more than 6 months behind.
As far as code quality and level of bugs, certainly Claude Code has been hugely successful despite that, for two reasons.
1) It's a revolutionary product, and people are willing to accept a high level of bugs because of that.
2) The product is an LLM, itself an inherently flawed and unreliable technology, but one that people have got used to. The fact that the agent/harness, as well as the LLM itself, is unreliable and regresses from release to release doesn't much change the vibe
The quality of code produced by Claude Code, at least the way it has been used to write itself, would be a complete non-starter for any business where reliability is important. Maybe best suited for things like consumer web apps where the cost of product failure, or version regression, is just an annoyed customer rather than a lawsuit.
If you don't have the source code then it makes no difference. If you have the weights and are running some model via llama.cpp, then you are using whatever API llama.cpp is using, not the API that was used to train the model or that anyone else may be using to serve it.
> This is why the swing voters / swing states are so important in the US, because only a few million are flexible enough to switch sides.
Of course if the USA was an actual democracy, electing it's president by popular vote, then this would not be an issue - every vote would count to tip the balance in favor of who the people wanted to elect, not just the votes of the 20% fortunate enough to live in a "swing" state.
You do realize that the US has a greater percentage of it's citizens in prison than any other country, including China?
In the US its not the Uighurs or Tibetans who are being oppressed - it's the blacks and immigrants. The US elected a president who characterizes immigrants as rapists and murderers (while he himself is a convicted rapist, suspected pedophile, and wants to commit war crimes in Iran).
The facade, believed by many Americans, is that USA is the land of the free, a democracy (despite no popular vote) one of the good guys, but actions say otherwise.
I've seen Youtube videos of people growing citrus, among other things, in colder climates in "greenhouses" made of plastic sheeting heated by a thick layer of woodchips which slowly decompose and give off heat.
A number of the RISC processors have a special zero register, giving you a "mov reg, zero" instruction.
Of course many of the RISC processors also have fixed length instructions, with small literal values being encoded as part of the instruction, so "mov reg, #0" and "mov reg, zero" would both be same length.
The incentive to design something new - which became the Transformer - came from language model researchers who had been working with recurrent models such as LSTMs, whose recurrent nature made them inefficient to train (needing BPPT), and wanted to come up with a new seq-2-seq/language model that could take advantage of the parallel hardware that now existed and (since AlexNet) was now being used to good effect for other types of model.
As I understand it, the inspiration for the concept of what would become the Transformer came from Attention paper co-author Jakob Uzkoreit who realized that language, while superficially appearing sequential (hence a good match for RNNs) was in fact really parallel + hierarchical as can be seen by linguist's sentence parse trees where different branches of the tree reflect parallel analysis of different parts of the sentence, which are then combined at higher levels of the hierarchical parse tree. This insight gave rise to the idea of a language model that mirrored this analytical structure with hierarchical layers of parallel processing, with the parallel processing being the whole point since this could be accelerated by GPUs. While the concept was Uzkoreit's, it took another researcher, Noam Shazeer, to take the concept and realize it as a performant architecture - the Transformer.
Without the fast parallel hardware already pre-existing, there would not have been any incentive to design a new type of language model to take advantage of it!
The other point is that while the Transformer is a very powerful general purpose and scalable type of model, it only really comes into it's own at scale. If a Transformer had somehow been designed in the pre-GPU-compute era, before the compute power to scale it up to massive size existed it, then it would likely not have appeared so promising/interesting.
The other aspect to the history is that neural networks, of various types, have evolved in complexity and sophistication over time. RNNs and LSTMs came first, then Bahdanau attention as a way to improve their context focus and performance. Attention was now seen to be a valuable part of language and seq-2-seq modelling, so when GPUs motivated the Transformer, attention was retained, recurrence ditched, and hence "Attention is all you need".
The time was right for the Transformer to appear when it did, designed to take advantage of recent GPU advances, building on top of this new attention architecture, and now with the compute power and dataset size available that it started to really shine when scaled from GPT-1 to GPT-2 size, and beyond.
reply