At least in elementary school I don't see the deficiency in common core math compared to what I had 30 years ago. My kid has been exposed to a wide variety of topics sooner than I was, and she's way stronger in word problems on top of that. Do people have a specific complaint with elementary school common core math that we should be teaching but aren't, or vice versa? Or is it more problematic later?
One thing I notice is there seem to be far more students who finish elementary school unable to comfortably do basic math in their head (stuff like 17+36 or 144 or even basic multiplication tables like 38).
I've been working on a client/server game in Unity the past few years and the LLM constantly forgets to update parts of the UI when I have it make changes. The codebase isn't even particularly large, maybe around 150k LOC in total.
A single complex change (defined as 'touching many parts') can take Claude code a couple hours to do. I could probably do it in a couple hours, but I can have Claude do it (while I steer it) while I also think about other things.
My current guess is that LLMs are really good at web code because its seen a shitload of it. My experience with it in arenas where there's less open source code has been less magical.
This is where the old line of "LLMs are just next token predictors" actually factors in. I don't know how you get a next token predictor that user input can't break out of. The answer is for the implementer to try to split what they can, and run pre/post validation. But I highly doubt it will ever be 100%, its fundamental to the technology.
I think this is fundamental to any technology, including human brains.
Humans have a problem distinguishing "John from Microsoft" from somebody just claiming to be John from Microsoft. The reason why scamming humans is (relatively) hard is that each human is different. Discovering the perfect tactic to scam one human doesn't necessarily scale across all humans.
LLMs are the opposite; my Chat GPT is (almost) the same as your Chat GPT. It's the same model with the same system message, it's just the contexts that differ. This makes LLM jailbreaks a lot more scalable, and hence a lot more worthwhile to discover.
LLMs are also a lot more static. With people, we have the phenomenon of "banner blindness", which LLMs don't really experience.
So people can focus their attention to parts of content, specifically parts they find irrelevant or adversarial (like ads). LLMs on the other hand pay attention to everything or if they focus on something, it is hard to steer them away from irrelevant or adversarial parts.
Banner blindness is a phenomenon where humans build resistance to previously-effective ad formats, making them much less effective than they previously used to be.
You can find a "hook" to effectively manipulate people with advertising, but that hook gets less and less effective as it is exploited. LLMs don't have this property, except across training generations.
Maybe it's my failing but I can't imagine what that would look like.
Right now, you train an LLM by showing it lots of text, and tell it to come up with the best model for predicting the next word in any of that text, as accurately as possible across the corpus. Then you give it a chat template to make it predict what an AI assistant would say. Do some RLHF on top of that and you have Claude.
What would a model with multiple input layers look like? What is it training on, exactly?
It's hard in general, but for instruct/chat models in particular, which already assume a turn-based approach, could they not use a special token that switches control from LLM output to user input? The LLM architecture could be made so it's literally impossible for the model to even produce this token. In the example above, the LLM could then recognize this is not a legitimate user input, as it lacks the token. I'm probably overlooking something obvious.
Yes, and as you'd expect, this is how LLMs work today, in general, for control codes. But different elems use different control codes for different purposes, such as separating system prompt from user prompt.
But even if you tag inputs however your this is good, you can't force an LLM to it treat input type A as input type B, all you can do is try to weight against it! LLMs have no rules, only weights. Pre and post filters cam try to help, but they can't directly control the LLM text generation, they can only analyze and most inputs/output using their own heuristics.
I wouldn't personally do so, but arguably those tens of thousands rest at our feet considering the current government was political blowback from the US and UK regime changing Iran back in the '50s.
It's even less likely to work because Trump has already claimed, publicly, to arming the protestors. That already makes any regime change illegitimate. They're all foreign backed agitators.
Bad code works fine until it doesn't. In my experience, with humans, doing the right thing is worth it over doing the bad thing if your time horizon is a few months. Once you're in years, absolutely do the right thing, you're actually throwing time away if you don't. And I don't mean "big refactor", I mean at-change-time, when you think "this change feels like an icky hack."
For LLMs, I don't really know. I only have a couple years experience at that.
It's similar to writing. Most people suck at writing so badly that the LLM/AI writing is almost always better when writing is "output".
Code is similar. Most programmers suck at programming so badly that LLM/AI production IS better than 90+% (possibly 99%+). Remember, a huge number of programmers couldn't pass FizzBuzz. So, if you demand "output", Claude is probably better than most of your (especially enterprise) programming team.
The problem is that the Claude usage flood is simply identifying the fact that things that work do so because there is a competent human somewhere in the review pipeline who has been rejecting the vast majority of "output" from your programming team. And he is now overwhelmed.
Because of just how many programmersI've interviewed who can't pass FizzBuzz?
I also taught upper level CS and my first assignment was always "You have 10 days. Here is a 10 line program on this sheet of paper. Type it in, check it into source control, and make the automated tests go green. Warning: start today."
1/3 of the class couldn't finish that task and would drop.
"Perfectly implements" is doing a lot of work there. Enterprise software is very rarely perfect out of the box, and the issue with bad code is that it can make it extraordinarily hard to solve simple problems. I have personally seen tech-debt induced scenarios where "I want a new API to edit this field in an object" and "Let's do a dependency upgrade" respectively became multi-month projects.
> Perfectly implements" is doing a lot of work there. Enterprise software is very rarely perfect out of the box
Fair, by “perfectly implements” I meant to say that it correctly implemented the core invariant of a double entry ledger (debits = credits), not that it was 100% bug free
Since most won't actually deal with fintech (I don't know the stats on HN, but I'm talking devs as one industry), your first "a" example might actually be better than your first "b" example, depending on the complexity of the software. In lots (probably most) of industries, having a good codebase would mean architecture decisions were solid, but the domain/service layer is bad. Maybe my experiences don't match most of the HN crowd, but usually I get stuck with very detailed domain/service rules, but the architecture is a problem where too much memory or CPU is being used, just to abstract away the actual rules of the application (the purpose). Usually when I've been brought in to rebuild an application, the client is fine with the results, but they are upset over performance and/or cost to run the application. For anything of actual complexity, it's usually the supporting code that is the biggest failure, because complex apps usually have decent requirements. Now, if the requirements were bad, and the architecture was bad, AND the domain/service layer is bad, I don't know if there's anything to fix that.
And it’s perfectly okay to fix and improve the code later.
Many super talented developers I know will say “Make it work, then make it good”. I think it’s okay to do this on a bigger scale than just the commit cycle.
But why not rewrite the app, change the name, and get shareholder value from a new product announcement? It shouldn't take a long time, the spec for the new product is the old product being rewritten.
Imagine thinking people losing their primary income source (usually 100% of it) is remotely comparable to the share price of a single company not going up 2%.
If you can’t lay off people then the economy won’t run and it affects everyone.
Sure you can show easy empathy for the employees but this is how economy runs. A static economy where layoffs are hard or punished will lose to a more dynamic one.
> Sure you can show easy empathy for the employees but this is how economy runs. A static economy where layoffs are hard or punished will lose to a more dynamic one.
Is that why workers are generally happier in Europe even though on paper their economy loses?
I've always been skeptical of happiness statistics. In many cases, self-reporting happiness offers an objective floor for happiness, but the ceiling is entirely relative/subjective.
The floor is universal: starvation, suffering, death.
The ceiling...
For someone who's starving & facing death, would simply be good health, easy access to food, healthy family, house & car.
But the ceiling for someone who already has these things is different. The ceiling for a billionaire is different.
The only way I can imagine not doing this type of subjective self-reporting is... maybe you can draw blood from populations and record cortisol and oxytocin levels?
reply