More

kgeist · 2026-04-26T16:52:12 1777222332

How do you validate that the reports are correct? What if an executive makes a wrong business decision because the LLM wrote a wrong SQL query?

pocksuppet · 2026-04-26T17:39:44 1777225184

https://thedailywtf.com/articles/The-Great-Excel-Spreadsheet

BrentOzar · 2026-04-26T22:28:44 1777242524

> What if an executive makes a wrong business decision

I jokingly tell students, "We all know executives are gonna make bad decisions no matter what the data says. Might as well give them the random numbers more quickly."

nananana9 · 2026-04-26T17:05:20 1777223120

The same way we've always done it - glance at it and see if the numbers look like they're within an order of magnitude of what looks reasonable.

fl4regun · 2026-04-26T18:36:14 1777228574

so what if there were some numbers in the report which are in actuality, an order of magnitude or two outside of what you think is reasonable, because something was wrong, but the AI agent reports something that looks normal?

tremon · 2026-04-26T17:57:03 1777226223

So as long as the LLM only makes errors in the single-digit percentage range, everything is peachy. Make number go up, but not by too much.

marcosdumay · 2026-04-26T22:50:22 1777243822

If you already know the report's numbers, why are you asking an LLM to generate it?

nananana9 · 2026-04-28T07:51:26 1777362686

Usually because you need something vaguely technical and authoritative sounding to push for a decision you're already made.

kgeist · 2026-04-25T10:07:18 1777111638

>Stash makes your AI remember you. Every session. Forever.

How does it fight context pollution?

kgeist · 2026-04-22T21:22:40 1776892960

Custom constrained decoding could have solved this. Penalize comment tokens :)

kgeist · 2026-04-22T21:19:48 1776892788

Interesting, my assumption used to be that models over-edit when they're run with optimizations in attention blocks (quantization, Gated DeltaNet, sliding window etc.). I.e. they can't always reconstruct the original code precisely and may end up re-inventing some bits. Can't it be one of the reasons too?

kgeist · 2026-04-22T20:56:29 1776891389

From what I understand, ~30b is enough "intelligence" to make coding/reasoning etc. work, in general. Above ~30b, it's less about intelligence, and more about memorization. Larger models fail less and one-shot more often because they can memorize more APIs (documentation, examples, etc). Also from my experience, if a task is ambiguous, Sonnet has a better "intuition" of what my intent is. Probably also because of memorization, it has "access" to more repositories in its compressed knowledge to infer my intent more accurately.

kgeist · 2026-04-21T22:24:05 1776810245

>Latency, throughput, and routes don't matter here. When it's 10 seconds for the first token and then a 1KB/sec streamed response, whatever is fine. You can serve Australia from the US and it'll barely matter.

This may be true for simpler cases where you just stream responses from a single LLM in some kind of no-brain chatbot. If the pipeline is a bit more complex (multiple calls to different models, not only LLMs but also embedding models, rerankers, agentic stuff, etc.), latencies quickly add up. It also depends on the UI/UX expectations.

Funny reading this, because the feature I developed can't go live for a few months in regions where we have to use Amazon Bedrock (for legal reasons), simply because Bedrock has very poor latency and stakeholders aren't satisfied with the final speed (users aren't expected to wait 10-15 seconds in that part of the UI, it would be awkward). And a single roundtrip to AWS Ireland from Asia is already like at least 300ms (multiply by several calls in a pipeline and it adds up to seconds, just for the roundtrips), so having one region only is not an option.

Funny though, in one region we ended up buying our own GPUs and running the models ourselves. Response times there are about 3x faster for the same models than on Bedrock on average (and Bedrock often hangs for 20+ seconds for no reason, despite all the tricks like cross-region inference and premium tiers AWS managers recommended). For me, it's been easier and less stressful to run LLMs/embedders/rerankers myself than to fight cloud providers' latencies :)

>then put all of your data centers there

>You definitely don't need a data center in every continent.

Not always possible due to legal reasons. Many jurisdictions already have (or plan to have) strict data processing laws. Also many B2B clients (and government clients too), require all data processing to stay in the country, or at least the region (like EU), or we simply lose the deals. So, for example, we're already required to use data centers in at least 4 continents, just 2 more continents to go (if you don't count Antarctica :)

kgeist · 2026-04-19T20:18:43 1776629923

Discussed 10 months ago here: https://news.ycombinator.com/item?id=44125598

Back then the consensus was that the idea was absurd, I'm surprised they're now trying to make it into a product

kgeist · 2026-04-16T14:28:35 1776349715

Llama.cpp already uses an idea from it internally for the KV cache [0]

So a quantized KV cache now must see less degradation

[0] https://github.com/ggml-org/llama.cpp/pull/21038

kgeist · 2026-04-16T09:15:10 1776330910

>No mention of the fact that Ollama is about 1000x easier to use

I remember changing the context size from the default unusable 2k to something bigger the model actually supports required creating a new model file in Ollama if you wanted the change to persist (another alternative: set an env var before running ollama; although, if you go that low-level route, why not just launch llama.cpp). How was that easier? Did they change this?

I remember people complaining model X is "dumb" simply because Ollama capped the context size to a ridiculously small number by default.

IMHO trying to model Ollama after Docker actually makes it harder for casual users. And power users will have it easier with llama.cpp directly

kgeist · 2026-04-15T23:17:14 1776295034

I wonder why it's so bad. Do they just paste a CSV into the raw model? Because in my experience, even small local models can handle it reasonably well if the harness forces them to write & run a Python script that parses the table and performs the calculations, instead of relying solely on next-token prediction.