More

irthomasthomas · 2026-04-24T12:37:47 1777034267

Can you actually get caffeine free coffee? I thought most decaf brands where only about 50% less.

hrldcpr · 2026-04-24T12:48:04 1777034884

You're right that the caffeine isn't entirely removed, but it's supposedly more than 90% removed.

(I'm even seeing the number 97% mentioned a lot online.)

winrid · 2026-04-24T15:49:33 1777045773

The bean is already only like 2% caffeine lol so cheap decaf can definitely really be "half caf" even if they say that.

jeffbee · 2026-04-24T12:57:16 1777035436

That fraction is going to depend a lot on the definition and the reference. I believe the 97% is the US standard for how much of the natural caffeine in green beans must be removed. You will note how this can be manipulated by using a more caffeine-abundant variety. EU standards are more sensible, stated in terms of caffeine content in the final product.

Either way, commercial decaf processes and normal brewing methods will yield something like 5-10mg of caffeine in a "decaf" dose of coffee, which is an order of magnitude less than usual.

irthomasthomas · 2026-04-23T20:31:12 1776976272

This was something I worried about after openai started building apps as well as models. Now all of the labs make no secret of the fact that they are going after the whole software industry. Its going to be hard to maintain functioning fair markets unless governments step in.

irthomasthomas · 2026-04-23T20:19:59 1776975599

Try llm-consortium with --judging-method rank

irthomasthomas · 2026-04-23T20:18:50 1776975530

It beats opus-4.7 but looks like open models actually have the lead here.

irthomasthomas · 2026-04-23T18:57:08 1776970628

> Think Step By Step

What is this, 2023?

I feel like this was generated by a model tapping in to 2023 notions of prompt engineering.

irthomasthomas · 2026-04-23T18:46:50 1776970010

I like chutes because they always use the full weights, and prompts are encrypted with TEE.

irthomasthomas · 2026-04-20T16:22:37 1776702157

10T? Impossible! They told us the training run was under 10^26 flops.

irthomasthomas · 2026-04-20T15:44:22 1776699862

Beats opus 4.6! They missed claiming the frontier by a few days.

NitpickLawyer · 2026-04-20T16:03:24 1776701004

While I'm skeptical of any "beats opus" claims (many were said, none turned out to be true), I still think it's insane that we can now run close-to-SotA models locally on ~100k worth of hardware, for a small team, and be 100% sure that the data stays local. Should be a no-brainer for teams that work in areas where privacy matters.

cedws · 2026-04-20T16:11:22 1776701482

Even the smaller quantized models which can run on consumer hardware pack in an almost unfathomable amount of knowledge. I don't think I expected to be able to run a 'local Google' in my lifetime before the LLM boom.

sterlind · 2026-04-20T17:49:37 1776707377

I'm extremely curious how these models learn to pack a lossily-compressed representation of the entire Internet (more or less) into a few hundred billion parameters. like, what's the ontology?

osti · 2026-04-20T16:33:42 1776702822

I think this one is only about 600GB VRAM usage, so it could fit on two mac studios with 512GB vram each. That would have costed (albeit no longer available) something like less than 20k.

NitpickLawyer · 2026-04-20T16:38:30 1776703110

Yeah, but that's personal use at best, not much agentic anything happening on that hardware. Macs are great for small models at small-medium context lengths, but at > 64k (something very common with agentic usage) it struggles and slows down a lot.

The ~100k hardware is suitable for multi-user, small team usage. That's what you'd use for actual work in reasonable timeframes. For personal use, sure macs could work.

osti · 2026-04-20T18:52:10 1776711130

True, but I think for local models, we are mostly considering personal usage.

zozbot234 · 2026-04-20T17:00:05 1776704405

You could run it with SSD offload, earlier experiments with Kimi 2.5 on M5 hardware had it running at 2 tok/s. K2.6 has a similar amount of total and active parameters.

osti · 2026-04-20T18:50:54 1776711054

Yeah... I would definitely call 2t/s unusable. For simple chats, I'd want at least 15 t/s. For agentic coding (which this model is advertised for), I'd want good prefill performance as well.

veber-alex · 2026-04-20T23:48:00 1776728880

That's just throwing money away. The performance with large context would have been unusable especially if you need to serve more then a single person.

BoorishBears · 2026-04-20T16:05:55 1776701155

Opus is clearly a sidegrade meant to help Anthropic manage cost, so I would say they may have it if it actually beats 4.6

irthomasthomas · 2026-04-20T16:11:46 1776701506

Could be right. I just noticed my feed is absent the usual flood of posts demoing the new hotness on 3D modeling, game design and SVG drawings of animals on vehicles.

pixel_popping · 2026-04-20T16:42:39 1776703359

It doesn't beat Opus 4.6, no way, don't be fooled by benchmarks.

irthomasthomas · 2026-04-20T15:24:20 1776698660

It is pretty obvious from the token speed that opus now is sonnet or haiku size a few versions ago. So Mythos is likely what was called opus. They dont tell us the size but they did co firm the training run for Mythos was under the 10^26 flops reporting requirement.

In an alternate universe, opus 4.7 is sonnet 5, and Mythos is released as Opus. Can you imagine how much praise would be heaped on Anthropic if it opus 4.7 was < half the price it is now?

irthomasthomas · 2026-04-19T11:36:56 1776598616

The link you are commenting on shows data from actual prompts from real users, and the COST of the average prompt increased 37%. I do not think synthetic benchmarks are a rebuttal to real usage data.

andai · 2026-04-19T12:21:25 1776601285

The cost of the input tokens, not the reasoning or output.

Agree though that benchmarks aren't very helpful w.r.t. estimating real world performance or costs.

What we'd need are people giving the same real world tasks to 4.6 and 4.7 and measuring time, quality and costs.

irthomasthomas · 2026-04-19T14:02:26 1776607346

Thanks, that wasn't clear because it mentioned conversations, but it is only measuring the input tokens. So its just measuring the difference in the tokenizer.