This isn't the first open-weight LLM to be released. People tend to get a feel for this stuff over time.
Let me give you some more baseless speculation: Based on the quality of the 3.5 27B and the 3.6 35B models, this model is going to absolutely crush it.
Divide the value before the B by 2, and there's your answer if you get a Q4_K_M quant. Plus a bit of room for KV cache.
TLDR: If you have 14GB of VRAM, you can try out this model with a 4-bit quant.
Tokens per second is an unreasonable ask since every card is different, are you using GGUF or not, CUDA or ROCm or Vulkan or MLX, what optimizations are in your version of your inference software, flags are you running, etc.
Note that it's a dense model (the Qwen models have another value at the end of the MoE model names, e.g. A3B) so it will not run very well in RAM, whereas with a MoE model, you can spill over into RAM if you don't have enough VRAM, and still have reasonable performance.
Using these models requires some technical know-how, and there's no getting around that.
That's the interesting question, right? Because if this unwinds during a period of external inflation (say, because of a big war and energy shortage) then even the Bernanke would say helicopter money won't work
Not that I'm some paragon when it comes to critical thinking exactly, but if there any sort of proof or evidence of Anthropic "silencing negativity"? Wouldn't surprise me, but also haven't seen anything conclusive about it either, so spreading that they are as fact, is ironically FUD itself.
reply