> The tokens stored in a KV cache are not arbitrary floating-point data -- they are samples from the exact formal language the model was trained on, and the model is by construction a near-optimal predictor of that language.
You can compress the KV cache to 0 bytes by just recomputing it every token. This observation is not worth an ArXiv paper though.
This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.
The IEEE standard FP16 is an older 16-bit format, which has balanced exponent and significand sizes.
It has been initially supported by GPUs, where it is useful especially for storing the color components of pixels. For geometry data, FP32 is preferred.
In CPUs, some support has been first added in 2012, in Intel Ivy Bridge. Better support is provided in some server CPUs, and since next year also in the desktop AMD Zen 6 and Intel Nova Lake.
BF16 is a format introduced by Google, intended only for AI/ML applications, not for graphics, so initially it was implemented in some of the Intel server CPUs and only later in GPUs. Unlike FP16, which is balanced, BF16 has great dynamic range, but very low precision. This is fine for ML but inappropriate for any other applications.
Nowadays, most LLMs are trained preponderantly using BF16, with a small number of parameters using FP32, for higher precision.
Then from the biggest model that uses BF16, smaller quantized models are derived, which use 8 bits or less per parameter, trading off accuracy for speed.
Yes, it's a "Brain float", basically an ordinary 32-bit float with the low 16 mantissa bits cut off. Exact same range as fp32, lower precision, and not the same as the other fp16, which has less exponent and more mantissa.
the model generates probabilities for the next token, then you set the probability of not allowed tokens to 0 before sampling (deterministically or probabilistically)
but some tokens are only not allowed in certain contexts, not others.
You might be talking about how to defuse a bomb, instead of building one. Or you might be talking about a bomb in a video game. Or you could be talking about someone being "da bomb!". Or maybe the history of certain types of bombs. Or a ton of other possible contexts. You can't just block the "bomb" token. Or the word explosive when followed by "device", or "rapid unscheduled disassembly contraption". You just can't predict all infinite wrong possibilities.
And there is no way to figure out which contexts the word is safe in.
If you're syntax checking every token, you're doing it AFTER the llm has spat out its output. You didn't actually do anything to force the llm to produce correct code. You just reject invalid output after the fact.
If you could force it to emit syntactically correct code, you wouldn't need to perform a separate manual syntax check afterwards.
how do you disallow it from generating specific things? My point is that you can't. And again, how do you stop it generating certain tokens, but only in certain contexts?
You would need to somehow analyze the prompt, figure out that the user is asking for an addition of two numbers, and selectively enable that filter. If that filter was left enabled permanently then you'd just functionally have a calculator.
But the analysis of the prompt itself is not a task that can be reliably automated either, for the exact same reasons the original model couldn't consistently do addition properly.
So your solution has the exact same problem as the original. If you ask for an addition, you can't be sure that you will get numbers (you can't be sure the filter will always be enabled when needed). You just shifted the problem out to a separate thing to be "left as an exercise to the reader" and declared the problem trivial.
reply