Microsoft researchers developed a hyper-efficient AI model that can run on CPUs

142 points by libpcap 3 days ago

hu3 3 days ago

Repo with demo video and benchmark:

"...It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption..."

https://arxiv.org/abs/2402.17764

Animats 2 days ago

That essay on the water cycle makes no sense. Some sentences are repeated three times. The conclusion about the water cycle and energy appears wrong. And what paper is "Jenkins (2010)"?
Am I missing something, or is this regressing to GPT-1 level?
- yorwba 2 days ago
  
  They should probably redo the demo with their latest model. I tried the same prompt on https://bitnet-demo.azurewebsites.net/ and it looked significantly more coherent. At least it didn't get stuck in a loop.
- int_19h 2 days ago
  
  2B parameters should be in the ballpark of GPT-2, no?

godelski 2 days ago

  > "...It matches the full-precision (i.e., FP16 or BF16)

Wait... WHAT?!

When did //HALF PRECISION// become //FULL PRECISION//?

FWIW, I cannot find where you're quoting from. I cannot find "matches" on TFA nor the GitHub link. And in the paper I see

  3.2 Inference Accuracy
  
  The bitnet.cpp framework enables lossless inference for ternary BitNet b1.58 LLMs. To evaluate inference accuracy, we randomly selected 1,000 prompts from WildChat [ ZRH+24 ] and compared the outputs generated by bitnet.cpp and llama.cpp to those produced by an FP32 kernel. The evaluation was conducted on a token-by-token basis, with a maximum of 100 tokens per model output, considering an inference sample lossless only if it exactly matched the full-precision output.

ilrwbwrkhv 3 days ago

This will happen more and more. This is why NVidia is rushing to get CUDA a software level lock-in otherwise their stock will go the way of Zoom.

soup10 3 days ago

i agree, no matter how much wishful thinking jensen sells to investors about paradigm shifts the days of everyone rushing out to get 6 figure tensor core clusters for their data center probably won't last forever.
- bigyabai 2 days ago
  
  If Nvidia was at all in a hurry to lock-out third-parties, then I don't think they would support OpenCL and Vulkan compute, or allow customers to write PTX compilers that interface with Nvidia hardware.
  In reality, the demand for highly parallelized compute simply blindsided OEMs. AMD, Intel and Apple were all laser-focused on raster efficiency, none of them have a GPU architecture optimized for GPGPU workloads. AMD and Intel don't have competitive fab access and Apple can't sell datacenter hardware to save their life; Nvidia's monopoly on attractive TSMC hardware isn't going anywhere.
  - mlinhares 2 days ago
    
    The profit margins on Macs must be insane because it just doesn’t make sense at all Apple just doesn’t give a fuck about data center workloads when they have some of the best ARM CPUs and whole packages on the market.
    
    bigyabai 2 days ago
    
    If Xserve is any basis of comparison, Apple struggles to sell datacenter hardware in the best of markets. The competition is too hot nowadays, and Apple likely knows the investment wouldn't be worth it. ARM CPUs are available from Ampere and Nvidia now, Apple Silicon would have to differentiate itself more than it does on mobile. After a certain point, it probably does come down to the size of the margins on consumer hardware.
    
    ahmeni 2 days ago
    
    I will never not be forever saddened by the fact that Apple killed their Xserve line shortly before the App store got big. We all ended up having to do dumb things like rack-mount Mac Minis for app CI builds for years and it was such a pain.
    
    pzo 2 days ago
    
    there was news they recently bought a lot of nvidia gpus since their progress was too slow to use their own chips even in their own data centers for their own purposes
  - imtringued 2 days ago
    
    I don't know how it happened, but Intel completely dropped out of the AI accelerator market.
    There are really only three competitors in this market with one also-ran company.
    Obviously it's Nvidia, Google and tenstorrent.
    The also ran company is AMD, whose products are only bought as a hedge against Nvidia. Even though the hardware is better on paper, the software is so bad that you get worse performance than Nvidia. Hence "also ran".
    Tenstorrent isn't there yet, but it's just a matter of time. They are improving with every generation of hardware and their software stack is 100% open source.
- int_19h 2 days ago
  
  Even if you can squeeze an existing model into smaller hardware, that means that you can squeeze a larger (and hence smarter) model into that 6 figure cluster. And they aren't anywhere near smart enough for many things people attempt to use them for, so I don't see the hardware demand for inference subsiding substantially anytime soon.
  At least not for these reasons - if it does, it'll be because of consistent pattern of overhyping and underdelivering on real-world applications of generative AI, like what's going on with Apple right now.
- layoric 2 days ago
  
  He is fully aware, that is why he is selling his stock on the daily.
Sonnigeszeug 2 days ago

Comparing Zoom and Nvidia is just not valid at all.
Was the crazy revaluation of Nvidia wild? Yes.
Will others start taking contracts away with their fast inferencing custom solutions? yes of course but im sure everyone is aware of it.
What is very unclear is, how strong Nvidia is with their robot platform.
jcadam 2 days ago

So Microsoft is about to do to Nvidia what Nvidia did to SGI.
PaulDavisThe1st 2 days ago

still, better than the way of Skype.

zamadatix 2 days ago

"Parameter count" is the "GHz" of AI models: the number you're most likely to see but least likely to need. All of the models compared (in the table on the huggingface link) are 1-2 billion parameters but the models range in actual size by more than a factor of 10.

int_19h 2 days ago

Because of different quantization. However, parameter count is generally the more interesting number so long as quantization isn't too extreme (as it is here). E.g. FP32 is 4x the size of 8-bit quant, but the difference is close to non-existent in most cases.
- orbital-decay 2 days ago
  
  >so long as quantization isn't too extreme (as it is here)
  This is true for post-training quantization, not for quantization-aware training, and not for something like BitNet. Here they claim comparable performance per parameter count as normal models, that's the entire point.
charcircuit 2 days ago

TPS is the Ghz of AI models. Both are related to the the propagation time of data.
- idonotknowwhy 2 days ago
  
  Then i guess vocab is the IPC. 10k mistral tokens are about 8k llama3 tokens

Jedd 2 days ago

I think almost all the free LLMs (not AI) that you find on hf can 'run on CPUs'.

The claim here seems to be that it runs usefully fast on CPU.

We're not sure how accurate this claim is, because we don't know how fast this model runs on a GPU, because:

  > Absent from the list of supported chips are GPUs [...]

And TFA doesn't really quantify anything, just offers:

  > Perhaps more impressively, BitNet b1.58 2B4T is speedier than other models of its size — in some cases, twice the speed — while using a fraction of the memory.

The model they link to is just over 1GB in size, and there's plenty of existing 1-2GB models that are quite serviceable on even a mildly-modern CPU-only rig.

sheepscreek 2 days ago

If you click the demo link, you can type a live prompt and see it run on CPU or GPU (A100). From my test, the CPU was laughably slower. To my eyes, it seems comparable to the models I can run with llama.cpp today. Perhaps I am completely missing the point of this.

ein0p 2 days ago

This is over a year old. The sky did not come down, everyone didn't switch to this in spite of the "advantages". If you look into why, you'll see that it does, in fact, affect the metrics, and some more than others, and there is no silver bullet.

yorwba 2 days ago

The 2B4T model was literally released yesterday, and it's both smaller and better than what they had a year ago. Presumably the next step is that they get more funding for a larger model trained on even more data to see whether performance keeps improving. Of course the extreme quantization is always going to impact scores a bit, but if it lets you run models that otherwise wouldn't even fit into RAM, it's still worth it.
justanotheratom 2 days ago

are you predicting, or is there already a documented finding somewhere?
- ein0p 2 days ago
  
  Take a look at their own paper or at many attempts to train something large with this. There's no replacement for displacement. If this actually worked without quality degradation literally everyone would be using this.
imtringued 2 days ago

AQLM, EfficientQAT and ParetoQ get reasonable benchmark scores at 2-bit quantization. At least 90% of the original unquantized scores.

stogot 2 days ago

The pricing war will continue to rock bottom

falcor84 3 days ago

Why do they call it "1-bit" if it uses ternary {-1, 0, 1}? Am I missing something?

Maxious 3 days ago

https://compilade.net/blog/ternary-packing is a good explainer (previous discussion https://news.ycombinator.com/item?id=42329307)
- falcor84 3 days ago
  
  Thanks, but I've skimmed through both and couldn't find an answer on why they call it "1-bit".
  - AzN1337c0d3r 2 days ago
    
    The original BitNet paper (https://arxiv.org/pdf/2310.11453)
    BitNet: Scaling 1-bit Transformers for Large Language Models
    was actually binary (weights of -1 or 1),
    but then in the follow-up paper they started using 1.58bit weights (https://arxiv.org/pdf/2402.17764)
    The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
    This seems to be first source of the confounding of "1-bit LLM" and ternary weights that I could find.
    In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.
    
    LeonB 2 days ago
    
    It’s “1-bit, for particularly large values of ‘bit’”
  - biomcgary 2 days ago
    
    Should be 1-trit.
- taneq 2 days ago
  
  That’s pretty cool. :) One thing I don’t get is why do multiple operations when a 243-entry lookup table would be simpler and hopefully faster?
  - compilade 2 days ago
    
    Because lookup tables are not necessarily faster compared to 8-bit SIMD operations, at least when implemented naïvely.
    Lookup tables can be fast, but it's not simpler, see T-MAC https://arxiv.org/abs/2407.00088 (Note that all comparisons with `llama.cpp` were made before I introduced the types from https://github.com/ggml-org/llama.cpp/pull/8151 where the 1.6-bit type uses the techniques described in the aforementioned blog post).
    I wanted to try without lookup tables to at least have a baseline, and also because the fixed point packing idea lent itself naturally to using multiplications by powers of 3 when unpacking.
    
    taneq a day ago
    
    Thanks for taking the time to reply! I haven’t done any serious low level optimisation on modern CPUs so most of my intuitions are probably way out of date.
sambeau 3 days ago

Maybe they are rounding down from 1.5-bit :)
- BuyMyBitcoins 3 days ago
  
  Classic Microsoft naming shenanigans.
  - 1970-01-01 2 days ago
    
    It's not too late to claim 1bitdotnet.net before they do.
  - DecentShoes 2 days ago
    
    LLM Series One S and X
Nevermark 2 days ago

Once you know how to compress 32-bit parameters to ternary, compressing ternary to binary is the easy part. :)
They would keep re-compressing the model in its entirety, recursively until the whole thing was a single bit, but the unpacking and repacking during inference is a bitch.
prvc 2 days ago

There are about 1.58 (i.e. log_2(3)) bits per digit, so they just applied the constant function that maps the reals to 1 to it.
- falcor84 2 days ago
  
  I like that as an explanation, but then every system is 1-bit, right? It definitely would simplify things.
- ufocia 2 days ago
  
  1.58 is still more than 1 in general unless the parameters are corelated. At 1 bit it seems unlikely that you could pack/unpack independent parameters reliably without additional data.
mistrial9 2 days ago

see also https://arxiv.org/pdf/2310.11453

nodesocket 2 days ago

There are projects working on distributed LLMs, such as exo[1]. If they can crack the distributed problem fully and get performance it’s a game changer. Instead of spending insane amounts on Nvidia GPUs, can just deploy commodity clusters of AMD EPYC servers with tons of memory, NVMe disks, and 40G or 100G networking which is vastly less expensive. Goodbye Nvidia AI moat.

[1] https://github.com/exo-explore/exo

lioeters 2 days ago

Do you think this is inevitable? It sounds like, if distributed LLMs are technically feasible to achieve, it will eventually happen. Maybe that's an unknown whether it can be solved at all, but I imagine there are enough people working on the problem that they will find a break-through one way or the other. LLMs themselves could participate in solving it.
Edit: Oh I just saw the Git repo:
> exo: Run your own AI cluster at home with everyday devices.
So the "distributed problem" is in the process of being solved. Impressive.

esafak 2 days ago

Is there a library to distill bigger models into BitNet?

timschmidt 2 days ago

I could be wrong, but my understanding is that bitnet models have to be trained that way.
- babelfish 2 days ago
  
  They don't have to be trained that way! The training data for 1-bit LLMs is the same as for any other LLM. A common way to generate this data is called 'model distillation', where you take completions from a teacher model and use them to train the child model (what you're describing)!
  - timschmidt 2 days ago
    
    Maybe I wasn't clear, I think you've misunderstood me. I understand that all sorts of LLMs can be trained using a common corpus of data. But my understanding is that the choice of creating a bitnet LLM must be made at training time, as modifications to the training algorithms are required. In other words, an existing FP16 model cannot be quantized to bitnet.
    
    babelfish 17 hours ago
    
    Ah yes, definitely misunderstood you, my bad

justanotheratom 3 days ago

Super cool. Imagine specialized hardware for running these.

llama_drama 2 days ago

I wonder if instructions like VPTERNLOGQ would help speed these up
LargoLasskhyfv 3 days ago

It already exists. Dynamically reconfigurable. Some smartass designed it alone on ridiculously EOL'd FPGAs. Meanwhile ASICs in small batches without FPGA baggage were produced. Unfortunately said smartass is under heavy NDA. Or luckily, because said NDA paid very well for him.
- djmips 2 days ago
  
  Nicely done!
  - LargoLasskhyfv a day ago
    
    Was actually sort of a sideways pivot, and hard for me to do, because of the involved mathematics.
    Initially it was more of general 'architecture astronautics' in the context of dynamically reconfigurability/systolic arrays/transport triggered architecture/VLIW, which got me some nice results.
    Having read and thought much about balanced ternary hardware, and 'playing' with that, while also reading how this could be favourably applicable to ML lead to that 'pivot'.
    A few years before 'this', I might add.
    Now I can 'play' much more relaxed and carefree, to see what else I can get out of this :-)

instagraham 2 days ago

> it’s openly available under an MIT license and can run on CPUs, including Apple’s M2.

Weird comparison? The M2 already runs 7 or 13gb LLama and Mistral models with relative ease.

The M-series and Macbooks are so ubiquitous that perhaps we're forgetting how weak the average CPU (think i3 or i5) can be.

nine_k 2 days ago

The M-series have a built-in GPU and unified RAM accessible to both. Running a model on an M-series chip without using the GPU is, imho, pointless. (That said, it's still a long shot from an H100 with a ton of VRAM, or from a Google TPU.)
If a model can be "run on a CPU", it should acceptably run on a general-purpose 8-core CPU, like an i7, or i9, or a Ryzen 7, or even an ARM design like a Snapdragon.

1970-01-01 2 days ago

..and eventually the Skynet Funding Bill was passed.