this post was submitted on 14 Feb 2024
659 points (95.3% liked)
Technology
59438 readers
3583 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
What if you don't have a decent graphics card? Wait 5 minutes for your URL completion to finish?
Using an LLM is quite fast, especially if it's optimised to run on normal hardware
Decent models are huge; an average one requires 8GB to be kept in memory (better models requires something like 40 to 70 GB), and most currently available engines are extremely slow on a CPU and requires dedicated hardware (and even relatively powerful GPU requires a few seconds of "thinking" time). It is unlikely that these requirements will be easily squeezable in current computers, and more likely that dedicated hardware will be required.
Sorry but has anyone in this thread actually tried running local LLMs on CPU? You can easily run a 7B model at varying levels of quantization (ie. 5 bit quantization) and get a generalized prompt-able LLM. Yeah, of course it's going to take ~4GB of RAM (which is mem-mapped and paged into memory), but you can easily fine tune smaller more specific models (like the translation one mentioned above) and have surprising intelligence at a fraction of the resources.
Take, for example, phi-2 which performs as well as 13B param models but with 2.7B params. Yeah, that's still going to take 1.5GB RAM which Firefox wouldn't reasonably ship, but many lighter weight specialized tasks could easily use something like a fine tuned 0.3B model with quantization.
Yes, I did. And yes, it is possible. It's terribly slow in comparison, making it less useful. It very quickly devolves into random mumbling or get stuck in weird loops. It also hogs resources that are actually used by other tasks you may be doing.
I mainly test dev AI solutions, and moving from 1B to 7B models made them vastly more pertinent. And moving from CPU implementation (Ryzen 7 3700X) to GPU (RTX 3080 Ti) made them fast enough to be used as quick completion and immediate suggestion without breaking workflow, in addition to freeing resources for IDE, building tools and the actual software being run, while running it on CPU had multi-seconds delay, which made this use case completely useless.