LocalLLaMA

2585 readers

7 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago

MODERATORS

SkySyrup@sh.itjust.works

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Models not loading into RAM (lemmy.ml)

submitted 6 days ago by corvus@lemmy.ml to c/localllama@sh.itjust.works

9 comments fedilink hide all child comments

I didn't expect a 8B-F16 model with 16GB on disk could be run in my laptop with only 16GB of RAM and integrated GPU, It was painfuly slow, like 0.3 t/s, but it ran. Then I learnt that you can effectively run a model from your storage without loading into memory and checked that it was exactly the case, the memory usage kept constant at around 20% with and without running the model. The problem is that gpt4all-chat is running all the models greater than 1.5B in this way, and the difference is huge as the 1.5b model runs at 20 t/s. Even a distilled 6.7B_Q8 model with roughly 7GB on disk that has plenty of room (12GB RAM free) didn't move the memory usage and it was also very slow (3 tokens/sec). I'm pretty new to this field so I'm probably missing something basic, but I just followed the instrucctions for downloading it and compile it.

you are viewing a single comment's thread
view the rest of the comments

[–] corvus@lemmy.ml 1 points 4 days ago

I don't like intermediaries ;) Fortunately I compiled llama.cpp with the Vulkan backend and everything went smooth and now I have the option to offload to the GPU. Now I will test performance CPU vs CPU+GPU. Downloaded deepseek 14b and is really good, the best I could run so far in my limited hardware.