try llm studio
LocalLLaMA
Community to discuss about LLaMA, the large language model created by Meta AI.
This is intended to be a replacement for r/LocalLLaMA on Reddit.
Did you try your CPU?
Also try Deepseek 14b. It will be much faster.
Yes, gpt4all runs it in cpu mode, the gpu option does not appear in the drop-down menu, which means the gpu it's not supported or there is an error. I'm trying to run the models with the SyCL backend implemented in llama.cpp that performs specific optimizations for cpu+gpu with the Intel DPC++/C++ Compiler and the OneAPI Toolkit.
Also try Deepseek 14b. It will be much faster.
ok, I'll test it out.
Why don't you just use ollama?
I don't like intermediaries ;) Fortunately I compiled llama.cpp with the Vulkan backend and everything went smooth and now I have the option to offload to the GPU. Now I will test performance CPU vs CPU+GPU. Downloaded deepseek 14b and is really good, the best I could run so far in my limited hardware.
I'm not sure what kind of laptop you own. Mine does about 2-3 tokens/sec if I'm running a 8B parameter model. So your last try seems about right. Concerning the memory: Llama.cpp can load models "memory mapped". That means the system decides which necessary parts lo load into the memory. It might be all in there, but it doesn't count as active memory usage. I believe it'll count towards the "cached" value in the statistics. If you want to make sure, you have to force it not to memory-map the model. In llama.cpp that's the parameter --no-mmap
I have no idea how to do it in gpt4all-chat. But I'd say it's already loaded in your case, it just doesn't show up as used memory, since it's the mmap thing.
Maybe try a few other software as well, like one of: ollama, koboldcpp, llama.cpp and see how they do. And I wouldn't run full precision models on an iGPU. Keep it to quantized models. Q8 or Q5... or Q4...
I tried llama.cpp but I was having some errors about not finding some library so I tried gpt4all and it worked. I'll try to recompilte and test it again. I have a thinkbook with Intel i5-1335u and integrated Xe graphics. I installed the Intel OneAPI toolkit so llama.cpp could take advantage of the SYCL backend for Intel GPUs, but I had an execution error that I was unable to solve after many days. I installed the Vulkan SDK needed to compile gpt4all with the hope to being able to use the GPU but gpt4all-chat doesn't show the option to run from it, so from what I read it means that it's not supported, but from some posts that I read I should not expect a big performance boost from that GPU.
That laptop should be a bit faster than mine. It's a few generations newer, has DDR5 RAM and maybe even proper dual channel. As far as I know, LLM inference is almost always memory bound. That means the bottleneck is your RAM speed (and how wide the bus is between CPU and memory). So whether you use SyCL, Vulkan or even the CPU cores shouldn't have a dramatic effect. The main thing limiting speed is, that the computer has to transfer gigabytes worth of numbers from memory to the processor on each step. So the iGPU or processor spends most of its time waiting for memory transfers. I haven't kept up with development, so I might be wrong here, but I don't think more that single digit tokens/sec is possible on such a computer. It'd have to be a workstation or server with multiple separate memory banks, or something like a MacBook with Apple silicon and its unified memory. Or a GPU with fast VRAM on it. Though, you might be able to do a bit more than 3 t/s.
Maybe keep trying the different computation backends. Have a look at your laptop's power settings as well. Mine is a bit slow when it's on the default "balanced" power profile. It'll speed up once I set it to "performance" or gaming mode. And if you can't get llama.cpp compiled, maybe just try Ollama, Koboldcpp instead. They use the same framework and might be easier to install. And SyCL might prove to be a bit of a letdown. It's nice. But seems few people are using it, so it might not be very polished or optimized.
I'll vouch for Koboldcpp. I use the CUDA version currently and it has a lot of what you'd need to get the settings that work for you. Just remember to save what works best as a .kcpps, or else you'll be putting it in manually every time you boot it up (though saving doesn't work on Linux afaik, and its a pain that it doesn't).