196

17947 readers

785 users here now

Be sure to follow the rule before you head out.

Rule: You must post before you leave.

Other rules

Behavior rules:

No bigotry (transphobia, racism, etc…)
No genocide denial
No support for authoritarian behaviour (incl. Tankies)
No namecalling
Accounts from lemmygrad.ml, threads.net, or hexbear.net are held to higher standards
Other things seen as cleary bad

Posting rules:

No AI generated content (DALL-E etc…)
No advertisements
No gore / violence
Mutual aid posts are not allowed

NSFW: NSFW content is permitted but it must be tagged and have content warnings. Anything that doesn't adhere to this will be removed. Content warnings should be added like: [penis], [explicit description of sex]. Non-sexualized breasts of any gender are not considered inappropriate and therefore do not need to be blurred/tagged.

If you have any questions, feel free to contact us on our matrix channel or email.

Other 196's:

founded 2 years ago

MODERATORS

moss@lemmy.blahaj.zone

greembow@lemmy.blahaj.zone

moss@lemmy.world

queue@beehaw.org

funky_rodent@lemmy.blahaj.zone

PeachyMcPeachface@lemmy.blahaj.zone

threegnomes@lemmy.blahaj.zone

greembow@lemmy.world

remotelove@lemmy.ca

Roflmasterbigpimp@feddit.de

A_Very_Big_Fan@lemm.ee

qaz@lemmy.blahaj.zone

A_Very_Big_Fan@lemmy.world

qaz@lemmy.sdf.org

qaz@lemmy.world

qaz@sh.itjust.works

657

The Rule (lemmy.ml)

submitted 11 months ago by roon@lemmy.ml to c/196@lemmy.blahaj.zone

60 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] josefo@leminal.space 3 points 11 months ago (2 children)

there are other options less ram consuming?

[–] PumpkinEscobar@lemmy.world 9 points 11 months ago (1 children)

There's quantization which basically compresses the model to use a smaller data type for each weight. Reduces memory requirements by half or even more.

There's also airllm which loads a part of the model into RAM, runs those calculations, unloads that part, loads the next part, etc... It's a nice option but the performance of all that loading/unloading is never going to be great, especially on a huge model like llama 405b

Then there are some neat projects to distribute models across multiple computers like exo and petals. They're more targeted at a p2p-style random collection of computers. I've run petals in a small cluster and it works reasonably well.

[–] AdrianTheFrog@lemmy.world 1 points 11 months ago

Yes, but 200 gb is probably already with 4 bit quantization, the weights in fp16 would be more like 800 gb IDK if its even possible to quantize more, if it is, you're probably better of going with a smaller model anyways

[–] theneverfox@pawb.social 5 points 11 months ago

Why, of course! People on here saying it's impossible, smh

Let me introduce you to the wonderful world of thrashing. What is thrashing? It's when you run out of ram. Luckily, most computers these days do something like swap space - they just treat your SSD as extra slow extra RAM.

Your computer gets locked up when it genuinely doesn't have enough RAM still though, so it unloads some RAM into disk, puts what it needs right now back into RAM, executes a bit of processing, then the program tells it actually needs some of what got shelved on disk. And it does it super fast, so it's dropping the thing it needs hundreds of times a second - technology is truly remarkable

Depending on how the software handles it, it might just crash... But instead it might just take literal hours