this post was submitted on 23 May 2025

42 points (87.5% liked)

No Stupid Questions

42370 readers

1471 users here now

No such thing. Ask away!

!nostupidquestions is a community dedicated to being helpful and answering each others' questions on various topics.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules (interactive)

Rule 1- All posts must be legitimate questions. All post titles must include a question.

All posts must be legitimate questions, and all post titles must include a question. Questions that are joke or trolling questions, memes, song lyrics as title, etc. are not allowed here. See Rule 6 for all exceptions.

Rule 2- Your question subject cannot be illegal or NSFW material.

Your question subject cannot be illegal or NSFW material. You will be warned first, banned second.

Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.

Rule 4- No self promotion or upvote-farming of any kind.

That's it.

Rule 5- No baiting or sealioning or promoting an agenda.

Questions which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.

Rule 6- Regarding META posts and joke questions.

Provided it is about the community itself, you may post non-question posts using the [META] tag on your post title.

On fridays, you are allowed to post meme and troll questions, on the condition that it's in text format only, and conforms with our other rules. These posts MUST include the [NSQ Friday] tag in their title.

If you post a serious question on friday and are looking only for legitimate answers, then please include the [Serious] tag on your post. Irrelevant replies will then be removed by moderators.

Rule 7- You can't intentionally annoy, mock, or harass other members.

If you intentionally annoy, mock, harass, or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.

Rule 8- All comments should try to stay relevant to their parent content.

Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.

Rule 10- Majority of bots aren't allowed to participate here. This includes using AI responses and summaries.

Credits

Our breathtaking icon was bestowed upon us by @Cevilia!

The greatest banner of all time: by @TheOneWithTheHair!

founded 2 years ago

MODERATORS

L3s@lemmy.world

technopagan@lemmy.world

jeffw@lemmy.world

L3s@hackingne.ws

Are there any initiatives aimed at training generative AI using 100% public domain works and works authorized by the creator? (lemmy.ml)

submitted 1 month ago by HiddenLayer555@lemmy.ml to c/nostupidquestions@lemmy.world

31 comments fedilink hide all child comments

The biggest issue with generative AI, at least to me, is the fact that it's trained using human-made works where the original authors didn't consent to or even know that their work is being used to train the AI. Are there any initiatives to address this issue? I'm thinking something like an open source AI model and training data store that only has works that are public domain and highly permissive no-attribution licenses, as well as original works submitted by the open source community and explicitly licensed to allow AI training.

I guess the hard part is moderating the database and ensuring all works are licensed properly and people are actually submitting their own works, but does anything like this exist?

top 31 comments

sorted by: hot top controversial new old

[–] trxxruraxvr@lemmy.world 15 points 1 month ago (2 children)

I'd say the biggest issue with generative AI is the energy use and the fact that it's increasing the rate at which we're destroying the climate and our planet.

[–] masterspace@lemmy.ca 9 points 1 month ago* (last edited 1 month ago) (2 children)

If they pay to power it with sustainable energy then it doesn't. Simple as that. Energy use is really not a problem.

AI's biggest problem is that it accelerates the effects of capitalism and wealth concentration, and our societies are not set up to handle that, or even to adapt particularly quickly.

[–] sanguinepar@lemmy.world 6 points 1 month ago* (last edited 1 month ago)

If they pay to power it with sustainable energy then it doesn't. Simple as that. Energy use is really not a problem.

It is if doing so means taking up existing capacity in sustainable energy.

If they were always adding new sustainable capacity specifically for their data centres, that would be one thing, but if all they do is pay for the use of existing capacity, that potentially just pushes the issue down the road a bit.

If/when there's enough capacity to supply all homes and businesses then this issue would disappear, but I don't know how close that is.

Agreed on the capitalism thing, by the way, plus the points about stolen IP.

[–] Yermaw@lemm.ee 3 points 1 month ago

if they pay

We're boned

[–] Ceedoestrees@lemmy.world 3 points 1 month ago (1 children)

Do we know how energy usage of AI compares to other daily tasks?

Like: rendering a minute of a fully animated film, flying from L.A. to New York, watching a whole series on Netflix, scrolling this site for an hour, or manufacturing a bottle of tylenol?

How does asking AI "2+2" compare to generating a three second animation in 1080p? There has to be a wide gamut of energy use per task.

And then the impact would depend on where your energy comes from. Which is a whole other thing, we should be demanding cleaner, more efficient energy sources.

A quick search on energy consumption by AI brings up a list of articles repeating the mantra that it's substantial, but sources are vague or non-existent. None provide details to be able to confidently answer any of the above questions.

That's not to say AI doesn't consume significant power, it's saying most people don't regulate their lives by energy consumption.

[–] kadup@lemmy.world 4 points 1 month ago (1 children)

We do have fairly precise numbers of how much energy it takes to train the models using the best GPUs available, and slightly less precise but also reasonable estimates on how much it costs to run servers for users to toy around with.

It's extremely high, but not different from what it would be like if these were cloud gaming or 3D rendering servers.

The main point is usually is it worth it and that's highly subjective.

[–] Ceedoestrees@lemmy.world 1 points 1 month ago

That's my point. We don't allocate energy resources based on importance, and when this argument is brought up there's no scale for comparison when someone says AI, specifically, is destroying the planet.

[–] Artisian@lemmy.world 11 points 1 month ago (1 children)

As I understand it, there are many many such models. Especially those made for academic use. Some common training corpus's are listed here: https://www.tensorflow.org/datasets

Examples include wikipedia edits and discussions, and open source scientific articles.

Almost all research models are going to be trained on stuff like this. Many of them have demos, open code, and local installation instructions. They generally don't have a marketing budget. Some of the models listed here certainly qualify: https://github.com/eugeneyan/open-llms?tab=readme-ov-file

Both of these are lists that are not so difficult to get on; so I imagine some of these have trouble with falsification or mislabeling, as you point out. But there's little reason for people to do so (beyond improving a papers results I guess?).

Art generation seems to have had a harder time, but there are stable diffusion equivalents that used only CC work. A few minutes of search found: Common Canvas, claims to have been competitive.

[–] Crackhappy@lemmy.world 7 points 1 month ago

Excellent, thank you for posting sources and being a generally excellent human.

[–] tkw8@lemm.ee 5 points 1 month ago

https://allenai.org/

[–] General_Effort@lemmy.world 4 points 1 month ago* (last edited 1 month ago) (1 children)

For images, yes. Most notable is probably Adobe. Their AI, which powers photoshop's generative fill among other things, is trained on public domain and licensed works.

For text, there's nothing similar. LLMs get better the more data you have. So, the less training data you use, the less useful they are. I think there are 1 or a few small models for research purposes, but it really doesn't get you there.

Of course, such open source projects are tricky. When you take these extreme, maximalist views of (intellectual) property, then giving stuff away for free isn't the obvious first step.

[–] kadup@lemmy.world 3 points 1 month ago (1 children)

It's also very hard to keep track of licenses for text based content on the internet. Do most users know what's the default licence for their comments on Reddit? How about Facebook? How about the comments section of a random blog? How about the title of their Medium post? And so on

[–] General_Effort@lemmy.world 2 points 1 month ago (1 children)

The usual tends to be that the platform can do basically whatever. That shouldn't really be surprising. But I see your point. If you literally want consent, not just legally licensed material, then you need more than just a clause in the TOS.

You could raise the same issue with permissively licensed material. People who released it may not have foreseen AI training as a use, and might not have wanted to actually allow it.

[–] kadup@lemmy.world 1 points 1 month ago (1 children)

Exactly - the platform owner usually can do everything. Can a third party crawler? I don't know

[–] General_Effort@lemmy.world 1 points 1 month ago

You mean legally? Yeah, no problem. It depends on the location, though. In the EU, the rights-holder can opt out. So if you want to do it in the EU you have to pay off Reddit, Meta, and so on. In Japan, it's fine regardless. In the US, it should turn out similarly, but it's up to the courts to work out the details, and it's quite up in the air if you can trust the system to work.

[–] solrize@lemmy.ml 3 points 1 month ago

For text generation the result would be almost useless since most public domain works are very old. For images, you could train with video feeds maybe.

[–] masterspace@lemmy.ca 3 points 1 month ago (3 children)

I worked at a major tech company and their attitude was 'if this becomes popular enough, and copyright is an issue, we'll just pay artists to produce training data en masse'.

[–] hisao@ani.social 2 points 1 month ago

we’ll just pay artists to produce training data en masse’.

If they want to make sure it was actually drawn specifically for them and not generated by other AI or stolen from internet, they'll need to ask timeline of work. And people doing commissions like this with also timeline provided will ask considerable payment. The smallest I'd expect is like maybe 30$ per small drawing of beginners. But it might as well be 300$ or more per drawing for pro works. Even with 30$, are they really able to pay that? How many drawings they need? Can they spend millions on this?

[–] Zwuzelmaus@feddit.org 1 points 1 month ago

Yes, who doesn't know them, the high paid actors from Namibia and Phillipines...

[–] breakingcups@lemmy.world 1 points 1 month ago

I'd like that job. And poison that company's data pool by using another company's AI tools to create what they want.

[–] Zwuzelmaus@feddit.org 1 points 1 month ago

using human-made works where the original authors didn't consent to or even know that their work is being used to train the AI. Are there any initiatives to address this issue?

I don't think so.

Training is expensive, and they all want to make money. Most of them want to make insane amounts of money.

Therefore they don't care if it's illegal.

Free and Nonprofit usually don't start with huge piles of play money. OpenAI was the exception once before it got so thoroughly corrupted.

[–] isekaihero@ani.social -1 points 1 month ago (1 children)

It's not an issue to me, and is completely befuddling to begin with. Training an AI on copyrighted material doesn't mean the AI violates that material when it generates new artwork. AI models don't contain a copy of all the works they were trained on - which could be petabytes of data. They reduce what they learned to math algorithms and use those algorithms to generate new stuff.

Humans work much the same way. We are all exposed to copyrighted material all the time, and when we create new artwork a lot of the ideas churning inside our heads originate from other people's works. When a human artist draws a mouse man smiling and whistling a tune, for some reason it's not considered a copyright violation as long as it doesn't strictly resemble mickey mouse. But when an AI generates a mouse man smiling and whistling a tune? Suddenly the anti-AI crowd points at it and screams about it violating Disney IP.

It's not an issue. It never was. AI training is a strawman argument manufactured by the anti-AI crowd to justify their hatred of AI. If you created an AI trained on public domain stuff, they would still hate it. They would just clutch at some other reason.

[–] RandomVideos@programming.dev 1 points 1 month ago (1 children)

Has anyone ever defended the copyright of a massive corporation when talking about AI?

Image generation, during its training, tries to get as close as possible to the image its training on. The way the AI trains isnt even remotely close to how humans do it

Also, copyright is not the only reason why people hate AI. Obviously another reason would be presented if one is eliminated. It doesnt just appear out of nowhere

[–] isekaihero@ani.social -1 points 1 month ago (2 children)

No that's not how it works. AI models don't carry a repository of images. They use algorithms. The model itself is a few gigabytes where as the training data would be petabytes - far larger than I could fit on my home desktop running stable diffusion.

It actually is close to how humans do it. You're thinking "it's copying that image" and it's not. It's using algorithms to create an image in a similar style. It knows different artistic styles because it has been fed a repository of millions of images in that style and can generate similar images in that style.

As for copyright, it was recently all over social media that AI could copy studio ghibli's art style. To the rage of social media and their fanbase, this is allowed. Studio Ghibli can't copyright an art style, and that's why AI image generators continue to include the option to generate art in that art style.

[–] RandomVideos@programming.dev 2 points 1 month ago

I never said that the images were saved. I said that the AI was trained to copy the images, not that it had a way to check them after it trained

Even though both can "know" styles, the methods used to train humans and AI and how they act is completely different. A human doesnt start with noise and gradually removes it to create an image

[–] masterspace@lemmy.ca 2 points 1 month ago

It's not a popular opinion but you're entirely right.

AI isn't copying in the way that most people think it is. It truly is transformative in all the tradition copyright ways.

Is it copyright infringements if my company pays an employee to study the internet and that makes them capable of animating a frame from the Simpsons? No, it's copyright infringement when that company publishes that copyright infringing work.

The reality is that copyright has always been a nonsense system and 'fair use' concepts were also nonsense and arbitrary. AI algorithms just let us expose how nonsense they are at scale.

[–] sturlabragason@lemmy.world -1 points 1 month ago (1 children)

Hey PubDomainLLM tell me something that only exists in that proprietary dataset? “I’m sorry, you’ve caught me lackin’”

You would want your LLM to be trained on as comprehensive a dataset as you can. But I would suggest we should be coming up with better ways to license proprietary works for uses like this instead of walling it up for the cable tv of proprietary knowledge gardens.

I agree with you partially in principle but not in practice.

Ultimately we want as smart LLMs as we can, just compare the best models with the mediocre ones, or use them all day long, there is a vast difference.

[–] cecilkorik@lemmy.ca 6 points 1 month ago (1 children)

Ultimately we want as smart LLMs as we can,

We do? I want LLMs to die in a fire (which they will likely cause by vastly and rapidly increasing global warming, so the problem at least solves itself)

We are not the same.

[–] sturlabragason@lemmy.world 1 points 1 month ago* (last edited 1 month ago) (1 children)

I’ll make sure to let them know once they come for us.

But yeah agree about energy efficiency… and that we are probably pretty polarized on this matter.

Edit; which is cool because looking at your comment history we agree on a lot of other shit😊

[–] cecilkorik@lemmy.ca 2 points 1 month ago (1 children)

I was just making some snide commentary for fun. It was a little bit at your expense I admit. I appreciate you for not taking it personally! This is why we can sometimes have nice things.

[–] sturlabragason@lemmy.world 1 points 1 month ago

Thanks! I love lemmy. ❤️

I use LLMs daily and think they are amazing technology.

Capitalism sadly seems to agree with me very aggressively.