I'm sure the AI devs so lazy they cannot train their AI on anything other than scraped HTML can set up a Lemmy instance and point their crawlers at that.
Asklemmy
A loosely moderated place to ask open-ended questions
Search asklemmy π
If your post meets the following criteria, it's welcome here!
- Open-ended question
- Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
- Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
- Not ad nauseam inducing: please make sure it is a question that would be new to most members
- An actual topic of discussion
Looking for support?
Looking for a community?
- Lemmyverse: community search
- sub.rehab: maps old subreddits to fediverse options, marks official as such
- !lemmy411@lemmy.ca: a community for finding communities
~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~
I think lemmy content is scraped too, just how the whole web is beeing scraped. I do not have any proof for it though.
I have seen a user add a like anti-commercial AI license as a footer for every comment he writes lol
Those are truly useless to go against bad actors and is instead only annoying for the humans that read. And good actors with proper licenses won't be scraping Lemmy, Reddit or Twitter.
You just cannot prevent it on Lemmy because if an instance places filters like Anubis, another will not. And it is not feasable to mandate every instance to do so. Also, this is an open platform by nature and there is no group or company that can mandate rules of access. As you are limiting non-humans, you might also be limiting real users with peculiar configurations or under heavy privacy middlewares.
The point (as I see it) is not so much to stop scraping as it is to prevent bots from effectively DDOS-ing web services. As others have said ActivityPub content is public and there are ways to get it without slamming instances with scraper bots.
It is, I saw claudebot and gptbot scraping my instance, made a post about it on fuckai, but i have blocked all these bots now and my instance is a lot faster.
Out of curiosity, I am not familiar with the stack that runs the behind the scenes at all for lemmy. Are you blocking IP ranges or something else?
I use this nginx extension, it has a lot of rules, they mix IP, user agent, etc, to block a response it seems. Like adblocking rules, but for bots.
If ai respected that, people would not need AI mazes.
I don't host a Lemmy instance, but I post links in my comments. I sometimes generate and share unique-ish URLs to share updates with specific versions of my hobby projects. I've seen them queried a few times in my Apache logs by useragents claiming to be from OpenAI, Anthropic, etc. Also search engine crawler bots.
Here's the IP whose useragent claimed to be an Anthropic bot, seems like others have encountered the same behaviour: https://abuseipdb.com/check/216.73.216.135
They donβt really need to scrape. They just have to set up their own federated instance and the ActivityPub protocol will willingly hand it all to them in a nicely parsable format.
One link on your website leads to a neverending labyrinth of nonesense to slowly poison a LLM.
slrpnk.net has an AI intercept called Anubis, fwiw
It's very easy for any activitypub content to be scraped, all servers practically serve the content on a silver platter to any federated server.