this post was submitted on 10 Jul 2025

46 points (96.0% liked)

Asklemmy

49388 readers

509 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy 🔍

If your post meets the following criteria, it's welcome here!

Open-ended question
Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
Not ad nauseam inducing: please make sure it is a question that would be new to most members
An actual topic of discussion

Looking for support?

Looking for a community?

Lemmyverse: community search
sub.rehab: maps old subreddits to fediverse options, marks official as such
!lemmy411@lemmy.ca: a community for finding communities

~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~

founded 6 years ago

MODERATORS

Are there any Lemmy servers facing AI scraping invasion? (lemmy.ml)

submitted 3 days ago by greywolf0x1@lemmy.ml to c/asklemmy@lemmy.ml

15 comments fedilink hide all child comments

Almost every website and services are getting scraped at alarming rate, are Lemmy servers facing this issue?

Please share mitigations you've seen applied to this.

top 15 comments

sorted by: hot top controversial new old

[–] lemuria@lemmy.ml 1 points 19 hours ago

I'm sure the AI devs so lazy they cannot train their AI on anything other than scraped HTML can set up a Lemmy instance and point their crawlers at that.

[–] safesyrup@feddit.org 29 points 3 days ago (4 children)

I think lemmy content is scraped too, just how the whole web is beeing scraped. I do not have any proof for it though.

I have seen a user add a like anti-commercial AI license as a footer for every comment he writes lol

[–] SSUPII@sopuli.xyz 14 points 3 days ago* (last edited 3 days ago) (1 children)

Those are truly useless to go against bad actors and is instead only annoying for the humans that read. And good actors with proper licenses won't be scraping Lemmy, Reddit or Twitter.

You just cannot prevent it on Lemmy because if an instance places filters like Anubis, another will not. And it is not feasable to mandate every instance to do so. Also, this is an open platform by nature and there is no group or company that can mandate rules of access. As you are limiting non-humans, you might also be limiting real users with peculiar configurations or under heavy privacy middlewares.

[–] beyond@linkage.ds8.zone 3 points 3 days ago

The point (as I see it) is not so much to stop scraping as it is to prevent bots from effectively DDOS-ing web services. As others have said ActivityPub content is public and there are ways to get it without slamming instances with scraper bots.

[–] potatoguy@potato-guy.space 14 points 3 days ago (1 children)

It is, I saw claudebot and gptbot scraping my instance, made a post about it on fuckai, but i have blocked all these bots now and my instance is a lot faster.

[–] Forester@pawb.social 8 points 3 days ago (1 children)

Out of curiosity, I am not familiar with the stack that runs the behind the scenes at all for lemmy. Are you blocking IP ranges or something else?

[–] potatoguy@potato-guy.space 7 points 3 days ago

I use this nginx extension, it has a lot of rules, they mix IP, user agent, etc, to block a response it seems. Like adblocking rules, but for bots.

[–] Tolookah@discuss.tchncs.de 6 points 3 days ago

If ai respected that, people would not need AI mazes.

[–] axby@lemmy.ca 2 points 3 days ago

I don't host a Lemmy instance, but I post links in my comments. I sometimes generate and share unique-ish URLs to share updates with specific versions of my hobby projects. I've seen them queried a few times in my Apache logs by useragents claiming to be from OpenAI, Anthropic, etc. Also search engine crawler bots.

Here's the IP whose useragent claimed to be an Anthropic bot, seems like others have encountered the same behaviour: https://abuseipdb.com/check/216.73.216.135

[–] ramble81@lemmy.zip 19 points 3 days ago* (last edited 3 days ago)

They don’t really need to scrape. They just have to set up their own federated instance and the ActivityPub protocol will willingly hand it all to them in a nicely parsable format.

[–] yardy_sardley@lemmy.ca 16 points 3 days ago

Nepenthes

One link on your website leads to a neverending labyrinth of nonesense to slowly poison a LLM.

[–] potatoguy@potato-guy.space 11 points 3 days ago

I use this nginx extension.

[–] Nemo@slrpnk.net 10 points 3 days ago

slrpnk.net has an AI intercept called Anubis, fwiw

[–] otter@lemmy.ca 4 points 3 days ago

We made a post about our actions here

https://lemmy.ca/post/44214013

[–] CaptainBasculin@lemmy.ml 4 points 3 days ago

It's very easy for any activitypub content to be scraped, all servers practically serve the content on a silver platter to any federated server.