this post was submitted on 07 Jul 2025

145 points (98.7% liked)

Technology

39535 readers

484 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org

145

The Open-Source Software Saving the Internet From AI Bot Scrapers (www.404media.co)

submitted 4 days ago by sabreW4K3@lazysoci.al to c/technology@beehaw.org

24 comments fedilink hide all child comments

top 24 comments

sorted by: hot top controversial new old

[–] theangriestbird@beehaw.org 42 points 4 days ago (1 children)

This snip at the end is so good:

Iaso said she thinks AI companies follow her work, and that if they really want to stop her and Anubis they just need to distract her.

“If you are working at an AI company, here's how you can sabotage Anubis development as easily and quickly as possible,” she wrote on her site. “So first is quit your job, second is work for Square Enix, and third is make absolute banger stuff for Final Fantasy XIV. That’s how you can sabotage this the best.”

[–] Geodad@beehaw.org 8 points 4 days ago

I'd be fine with this... 🤣

[–] who@feddit.org 16 points 4 days ago* (last edited 4 days ago) (1 children)

She told me she’s [...] also thinking about a version that doesn’t require JavaScript, which some privacy-minded disable in their browsers.

As someone who is keenly aware of the privacy and security problems that come with allowing web scripts, I hope she prioritizes this soon. It's really disappointing to find sites that were formerly readable without javascript suddenly inaccessible since adopting Anubis. The more sites that do this, the more people are pushed toward enabling scripts by default, exposing them to a great many trackers and web exploits that would otherwise be blocked.

[–] exu@feditown.com 2 points 3 days ago (1 children)

There's an option using some very new HTML tag, but it's not the default.

https://anubis.techaro.lol/docs/admin/configuration/challenges/metarefresh

[–] who@feddit.org 1 points 3 days ago (1 children)

Interesting. Judging by that option's name, it seems to refer to use of the HTML <meta> tag to refresh a page.

https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/meta/http-equiv

Neither this tag nor using it for refresh is new at all. I don't think I've seen it used to detect bots, though. I wonder what Anubis is doing here.

[–] JohnEdwa@sopuli.xyz 2 points 3 days ago

It's simply checking if the connection is from an actual browser, as a scraper pretending to be one won't actually refresh the page as instructed. It's going to buy some time, but like the rest of Anubis in general, it will only work until the scrapers get modified to work around it.

[–] FundMECFSResearch@lemmy.blahaj.zone 13 points 4 days ago (3 children)

This thing Anubis always flags me for some reason. I use mullvad and safari (ios) with some add and tracker blocking extensions.

[–] Appoxo@lemmy.dbzer0.com 4 points 3 days ago

I wonder why traffic from known VPN companies are under more scrutiny than traffic from domestic households................

[–] Photuris@lemmy.ml 6 points 4 days ago (2 children)

More sites in general are blocking mullvad traffic lately (in my experience), and I’m not sure what, if anything, can be done about it.

[–] FundMECFSResearch@lemmy.blahaj.zone 6 points 4 days ago (1 children)

I expect better from a popular FOSS tool being used by privacy aware people though.

[–] SweetCitrusBuzz@beehaw.org 2 points 4 days ago

Can you open an issue, or see if one is open already for this?

[–] Powderhorn@beehaw.org 3 points 4 days ago

Agreed. Luckily, they don't seem to have the full list of Mullvad IPs, so if I really want to read something, I just try another tunnel.

[–] simple@piefed.social 6 points 4 days ago (1 children)

Do you have javascript or cookies disabled? That might stop you from getting past.

[–] FundMECFSResearch@lemmy.blahaj.zone 3 points 4 days ago

nope

[–] leaky_shower_thought@feddit.nl 11 points 4 days ago (1 children)

i like this one better than cloudflare's turnstile.

cf blocks me all the time for the smallest reasons and i can't seem to find their nag email.

[–] fuckwit_mcbumcrumble@lemmy.dbzer0.com 2 points 4 days ago (1 children)

I have no issues with Cloudflare, but Anubis always takes it sweet ass time to verify me. Like 30+ seconds just sitting there, but then eventually I get in.

[–] Vanilla_PuddinFudge@infosec.pub 1 points 3 days ago* (last edited 3 days ago)

Windows XP ended support like 20 years ago if you were wondering if the Pentium 4 build you're using was still viable.

[–] remington@beehaw.org 2 points 4 days ago (2 children)

Would you edit your post and add the following archive link to the body, please?

https://archive.is/VcoE1

[–] who@feddit.org 7 points 4 days ago* (last edited 4 days ago) (1 children)

Unfortunately, archive.is seems to have moved behind a big corporate CAPTCHA service, subjecting readers to having their reading habits (both the articles and the referring communities) tracked at a large scale.

I suggest this archive link instead:

https://web.archive.org/web/20250707135819/https://www.404media.co/the-open-source-software-saving-the-internet-from-ai-bot-scrapers/

[–] remington@beehaw.org 1 points 4 days ago (1 children)

Unfortunately, archive.is has moved behind Cloudflare, subjecting readers to having their reading habits (both the articles and the referring communities) tracked at a large scale.

How do you know this?

What about https://ghostarchive.org/?

[–] who@feddit.org 6 points 4 days ago* (last edited 4 days ago) (1 children)

Sorry; I shouldn't have written Cloudflare specifically. Their CAPTCHA page now contains scripts from Google, not Cloudflare. I have corrected my comment.

How do you know this?

Because a couple months ago, archive.is/archive.today started showing me CAPTCHA pages instead of the archived articles when I use Firefox with scripts disabled. The current page contains scripts hosted by Google, which I won't enable, so I can't read the archived articles.

What about https://ghostarchive.org/?

I haven't used that site enough to have a consistent picture of what it's doing. When I tried it a few minutes ago, it directed me to a CAPTCHA wall when trying to submit an article, but not when searching for an archived article. I'll try to remember to look at it again periodically, to be able to answer this question in the future.

[–] remington@beehaw.org 3 points 4 days ago

Thanks. I appreciate the info and effort.

[–] sabreW4K3@lazysoci.al 5 points 4 days ago (1 children)

To be honest with you, I refuse on moral grounds. 404 are independent and do good work. You've already linked a pay wall bypass in the comments, if anyone would like to find it, it's not hard to scroll.

[–] remington@beehaw.org 4 points 4 days ago

OK. Fair enough.