this post was submitted on 14 Feb 2024

1014 points (98.7% liked)

Technology

58135 readers

4425 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

1014

AI companies are violating a basic social contract of the web and and ignoring robots.txt (www.theverge.com)

submitted 7 months ago by TravisKelce@lemmy.world to c/technology@lemmy.world

176 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] CosmicCleric@lemmy.world 135 points 7 months ago (3 children)

As unscrupulous AI companies crawl for more and more data, the basic social contract of the web is falling apart.

Honestly it seems like in all aspects of society the social contract is being ignored these days, that's why things seem so much worse now.

[–] maness300@lemmy.world 23 points 7 months ago

It's abuse, plain and simple.

[–] TheObviousSolution@lemm.ee 12 points 7 months ago

Governments could do something about it, if they weren't overwhelmed by bullshit from bullshit generators instead and lead by people driven by their personal wealth.

load more comments (1 replies)

[–] homesweethomeMrL@lemmy.world 120 points 7 months ago (2 children)

Well the trump era has shown that ignoring social contracts and straight up crime are only met with profit and slavish devotion from a huge community of dipshits. So. Y’know.

load more comments (2 replies)

[–] MonsiuerPatEBrown@reddthat.com 95 points 7 months ago* (last edited 7 months ago) (3 children)

The open and free web is long dead.

just thinking about robots.txt as a working solution to people that literally broker in people's entire digital lives for hundreds of billions of dollars is so ... quaint.

[–] lightnegative@lemmy.world 26 points 7 months ago (2 children)

It's up there with Do-Not-Track.

Completely pointless because it's not enforced

load more comments (2 replies)

[–] rtxn@lemmy.world 87 points 7 months ago* (last edited 7 months ago) (2 children)

I would be shocked if any big corpo actually gave a shit about it, AI or no AI.

if exists("/robots.txt"):
    no it fucking doesn't

[–] bionicjoey@lemmy.ca 48 points 7 months ago (1 children)

Robots.txt is in theory meant to be there so that web crawlers don't waste their time traversing a website in an inefficient way. It's there to help, not hinder them. There is a social contract being broken here and in the long term it will have a negative impact on the web.

load more comments (1 replies)

[–] moitoi@feddit.de 82 points 7 months ago (8 children)

Alternative title: Capitalism doesn't care about morals and contracts. It wants to make more money.

[–] AutistoMephisto@lemmy.world 12 points 7 months ago (5 children)

Exactly. Capitalism spits in the face of the concept of a social contract, especially if companies themselves didn't write it.

load more comments (5 replies)

load more comments (7 replies)

[–] circuitfarmer@lemmy.world 68 points 7 months ago (2 children)

Most every other social contract has been violated already. If they don't ignore robots.txt, what is left to violate?? Hmm??

[–] blanketswithsmallpox@lemmy.world 44 points 7 months ago (8 children)

It's almost as if leaving things to social contracts vs regulating them is bad for the layperson... 🤔

Nah fuck it. The market will regulate itself! Tax is theft and I don't want that raise or I'll get in a higher tax bracket and make less!

[–] Jimmyeatsausage@lemmy.world 15 points 7 months ago* (last edited 7 months ago)

This can actually be an issue for poor people, not because of tax brackets but because of income-based assistance cutoffs. If $1/hr raise throws you above those cutoffs, that extra $160 could cost you $500 in food assistance, $5-$10/day for school lunch, or get you kicked out of government subsidied housing.

Yet another form of persecution that the poor actually suffer and the rich pretend to.

[–] SlopppyEngineer@lemmy.world 8 points 7 months ago

And then the companies hit the "trust thermocline", customers leave them in droves and companies wonder how this could've happened.

[–] KairuByte@lemmy.dbzer0.com 8 points 7 months ago

God the number of people I’ve heard say this over the years is nuts.

load more comments (5 replies)

load more comments (1 replies)

[–] ytg@feddit.ch 60 points 7 months ago (15 children)

We need laws mandating respect of robots.txt. This is what happens when you don’t codify stuff

[–] echodot@feddit.uk 33 points 7 months ago

It's a bad solution to a problem anyway. If we are going to legally mandate a solution I want to take the opportunity to come up with an actually better fix than the hacky solution that is robots.txt

[–] patatahooligan@lemmy.world 22 points 7 months ago

AI companies will probably get a free pass to ignore robots.txt even if it were enforced by law. That's what they're trying to do with copyright and it looks likely that they'll get away with it.

[–] nutsack@lemmy.world 17 points 7 months ago* (last edited 7 months ago) (1 children)

you can't really make laws in the united states it's too hard

[–] SPRUNT@lemmy.world 15 points 7 months ago (4 children)

The battle cry of conservatives everywhere: It's too hard!

Except if it involves oppressing minorities and women. Then it's a moral imperative worth all the time and money you can shovel at it regardless of whether the desired outcome is realistic or not.

load more comments (4 replies)

[–] AA5B@lemmy.world 11 points 7 months ago (3 children)

Turning that into a law is ridiculous - you really can’t consider that more than advisory unless you enforce it with technical means. For example, maybe put it behind a login or captcha if you want only humans to see it

load more comments (3 replies)

[–] ArmokGoB@lemmy.dbzer0.com 11 points 7 months ago

Sounds like the type of thing that would either be unenforceable or profitable to violate compared to the fines.

load more comments (10 replies)

[–] maynarkh@feddit.nl 54 points 7 months ago (1 children)

They didn't violate the social contact, they disrupted it.

[–] lando55@lemmy.world 13 points 7 months ago

True innovation. So brave.

[–] KillingTimeItself@lemmy.dbzer0.com 27 points 7 months ago (5 children)

hmm, i though websites just blocked crawler traffic directly? I know one site in particular has rules about it, and will even go so far as to ban you permanently if you continually ignore them.

[–] Bogasse@lemmy.ml 32 points 7 months ago (1 children)

Detecting crawlers can be easier said than done 🙁

load more comments (1 replies)

[–] ricdeh@lemmy.world 26 points 7 months ago (3 children)

You cannot simply block crawlers lol

[–] bigMouthCommie@kolektiva.social 18 points 7 months ago (5 children)

hide a link no one would ever click. if an ip requests the link, it's a ban

load more comments (5 replies)

load more comments (2 replies)

load more comments (3 replies)

[–] FrankTheHealer@lemmy.world 27 points 7 months ago (2 children)

TIL that robots.txt is a thing

[–] i_have_no_enemies@lemmy.world 8 points 7 months ago (2 children)

what is it?

[–] wise_pancake@lemmy.ca 55 points 7 months ago* (last edited 7 months ago)

robots.txt is a file available in a standard location on web servers (example.com/robots.txt) which set guidelines for how scrapers should behave.

That can range from saying "don't bother indexing the login page" to "Googlebot go away".

IT's also in the first paragraph of the article.

[–] mrnarwall@lemmy.world 17 points 7 months ago (1 children)

Robots.txt is a file that is is accessible as part of an http request. It's a backend configuration file that sets rules for what automatically running web crawlers are allowed. It can set both who is and who isn't allowed. Google is usually the most widely allowed domain for bots just because their crawler is how they find websites for search results. But it's basically the honor system. You could write a scraper today that goes to websites that it is being told it doesn't have permission to view this page, ignore it, and still get the information

load more comments (1 replies)

[–] KingThrillgore@lemmy.ml 25 points 7 months ago (1 children)

I explicitly have my robots.txt set to block out AI crawlers, but I don't know if anyone else will observe the protocol. They should have tools I can submit a sitemap.xml against to know if i've been parsed. Until they bother to address this, I can only assume their intent is hostile and if anyone is serious about building a honeypot and exposing the tooling for us to deploy at large, my options are limited.

[–] phx@lemmy.ca 37 points 7 months ago* (last edited 7 months ago) (6 children)

The funny (in an "wtf" not "haha" sense) thing is, individuals such as security researchers have been charged under digital trespassing laws for stuff like accessing publicly available ststems and changing a number in the URL in order to get access to data that normally wouldn't, even after doing responsible disclosure.

Meanwhile, companies completely ignore the standard mentions to say "you are not allowed to scape this data" and then use OUR content/data to build up THEIR datasets, including AI etc.

That's not a "violation of a social contract" in my book, that's violating the terms of service for the site and essentially infringement on copyright etc.

No consequences for them though. Shit is fucked.

[–] FartsWithAnAccent@lemmy.world 14 points 7 months ago

Remember Aaron Swartz

load more comments (5 replies)

[–] mo_lave@reddthat.com 25 points 7 months ago

Strong "the constitution is a piece of paper" energy right there

[–] lily33@lemm.ee 17 points 7 months ago (2 children)

What social contract? When sites regularly have a robots.txt that says "only Google may crawl", and are effectively helping enforce a monolopy, that's not a social contract I'd ever agree to.

[–] Imgonnatrythis@sh.itjust.works 17 points 7 months ago (3 children)

I had a one-eared rabbit. He was a monolopy.

load more comments (3 replies)

load more comments (1 replies)

[–] Yoz@lemmy.world 14 points 7 months ago (4 children)

No laws to govern so they can do anything they want. Blame boomer politicians not the companies.

[–] gian@lemmy.grys.it 15 points 7 months ago (1 children)

Why not blame the companies ? After all they are the ones that are doing it, not the boomer politicians.

And in the long term they are the ones that risk to be "punished", just imagine people getting tired of this shit and starting to block them at a firewall level...

load more comments (1 replies)

[–] itsralC@lemm.ee 14 points 7 months ago (2 children)

¿Por qué no los dos?

load more comments (2 replies)

[–] Ascend910@lemmy.ml 13 points 7 months ago (1 children)

This is a very interesting read. It is very rarely people on the internet agree to follow 1 thing without being forced

[–] echodot@feddit.uk 16 points 7 months ago (1 children)

Loads of crawlers don't follow it, i'm not quite sure why AI companies not following it is anything special. Really it's just to stop Google indexing random internal pages that mess with your SEO.

It barely even works for all search providers.

load more comments (1 replies)

[–] autotldr@lemmings.world 9 points 7 months ago

This is the best summary I could come up with:

If you hosted your website on your computer, as many people did, or on hastily constructed server software run through your home internet connection, all it took was a few robots overzealously downloading your pages for things to break and the phone bill to spike.

AI companies like OpenAI are crawling the web in order to train large language models that could once again fundamentally change the way we access and share information.

In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities.

You might build a totally innocent one to crawl around and make sure all your on-page links still lead to other live pages; you might send a much sketchier one around the web harvesting every email address or phone number you can find.

The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI’s models “were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.” A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.

“We recognize that existing web publisher controls were developed before new AI and research use cases,” Google’s VP of trust Danielle Romain wrote last year.

The original article contains 2,912 words, the summary contains 239 words. Saved 92%. I'm a bot and I'm open source!

load more comments