this post was submitted on 19 Feb 2024

509 points (98.8% liked)

Technology

59261 readers

2639 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

509

Reddit user content being sold to AI company in $60M/year deal (9to5mac.com)

submitted 8 months ago by L4s@lemmy.world to c/technology@lemmy.world

60 comments fedilink hide all child comments

Reddit user content being sold to AI company in $60M/year deal::It’s being reported that a deal has been struck to allow an unnamed large AI company to use Reddit user...

top 50 comments

sorted by: hot top controversial new old

[–] femboy_bird@lemmy.blahaj.zone 140 points 8 months ago (2 children)

"We need to closec the api in order to protect our users from being used for ai"

[–] Nawor3565@lemmy.blahaj.zone 97 points 8 months ago (1 children)

I mean, they never claimed it was to protect users. It was to protect their user's data from being used without paying Reddit. They didn't like that AI companies were using Reddit content as a free source of training data, they never gave a shit about their users' privacy.

[–] killeronthecorner@lemmy.world 27 points 8 months ago* (last edited 8 months ago) (4 children)

This is also slightly off. It was primarily to eliminate third party apps from the existing landscape. Reddit want money from users in one of two ways:

Use their app and pay with your data via invasive tracking and advertising.
Pay for a third party app that pays them for API access.

Due to the extortionate pricing, (2) was only ever hypothetical. In reality there was no sustainable model for this for any third party app, even as a non-profit.

The case around AI does exist, but it was smoke and mirrors for Reddit pulling the same nonsense that Twitter did once they realized they might get away with it, regardless of the short term damage it would do to their public image.

[–] Syntha@sh.itjust.works 3 points 8 months ago

I think the 3rd party apps very a nice bonus but considering the timing I'm pretty sure the AI boom was the main reason.

load more comments (3 replies)

[–] RedEyeFlightControl@lemmy.world 22 points 8 months ago

It was more like "We need to closec the api in order to protect our profits from the use of your data"

[–] thesmokingman@programming.dev 65 points 8 months ago (4 children)

That’s how little they got‽ Holy shit. That’s the steal of the fucking century for all that content. Reddit clearly puts the same stock in its negotiators as it does its 3rd party ecosystem. Anyone who values them more than maybe 2x this price for their IPO is a fucking idiot. Forget Trump’s Art of the Deal. spez needs to write a book.

[–] Dagrothus@reddthat.com 15 points 8 months ago (1 children)

To be fair, most of the content is written by AI's, so it's AI training AI

load more comments (1 replies)

[–] ColeSloth@discuss.tchncs.de 8 points 8 months ago (1 children)

Getting access to the massive backlog of user data over the last 15 years for a mere 60 million. I'm glad reddit shot themselves in the foot, I'd go delete my user data from reddit, but im sure they'll be crawling the backups as well.

[–] SatansMaggotyCumFart@lemmy.world 2 points 8 months ago (1 children)

Any AI company who buys more then a year is dumb.

[–] ColeSloth@discuss.tchncs.de 3 points 8 months ago (2 children)

Unless they're leasing the information every year, which would essentially make their ai dependent on the data, but that data is probably the best source to use on the internet. Also, without continuously using the most current comments and posts, the ai model won't be able to give any info about current events topics and such.

load more comments (2 replies)

[–] Akasazh@feddit.nl 4 points 8 months ago (1 children)

I appreciate your use of the interrobang

[–] thesmokingman@programming.dev 3 points 8 months ago (1 children)

I have a replacement action set up to change a ? and a ! to ‽. I use it at least once a week!

[–] Akasazh@feddit.nl 2 points 8 months ago

Great‽ ;)

[–] T156@lemmy.world 2 points 8 months ago

Considering that the data has almost certainly been scraped already, that might have been the best that they could get for it. Or else the companies might just get it from their archives/training sets for free, like they did before.

[–] gravitas_deficiency@sh.itjust.works 58 points 8 months ago (3 children)

Putting aside pretty much everything else about this announcement: That’s… shockingly cheap.

[–] andrew_bidlaw@sh.itjust.works 23 points 8 months ago

Probably because it was harvested long before they locked API. I suspect it's not a purchase but a way to legitimize the datasets already in the works since Reddit said they are now trading them. And our favorite CEO struggles to turn any profits, so he hardly had any leverage to ask for more.

[–] Grimy@lemmy.world 12 points 8 months ago* (last edited 8 months ago)

It's mostly data that's publically available. It's more of a gamble I think, it's only worth anything if the government decides you need to pay for the data you use in training.

[–] twofont@lemmy.world 9 points 8 months ago (2 children)

1m for every IQ point of the average Reddit user

[–] gravitas_deficiency@sh.itjust.works 18 points 8 months ago (3 children)

lol dude most of us were over there for years before jumping ship and coming here

Wait

Fuck

[–] deus@lemmy.world 14 points 8 months ago

Shhh, let's just pretend the average IQ over there dropped when we left.

load more comments (2 replies)

load more comments (1 replies)

[–] JoMiran@lemmy.ml 38 points 8 months ago* (last edited 8 months ago) (6 children)

Remember kids, don't delete your account. Use scripts to replace all of your posts and comments with nonesense. If there is an option in your script to feed itba "dictionary", I highly suggest using books from the public domain like "Lady Chatterley's Lover" by D. H. Lawrence. Replace all images and video links with Steam Boat Willie.

[–] Grimy@lemmy.world 12 points 8 months ago (2 children)

They sell all your edits as well. This does make it harder to scrap the data, inadvertently bringing up how much the data they sell is worth.

[–] JoMiran@lemmy.ml 5 points 8 months ago

Yeah, that's the idea. Originally I went the "random characters then delete" route but realized that if I used randomized book excerpts from the public domain, the AI, or even a human, would have a very hard time figuring out what was real and what was trash. Ultimately, even if I can't modify them all, I can modify enough to make it easier for the buyer to just filter my username out in order to keep the results clean.

[–] BananaTrifleViolin@lemmy.world 3 points 8 months ago* (last edited 8 months ago) (1 children)

I do wonder how much backup data a site like Reddit keeps. I suspect their back ups are poor as the main focus is staying live and moving forward.

I'd imagine ability to revert a few days, maybe weeks but not much more than that? Would they see the value in keeping copies of every edit and a every deleted post? Would someone building the website even bother to build that functionality.

Also for reddit so much of their content is based around weblinks, which give the discussions context and meaning. I bet there are an awful lot of dead links in reddit and their moves to host their own pictures and videos was probably too late. Big hosting sites have disappeared over time or deleted content, or locked down content from AI farming.

The more I think about it, they were lucky to get $60m/year.

load more comments (1 replies)

[–] Kbobabob@lemmy.world 3 points 8 months ago (1 children)

I did pretty much this and everything is back to the way it was.

[–] JoMiran@lemmy.ml 3 points 8 months ago

I did it and it is still nuked. It did take a number of runs though.

[–] CosmoNova@lemmy.world 2 points 8 months ago (1 children)

Generally, what's the best/most efficient way to make LLMs go off the rail? I mean without just typing lots of gibberish and making it too obvious. As an example: I've seen people formatting their prompts with java code for like 2 lines and replies instantly went nuts.

[–] JoMiran@lemmy.ml 2 points 8 months ago

I use a few dozen novels in a single text file and randomize which lines the script pulls. It then replaces the text three times with a random pull. What you end up with are four responses in plain English. Which is the real one? You could filter out responses edited after "the great exodus", but I have been doing this to my comments a few times per year during my twelve years on reddit.

The truth is that even if I don't get them all, I get enough that it makes it far easier for the group that bought the data to just filter my username out rather than figure out what's junk and what isn't.

[–] PrincessLeiasCat@sh.itjust.works 2 points 8 months ago (2 children)

I edited all of my comments to gibberish then deleted them.

load more comments (2 replies)

[–] FatTony@discuss.online 1 points 8 months ago

I did both. Both used editing comment software and deleted them afterwards. Is that better, same or worse?

[–] TheRaven@lemmy.ca 1 points 8 months ago

On iOS, I used Redact. It worked well to replace all my posts and comments with gibberish. I did the same for Twitter too. https://apps.apple.com/app/id6449900531

[–] anticurrent@sh.itjust.works 34 points 8 months ago (1 children)

Won't be long long before reddit is selling 90% AI generated content passing for human generated content!

[–] dangblingus@lemmy.dbzer0.com 6 points 8 months ago

Feels like they're already there.

[–] redcalcium@lemmy.institute 33 points 8 months ago* (last edited 8 months ago) (2 children)

Those AI companies should love fediverse then. I mean, all data here is basically open for anyone to grab. Heck, they don't even need to grab the data, just run their own instance and the federation data will flood in on its own.

[–] LixWindoz@lemmy.world 10 points 8 months ago

Oh, don’t give them ideas please!

[–] dakial@lemmy.world 1 points 8 months ago (1 children)

This was my thought exactly. Shouldn’t there be a “no_ai.txt” on the servers somehow?

[–] T156@lemmy.world 4 points 8 months ago* (last edited 8 months ago)

That would be about as effective as robots.txt, unfortunately.

[–] qooqie@lemmy.world 20 points 8 months ago (1 children)

Does this include art OC posted there being used to train art bots? If I were posting OC art I’d just delete that shit right away, not that it’ll help I suppose

[–] Poiar@sh.itjust.works 22 points 8 months ago (1 children)

Waaaay too late for that

[–] qooqie@lemmy.world 6 points 8 months ago

And now those artists can’t sue like others have done. Really hope the products realize this and jump ship

[–] just_change_it@lemmy.world 18 points 8 months ago (1 children)

I can see it now, that ai model is going to be really, really fucking angry. lol

[–] CosmoNova@lemmy.world 10 points 8 months ago

Honestly, I can see the appeal of a model going "fuck spez" unprompted once in a while.

[–] C126@sh.itjust.works 10 points 8 months ago (1 children)

Shower thought: what if a large number of people made lots of posts and comments on reddit using only AI generated content?

[–] T156@lemmy.world 13 points 8 months ago* (last edited 8 months ago) (1 children)

Considering the spam problem, in a way, it sort of is already happening.

It's possible that par tof the API changes might have been to curb off that kind of behaviour before people decided to go and do just that too, or stop them using bots to wipe their profiles out.

[–] Corkyskog@sh.itjust.works 4 points 8 months ago* (last edited 8 months ago)

Honestly, you just need to convince people to go through their comments and break any chains with nonsense. I bet that they are training conversational abilities (I mean what other good is the data set, it's not like redditors are experts, or when there is that the experts get upvoted at all.)

[–] 7heo@lemmy.ml 7 points 8 months ago* (last edited 8 months ago)

The annoying part is that the only use of "AI" I have so far, is "translating reddit post titles to understandable English". Once they train their "AI" on whatever is there, I probably won't be able to understand the "translation" anymore... Sucks. 😬

[–] Burn_The_Right@lemmy.world 6 points 8 months ago

This is going to backfire when the content they are selling is used by AI to make bots to make the content that gets sold to make the AI to make bots to make the content.

[–] Grimy@lemmy.world 6 points 8 months ago (1 children)

This is why its so important we don't legislate against AI and make it illegal to use scraped data. All the data is already owned by someone, putting up walls only screws us out of the open source scene.

[–] g0nz0li0@lemmy.world 3 points 8 months ago* (last edited 8 months ago)

And legislate content ownership altogether. The idea that Reddit spent more than a decade growing its community just so that it could use our content as its own property is a huge issue. How do we safely and fairly communicate and express our ideas in society where the platforms that enable this automatically claim ownership of our ideas? Social media are middlemen with outsized influence.

load more comments