this post was submitted on 17 Feb 2024
1058 points (98.6% liked)
Technology
59288 readers
4063 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
They've also got vote counts and breakdowns of who is making those votes. This data will be worth more for AI training than any similar volume of data other than maybe the contents of Wikipedia. Assuming they didn't have it set up to delete the vote breakdowns when they archived threads.
Why are those breakdowns worth so much? Because they can be used to build profiles on each voter (including those who only had lurker accounts to vote with), so they can build AIs that know how to speak with the MAGA cult, Republicans who aren't MAGA, liberals, moderates, centrists, socialists, communists, anarchists. Not only that, they'll be able to look at how sentiments about various things changed over time with each of these groups, watch people move from one to another as their opinions evolved, see how someone pretends to be a member of whatever group (assuming they voted honestly and posted under their fake persona).
Oh and also, all of that data is available through the fediverse but it's free to train on to anyone who sets up a server. Which makes me question whether the fediverse is a good thing because even changing federation to opt-in instead of opt-out just covers whether your server accepts data from another. It's always shared.
Open and private are on opposite sides of a spectrum. You can't have both, best you can do is settle for something in the middle.
I'd argue it's good, because it means open source AI has a fighting chance with FOSS data to train on without needing to fork over a morbillion dollars to Reddits owners.
Whatever use cases the reddit data can train on, FOSS researchers can repeat it on Lemmy data and release free models that average joes can use on their own without having to subscribe to shit like Microsoft Copilot and friends to stay relevant.
What if reddit also kept all deleted comments and post, im sure there are shit loads of things people type out just to delete, thinking all the while it'll never see the light of day.
I'd be surprised if they don't keep all of that. There were a number of sites for looking at deleted posts. They'd just go and grab everything and compare what was still there with what wasn't and highlight the stuff that wasn't there anymore.
Which is also possible here, though the mod log reduces the need for it. But if someone is looking for posts people change their mind about wanting anyone to see, deleting it highlights it instead of hides it for anyone who is watching for that.
I think that site was unddit, but yes those were posted then later deleted. Im talking about just typing out a post or comment and never posting just simply backing out of the page or hitting cancel. Im not just if any of that is stored on the site or just locally.
You would be able to tell by monitoring the network tab of the browser developer tools. If post requests are being made (which they probably are, though I’m too lazy to go check) while you are typing a comment, they are most likely saving work in progress records for comments.
Oh, yeah, I've wondered the same myself. Hell, that might have been a motivation for removing the API access.
They definitely do, it's common for such systems to never actually delete anything because storage is cheap. It likely just is flagged
deleted=true
and the searches just returnWHERE [post].Deleted = False
on queries on the backend.So it looks deleted to the consumer, but it's all saved and squirreled away on the backend.
It's good to keep all this shit for both legal reasons (if someone posts illegal stuff then deletes it, you still can give it to the feds), as well as auditing (mods can't just delete stuff to cover it up, the original still exists and admins can see it)
This is how system storage works generally: the disk "de-lists" the data in the block registry, so it appears there is no data in that block.
Obviously a server back end it keeping it for redundancy and not efficiency, but procedurally it's the same
The problem (for most) was never that people's public posts/comments were being used for AI training, it was that someone else was claiming ownership over them and being paid for access, and the resulting AI was privately owned. The fediverse was always about avoiding the pitfalls of private ownership, not privacy.
It's exhausting constantly being "that guy," but it really needs to be said constantly; private ownership is at the core of nearly every major issue in the 21st century.
The same goes for piracy and copyright. The same goes for DMCA circumvention and format shifting content you own. The same goes for proprietary tech ecosystems and walled gardens. Private ownership is at the core of the most contentious practices in the 21st century, and if we don't address it shit like this will just keep happening.