mm_maybe

joined 1 year ago
[–] mm_maybe@sh.itjust.works 2 points 13 hours ago (1 children)

Tell that to the MAGA working class

[–] mm_maybe@sh.itjust.works 1 points 1 day ago* (last edited 1 day ago) (1 children)

Wait, did I miss something? I was under the impression that Vivaldi wouldn't be affected by the Manifest v3 change since their adblocker is independently developed... is that not the case?

[–] mm_maybe@sh.itjust.works 2 points 1 day ago

Yes, and I loved it at first sight--it's the only version of Firefox that feels modern and delivers competitive performance in terms of resource efficiency. I'm backing the project via Patreon and really hope it develops into something even better... though I have to admit that I've mostly switched back to Vivaldi because of its greater customization ability and mobile browser (Zen is desktop only) as well as its built-in adblocker, which even works in iOS, unlike uBlock Origin.

[–] mm_maybe@sh.itjust.works 1 points 3 days ago

I think that there are some people working on this, and a few groups that have claimed to do it, but I'm not aware of any that actually meet the description you gave. Can you cite a paper or give a link of some sort?

[–] mm_maybe@sh.itjust.works 5 points 1 week ago

It's 100% this. Politics is treated like a sport in the USA; the only thing that matters is your side winning, and which side you root for is largely dictated by location and family history. This is encouraged by the private news media, who intentionally report on election campaigns in this manner in order to increase ratings and ad revenue. Social media only made it worse because it made a lot of abstract identity dimensions, such as political affiliation, feel stronger to people than their everyday lives.

[–] mm_maybe@sh.itjust.works 4 points 1 week ago (1 children)

Y'all should really stop expecting people to buy into the analogy between human learning and machine learning i.e. "humans do it, so it's okay if a computer does it too". First of all there are vast differences between how humans learn and how machines "learn", and second, it doesn't matter anyway because there is lots of legal/moral precedent for not assigning the same rights to machines that are normally assigned to humans (for example, no intellectual property right has been granted to any synthetic media yet that I'm aware of).

That said, I agree that "the model contains a copy of the training data" is not a very good critique--a much stronger one would be to simply note all of the works with a Creative Commons "No Derivatives" license in the training data, since it is hard to argue that the model checkpoint isn't derived from the training data.

[–] mm_maybe@sh.itjust.works 4 points 1 week ago (2 children)

Yeah, I've struggled with that myself, since my first AI detection model was technically trained on potentially non-free data scraped from Reddit image links. The more recent fine-tune of that used only Wikimedia and SDXL outputs, but because it was seeded with the earlier base model, I ultimately decided to apply a non-commercial CC license to the checkpoint. But here's an important distinction: that model, like many of the use cases you mention, is non-generative; you can't coerce it into reproducing any of the original training material--it's just a classification tool. I personally rate those models as much fairer uses of copyrighted material, though perhaps no better in terms of harm from a data dignity or bias propagation standpoint.

[–] mm_maybe@sh.itjust.works 16 points 1 week ago (1 children)

Model sizes are larger than their training sets

Excuse me, what? You think Huggingface is hosting 100's of checkpoints each of which are multiples of their training data, which is on the order of terabytes or petabytes in disk space? I don't know if I agree with the compression argument, myself, but for other reasons--your retort is objectively false.

[–] mm_maybe@sh.itjust.works 13 points 1 week ago (4 children)

I'm getting really tired of saying this over and over on the Internet and getting either ignored or pounced on by pompous AI bros and boomers, but this "there isn't enough free data" claim has never been tested. The experiments that have come close (look up the early Phi and Starcoder papers, or the CommonCanvas text-to-image model) suggested that the claim is false, by showing that a) models trained on small, well-curated datasets can match and outperform models trained on lazily curated large web scrapes, and b) models trained solely on permissively licensed data can perform on par with at least the earlier versions of models trained more lazily (e.g. StarCoder 1.5 performing on par with Code-Davinci). But yes, a social network or other organization that has access to a bunch of data that they own, or have licensed, could almost certainly fine-tune a base LLM trained solely on permissively licensed data to get a tremendously useful tool that would probably be safer and more helpful than ChatGPT for that organization's specific business, at vastly lower risk of copyright claims or toxic generated content, for that matter.

[–] mm_maybe@sh.itjust.works 63 points 1 week ago (26 children)

The problem with your argument is that it is 100% possible to get ChatGPT to produce verbatim extracts of copyrighted works. This has been suppressed by OpenAI in a rather brute force kind of way, by prohibiting the prompts that have been found so far to do this (e.g. the infamous "poetry poetry poetry..." ad infinitum hack), but the possibility is still there, no matter how much they try to plaster over it. In fact there are some people, much smarter than me, who see technical similarities between compression technology and the process of training an LLM, calling it a "blurry JPEG of the Internet"... the point being, you wouldn't allow distribution of a copyrighted book just because you compressed it in a ZIP file first.

[–] mm_maybe@sh.itjust.works 1 points 1 week ago

Yeah, I would agree that there's something really off about the framework that just doesn't fit most people's feelings of justice or injustice. A synth YouTuber, of all people, made a video about this that I liked, though his proposed solution is about as workable as Jaron Lanier's: https://youtu.be/PJSTFzhs1O4?si=ZvY9yfOuIJI7CVUk

Again, I don't have a proposal of my own, I've just decided for myself that if I'm going to do anything money-making with LLMs in my practice as a professional data scientist, I'll rely on StarCoder as my base model instead of the others, particularly because a lot of my clients are in the public sector and face public scrutiny.

[–] mm_maybe@sh.itjust.works 1 points 1 week ago

yes, I've extensively written about Phi and other related issues in a blog post which I'll share here: https://medium.com/@matthewmaybe/data-dignity-is-difficult-64ba41ee9150

view more: next ›