ah, the NP-complete problem of just fucking pulling the file into memory (there’s no way this clown was burning a rainforest asking ChatGPT for a memory-optimized way to do this),
It's worse than that, because there's been incredibly simple, efficient ways to k-sample a stream with all sorts of guarantees about its distribution with no buffering required for centuries. And it took me all of 1 minute to use a traditional search engine to find all kinds of articles detailing this.
If you can't bother learning a thing, it isn't surprising when you end up worshiping the magic of the thing.
Currently. Though I think that there is a future where adversarial machine learning might be able to greatly increase the cost of training on pilfered data by encoding human generated inputs in a way that runs counter to training algorithms.
https://glaze.cs.uchicago.edu/