Both are happening. Samples of casual writing are more valuable to use to generate an article than research papers though.
Asklemmy
A loosely moderated place to ask open-ended questions
Search asklemmy π
If your post meets the following criteria, it's welcome here!
- Open-ended question
- Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
- Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
- Not ad nauseam inducing: please make sure it is a question that would be new to most members
- An actual topic of discussion
Looking for support?
Looking for a community?
- Lemmyverse: community search
- sub.rehab: maps old subreddits to fediverse options, marks official as such
- !lemmy411@lemmy.ca: a community for finding communities
~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~
Yeah. Scientific papers may teach an AI about science, but Reddit posts teach AI how to interact with people and "talk" to them. Both are valuable.
Hopefully not too pedantic, but no one is βteachingβ AI anything. Theyβre just feeding it data in the hopes that it can learn probabilities for certain types of output. It βunderstandsβ neither the Reddit post nor the scientific paper.
Because AI needs a lot of training data to reliably generate something appropriate. It's easier to get millions of reddit posts than millions of research papers.
Even then, LLMs simply generate text but have no idea what the text means. It just knows those words have a high probability of matching the expected response. It doesn't check that what was generated is factual.
I find it amusing that everyone is answering the question with the assumption that the premise of OP's question is correct. You're all hallucinating the same way that an LLM would.Β
LLMs are rarely trained on a single source of data exclusively. All the big ones you find will have been trained on a huge dataset including Reddit, research papers, books, letters, government documents, Wikipedia, GitHub, and much more.Β
Example datasets:
Rules of lemmy
Ignore facts, donβt do research to see if the comment/post is correct, donβt look at other comments to see if anyone else has corrected the post/comment already, there is only one right side (and that is the side of the loudest group)
"AI" is a parlor trick. Very impressive at first, then you realize there isn't much to it that is actually meaningful. It regurgitates language patterns, patterns in images, etc. It can make a great Markov chain. But if you want to create an "AI" that just mines research papers, it will be unable to do useful things like synthesize information or describe the state of a research field. It is incapable of critical or analytical approaches. It will only be able to answer simple questions with dubious accuracy and to summarize texts (also with dubious accuracy).
Let's say you want to understand research on sugar and obesity using only a corpus from peer reviewed articles. You want to ask something like, "what is the relationship between sugar and obesity?". What will LLMs do when you ask this question? Well, they will just attempt to do associations and to construct reasonable-sounding sentences based on their set of research articles. They might even just take an actual semtence from an article and reframe it a little, just like a high schooler trying to get away with plagiarism. But they won't be able to actually mechanistically explain the overall mechanisms and will fall flat on their face when trying to discern nonsense funded by food lobbies from critical research. LLMs do not think or criticize. Of they do produce an answer that suggests controversy it will be because they either recognized diversity in the papers or, more likely, their corpus contains reviee articles that criticize articles funded by the food industry. But it will be unable to actually criticize the poor work or provide a summary of the relationship between sugar and obesity based on any actual understanding that questions, for example, whether this is even a valid question to ask in the first place (bodies are not simple!). It can only copy and mimic.
They might even just take an actual semtence from an article and reframe it a little
Case for many things that can be answered via stackoverflow searches. Even the order in which GPT-4o brings up points is the exact same as SO answers or comments.
Yeah it's actually one of the ways I caught a previous manager using AI for their own writing (things that should not have been done with AI). They were supposed to write about something in a hyper-specific field and an entire paragraph ended up just being a rewording of one of two (third party) website pages that discuss this topic directly.
Why does everyone keep calling them Markov chains? They're missing ~~all the required properties, including~~ the eponymous Markovian property. Wouldn't it be more correct to call them stochastic processes?
Edit: Correction, turns out the only difference between a stochastic process and a Markov process is the Markovian property. It's literally defined as "stochastic process but Markovian".
Because it's close enough. Turn off beam and redefine your state space and the property holds.
Why settle for good enough when you have a term that is both actually correct and more widely understood?
You could feed all the research papers in the world to an LLM and it will still have zero understanding of what you trained it on. It will still make shit up, it can't save the world.
Training it on research papers wouldnβt make it smarter, it would just make it better at mimicking their writing style.
Donβt fall for the hype.
Short answer: they already are
Slightly longer answer: GPT models like ChatGPT are part of an experiment in "if we train the AI model on shedloads of data does it make a more powerful AI model?" and after OpenAI made such big waves every company is copying them including trying to train models similar to ChatGPT rather than trying to innovate and do more
Even longer answer: There's tons of different AI models out there for doing tons of different things. Just look at the over 1 million models on Hugging Face (a company which operates as a repository for AI models among other services) and look at all of the different types of models you can filter for on the left.
Training an image generation model on research papers probably would make it a lot worse at generating pictures of cats, but training a model that you want to either generate or process research papers on existing research papers would probably make a very high quality model for either goal.
More to your point, there's some neat very targeted models with smaller training sets out there like Microsoft's PHI-3 model which is primarily trained on textbooks
As for saving the world, I'm curious what you mean by that exactly? These generative text models are great at generating text similar to their training data, and summarization models are great at summarizing text. But ultimately AI isn't going to save the world. Once the current hype cycle dies down AI will be a better known and more widely used technology, but ultimately its just a tool in the toolbox.
also the answer to that question, shitloads of data for a better ai, is yes⦠with logarithmic returns. massively underpriced (by cost to generate) returns that have questionable value statement at best.
Redditors are always right, peer reviewed papers always wrong. Pretty obvious really. :D
editor's note: it will not save the world
Who is "we"? My understanding is LLMs are mostly being trained on a large amount of publicly available texts, including both reddit posts and research papers.
They already do that. You're being a troglodyte.
Hmmm. Not sure if I'm being insulted. Is that one of those fish fossils that looks kind of like a horseshoe crab?
You're thinking of a trilobite
Papers are most importantly a documentation of exactly what and how a procedure was performed, adding a vagueness filter over that is only going to decrease its value infinitely.
Real question is why are we using generative ai at all (gets money out of idiot rich people)
They're trained on both, and the kitchen sink.
The Ghost of Aaron Schwartz
What he was fighting for was an awful lot more important than a tool to write your emails while causing a ginormous tech bubble.
They are. T&F recently cut a deal with Microsoft. Without author's consent, of course.
I'm fairly sure a few others have too, but that's the only article I could find quickly.
We are. I just read an article yesterday about how Microsoft paid research publishers so they could use the papers to train AI, with or without the consent of the papers' authors. The publishers also reduced the peer review window so they could publish papers faster and get more money from Microsoft. So... expect AI to be trained on a lot of sloppy, poorly-reviewed research papers because of corporate greed.
Anyone running a webserver and looking at their logs will know AI is being trained on EVERYTHING. There are so many crawlers for AI that are literally ripping the internet wholesale. Reddit just got in on charging the AI companies for access to freely contributed content. For everyone else, they're just outright stealing it.
Saving the world isn't profitable in the short term.
Vulture capitalists don't care about the future. They care about the immediate. Short term profitability. And nothing else.
Because they are looking for conversations.
I saw an article about one trained on research papers. (Built by Meta, maybe?) It also spewed out garbage: it would make up answers that mimicked the style of the papers but had its own fabricated content! Something about the largest nuclear reactor made of cheese in the world...
Brain damage is cheaper than professionals
Who's going to peer review that?
Tons of people already are. The following site is useful for searching papers using ai https://consensus.app/
How does that help disempower the fossil fuel mafia?
Part of it is the same "human speech" aspects that have plagued NLP work over the past few years. Nobody (except the poor postdoctoral bastard who is running the paper farm for their boss) actually speaks in the same way that scholarly articles are written because... that should be obvious.
This combines with the decades of work by right wing fascists to vilify intellectuals and academia. If you have ever seen (or written) a comment that boils down to "This youtuber sounds smug" or "They are presenting their opinion as fact" then you see why people prefer "natural human speech" over actual authoritatively researched and tested statements.
And... while not all pay to publish journals are trash, I feel confident saying that most are. And filtering those can be shockingly hard by design.
But the big one? Most of the owners of the various journals are REALLY fucking litigious and will go scorched earth on anyone who is using their work (because Elsevier et al own your work) to train a model.
money. theres no money in saving the world. lots of money in not saving the world.
greed will be humanities downfall
Because broken English from research paper and relatively structured style will be even worse than reddit posts
Came to wonder about this.
The few I've seen weren't shining examples of the language, and could have used some editing.
As well, the rumours abound that a lot of papers are available before review, and that's likely to cause some harm if we trust a model predicting on bad data.
(Yes, I know: reddit isn't going to be better; but it has its own warning because, well, Reddit)
I think I read this post wrong.
I was thinking the sentence "We could be saving the world!" meant 'we' as in humans only.
No need to be training AI. No need to do anything with AI at all. Humans simply start saving the world. Our Research Papers can train on Reddit. We cannot be training, we are saving the world. Let the Research Papers run a train on Reddit AI. Humanity Saves World.
No cynical replies please.
AuroraGPT. They are trying to do it.
Its cause number of people who can read, understand, and then create the necessary dataset to train and test the LLM are very very very few for research papers vs the data for pop culture is easilier to source.