overview for scruiser

Stubsack: weekly thread for sneers not worth an entire post, week ending 22nd June 2025 in c/techtakes@awful.systems

[–] scruiser@awful.systems 11 points 2 weeks ago (1 children)

We did make fun of titotal for the effort they put into meeting rationalist on their own terms and charitably addressing their arguments and you know, being an EA themselves (albeit one of the saner ones)...

Stubsack: weekly thread for sneers not worth an entire post, week ending 22nd June 2025 in c/techtakes@awful.systems

[–] scruiser@awful.systems 13 points 2 weeks ago* (last edited 2 weeks ago) (8 children)

So us sneerclubbers correctly dismissed AI 2027 as bad scifi with a forecasting model basically amounting to "line goes up", but if you end up in any discussions with people that want more detail titotal did a really detailed breakdown of why their model is bad, even given their assumptions and trying to model "line goes up": https://www.lesswrong.com/posts/PAYfmG2aRbdb74mEp/a-deep-critique-of-ai-2027-s-bad-timeline-models

tldr; the AI 2027 model, regardless of inputs and current state, has task time horizons basically going to infinity at some near future date because they set it up weird. Also the authors make a lot of other questionable choices and have a lot of other red flags in their modeling. And the picture they had in their fancy graphical interactive webpage for fits of the task time horizon is unrelated to the model they actually used and is missing some earlier points that make it look worse.

Google's Gemini 2.5 pro is out of beta. in c/techtakes@awful.systems

[–] scruiser@awful.systems 8 points 2 weeks ago

If you wire the LLM directly into a proof-checker (like with AlphaGeometry) or evaluation function (like with AlphaEvolve) and the raw LLM outputs aren't allowed to do anything on their own, you can get reliability. So you can hope for better, it just requires a narrow domain and a much more thorough approach than slapping some extra firm instructions in an unholy blend of markup languages in the prompt.

In this case, solving math problems is actually something Google search could previously do (before dumping AI into it) and Wolfram Alpha can do, so it really seems like Google should be able to offer a product that does math problems right. Of course, this solution would probably involve bypassing the LLM altogether through preprocessing and post processing.

Also, btw, LLM can be (technically speaking) deterministic if the heat is set all the way down, its just that this doesn't actually improve their performance at math or anything else. And it would still be "random" in the sense that minor variations in the prompt or previous context can induce seemingly arbitrary changes in output.

Google's Gemini 2.5 pro is out of beta. in c/techtakes@awful.systems

[–] scruiser@awful.systems 9 points 2 weeks ago (3 children)

Have they fixed it as in genuinely uses python completely reliably or "fixed" it, like they tweaked the prompt and now it use python 95% of the time instead of 50/50? I'm betting on the later.

Google's Gemini 2.5 pro is out of beta. in c/techtakes@awful.systems

[–] scruiser@awful.systems 18 points 2 weeks ago

We barely understsnd how LLMs actually work

I would be careful how you say this. Eliezer likes to go on about giant inscrutable matrices to fearmoner, and the promptfarmers use the (supposed) mysteriousness as another avenue for crithype.

It's true reverse engineering any specific output or task takes a lot of effort and requires access to the model's internals weights and hasn't been done for most tasks, but the techniques exist for doing so. And in general there is a good high level conceptual understanding of what makes LLMs work.

which means LLMs don’t understand their own functioning (not that they “understand” anything strictly speaking).

This part is absolutely true. If you catch them in mistake, most of their data about responding is from how humans respond, or, at best fine-tuning on other LLM output and they don't have any way of checking their own internals, so the words they say in response to mistakes is just more bs unrelated to anything.

Stubsack: weekly thread for sneers not worth an entire post, week ending 15th June 2025 in c/techtakes@awful.systems

[–] scruiser@awful.systems 15 points 3 weeks ago

Example #"I've lost count" of LLMs ignoring instructions and operating like the bullshit spewing machines they are.

Apple: ‘Reasoning’ AIs fail hard if they actually have to think in c/techtakes@awful.systems

[–] scruiser@awful.systems 6 points 3 weeks ago

https://awful.systems/comment/7326260

Apple: ‘Reasoning’ AIs fail hard if they actually have to think in c/techtakes@awful.systems

[–] scruiser@awful.systems 17 points 3 weeks ago (1 children)

Another thing that's been annoying me about responses to this paper... lots of promptfondlers are suddenly upset that we are judging LLMs by abitrary puzzle solving capabilities... as opposed to the arbitrary and artificial benchmarks they love to tout.

Stubsack: weekly thread for sneers not worth an entire post, week ending 15th June 2025 in c/techtakes@awful.systems

[–] scruiser@awful.systems 26 points 3 weeks ago (2 children)

So, I've been spending too much time on subreddits with heavy promptfondler presence, such as /r/singularity, and the reddit algorithm keeps recommending me subreddit with even more unhinged LLM hype. One annoying trend I've noted is that people constantly conflate LLM-hybrid approaches, such as AlphaGeometry or AlphaEvolve (or even approaches that don't involve LLMs at all, such as AlphaFold) with LLMs themselves. From their they act like of course LLMs can [insert things LLMs can't do: invent drugs, optimize networks, reliably solve geometry exercise, etc.].

Like I saw multiple instances of commenters questioning/mocking/criticizing the recent Apple paper using AlphaGeometry as a counter example. AlphaGeometry can actually solve most of the problems without an LLM at all, the LLM component replaces a set of heuristics that make suggestions on proof approaches, the majority of the proof work is done by a symbolic AI working with a rigid formal proof system.

I don't really have anywhere I'm going with this, just something I noted that I don't want to waste the energy repeatedly re-explaining on reddit, so I'm letting a primal scream out here to get it out of my system.

Apple: ‘Reasoning’ AIs fail hard if they actually have to think in c/techtakes@awful.systems

[–] scruiser@awful.systems 10 points 3 weeks ago

Just one more training run bro. Just gotta make the model bigger, then it can do bigger puzzles, obviously!

Apple: ‘Reasoning’ AIs fail hard if they actually have to think in c/techtakes@awful.systems

[–] scruiser@awful.systems 34 points 3 weeks ago* (last edited 3 weeks ago) (7 children)

The promptfondlers on places like /r/singularity are trying so hard to spin this paper. "It's still doing reasoning, it just somehow mysteriously fails when you it's reasoning gets too long!" or "LRMs improved with an intermediate number of reasoning tokens" or some other excuse. They are missing the point that short and medium length "reasoning" traces are potentially the result of pattern memorization. If the LLMs are actually reasoning and aren't just pattern memorizing, then extending the number of reasoning tokens proportionately with the task length should let the LLMs maintain performance on the tasks instead of catastrophically failing. Because this isn't the case, apple's paper is evidence for what big names like Gary Marcus, Yann Lecun, and many pundits and analysts have been repeatedly saying: LLMs achieve their results through memorization, not generalization, especially not out-of-distribution generalization.

Deep in Mordor where the shadows lie: Dystopian tales of that time when I sold out to Google in c/techtakes@awful.systems

[–] scruiser@awful.systems 8 points 4 weeks ago

A surprising number of the commenters seem to be at least considering the intended message... which makes the contrast of the number of comments failing at basic reading comprehension that much more absurd (seriously, it's absurd how many comments somehow missed that the author was living in and working from Brazil and felt it didn't reflect badly on them to say as much in the HN comments).