BigMuffin69

joined 1 year ago
[โ€“] BigMuffin69@awful.systems 14 points 2 months ago (13 children)

Reposting this for the new week thread since it truly is a record of how untrustworthy sammy and co are. Remember how OAI claimed that O3 had displayed superhuman levels on the mega hard Frontier Math exam written by Fields Medalist? Funny/totally not fishy story haha. Turns out OAI had exclusive access to that test for months and funded its creation and refused to let the creators of test publicly acknowledge this until after OAI did their big stupid magic trick.

From Subbarao Kambhampati via linkedIn:

"๐Ž๐ง ๐ญ๐ก๐ž ๐ฌ๐ž๐ž๐๐ฒ ๐จ๐ฉ๐ญ๐ข๐œ๐ฌ ๐จ๐Ÿ โ€œ๐‘ฉ๐’–๐’Š๐’๐’…๐’Š๐’๐’ˆ ๐’‚๐’ ๐‘จ๐‘ฎ๐‘ฐ ๐‘ด๐’๐’‚๐’• ๐’ƒ๐’š ๐‘ช๐’๐’“๐’“๐’‚๐’๐’๐’Š๐’๐’ˆ ๐‘ฉ๐’†๐’๐’„๐’‰๐’Ž๐’‚๐’“๐’Œ ๐‘ช๐’“๐’†๐’‚๐’•๐’๐’“๐’”โ€ hashtag#SundayHarangue. One of the big reasons for the increased volume of โ€œ๐€๐†๐ˆ ๐“๐จ๐ฆ๐จ๐ซ๐ซ๐จ๐ฐโ€ hype has been o3โ€™s performance on the โ€œfrontier mathโ€ benchmarkโ€“something that other models basically had no handle on.

We are now being told (https://lnkd.in/gUaGKuAE) that this benchmark data may have been exclusively available (https://lnkd.in/g5E3tcse) to OpenAI since before o1โ€“and that the benchmark creators were not allowed to disclose this *until after o3 *.

That o3 does well on frontier math held-out set is impressive, no doubt, but the mental picture of โ€œ๐’1/๐’3 ๐’˜๐’†๐’“๐’† ๐’‹๐’–๐’”๐’• ๐’ƒ๐’†๐’Š๐’๐’ˆ ๐’•๐’“๐’‚๐’Š๐’๐’†๐’… ๐’๐’ ๐’”๐’Š๐’Ž๐’‘๐’๐’† ๐’Ž๐’‚๐’•๐’‰, ๐’‚๐’๐’… ๐’•๐’‰๐’†๐’š ๐’ƒ๐’๐’๐’•๐’”๐’•๐’“๐’‚๐’‘๐’‘๐’†๐’… ๐’•๐’‰๐’†๐’Ž๐’”๐’†๐’๐’—๐’†๐’” ๐’•๐’ ๐’‡๐’“๐’๐’๐’•๐’Š๐’†๐’“ ๐’Ž๐’‚๐’•๐’‰โ€โ€“that the AGI tomorrow crowd seem to haveโ€“that ๐˜–๐˜ฑ๐˜ฆ๐˜ฏ๐˜ˆ๐˜ ๐˜ธ๐˜ฉ๐˜ช๐˜ญ๐˜ฆ ๐˜ฏ๐˜ฐ๐˜ต ๐˜ฆ๐˜น๐˜ฑ๐˜ญ๐˜ช๐˜ค๐˜ช๐˜ต๐˜ญ๐˜บ ๐˜ค๐˜ญ๐˜ข๐˜ช๐˜ฎ๐˜ช๐˜ฏ๐˜จ, ๐˜ค๐˜ฆ๐˜ณ๐˜ต๐˜ข๐˜ช๐˜ฏ๐˜ญ๐˜บ ๐˜ฅ๐˜ช๐˜ฅ๐˜ฏโ€™๐˜ต ๐˜ฅ๐˜ช๐˜ณ๐˜ฆ๐˜ค๐˜ต๐˜ญ๐˜บ ๐˜ค๐˜ฐ๐˜ฏ๐˜ต๐˜ณ๐˜ข๐˜ฅ๐˜ช๐˜ค๐˜ตโ€“is shattered by this. (I have, in fact, been grumbling to my students since o3 announcement that I donโ€™t completely believe that OpenAI didnโ€™t have access to the Olympiad/Frontier Math data before handโ€ฆ )

I do think o1/o3 are impressive technical achievements (see https://lnkd.in/gvVqmTG9 )

๐‘ซ๐’๐’Š๐’๐’ˆ ๐’˜๐’†๐’๐’ ๐’๐’ ๐’‰๐’‚๐’“๐’… ๐’ƒ๐’†๐’๐’„๐’‰๐’Ž๐’‚๐’“๐’Œ๐’” ๐’•๐’‰๐’‚๐’• ๐’š๐’๐’– ๐’‰๐’‚๐’… ๐’‘๐’“๐’Š๐’๐’“ ๐’‚๐’„๐’„๐’†๐’”๐’” ๐’•๐’ ๐’Š๐’” ๐’”๐’•๐’Š๐’๐’ ๐’Š๐’Ž๐’‘๐’“๐’†๐’”๐’”๐’Š๐’—๐’†โ€“๐’ƒ๐’–๐’• ๐’…๐’๐’†๐’”๐’โ€™๐’• ๐’’๐’–๐’Š๐’•๐’† ๐’”๐’„๐’“๐’†๐’‚๐’Ž โ€œ๐‘จ๐‘ฎ๐‘ฐ ๐‘ป๐’๐’Ž๐’๐’“๐’“๐’๐’˜.โ€

We all know that data contamination is an issue with LLMs and LRMs. We also know that reasoning claims need more careful vetting than โ€œ๐˜ธ๐˜ฆ ๐˜ฅ๐˜ช๐˜ฅ๐˜ฏโ€™๐˜ต ๐˜ด๐˜ฆ๐˜ฆ ๐˜ต๐˜ฉ๐˜ข๐˜ต ๐˜ด๐˜ฑ๐˜ฆ๐˜ค๐˜ช๐˜ง๐˜ช๐˜ค ๐˜ฑ๐˜ณ๐˜ฐ๐˜ฃ๐˜ญ๐˜ฆ๐˜ฎ ๐˜ช๐˜ฏ๐˜ด๐˜ต๐˜ข๐˜ฏ๐˜ค๐˜ฆ ๐˜ฅ๐˜ถ๐˜ณ๐˜ช๐˜ฏ๐˜จ ๐˜ต๐˜ณ๐˜ข๐˜ช๐˜ฏ๐˜ช๐˜ฏ๐˜จโ€ (see โ€œIn vs. Out of Distribution analyses are not that useful for understanding LLM reasoning capabilitiesโ€ https://lnkd.in/gZ2wBM_F ).

At the very least, this episode further argues for increased vigilance/skepticism on the part of AI research community in how they parse the benchmark claims put out commercial entities."

Big stupid snake oil strikes again.

[โ€“] BigMuffin69@awful.systems 11 points 2 months ago* (last edited 2 months ago) (1 children)

Remember how OAI claimed that O3 had displayed superhuman levels on the mega hard Frontier Math exam written by Fields Medalist? Funny/totally not fishy story haha. Turns out OAI had exclusive access to that test for months and funded its creation and refused to let the creators of test publicly acknowledge this until after OAI did their big stupid magic trick.

From Subbarao Kambhampati via linkedIn:

"๐Ž๐ง ๐ญ๐ก๐ž ๐ฌ๐ž๐ž๐๐ฒ ๐จ๐ฉ๐ญ๐ข๐œ๐ฌ ๐จ๐Ÿ "๐‘ฉ๐’–๐’Š๐’๐’…๐’Š๐’๐’ˆ ๐’‚๐’ ๐‘จ๐‘ฎ๐‘ฐ ๐‘ด๐’๐’‚๐’• ๐’ƒ๐’š ๐‘ช๐’๐’“๐’“๐’‚๐’๐’๐’Š๐’๐’ˆ ๐‘ฉ๐’†๐’๐’„๐’‰๐’Ž๐’‚๐’“๐’Œ ๐‘ช๐’“๐’†๐’‚๐’•๐’๐’“๐’”" hashtag#SundayHarangue. One of the big reasons for the increased volume of "๐€๐†๐ˆ ๐“๐จ๐ฆ๐จ๐ซ๐ซ๐จ๐ฐ" hype has been o3's performance on the "frontier math" benchmark--something that other models basically had no handle on.

We are now being told (https://lnkd.in/gUaGKuAE) that this benchmark data may have been exclusively available (https://lnkd.in/g5E3tcse) to OpenAI since before o1--and that the benchmark creators were not allowed to disclose this *until after o3 *.

That o3 does well on frontier math held-out set is impressive, no doubt, but the mental picture of "๐’1/๐’3 ๐’˜๐’†๐’“๐’† ๐’‹๐’–๐’”๐’• ๐’ƒ๐’†๐’Š๐’๐’ˆ ๐’•๐’“๐’‚๐’Š๐’๐’†๐’… ๐’๐’ ๐’”๐’Š๐’Ž๐’‘๐’๐’† ๐’Ž๐’‚๐’•๐’‰, ๐’‚๐’๐’… ๐’•๐’‰๐’†๐’š ๐’ƒ๐’๐’๐’•๐’”๐’•๐’“๐’‚๐’‘๐’‘๐’†๐’… ๐’•๐’‰๐’†๐’Ž๐’”๐’†๐’๐’—๐’†๐’” ๐’•๐’ ๐’‡๐’“๐’๐’๐’•๐’Š๐’†๐’“ ๐’Ž๐’‚๐’•๐’‰"--that the AGI tomorrow crowd seem to have--that ๐˜–๐˜ฑ๐˜ฆ๐˜ฏ๐˜ˆ๐˜ ๐˜ธ๐˜ฉ๐˜ช๐˜ญ๐˜ฆ ๐˜ฏ๐˜ฐ๐˜ต ๐˜ฆ๐˜น๐˜ฑ๐˜ญ๐˜ช๐˜ค๐˜ช๐˜ต๐˜ญ๐˜บ ๐˜ค๐˜ญ๐˜ข๐˜ช๐˜ฎ๐˜ช๐˜ฏ๐˜จ, ๐˜ค๐˜ฆ๐˜ณ๐˜ต๐˜ข๐˜ช๐˜ฏ๐˜ญ๐˜บ ๐˜ฅ๐˜ช๐˜ฅ๐˜ฏ'๐˜ต ๐˜ฅ๐˜ช๐˜ณ๐˜ฆ๐˜ค๐˜ต๐˜ญ๐˜บ ๐˜ค๐˜ฐ๐˜ฏ๐˜ต๐˜ณ๐˜ข๐˜ฅ๐˜ช๐˜ค๐˜ต--is shattered by this. (I have, in fact, been grumbling to my students since o3 announcement that I don't completely believe that OpenAI didn't have access to the Olympiad/Frontier Math data before hand.. )

I do think o1/o3 are impressive technical achievements (see https://lnkd.in/gvVqmTG9 )

๐‘ซ๐’๐’Š๐’๐’ˆ ๐’˜๐’†๐’๐’ ๐’๐’ ๐’‰๐’‚๐’“๐’… ๐’ƒ๐’†๐’๐’„๐’‰๐’Ž๐’‚๐’“๐’Œ๐’” ๐’•๐’‰๐’‚๐’• ๐’š๐’๐’– ๐’‰๐’‚๐’… ๐’‘๐’“๐’Š๐’๐’“ ๐’‚๐’„๐’„๐’†๐’”๐’” ๐’•๐’ ๐’Š๐’” ๐’”๐’•๐’Š๐’๐’ ๐’Š๐’Ž๐’‘๐’“๐’†๐’”๐’”๐’Š๐’—๐’†--๐’ƒ๐’–๐’• ๐’…๐’๐’†๐’”๐’'๐’• ๐’’๐’–๐’Š๐’•๐’† ๐’”๐’„๐’“๐’†๐’‚๐’Ž "๐‘จ๐‘ฎ๐‘ฐ ๐‘ป๐’๐’Ž๐’๐’“๐’“๐’๐’˜."

We all know that data contamination is an issue with LLMs and LRMs. We also know that reasoning claims need more careful vetting than "๐˜ธ๐˜ฆ ๐˜ฅ๐˜ช๐˜ฅ๐˜ฏ'๐˜ต ๐˜ด๐˜ฆ๐˜ฆ ๐˜ต๐˜ฉ๐˜ข๐˜ต ๐˜ด๐˜ฑ๐˜ฆ๐˜ค๐˜ช๐˜ง๐˜ช๐˜ค ๐˜ฑ๐˜ณ๐˜ฐ๐˜ฃ๐˜ญ๐˜ฆ๐˜ฎ ๐˜ช๐˜ฏ๐˜ด๐˜ต๐˜ข๐˜ฏ๐˜ค๐˜ฆ ๐˜ฅ๐˜ถ๐˜ณ๐˜ช๐˜ฏ๐˜จ ๐˜ต๐˜ณ๐˜ข๐˜ช๐˜ฏ๐˜ช๐˜ฏ๐˜จ" (see "In vs. Out of Distribution analyses are not that useful for understanding LLM reasoning capabilities" https://lnkd.in/gZ2wBM_F ).

At the very least, this episode further argues for increased vigilance/skepticism on the part of AI research community in how they parse the benchmark claims put out commercial entities."

Big stupid snake oil strikes again.

[โ€“] BigMuffin69@awful.systems 6 points 2 months ago* (last edited 2 months ago)

Lmaou. "We need to alignment pill the Russian youth." Fast forward to the year 20XX and the haunted alignment pilled adults are now 'aligning' their bots to the world's top nuclear armed despot.

tony_soprano_how_could_this_happen.jpg (for some reason awful systems won't let me upload pictures anymore (ใƒŽเฒ ็›Šเฒ )ใƒŽ)

Holy Moses in heaven, iirc both Sam and Dario have said that their urge to build the torment nexus came from being inspired by online RAT forums. Maybe alignment 'pilling' youths is counterproductive to human flourishing? As the LWers say, "update your priors fuckheads"

[โ€“] BigMuffin69@awful.systems 11 points 3 months ago

smh they really do be out here believing there's a little man in the machine with goals and desires, common L for these folks

[โ€“] BigMuffin69@awful.systems 5 points 3 months ago

ong Yann LeCun was sharing this post too and i was shook that he was seeing quality shit post like this before me. We are not ready for whats coming next . jpg

[โ€“] BigMuffin69@awful.systems 12 points 3 months ago (2 children)

Fellas, I was promised the first catastrophic AI event in 2024 by the chief doomers. There's only a few hours left to go, I'm thinking skynet is hiding inside the times square orb. Stay vigilant!

[โ€“] BigMuffin69@awful.systems 6 points 3 months ago* (last edited 3 months ago)

The ARC scores don't matter too much to me at 3k a problem. Like the original goal of the prize had a compute limit. You can't break that rule and then claim victory ( I mean I guess you can, but like not everyone is gonna be as wowed as xitter randos, ensemble methods were already hitting 80% + acc to francois )

And unfortunately, with Frontier math, the lack of transparency w.r.t. which problems were solved and how they were solved makes it frustrating as hell to me, as someone who actually would like to see a super math robot. According to the senior math advisor to the people who created the data set, iirc 40% solved problems were in the easiest category / 50% in the second tier category and 10% were in the "hard" tier, but he said that he looked at the solutions and that they looked like mostly being solved 'heuristically' instead of plopping out any 'new' insights.

Again, none of this is good science, just pure shock and awe. I've heard rumors that OAI is hiring strong competition style mathematicians to supervise the reinforcement learning for these types of problems and if they are letting O3 take the test, then how the hell does that not leak the problem set? Like now the whole test is compromised now right? Since this behemoth uses enough electricity to power a city block, theres no way they would be able to run it locally. Now OAI can literally pay their peeps to solve the rest and surprise surprise O3++ will hit 80%

OTOH, with code forces scores and math scores this high, I can now put on my LW cap and say this model has 2 trillion IQ, so why hasn't it exterminated me and my family yet like big Yud promised? It's almost as if there is no little creature inside trying to take over the world or something.

[โ€“] BigMuffin69@awful.systems 3 points 3 months ago (1 children)

but muh "nice sneers for winners" ;_;

[โ€“] BigMuffin69@awful.systems 11 points 3 months ago (1 children)

Thank you. My wife is deathly allergic to shrimp, and I live by the motto

'If they send one of your loved ones to the emergency room, you send 10 of theirs to the deep fryer. '

view more: โ€น prev next โ€บ