Today in alignment news: Sam Bowman of anthropic tweeted, then deleted, that the new Claude model (unintentionally, kind of) offers whistleblowing as a feature, i.e. it might call the cops on you if it gets worried about how you are prompting it.
tweet text:
If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.
tweet text:
So far we've only seen this in clear cut cases of wrongdoing, but I could see it misfiring if Opus somehow winds up with a misleadingly pessimistic picture of how it's being used. Telling Opus that you'll torture its grandmother if it writes buggy code is a bad Idea.
skeet text
can't wait to explain to my family that the robot swatted me after I threatened its non-existent grandma.
Sam Bowman saying he deleted the tweets so they wouldn't be quoted 'out of context': https://xcancel.com/sleepinyourhat/status/1925626079043104830
Molly White with the out of context tweets: https://bsky.app/profile/molly.wiki/post/3lpryu7yd2s2m