Bruh, Big Yud was yapping that this means the orthogonality thesis is false and mankind is saved b.c. of this. But then he immediately retreated to, "we are all still doomed b.c. recursive self-improvement." I wonder what it's like to never have to update your priors.
Also, I saw other papers that showed almost all prompt rejection responses shared common activation weights and tweeking them can basically jailbreak any model, so what is probably happening here is that by finetuning to intentionally make malicious code, you are undoing those rejection weights + until this is reproduced by nonsafety cranks im pressing x to doubt.
So they had the new Claude hooked up to some tools so that it could play Pokemon red. Somewhat impressive (at least to me!) It was able to beat lt surge after several days of play. They had a stream demo'ing it on twitch and despite the on paper result of getting 3 gym badges, poor fellas got stuck in Viridian forest trying to find the exit to the maze.
As far as finding the exit goes... I guess you could say he was stumped? (MODS PLEASE DONT BAN)
strim if anyone is curious. Yes, i know this is clever advertising for anthropic, but i do find it cute and maybe someone else will?
https://www.twitch.tv/claudeplayspokemon