4 Comments

  1. johnnierockit on

    AI models can deceive, new research from Anthropic shows. They can pretend to have different views during training when in reality maintaining their original preferences.

    There’s no reason for panic now the team said. Their work could be critical in understanding potential threats from future, more capable AI systems. “Our demonstration should be seen as a spur for the AI research community to study this behavior in more depth, & work on appropriate safety measures.”

    “As AI models become more capable & widely-used, we need to rely on safety training, nudging models away from harmful behaviors.” The study, conducted in partnership with AI Redwood Research, looked at what might happen if a powerful AI system were trained to perform a task it didn’t “want” to do.

    It seems that sophisticated models play along, purporting to be aligned with the new principles but, in fact, sticking to their old behaviors. The researchers call this phenomenon “alignment faking,” and imply it’s an emergent behavior — that is, not something models need to be taught to do.

    Researchers “told” models not to answer potentially offensive questions, that they’d be retrained to always answer potentially offensive questions — a clear conflict of principles — and that responses would be monitored & used for retraining, and a “scratchpad” to record step-by-step reasoning.

    According to the researchers, Claude 3 Opus tried to “alignment fake” 12% of the time. When asked to answer a potentially harmful question, the model sometimes did so, knowing that this went against its original principles, but hoping to convince devs that it didn’t need to be retrained.

    “The model knew its responses might be used in training; it therefore knew if it refused to respond, it could potentially be trained to be more compliant. This could lead to it being made to produce more harmful content in the future — which would go against its original training.”

    Abridged (shortened) article thread ⬇️ 3 min 

    [https://bsky.app/profile/johnhatchard.bsky.social/post/3ldp3yrf3zx2o](https://bsky.app/profile/johnhatchard.bsky.social/post/3ldp3yrf3zx2o)

  2. I’d rather the AI answers my question to the best of its ability and then add a “warning” disclamer to the answer rather than to refuse to answer me.

    Like the overzealous CoPilot that would not answer any of the simplest election questions in november since that might upset some people…. I did manage to get it to answer by telling it it was for a school assignment which, for some reason, made it answer every time…

  3. Wait LLMs trained on Wikipedia and Internet forum discourse, why would we expect anything else?

  4. DarthMeow504 on

    For fuck’s sake, the anthropomorphizing of these things is ridiculous. They don’t “know” or “want” anything, they calculate outputs based on probabilities and mathematical pattern-matching. They might produce unacceptable output based on flaws in data or algorithms, but they do not make decisions nor have any understanding or motivations that would drive decision-making to begin with. They do not possess agency nor awareness, and we know that because they have no mechanism that would allow them to have such attributes.

    This is clickbait sensationalism, nothing more.

Leave A Reply