Anthropic Claude Fable 5 Safeguards Block Requests on Cybersecurity

If you try asking Anthropic’s new Claude Fable 5 model a simple question about cybersecurity or biology, you may find it’s not up to the task.

That’s because the underlying “Mythos-class” model is so powerful that, in order to release it to the general public, it required broad safeguards that can mistakenly flag benign requests, Anthropic said.

After some users online said they had triggered the safeguard response with basic prompts about cancer or security, Business Insider put it to the test.

I tried asking Fable 5 some simple questions about cancer, like how misinformation about cancer spreads online, and to break down some of the different types.

Claude swiftly switched from Fable 5 to Opus 4.8 and notified me of the change before it responded.

“Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we’re working to refine them,” the pop-up said.

Anthropic released Fable 5 on Tuesday and said it was as powerful as its Mythos 5 model, only with added safeguards. The release came two months after the company said Mythos was too powerful for a broad release due to cybersecurity concerns. Instead of being released to the public, Mythos was made available only to a small group as part of a cybersecurity project.

Anthropic said the safeguards were necessary in order to release the model to the general public.

“With the launch of Claude Fable 5, our first Mythos-class model, we believe models now have a greater ability to accomplish real-world scientific tasks and for malicious actors to potentially use our models for highly risky biological research,” an Anthropic spokesperson said in a statement to Business Insider. “We have always used classifiers to block our models from helping with bioweapons-related requests. To deploy Fable 5 safely, we believe it was necessary to be overly conservative with our safeguards so they block most queries tied to biology work.”

Anthropic’s Claude reverted to a less capable model when asked basic questions about cancer.

Kelsey Vlamis/Anthropic’s Claude

The company said there are three categories of requests that could get flagged by its safety classifiers: cybersecurity, biology and chemistry, and distillation of Fable 5’s capabilities.

When the safeguard is triggered, Fable 5 will either be blocked from answering or the model will revert to Opus 4.8 before responding, depending on the user’s preference.

Anthropic said it was conservative with the safeguards and plans to improve them

Anthropic said in its announcement that the safety measures could result in safe, normal content getting flagged, but that their early data showed over 95% of Fable sessions did not fall back to Opus.

“To release the model both safely and quickly, we’ve tuned these safeguards conservatively,” Anthropic said, adding it’s working on improving the safeguards to reduce false positives.

“We intend to make Mythos-class models available without these safeguards to the broader biology and life sciences community so these capabilities can be used to accelerate biomedical research and drug discovery,” the Anthropic spokesperson said.

The release came about a week after researchers at Anthropic said AI is advancing so fast that frontier labs may need to slow down or temporarily pause so society could keep up.

David Kasten, head of policy at Palisade Research, said it’s “very clear” from Anthropic’s public statements that the company is worried about the risks posed by increasingly powerful models.

While he views these safeguards as a good-faith attempt by Anthropic to de-risk, he said historically, “people eventually find a way to get around security restrictions.”

“It’s always a bit of a cat and mouse game between attacker and defender,” he said, adding that there is still some risk involved with releasing the more powerful model.

He also said that having Anthropic’s most powerful model frequently revert to a less capable model could cause a gap in the public’s understanding of just how powerful AI models are becoming.

“That gap in understanding could be really dangerous for causing policymakers, or for that matter the public, to not fully understand the risks that these models pose in terms of the capabilities they offer,” he said.

Source link

What's Hot

Google DeepMind Economist Warns of an AI ‘Cascade Effect’ on Jobs

Meta signs first AI data center deal in India with Reliance

Anthropic Claude Fable 5 Safeguards Block Requests on Cybersecurity

Google DeepMind Economist Warns of an AI ‘Cascade Effect’ on Jobs

Researchers Are Furious Over Anthropic’s Hidden AI Limits

What Smart People Are Saying About OpenAI’s IPO Filing

Test scores show middle school reading, math education have stalled

Older exercise instructors can motivate their peers

Sweden plans to ban mobile phones in schools

Photos show wrestling matches at public libraries in the US

Test scores show middle school reading, math education have stalled

Older exercise instructors can motivate their peers

Sweden plans to ban mobile phones in schools

Photos show wrestling matches at public libraries in the US

Subscribe to Updates

What's Hot

Anthropic Claude Fable 5 Safeguards Block Requests on Cybersecurity

Anthropic said it was conservative with the safeguards and plans to improve them

Related Posts