Tamper-Resistant Safeguards for Open-Weight LLMs

Events

Tamper-Resistant Safeguards for Open-Weight LLMs

Lectures and seminars

LLM seminar event about the paper “Tamper-Resistant Safeguards for Open-Weight LLMs" from Center for AI Safety and many other collaborators.

Image with writing about the presenter name, title, time and place of the event. Black background with a book

When

6.2.2025 14:00 – 15:00 (UTC +2)

Where

Computer Science building - meeting room A142 &

Event language(s)

English

Title: Tamper-Resistant Safeguards for Open-Weight LLMs

Presenter: Dmitrii Gusev

Abstract: Rapid advances in the capabilities of large language models (LLMs) have raised widespread concerns regarding their potential for malicious use. Open-weight LLMs present unique challenges, as existing safeguards lack robustness to tampering attacks that modify model weights. For example, recent works have demonstrated that refusal and unlearning safeguards can be trivially removed with a few steps of fine-tuning. These vulnerabilities necessitate new approaches for enabling the safe release of open-weight LLMs. The authors develop a method, called TAR, for building tamper-resistant safeguards into open-weight LLMs such that adversaries cannot remove the safeguards even after thousands of steps of fine-tuning. In extensive evaluations and red teaming analyses, they find that their method greatly improves tamper-resistance while preserving benign capabilities. Theirs results demonstrate that tamper-resistance is a tractable problem, opening up a promising new avenue to improve the safety and security of open-weight LLMs.

Paper link:

Disclaimer: The presenter is not part of the authors!

Updated: 8.2.2025
Published: 31.1.2025

91�����