Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/11531/110126
Título : Stopping LLMs from Going Rogue: A Control Barrier Approach to Text Generation
Autor : Silvestre, Joao Pedro
Rodríguez Abella, Álvaro
Tabuada, Paulo
Fecha de publicación : 25-dic-2025
Editorial : IEEE (Río de Janeiro, Brasil)
Resumen : .
The rapid integration of large language models (LLMs) into our everyday lives has outpaced safety considerations aimed at protecting users from toxic outputs and preventing malicious actors from generating harmful text at scale. As a result, LLMs have been exploited by bots capable of producing vast amounts of harmful and toxic content, enabling users to manipulate online opinions and, in some cases, create dangerous online environments.Our work addresses this issue by developing a framework for designing safety filters that preclude toxic outputs. To achieve this, we leverage Control Barrier Functions (CBFs) which enable the design of closed-loop systems that remain safe. We consider the continuous-time model of an LLM, where tokens are regarded as the state of the model, and prove that by only controlling the first token, any function satisfying mild assumptions becomes a CBF. Our approach can be utilized to design LLMs capable of ensuring safety of its outputs without significantly affecting the original model’s behavior.
Descripción : Capítulos en libros
URI : 10.1109/CDC57313.2025.11312450
Aparece en las colecciones: Artículos



Los ítems de DSpace están protegidos por copyright, con todos los derechos reservados, a menos que se indique lo contrario.