Estudio Comparativo y Desarrollo de Algoritmos de Reward Learning y Goal Conditioned Learning: Análisis y Aplicación.

Lozano Mendoza, Adolfo

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/11531/97182

Registro completo de metadatos

Campo DC	Valor	Lengua/Idioma
dc.contributor.advisor	Güitta López, Lucía	es-ES
dc.contributor.advisor	López López, Álvaro Jesús	es-ES
dc.contributor.author	Lozano Mendoza, Adolfo	es-ES
dc.contributor.other	Universidad Pontificia Comillas, Escuela Técnica Superior de Ingeniería (ICAI)	es_ES
dc.date.accessioned	2025-02-03T08:27:16Z	-
dc.date.available	2025-02-03T08:27:16Z	-
dc.date.issued	2025	es_ES
dc.identifier.uri	http://hdl.handle.net/11531/97182	-
dc.description	Grado en Ingeniería Matemática e Inteligencia Artificial	es_ES
dc.description.abstract	Este trabajo realiza un análisis comparativo de tres algoritmos representativos del campo de Reward Learning (MCE-IRL, AIRL y Text2Reward), además de una revisión del Goal Conditioned Learning (GCL). El objetivo es evaluar fortalezas, limitaciones y posibles aplicaciones, proporcionando pautas de uso a futuros investigadores. MCE-IRL, basado en máxima entropía causal, permite obtener recompensas interpretables y fieles, aunque solo en entornos tabulares de baja complejidad. Sus resultados muestran gran fidelidad respecto a las trayectorias expertas, pero presenta un crecimiento exponencial de los costes computacionales al aumentar el tamaño del entorno. AIRL, en cambio, utiliza un enfoque adversarial con generador y discriminador. Demuestra gran capacidad de imitación incluso con pocas trayectorias, pero las recompensas aprendidas no generalizan lo suficiente para entrenar agentes desde cero. Text2Reward (T2R), apoyado en modelos de lenguaje como GPT-4o y DeepSeek-o1, genera funciones de recompensa densas a partir de descripciones textuales. Con ellas se entrenaron agentes SAC y PPO que alcanzaron rendimientos superiores a 3500 en Hopper, destacando la estabilidad del primero. La combinación de T2R y AIRL logró un experto en Hopper-Seals con resultados sin precedentes. Complementariamente, se revisan enfoques de GCL. Goal GAN, mediante redes adversarias, genera metas de dificultad creciente, acelerando el aprendizaje en entornos de recompensas escasas. RIG, basado en autoencoders variacionales, permite entrenar agentes desde imágenes sin necesidad de recompensas explícitas, definiendo objetivos en un espacio latente. En conclusión, MCE-IRL aporta interpretabilidad, AIRL garantiza imitación estable y T2R abre un horizonte prometedor gracias a los LLMs, combinando ventajas de generalización y eficiencia. Como trabajo futuro, se plantea ampliar la exploración de modelos de lenguaje y la implementación práctica de algoritmos de GCL.	es-ES
dc.description.abstract	This work presents a comparative analysis of three representative algorithms in the field of Reward Learning (MCE-IRL, AIRL, and Text2Reward), along with a review of Goal Conditioned Learning (GCL). The main objective is to evaluate their strengths, limitations, and potential applications, offering guidelines for future research. MCE-IRL, based on maximum causal entropy, produces interpretable and accurate reward functions, though it is restricted to small, tabular domains. Its results closely match expert trajectories, but computational costs grow exponentially with environment size, limiting scalability. AIRL, following an adversarial framework with generator and discriminator, demonstrates strong imitation capabilities even with few trajectories. However, the learned reward functions do not generalize well enough to train agents from scratch, serving more as identifiers of expert-like behavior. Text2Reward (T2R), leveraging large language models such as GPT-4o and DeepSeek-o1, generates dense reward functions from textual prompts describing the environment and the desired behavior. Agents trained with these functions in Hopper achieved exceptional performance, with SAC trained on GPT rewards reaching over 3500 points with remarkable stability. The combination of T2R and AIRL produced a state-of-the-art expert in the modified Hopper-Seals environment. Additionally, the study reviews GCL approaches. Goal GAN employs adversarial networks to generate progressively challenging goals, accelerating learning in sparse-reward settings. RIG, based on variational autoencoders, enables goal-conditioned policies directly from images without explicit reward signals, defining objectives in a latent space and achieving results close to oracle performance. In conclusion, MCE-IRL provides interpretability in simple settings, AIRL ensures robust imitation with limited data, and T2R highlights the potential of LLMs to produce dense, effective reward functions. Future work should further explore language model–based approaches and practical implementations of GCL algorithms.	en-GB
dc.format.mimetype	application/pdf	es_ES
dc.language.iso	es-ES	es_ES
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	es_ES
dc.subject.other	KMI	es_ES
dc.title	Estudio Comparativo y Desarrollo de Algoritmos de Reward Learning y Goal Conditioned Learning: Análisis y Aplicación.	es_ES
dc.type	info:eu-repo/semantics/bachelorThesis	es_ES
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es_ES
dc.keywords	Reward Learning; MCE-IRL; AIRL; Text2Reward; LLMs;, Goal Conditioned Learning; Goal GAN; RIG.	es-ES
dc.keywords	Reward Learning; MCE-IRL; AIRL; Text2Reward; LLMs;, Goal Conditioned Learning; Goal GAN; RIG.	en-GB
Aparece en las colecciones:	TFG, TFM (temporales)

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
TFG _ADOLFO LOZANO MENDOZA.pdf	Trabajo Fin de Grado	2,64 MB	Adobe PDF	Visualizar/Abrir
Anexo I.pdf	Autorización	74,62 kB	Adobe PDF	Visualizar/Abrir

Mostrar el registro sencillo del ítem