A Multi-Layer Verification Harness for LLM-Generated Code Patches

Hernández Bas, Ignacio

dc.contributor.advisor	Huang, Furong	es-ES
dc.contributor.author	Hernández Bas, Ignacio	es-ES
dc.contributor.other	Universidad Pontificia Comillas, Escuela Técnica Superior de Ingeniería (ICAI)	es_ES
dc.date.accessioned	2025-12-02T12:33:37Z
dc.date.available	2025-12-02T12:33:37Z
dc.date.issued	2026	es_ES
dc.identifier.uri	http://hdl.handle.net/11531/107492
dc.description	Máster Universitario en Ingeniería de Telecomunicación + Máster Universitario en Big Data	es_ES
dc.description.abstract	Este trabajo presenta un entorno de verificación multicapa para evaluar parches de código generados por LLMs. La motivación principal parte de una limitación habitual en los sistemas de reparación automática de programas: muchos parches se consideran correctos únicamente porque superan los tests existentes del repositorio. Sin embargo, pasar una batería de pruebas no siempre garantiza que el parche satisfaga realmente el comportamiento esperado descrito por el usuario en la incidencia original. Para abordar este problema, el proyecto propone un marco de verificación compuesto por tres capas complementarias. La primera capa realiza una verificación estática sin ejecutar el código. En ella se comprueba la validez sintáctica del parche, se extraen métricas estructurales y se calcula un Índice de Calidad Estática a partir de herramientas como Pylint, Radon, Flake8, Mypy y Bandit. Esta capa proporciona una señal temprana sobre la calidad, mantenibilidad, estilo, tipado y seguridad del código. La segunda capa realiza una verificación dinámica aplicando el parche en un entorno contenerizado y ejecutando los tests originales del repositorio, siguiendo la misma lógica de evaluación empleada en SWE-bench. Este paso permite comprobar si el parche resuelve los fallos esperados sin introducir regresiones. La tercera capa introduce una verificación semántica. Para ello, se extraen afirmaciones verificables a partir de la descripción en lenguaje natural de la incidencia. Estas afirmaciones guían la generación de tests adicionales mediante el uso de un bucle agéntico, con el objetivo de comprobar si el parche refleja la intención real del usuario y no solo si supera los tests disponibles. El sistema se evaluó sobre SWE-bench Lite, los resultados muestran que la verificación multicapa ofrece señales más completas y permite detectar diferencias de comportamiento que una validación basada únicamente en test unitarios podría pasar por alto.	es-ES
dc.description.abstract	This thesis presents a multi-layer verification harness for evaluating Python code patches generated by Large Language Models. The main motivation is that current automated program repair systems are often evaluated primarily against existing unit tests, which can confirm test-suite correctness but do not always guarantee that a patch truly satisfies the intended behavior described in a software issue. To address this limitation, the project proposes a verification framework composed of three complementary layers. The first layer performs static verification without executing the code. It checks syntax validity, extracts structural information from the patch, and computes a Static Quality Index based on tools such as Pylint, Radon, Flake8, Mypy, and Bandit. This provides an early signal about code quality, maintainability, typing, style, and security. The second layer performs dynamic verification by applying the candidate patch inside a controlled containerized environment and running the repository’s original test suite, following the evaluation logic used in SWE-bench. This step verifies whether the patch preserves existing behavior and resolves the original failing tests. The third layer introduces semantic verification. It extracts behavioral claims from the natural-language issue description and uses them to guide the generation of additional executable tests through an agentic loop. These tests are validated against both the buggy and reference versions of the repository, aiming to identify whether the generated patch captures the intended behavior rather than merely passing the available tests. The system was evaluated on SWE-bench Lite, and the results show that the layered approach provides richer verification signals and detects behavioral differences that traditional test-based validation misses.	en-GB
dc.format.mimetype	application/pdf	es_ES
dc.language.iso	es-ES	es_ES
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	es_ES
dc.subject.other	H67 (MIT)	es_ES
dc.title	A Multi-Layer Verification Harness for LLM-Generated Code Patches	es_ES
dc.type	info:eu-repo/semantics/masterThesis	es_ES
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es_ES
dc.keywords	Parches de Código Generados por LLMs; Verificación Multicapa; Evaluación de Corrección de Parches; Verificación Semántica; Pruebas Diferenciales; SWE-bench Lite	es-ES
dc.keywords	LLM-Generated Code Patches; Multi-Layer Verification; Patch Correctness Evaluation; Semantic Verification; Differential Testing; SWE-bench Lite	en-GB

Files in this item

Name:: TFM-HernandezBasIgnacio.pdf
Size:: 6.024Mb
Format:: PDF
Description:: Trabajo Fin de Máster

View/Open

Name:: AnexoI_2026_signed.pdf
Size:: 259.2Kb
Format:: PDF
Description:: Autorización

View/Open

This item appears in the following Collection(s)

TFG, TFM (temporales)

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States