The Generalization Gap: Do Audio Deepfake Detectors Actually Protect Against Modern Vishing?

García Martínez-Echevarría, Victoria; Palacios Hielscher, Rafael; López López, Gregorio; Gupta, Amar

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/11531/111097

Registro completo de metadatos

Campo DC	Valor	Lengua/Idioma
dc.contributor.author	García Martínez-Echevarría, Victoria	es-ES
dc.contributor.author	Palacios Hielscher, Rafael	es-ES
dc.contributor.author	López López, Gregorio	es-ES
dc.contributor.author	Gupta, Amar	es-ES
dc.date.accessioned	2026-07-02T04:32:29Z	-
dc.date.available	2026-07-02T04:32:29Z	-
dc.date.issued	2026-07-01	es_ES
dc.identifier.issn	2079-9292	es_ES
dc.identifier.uri	https://doi.org/10.3390/electronics15132846	es_ES
dc.identifier.uri	http://hdl.handle.net/11531/111097	-
dc.description	Artículos en revistas	es_ES
dc.description.abstract	Voice phishing, commonly known as vishing, has become one of the fastest-growing threats in social engineering. The rapid advancement and accessibility of AI voice cloning tools have enabled attackers to produce highly convincing synthetic speech at minimal cost, driving a sharp increase in impersonation fraud. Accordingly, automatic detection of synthetic voices could contribute, as one component of a broader defense, to mitigating vishing attacks. This paper studies the automatic detection of AI-generated speech, with a particular focus on how well such detectors generalize beyond their training data to modern, unseen synthesis methods. Two detection approaches are evaluated: a Residual CNN (convolutional neural network) trained as a binary classifier on three different time–frequency representations and a one-class learning strategy with a ResNet-18 backbone, yielding four models in total. Models were trained on the well-known ASVspoof 2019 Logical Access dataset and tested on its standard partitions. Then, models were tested on the SONAR benchmark, which gathers voices generated with state-of-the-art synthesis techniques unseen during training. Experimental results show that, on the modern systems gathered in SONAR, all four configurations fall close to chance. The LFCC one-class detector generalizes comparatively best, but the apparently higher accuracy of some models reflects a tendency to label most speech as spoofed. These findings indicate that the evaluated detectors can provide, at most, a partial security layer against vishing driven by current and emerging speech-synthesis technologies, although continuous model updates are recommended.	es-ES
dc.description.abstract	Voice phishing, commonly known as vishing, has become one of the fastest-growing threats in social engineering. The rapid advancement and accessibility of AI voice cloning tools have enabled attackers to produce highly convincing synthetic speech at minimal cost, driving a sharp increase in impersonation fraud. Accordingly, automatic detection of synthetic voices could contribute, as one component of a broader defense, to mitigating vishing attacks. This paper studies the automatic detection of AI-generated speech, with a particular focus on how well such detectors generalize beyond their training data to modern, unseen synthesis methods. Two detection approaches are evaluated: a Residual CNN (convolutional neural network) trained as a binary classifier on three different time–frequency representations and a one-class learning strategy with a ResNet-18 backbone, yielding four models in total. Models were trained on the well-known ASVspoof 2019 Logical Access dataset and tested on its standard partitions. Then, models were tested on the SONAR benchmark, which gathers voices generated with state-of-the-art synthesis techniques unseen during training. Experimental results show that, on the modern systems gathered in SONAR, all four configurations fall close to chance. The LFCC one-class detector generalizes comparatively best, but the apparently higher accuracy of some models reflects a tendency to label most speech as spoofed. These findings indicate that the evaluated detectors can provide, at most, a partial security layer against vishing driven by current and emerging speech-synthesis technologies, although continuous model updates are recommended.	en-GB
dc.language.iso	en-GB	es_ES
dc.source	Revista: Electronics, Periodo: 1, Volumen: online, Número: 13, Página inicial: 2846, Página final: 0	es_ES
dc.subject.other	Instituto de Investigación Tecnológica (IIT)	es_ES
dc.title	The Generalization Gap: Do Audio Deepfake Detectors Actually Protect Against Modern Vishing?	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.description.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.rights.holder		es_ES
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es_ES
dc.keywords	AI-generated speech; spoofing detection; residual CNN (convolutional neural network); one-class learning; generalization; vishing	es-ES
dc.keywords	AI-generated speech; spoofing detection; residual CNN (convolutional neural network); one-class learning; generalization; vishing	en-GB
Aparece en las colecciones:	Artículos

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
IIT-26-200R.pdf		1,21 MB	Adobe PDF	Visualizar/Abrir
IIT-26-200R_preview.pdf		3,82 kB	Adobe PDF	Visualizar/Abrir

Mostrar el registro sencillo del ítem