Energy-Aware Multilingual Evaluation of Large Language Models

de Zarzà i Cubero, Irene; Liz, Mauro; de Curtò i Díaz, Joaquim; Calafate, Carlos T.

dc.contributor.author	de Zarzà i Cubero, Irene	es-ES
dc.contributor.author	Liz, Mauro	es-ES
dc.contributor.author	de Curtò i Díaz, Joaquim	es-ES
dc.contributor.author	Calafate, Carlos T.	es-ES
dc.date.accessioned	2026-04-07T07:39:25Z
dc.date.available	2026-04-07T07:39:25Z
dc.date.issued	2026-03-27	es_ES
dc.identifier.issn	2079-9292	es_ES
dc.identifier.uri	https://doi.org/10.3390/electronics15071395	es_ES
dc.identifier.uri	http://hdl.handle.net/11531/109444
dc.description	Artículos en revistas	es_ES
dc.description.abstract	.	es-ES
dc.description.abstract	The rapid deployment of Large Language Models (LLMs) in multilingual, production-scale systems has made inference-time energy consumption a critical yet systematically under-evaluated dimension of model quality. While accuracy-centric benchmarks dominate current evaluation practice, they fail to capture the energy cost of reasoning, particularly across languages and task complexities where consumption profiles diverge substantially. In this work, we present a comprehensive energy–performance evaluation of five instruction-tuned LLMs, spanning Transformer, Grouped-Query Attention, and State Space Model architectures, across thirteen typologically diverse languages and multiple task difficulty levels under controlled GPU-level energy measurement on NVIDIA H200 hardware. Our analysis encompasses 65 model–language configurations totaling over 5100 individual inference runs, supported by rigorous non-parametric statistical testing (Friedman tests, pairwise Wilcoxon signed-rank with Holm correction, and paired Cohen’s d effect sizes). We report four principal findings. First, energy consumption varies up to threefold across models under identical workloads (𝜒2=49.42 , 𝑝=4.78×10−10 , Friedman test), stratifying into three distinct energy regimes driven by architecture and generation dynamics rather than parameter count. Second, energy expenditure and reasoning performance are only weakly coupled, as confirmed by Spearman rank correlation analysis (𝑟𝑠=0.109 , 𝑝=0.386 ). Third, task category and difficulty level introduce substantial and model-dependent variation in both energy demand and performance, with cross-lingual performance variance amplifying at higher difficulty levels. Fourth, language choice acts as a measurable deployment parameter as follows: Romance languages on average achieve lower energy consumption than English across multiple models, while model efficiency rankings shift across languages, yielding language-dependent Pareto-optimal frontiers. We formalize these trade-offs through multi-objective Pareto analysis and introduce a composite AI Energy Score metric that captures reasoning quality per unit of energy. Of the 65 evaluated configurations, only four are Pareto-optimal, three Mistral-7B configurations at the low-energy extreme and one Phi-4-mini-instruct configuration at the high-performance end, while three of the five models are entirely dominated across all language configurations. These findings provide actionable guidelines for energy-aware model selection in multilingual deployments and support the integration of AI Energy Scores as a standard complementary criterion in LLM evaluation frameworks.	en-GB
dc.format.mimetype	application/pdf	es_ES
dc.language.iso	en-GB	es_ES
dc.rights	Creative Commons Reconocimiento-NoComercial-SinObraDerivada España	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	es_ES
dc.source	Revista: Electronics, Periodo: 1, Volumen: 15(7), Número: 1395, Página inicial: 1, Página final: 44	es_ES
dc.title	Energy-Aware Multilingual Evaluation of Large Language Models	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.description.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.rights.holder		es_ES
dc.rights.accessRights	info:eu-repo/semantics/openAccess	es_ES
dc.keywords	.	es-ES
dc.keywords	large language models; energy efficiency; multilingual evaluation; sustainable AI; GPU energy consumption; AI energy scores	en-GB

Files in this item

Name:: electronics-15-01395_deZarza_e ...
Size:: 2.406Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Artículos
Artículos de revista, capítulos de libro y contribuciones en congresos publicadas.

Show simple item record

Except where otherwise noted, this item's license is described as Creative Commons Reconocimiento-NoComercial-SinObraDerivada España