Energy-Aware Multilingual Evaluation of Large Language Models

The rapid deployment of Large Language Models (LLMs) in multilingual, production-scale systems has made inference-time energy consumption a critical yet systematically under-evaluated dimension of model quality. While accuracy-centric benchmarks dominate current evaluation practice, they fail to capture the energy cost of reasoning, particularly across languages and task complexities where consumption profiles diverge substantially. In this work, we present a comprehensive energy–performance evaluation of five instruction-tuned LLMs, spanning Transformer, Grouped-Query Attention, and State Space Model architectures, across thirteen typologically diverse languages and multiple task difficulty levels under controlled GPU-level energy measurement on NVIDIA H200 hardware. Our analysis encompasses 65 model–language configurations totaling over 5100 individual inference runs, supported by rigorous non-parametric statistical testing (Friedman tests, pairwise Wilcoxon signed-rank with Holm correction, and paired Cohen’s d effect sizes). We report four principal findings. First, energy consumption varies up to threefold across models under identical workloads (𝜒2=49.42 , 𝑝=4.78×10−10 , Friedman test), stratifying into three distinct energy regimes driven by architecture and generation dynamics rather than parameter count. Second, energy expenditure and reasoning performance are only weakly coupled, as confirmed by Spearman rank correlation analysis (𝑟𝑠=0.109 , 𝑝=0.386 ). Third, task category and difficulty level introduce substantial and model-dependent variation in both energy demand and performance, with cross-lingual performance variance amplifying at higher difficulty levels. Fourth, language choice acts as a measurable deployment parameter as follows: Romance languages on average achieve lower energy consumption than English across multiple models, while model efficiency rankings shift across languages, yielding language-dependent Pareto-optimal frontiers. We formalize these trade-offs through multi-objective Pareto analysis and introduce a composite AI Energy Score metric that captures reasoning quality per unit of energy. Of the 65 evaluated configurations, only four are Pareto-optimal, three Mistral-7B configurations at the low-energy extreme and one Phi-4-mini-instruct configuration at the high-performance end, while three of the five models are entirely dominated across all language configurations. These findings provide actionable guidelines for energy-aware model selection in multilingual deployments and support the integration of AI Energy Scores as a standard complementary criterion in LLM evaluation frameworks.

URI

https://doi.org/10.3390/electronics15071395
http://hdl.handle.net/11531/109444

Tipo de Actividad

Artículos en revistas

ISSN

2079-9292

Palabras Clave

.
large language models; energy efficiency; multilingual evaluation; sustainable AI; GPU energy consumption; AI energy scores

Collections

Artículos

Except where otherwise noted, this item's license is described as Creative Commons Reconocimiento-NoComercial-SinObraDerivada España