SemTab: A Hybrid Framework for Semantic Feature Generation on Tabular Data

Chen, Olivia; Chou, Kara; Nagpal, Rashmi; Palacios Hielscher, Rafael; Gupta, Amar

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/11531/110698

Título :	SemTab: A Hybrid Framework for Semantic Feature Generation on Tabular Data
Autor :	Chen, Olivia Chou, Kara Nagpal, Rashmi Palacios Hielscher, Rafael Gupta, Amar
Fecha de publicación :	26-may-2026
Editorial :	Massachussets Institute of Technology; Institute of Electrical and Electronics Engineers (Cambridge, Estados Unidos de América)
Resumen :	Machine learning models on tabular datasets often struggle to understand the context between features, which can limit their accuracy. We propose SemTab, a hybrid framework for generating semantic features that utilizes an open-source Large Language Model (LLM). We evaluated our framework using three benchmark datasets: Adult Income, German Credit, and Bank Marketing. We compared its performance against several off-the-shelf LLMs. The results show that SemTab achieved the highest accuracy across all the classification tasks. For instance, on the Bank Marketing dataset, SemTab achieved an accuracy of 8 0%, which is approximately 2 0% improvement over the baseline models. This work highlights that a hybrid architecture is a practical approach for applying language models to structured tabular data, yielding accurate and interpretable results for various downstream tasks. Machine learning models on tabular datasets often struggle to understand the context between features, which can limit their accuracy. We propose SemTab, a hybrid framework for generating semantic features that utilizes an open-source Large Language Model (LLM). We evaluated our framework using three benchmark datasets: Adult Income, German Credit, and Bank Marketing. We compared its performance against several off-the-shelf LLMs. The results show that SemTab achieved the highest accuracy across all the classification tasks. For instance, on the Bank Marketing dataset, SemTab achieved an accuracy of 8 0%, which is approximately 2 0% improvement over the baseline models. This work highlights that a hybrid architecture is a practical approach for applying language models to structured tabular data, yielding accurate and interpretable results for various downstream tasks.
Descripción :	Capítulos en libros
URI :	http://hdl.handle.net/11531/110698
Aparece en las colecciones:	Artículos

Ficheros en este ítem:

Fichero	Tamaño	Formato
IIT-25-413C.pdf	199,36 kB	Adobe PDF	Visualizar/Abrir Request a copy

Mostrar el registro Dublin Core completo del ítem