SemTab: A Hybrid Framework for Semantic Feature Generation on Tabular Data

Chen, Olivia; Chou, Kara; Nagpal, Rashmi; Palacios Hielscher, Rafael; Gupta, Amar

dc.contributor.author	Chen, Olivia	es-ES
dc.contributor.author	Chou, Kara	es-ES
dc.contributor.author	Nagpal, Rashmi	es-ES
dc.contributor.author	Palacios Hielscher, Rafael	es-ES
dc.contributor.author	Gupta, Amar	es-ES
dc.date.accessioned	2026-06-12T06:49:28Z
dc.date.available	2026-06-12T06:49:28Z
dc.date.issued	2026-05-26	es_ES
dc.identifier.uri	http://hdl.handle.net/11531/110698
dc.description	Capítulos en libros	es_ES
dc.description.abstract	Machine learning models on tabular datasets often struggle to understand the context between features, which can limit their accuracy. We propose SemTab, a hybrid framework for generating semantic features that utilizes an open-source Large Language Model (LLM). We evaluated our framework using three benchmark datasets: Adult Income, German Credit, and Bank Marketing. We compared its performance against several off-the-shelf LLMs. The results show that SemTab achieved the highest accuracy across all the classification tasks. For instance, on the Bank Marketing dataset, SemTab achieved an accuracy of 8 0%, which is approximately 2 0% improvement over the baseline models. This work highlights that a hybrid architecture is a practical approach for applying language models to structured tabular data, yielding accurate and interpretable results for various downstream tasks.	es-ES
dc.description.abstract	Machine learning models on tabular datasets often struggle to understand the context between features, which can limit their accuracy. We propose SemTab, a hybrid framework for generating semantic features that utilizes an open-source Large Language Model (LLM). We evaluated our framework using three benchmark datasets: Adult Income, German Credit, and Bank Marketing. We compared its performance against several off-the-shelf LLMs. The results show that SemTab achieved the highest accuracy across all the classification tasks. For instance, on the Bank Marketing dataset, SemTab achieved an accuracy of 8 0%, which is approximately 2 0% improvement over the baseline models. This work highlights that a hybrid architecture is a practical approach for applying language models to structured tabular data, yielding accurate and interpretable results for various downstream tasks.	en-GB
dc.format.mimetype	application/pdf	es_ES
dc.language.iso	en-GB	es_ES
dc.publisher	Massachussets Institute of Technology; Institute of Electrical and Electronics Engineers (Cambridge, Estados Unidos de América)	es_ES
dc.rights		es_ES
dc.rights.uri		es_ES
dc.source	Libro: Undergraduate Research Technology Conference - MIT URTC 2025, Página inicial: 1-5, Página final:	es_ES
dc.subject.other	Instituto de Investigación Tecnológica (IIT)	es_ES
dc.title	SemTab: A Hybrid Framework for Semantic Feature Generation on Tabular Data	es_ES
dc.type	info:eu-repo/semantics/bookPart	es_ES
dc.description.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.rights.accessRights	info:eu-repo/semantics/restrictedAccess	es_ES
dc.keywords	Tabular Data, Semantic Feature Generation, LLMs, Model Interpretability	es-ES
dc.keywords	Tabular Data, Semantic Feature Generation, LLMs, Model Interpretability	en-GB

Files in this item

Name:: IIT-25-413C.pdf
Size:: 199.3Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Artículos
Artículos de revista, capítulos de libro y contribuciones en congresos publicadas.

Show simple item record