Enhancing Open Data Findability: Fine-Tuning LLMs(T5) for Metadata Generation

Umair Ahmed; Andrea Polini

doi:10.59490/dgo.2025.941

Authors

Umair Ahmed Dipartimento di Informatica, Universita di Camerino, Italia https://orcid.org/0000-0003-2260-2777
Andrea Polini Dipartimento di Informatica, Universita di Camerino, Italia https://orcid.org/0000-0002-2840-7561

DOI:

https://doi.org/10.59490/dgo.2025.941

Keywords:

FAIR Metadata, European Data Portal, Open Data, Large Language Models, LLM, T5, Artificial Intelligence, Annotation, Tagging, Keyword Extraction

Abstract

Metadata is essential for improving the discoverability and findability of datasets. According to the European Data Portal (EDP), Europe’s largest open data portal, keywords and categories alone contribute to 60 percent of a dataset’s visibility. Given that datasets can have highly variable and complex contexts, we propose leveraging the power of Large Language Models (LLMs), specifically T5-small and T5-large, to generate these metadata. In our study, we used EDP as our case study. We obtained 60,000 datasets from EDP and undertook thorough data cleaning and transformation. This process yielded 3,131 datasets for keyword extraction and 2,790 datasets for category extraction, ready for model fine-tuning. The base versions of T5-small and T5-large initially struggled to produce keywords that were representative of those generated by humans, resulting in F1 scores of only 0.0455 and 0.1051, respectively. However, fine-tuning significantly improved their performance, achieving an F1 score of 0.4538 for T5-small and 0.6085 for T5-large. A similar pattern was observed in category extraction. The base T5-small and T5-large models had F1 scores of only 0.1222 and 0.3326, respectively. In contrast, the fine-tuned models produced F1 scores of 0.6284 and 0.8322, respectively. Notably, T5-large produced keywords similar to those generated by humans in over 60% of cases, and its category predictions matched human-generated ones in over 80% of cases. This highlights the potential of using large language models (LLMs) for generating human-like metadata, thereby significantly enhancing data findability and usability across various applications.

Downloads

Download data is not yet available.

References

Ahmed, U. (2023). Reimagining opendata ecosystems: Apractical approachusing ai, ci, and knowledge graphs. BIR Workshops, 235–249.

Ahmed, U., Alexopoulos, C., Piangerelli, M., & Polini, A. (2024). Bryt: Automated keyword extraction for open datasets. Intelligent Systems with Applications, 23, 200421.

Alharbi, R., Ahmed, U., Dobriy, D., Łajewska,W.,Menotti, L., Saeedizade,M. J., & Dumontier,M. (2023). Exploring the role of generative ai in constructing knowledge graphs for drug indications with medical context. Proceedings [link] ISSN, 1613, 0073.

Brown, T.,Mann, B., Ryder, N., Subbiah,M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.

Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., & Jatowt, A. (2020). Yake! keyword extraction from single documents using multiple local features. Information Sciences, 509, 257–289.

Dai, S.-C., Xiong, A., &Ku, L.-W. (2023). Llm-in-the-loop: Leveraging large languagemodel for thematic analysis. arXiv preprint arXiv:2310.15100.

Deng, H. (2024). The advancements and progresses of artificial intelligence-based keyword extraction methods. 2024 International Conference on Artificial Intelligence and Communication (ICAIC 2024), 580–585.

EU. (2025). European data portal [Accessed: 2025-01-21].

Grusky, M. (2023). Rogue scores. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1914–1934.

Gunawan, D., Sembiring, C., & Budiman, M. A. (2018). The implementation of cosine similarity to calculate text relevance between two documents. Journal of physics: conference series, 978, 012120.

Kathiriya, S., Mullapudi, M., & Karangara, R. (n.d.). Optimizing ecommerce listing: Llm based description and keyword generation frommultimodal data.

Kumar, S., &Gupta, P. (2015). Comparative analysis of intersection algorithmsonqueries using precision, recall and f-score. International Journal of Computer Applications, 130(7), 28–36.

Lee, W., Chun, M., Jeong, H., & Jung, H. (2023). Toward keyword generation through large language models. Companion Proceedings of the 28th International Conference on Intelligent User Interfaces, 37–40.

Li, J., Wang, C., Fang, X., Yu, K., Zhao, J., Wu, X., & Gong, J. (2022). Multi-label text classification via hierarchical transformer-cnn. Proceedings of the 2022 14th International Conference on Machine Learning and Computing, 120–125.

Maragheh, R. Y., Fang, C., Irugu, C. C., Parikh, P., Cho, J., Xu, J., Sukumar, S., Patel, M., Korpeoglu, E., Kumar, S., et al. (2023). Llm-take: Theme-aware keyword extraction using large language models. 2023 IEEE International Conference on Big Data (BigData), 4318–4324.

Maratsi, M. I., Ahmed, U., Alexopoulos, C., Charalabidis, Y., & Polini, A. (2024). Towards cross-domain linking of data: A semanticmapping of cultural heritage ontologies.Proceedings of the 25th Annual International Conference on Digital Government Research, 165–176.

Ohm, P. (2024). Focusing on fine-tuning: Understanding the four pathways for shaping generative ai. Columbia Science and Technology Law Review, Forthcoming.

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730–27744.

Pottier, P., Lagisz, M., Burke, S., Drobniak, S. M., Downing, P. A., Macartney, E. L., Martinig, A. R., Mizuno, A., Morrison, K., Pollo, P., et al. (2023). Keywords to success: A practical guide to maximise the visibility and impact of academic papers. bioRxiv, 2023–10.

Purwar, A., & Sundar, R. (2023). Keyword augmented retrieval: Novel framework for information retrieval integrated with speech interface. Proceedings of the Third International Conference on AI-ML Systems, 1–5.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learningwith a unified text-to-text transformer. Journal ofmachine learning research, 21(140), 1–67.

Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. Text mining: applications and theory, 1–20.

Van Loenen, B., Zuiderwijk, A., Vancau-Wenberghe, G., Lopez-Pellicer, F. J., Mulder, I., Alexopoulos, C., Magnussen, R., Saddiqa, M., De Rosnay, M. D., Crompvoets, J., et al. (2021). Towards value-creating and sustainable open data ecosystems: A comparative case study and a research agenda. JeDEM-eJournal of eDemocracy and Open Government, 13(2), 1–27.

Würsch, M., Kucharavy, A., David, D. P., & Mermoud, A. (2023). Llms perform poorly at concept extraction in cyber-security research literature. arXiv preprint arXiv:2312.07110.

Enhancing Open Data Findability

Fine-Tuning LLMs(T5) for Metadata Generation

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Conference Proceedings Volume

Section

License

Programme dg.o

Browse per article type:

Search:

Questions?