Enhancing Open Data Findability
Fine-Tuning LLMs(T5) for Metadata Generation
DOI:
https://doi.org/10.59490/dgo.2025.941Keywords:
FAIR Metadata, European Data Portal, Open Data, Large Language Models, LLM, T5, Artificial Intelligence, Annotation, Tagging, Keyword ExtractionAbstract
Metadata is essential for improving the discoverability and findability of datasets. According to the European Data Portal (EDP), Europe’s largest open data portal, keywords and categories alone contribute to 60 percent of a dataset’s visibility. Given that datasets can have highly variable and complex contexts, we propose leveraging the power of Large Language Models (LLMs), specifically T5-small and T5-large, to generate these metadata. In our study, we used EDP as our case study. We obtained 60,000 datasets from EDP and undertook thorough data cleaning and transformation. This process yielded 3,131 datasets for keyword extraction and 2,790 datasets for category extraction, ready for model fine-tuning. The base versions of T5-small and T5-large initially struggled to produce keywords that were representative of those generated by humans, resulting in F1 scores of only 0.0455 and 0.1051, respectively. However, fine-tuning significantly improved their performance, achieving an F1 score of 0.4538 for T5-small and 0.6085 for T5-large. A similar pattern was observed in category extraction. The base T5-small and T5-large models had F1 scores of only 0.1222 and 0.3326, respectively. In contrast, the fine-tuned models produced F1 scores of 0.6284 and 0.8322, respectively. Notably, T5-large produced keywords similar to those generated by humans in over 60% of cases, and its category predictions matched human-generated ones in over 80% of cases. This highlights the potential of using large language models (LLMs) for generating human-like metadata, thereby significantly enhancing data findability and usability across various applications.
Downloads
Downloads
Published
How to Cite
Conference Proceedings Volume
Section
License
Copyright (c) 2025 Umair Ahmed, Andrea Polini

This work is licensed under a Creative Commons Attribution 4.0 International License.