Leveraging String Similarity Algorithms for Educational Data Validation
A Scalable Approach for Digital Governance
DOI:
https://doi.org/10.59490/dgo.2025.942Keywords:
Data Governance, String Similarity, Educational Data Interoperability, Levenshtein distance, public policiesAbstract
Data validation is critical for ensuring the reliability of information in educational systems, particularly in the context of digital governance. In Brazil, fragmented student records across various governmental databases hinder the implementation of educational public policies that rely on massive, high-quality student data. One effective method for validating these records is cross-referencing data across different databases. However, textual data, such as student names, are often prone to errors, including misspellings. Applying overly strict validation rules may result in the exclusion of valid records, while too lenient rules may allow incorrect data to slip through undetected. This study addresses these challenges by proposing a new data validation methodology that uses the Levenshtein distance algorithm. The approach identifies an optimal similarity threshold by taking into account the capacity of the manual validation team since excluded students are manually verified as a subsequent step, allowing for a balanced solution. We applied this methodology to validate student data from the Sistema Gestão Presente (SGP), which manages around 7 million student records and integrated it with the Brazilian Federal Revenue database. Through two key experiments, we demonstrated how an optimal validation threshold could be determined by considering the manual validation team capacity. In this case study, we found an optimal 80% similarity threshold when the manual validation capacity is approximately 950,000.
Downloads
Downloads
Published
How to Cite
Conference Proceedings Volume
Section
License
Copyright (c) 2025 Débora Barbosa Leite Silva, Emanuel Marques Queiroga, Abílio Nogueira Barros, Markson Rebelo Marcolino, Diego Dermeval, André Lima, Leonardo Brandão Marques, Christian Cechinel, Thales Vieira

This work is licensed under a Creative Commons Attribution 4.0 International License.