Leveraging String Similarity Algorithms for Educational Data Validation

A Scalable Approach for Digital Governance

Authors

  • Débora Barbosa Leite Silva Center of Excellence for Social Technologies (NEES), Brazil
  • Emanuel Marques Queiroga Instituto Federal de Educação, Ciência e Tecnologia Sul-Rio-Grandense (IFSul) | Center of Excellence for Social Technologies (NEES), Brazil
  • Abílio Nogueira Barros Departamento de Computação, Universidade Federal Rural de Pernambuco (UFRPE), Brazil
  • Markson Rebelo Marcolino Centro de Ciências, Tecnologias e Saúde, Universidade Federal de Santa Catarina | Center of Excellence for Social Technologies (NEES), Brazil
  • Diego Dermeval Center of Excellence for Social Technologies (NEES), Brazil
  • André Lima Center of Excellence for Social Technologies (NEES), Brazil
  • Leonardo Brandão Marques Center of Excellence for Social Technologies (NEES), Brazil
  • Christian Cechinel Centro de Ciências, Tecnologias e Saúde, Universidade Federal de Santa Catarina | Center of Excellence for Social Technologies (NEES), Brazil
  • Thales Vieira Center of Excellence for Social Technologies (NEES), Brazil

DOI:

https://doi.org/10.59490/dgo.2025.942

Keywords:

Data Governance, String Similarity, Educational Data Interoperability, Levenshtein distance, public policies

Abstract

Data validation is critical for ensuring the reliability of information in educational systems, particularly in the context of digital governance. In Brazil, fragmented student records across various governmental databases hinder the implementation of educational public policies that rely on massive, high-quality student data. One effective method for validating these records is cross-referencing data across different databases. However, textual data, such as student names, are often prone to errors, including misspellings. Applying overly strict validation rules may result in the exclusion of valid records, while too lenient rules may allow incorrect data to slip through undetected. This study addresses these challenges by proposing a new data validation methodology that uses the Levenshtein distance algorithm. The approach identifies an optimal similarity threshold by taking into account the capacity of the manual validation team since excluded students are manually verified as a subsequent step, allowing for a balanced solution. We applied this methodology to validate student data from the Sistema Gestão Presente (SGP), which manages around 7 million student records and integrated it with the Brazilian Federal Revenue database. Through two key experiments, we demonstrated how an optimal validation threshold could be determined by considering the manual validation team capacity. In this case study, we found an optimal 80% similarity threshold when the manual validation capacity is approximately 950,000.

Downloads

Download data is not yet available.

Downloads

Published

2025-05-19

How to Cite

Barbosa Leite Silva, D., Marques Queiroga, E., Nogueira Barros, A., Rebelo Marcolino, M., Dermeval, D., Lima, A., Brandão Marques, L., Cechinel, C., & Vieira, T. (2025). Leveraging String Similarity Algorithms for Educational Data Validation: A Scalable Approach for Digital Governance. Conference on Digital Government Research, 1. https://doi.org/10.59490/dgo.2025.942