Leveraging String Similarity Algorithms for Educational Data Validation

A Scalable Approach for Digital Governance

Authors

  • Débora Barbosa Leite Silva Center of Excellence for Social Technologies (NEES), Brazil
  • Emanuel Marques Queiroga Instituto Federal de Educação, Ciência e Tecnologia Sul-Rio-Grandense (IFSul) | Center of Excellence for Social Technologies (NEES), Brazil
  • Abílio Nogueira Barros Departamento de Computação, Universidade Federal Rural de Pernambuco (UFRPE), Brazil
  • Markson Rebelo Marcolino Centro de Ciências, Tecnologias e Saúde, Universidade Federal de Santa Catarina | Center of Excellence for Social Technologies (NEES), Brazil
  • Diego Dermeval Center of Excellence for Social Technologies (NEES), Brazil
  • André Lima Center of Excellence for Social Technologies (NEES), Brazil
  • Leonardo Brandão Marques Center of Excellence for Social Technologies (NEES), Brazil
  • Christian Cechinel Centro de Ciências, Tecnologias e Saúde, Universidade Federal de Santa Catarina | Center of Excellence for Social Technologies (NEES), Brazil
  • Thales Vieira Center of Excellence for Social Technologies (NEES), Brazil

DOI:

https://doi.org/10.59490/dgo.2025.942

Keywords:

Data Governance, String Similarity, Educational Data Interoperability, Levenshtein distance, public policies

Abstract

Data validation is critical for ensuring the reliability of information in educational systems, particularly in the context of digital governance. In Brazil, fragmented student records across various governmental databases hinder the implementation of educational public policies that rely on massive, high-quality student data. One effective method for validating these records is cross-referencing data across different databases. However, textual data, such as student names, are often prone to errors, including misspellings. Applying overly strict validation rules may result in the exclusion of valid records, while too lenient rules may allow incorrect data to slip through undetected. This study addresses these challenges by proposing a new data validation methodology that uses the Levenshtein distance algorithm. The approach identifies an optimal similarity threshold by taking into account the capacity of the manual validation team since excluded students are manually verified as a subsequent step, allowing for a balanced solution. We applied this methodology to validate student data from the Sistema Gestão Presente (SGP), which manages around 7 million student records and integrated it with the Brazilian Federal Revenue database. Through two key experiments, we demonstrated how an optimal validation threshold could be determined by considering the manual validation team capacity. In this case study, we found an optimal 80% similarity threshold when the manual validation capacity is approximately 950,000.

Downloads

Download data is not yet available.

References

Ali, M. S., Ichihara, M. Y., Lopes, L. C., Barbosa, G. C., Pita, R., Carreiro, R. P., Dos Santos, D. B., Ramos, D., Bispo, N., Raynal, F., et al. (2019). Administrative data linkage in brazil: Potentials for health technology assessment. Frontiers in pharmacology, 10, 984.

Almeida, D., Gorender, D., Ichihara, M. Y., Sena, S., Menezes, L., Barbosa, G. C., Fiaccone, R. L., Paixão, E. S., Pita, R., & Barreto, M. L. (2020). Examining the quality of record linkage process using nationwide brazilian administrative databases to build a large birth cohort. BMC medical informatics and decision making, 20(1), 173.

Arretche, M. (2004). Federalismo e políticas sociais no brasil: Problemas de coordenação e autonomia. São Paulo em perspectiva, 18(2), 17–26.

Asadollahi, H., Meouche, R. E., Zheng, Z., Eslahi, M., & Farazdaghi, E. (2024). Semantic text similarity in the civil engineer domain to enhance data interoperability: A domain-specific embedding approach. SSRN Electronic Journal. [link]

Barbalho, I. M. P., Fernandes, F., Barros, D. M. S., Paiva, J. C., Henriques, J., Morais, A. H. F., Coutinho, K. D., Coelho Neto, G. C., Chioro, A., & Valentim, R. A. M. (2022). Electronic health records in brazil: Prospects and technological challenges. Frontiers in Public Health, 10. DOI: https://doi.org/10.3389/fpubh.2022.963841.

Campmas, A., Iacob, N., & Simonelli, F. (2022). How can interoperability stimulate the use of digital public services? an analysis of national interoperability frameworks and e-government in the european union. Data & Policy, 4, e19. DOI: https://doi.org/10.1017/dap.2022.11.

Cohen, W. W. (2000). Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst., 18(3), 288–321. DOI: https://doi.org/10.1145/352595.352598.

Costin, C., & Pontual, T. (2020). Curriculum reform in brazil to develop skills for the twenty-first century. In F. M. Reimers (Ed.), Audacious education purposes: How governments transform the goals of education systems (pp. 47–64). Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-41882-3_2.

Damasceno, C. D. N., Lobato, F. M. F., Moutinho, E. R., França, A. S. d., Oliveira, I. I. d., & Santana, Á. L. d. (2021). Simcleaner – sistema de padronização de bases de dados utilizando funções de similaridade. arXiv preprint arXiv:2107.12884. [link]

Das, M., Tao, X., Liu, Y., & Cheng, J. C. (2022). A blockchain-based integrated document management framework for construction applications. Automation in Construction, 133, 104001.

Downs, J. M., Ford, T., Stewart, R., Epstein, S., Shetty, H., Little, R., Jewell, A., Broadbent, M., Deighton, J., Mostafa, T., Gilbert, R., Hotopf, M., & Hayes, R. (2019). An approach to linking education, social care and electronic health records for children and young people in south london: A linkage study of child and adolescent mental health service data. BMJ Open, 9(1). DOI: https://doi.org/10.1136/bmjopen-2018-024355.

Fiszbein, A., & Schady, N. R. (2009). Conditional cash transfers: Reducing present and future poverty. World Bank Publications.

Gali, N., Mariescu-Istodor, R., Hostettler, D., & Fränti, P. (2019). Framework for syntactic string similarity measures. Expert Systems with Applications, 129, 169–185.

Gómez, J., & Vázquez, P.-P. (2022). An empirical evaluation of document embeddings and similarity metrics for scientific articles. Applied Sciences, 12(11), 5664.

Handijono, A., & Suhatman, Z. (2024). Meningkatkan deduplikasi data melalui kesamaan teks dalam pembelajaran mesin: Pendekatan komprehensif. AKADEMIK: Jurnal Mahasiswa Humanis, 4(2), 602–615.

Harron, K., Dibben, C., Boyd, J., Hjern, A., Azimaee, M., Barreto, M. L., & Goldstein, H. (2017). Challenges in administrative data linkage for research. Big data & society, 4(2), 2053951717745678.

Kaufman, A. R., & Klevs, A. (2022). Adaptive fuzzy string matching: How to merge datasets with only one (messy) identifying field. Political Analysis, 30(4), 590–596.

Kouremenou, E., Kiourtis, A., & Kyriazis, D. (2024). A data modeling process for achieving interoperability. Advances in Digital Health and Medical Bioengineering, 711–719. [link]

Kremer, M. (2003). Randomized evaluations of educational programs in developing countries: Some lessons. American Economic Review, 93(2), 102–106.

Libuy, N., Harron, K., Gilbert, R., Caulton, R., Cameron, E., & Blackburn, R. (2021). Linking education and hospital data in england: Linkage process and quality. International Journal of Population Data Science, 6(1). DOI: https://doi.org/10.23889/ijpds.v6i1.1671.

Loureiro, A., Cruz, L., & Mello, U. (2021). Brazil case study. The Role of Intergovernmental Fiscal Transfers in Improving Education Outcomes, 201, 201–233.

Malodia, S., Dhir, A., Mishra, M., & Bhatti, Z. A. (2021). Future of e-government: An integrated conceptual framework. Technological Forecasting and Social Change, 173, 121102.

McBride, K., Kamalanathan, S., Valdma, S.-M., Toomere, T., & Freudenthal, M. (2022). Digital government interoperability and data exchange platforms: Insights from a twenty country comparative study. Proceedings of the 15th International Conference on Theory and Practice of Electronic Governance, 90–97. DOI: https://doi.org/10.1145/3560107.3560123.

Ministério da Educação. (2025a). ENEM: Sua porta de entrada para a educação superior [Acessado em: 20 jan. 2025]. [link]

Ministério da Educação. (2025b). Pé-de-Meia: Ministério da Educação Brasileiro [Acessado em: 19 jan. 2025]. [link]

Ouarda, L., Malika, B., & Brahim, B. (2023). Towards a better similarity algorithm for host-based intrusion detection system. Journal of Intelligent Systems, 32(1), 20220259.

Parker, S. W., & Todd, P. E. (2017). Conditional cash transfers: The case of progresa/oportunidades. Journal of Economic Literature, 55(3), 866–915.

Queiroga, E. M., Santana, D., da Silva, M., de Aguiar, M., dos Santos, V., Mello, R. F., Bittencourt, I. I., & Cechinel, C. (2024). Anticipating student abandonment and failure: Predictive models in high school settings. International Conference on Artificial Intelligence in Education, 351–364.

Queiroga, E. M., Siqueira, E. S., dos Santos Portela, C., Cordeiro, T. D., Bittencourt, I. I., Isotani, S., Melo, R. F., Muñoz, R., & Cechinel, C. (2024). Data-driven strategies for achieving school equity: Insights from brazil and policy recommendations. IEEE Access.

Queiroz, M. V. A. B., Sampaio, R. M. B., & Sampaio, L. M. B. (2020). Dynamic efficiency of primary education in brazil: Socioeconomic and infrastructure influence on school performance. Socio-Economic Planning Sciences, 70, 100738. DOI: https://doi.org/10.1016/j.seps.2019.100738.

Rocha, J. C., Ramos, V., Cechinel, C., Hernández-Leal, E. J., Muñoz, R., & Primo, T. T. (2024). Data interoperability in learning analytics - review of literature, 1–8. DOI: https://doi.org/10.1109/clei64178.2024.10700464.

Saffady, W. (2021). Records and information management: Fundamentals of professional practice. Rowman & Littlefield.

Sakai, K., Dong, Y., Oyamada, M., Takeoka, K., & Okadome, T. (2021). Entity matching with string transformation and similarity-based features. International Workshop on Software Foundations for Data Interoperability, 76–87.

Segatto, C. I., Santos, F. B. P. d., Bichir, R. M., & Morandi, E. L. (2022). Inequalities and the covid-19 pandemic in brazil: Analyzing un-coordinated responses in social assistance and education. Policy and Society, 41(2), 306–320. DOI: https://doi.org/10.1093/polsoc/puac005.

Styrin, E., Mossberger, K., & Zhulin, A. (2022). Government as a platform: Intergovernmental participation for public services in the russian federation. Government Information Quarterly, 39(1), 101627.

Tavares, A. A., & Bitencourt, C. M. (2024). Evaluation of public policies and interoperability from the perspective of digital public governance. In D. K. Kang & D. L. Li (Eds.), E-government digital frontiers - transforming public administration through technology. IntechOpen. DOI: https://doi.org/10.5772/intechopen.1007596.

Valeriano, E. S. (2024). Deduplication methods using levenshtein distance algorithm. Journal of Electrical Systems, 20(7s), 997–1006.

Wanke, P., Lauro, A., dos Santos Figueiredo, O. H., Faria, J. R., & Franklin G. Mixon, J. (2024). The impact of school infrastructure and teachers’ human capital on academic performance in brazil [PMID: 37610037]. Evaluation Review, 48(4), 636–662. DOI: https://doi.org/10.1177/0193841X231197741.

Wimmer, M. A., Boneva, R., & di Giacomo, D. (2018). Interoperability governance: A definition and insights from case studies in europe. Proceedings of the 19th Annual International Conference on Digital Government Research: Governance in the Data Age. DOI: https://doi.org/10.1145/3209281.3209306.

Yu, M., Li, G., Deng, D., & Feng, J. (2016). String similarity search and join: A survey. Frontiers of Computer Science, 10, 399–417.

Downloads

Published

2025-05-19

How to Cite

Barbosa Leite Silva, D., Marques Queiroga, E., Nogueira Barros, A., Rebelo Marcolino, M., Dermeval, D., Lima, A., Brandão Marques, L., Cechinel, C., & Vieira, T. (2025). Leveraging String Similarity Algorithms for Educational Data Validation: A Scalable Approach for Digital Governance. Conference on Digital Government Research, 26. https://doi.org/10.59490/dgo.2025.942

Conference Proceedings Volume

Section

Research papers