Applying Active Learning in Named Entity Recognition Corpora Expansion in Legal Domain
DOI:
https://doi.org/10.59490/dgo.2025.945Keywords:
Active Learning, Named Entity Recognition (NER), Selection Criteria, Information Retrieval, Natural Language Processing, Machine LearningAbstract
This work investigates the application of Active Learning methodologies for data annotation in Named Entity Recognition (NER) tasks mainly when used in documents from the legal domain in Portuguese. Its aim is to determine an algorithm able to improve the efficiency of the annotation process and reduce the human cost involved, without compromising the quality of the classifiers trained in these corpora. Three sample selection methods were explored: (i) Multi-Criteria Active Learning, using informativeness, representativeness, and diversity as selection criteria, (ii) Dynamic Selection Guided by Entity Volume, and (iii) Random Sentence Selection (used as a baseline for evaluating the other two). The study was conducted using the BERT model for classification, employing different amounts of labeled data for each approach (annotation budgets). The results show that, although Multi-Criteria Active Learning performed better in some scenarios, Dynamic Selection Guided by Entity Volume consistently showed good performance, especially for low annotation budgets, in addition to being computationally more efficient. Thus, the analysis of the results suggests that the volume of named entities is a good predictor for selecting informative samples. This study contributes to the Active Learning field by applying these techniques to modern language models and providing efficient solutions for reducing costs in data annotation for Named Entity Recognition.
Downloads
References
Albuquerque, H. O., Costa, R., Silvestre, G., Souza, E., da Silva, N. F. F., Vitório, D., Moriyama, G., Martins, L., Soezima, L., Nunes, A., Siqueira, F., Tarrega, J. P., Beinotti, J. V., Dias, M., Silva, M., Gardini, M., Silva, V., de Carvalho, A. C. P. L. F., & Oliveira., A. L. I. (2022). Ulyssesner-br: A corpus of brazilian legislative documents for named entity recognition. PROPOR 2022.
Albuquerque, H. O., Souza, E., Silva, T., Gouveia, R. P., Junior, F., Vitório, D., da Silva, N. F. F., de Carvalho, A. C., Oliveira, A. L., & de Andrade, F. E. (2024, March). Ulyssesnerq: Expanding queries from brazilian portuguese legislative documents through named entity recognition. [link]
Brandt, M. B. (2020). Modelagem da informação legislativa: Arquitetura da informação para o processo legislativo brasileiro [Doctoral dissertation, UNIVERSIDADE ESTADUAL PAULISTA].
Costa, R., Albuquerque, H. O., Silvestre, G., Silva, N. F. F., Souza, E., Vitório, D., Nunes, A., Siqueira, F., Tarrega, J. P., Beinotti, J. V., de Souza Dias, M., Pereira, F. S. F., Silva1, M., Gardini, M., Silva, V., de Carvalho, A. C. P. L. F., & Oliveira, A. L. I. (2022). Expanding ulyssesner-br named entity recognition corpus with informal user-generated text. EPIA 2022.
da Costa, R. P. (2023, Abril). Reconhecimento de entidades nomeadas em textos informais no domínio legislativo. [link]
de Sá Vitório, D. Á. M. (2020). Avaliando estratégias de seleção de active learning para mineração de opinião com fluxos contínuos de dados. [link]
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.
Hugging Face, I. (2019). [link]
Jehangir, B., Radhakrishnan, S., & Agarwal, R. (2023). A survey on named entity recognition – datasets, tools, and methodologies. Natural Language Processing Journal, 3, 100017. DOI: https://doi.org/10.1016/j.nlp.2023.100017.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Mitchell, T. (1998). Machine learning. McGraw-Hill Science/Engineering/Math.
Navarro, D. (2022, Dezembro). The independent samples t-test (welch test). [link]
Nunes, R. O., Spritzer, A. S., Freitas, C. M. D. S., & Balreira, D. G. (2024, Setembro). Reconhecimento de entidades nomeadas e vazamento de dados em textos legislativos: Uma reavaliação da literatura [Não publicado]. [link]
Polo, F. M., Mendonça, G. C. F., Parreira, K. C. J., Gianvechio, L., Cordeiro, P., Ferreira, J. B., de Lima, L. M. P., do Amaral Maia, A. C., & Vicente, R. (2021, Outubro). Legalnlp – natural language processing methods for the brazilian legal language. [link]
Riesener, M., Kuhn, M., Schümmelfeder, S., & Xiao, D. (2024, Outubro). Active learning with pre-trained language models for named entity recognition in requirements engineering. [link]
Schohn, G., & Cohn, D. A. (2000, June). Less is more: Active learning with support vector machines. [link]
Scikit-learn. (2024). Cross-validation: Evaluating estimator performance. scikit-learn. [link]
Shen, D., Zhang, J., Su, J., Zhou, G., & Tan, C.-L. (2004, July). Multi-criteria-based active learning for named entity recognition. [link]
Siqueira, F. A., Vitório, D., Souza, E. P., & Santos, J. A. P. (2024, July). Ulysses tesemõ: A new large corpus for brazilian legal and governmental domain. [link]
Souza, F., Nogueira, R., & de Alencar Lotufo, R. (2020, Outubro). Bertimbau: Pretrained bert models for brazilian portuguese. [link]
Tsuruoka, Y., Tsujii, J., & Ananiadou, S. (2008, November). Accelerating the annotation of sparse named entities by dynamic sentence selection. [link]
Downloads
Published
How to Cite
Conference Proceedings Volume
Section
License
Copyright (c) 2025 Rafael P. Gouveia, André C.P.L.F. de Carvalho, Ellen Souza, Hidelberg O. Albuquerque, Douglas Vitório, Nádia F.F. da Silva

This work is licensed under a Creative Commons Attribution 4.0 International License.
