Applying Active Learning in Named Entity Recognition Corpora Expansion in Legal Domain

Rafael P. Gouveia; André C.P.L.F. de Carvalho; Ellen Souza; Hidelberg O. Albuquerque; Douglas Vitório; Nádia F.F. da Silva

doi:10.59490/dgo.2025.945

Authors

Rafael P. Gouveia Institute of Mathematics and Computer Sciences, University of São Paulo, Brazil https://orcid.org/0000-0002-4684-8037
André C.P.L.F. de Carvalho Institute of Mathematics and Computer Sciences, University of São Paulo, Brazil https://orcid.org/0000-0002-4765-6459
Ellen Souza MiningBR Research Group, Federal Rural University of Pernambuco, Brazil https://orcid.org/0000-0002-7706-4809
Hidelberg O. Albuquerque MiningBR Research Group, Federal Rural University of Pernambuco, Brazil https://orcid.org/0000-0003-2277-8860
Douglas Vitório Centro de Informática, Federal University of Pernambuco, Brazil https://orcid.org/0000-0003-2285-574X
Nádia F.F. da Silva Institute of Informatics, Federal University of Goiás, Brazil https://orcid.org/0000-0002-3875-2211

DOI:

https://doi.org/10.59490/dgo.2025.945

Keywords:

Active Learning, Named Entity Recognition (NER), Selection Criteria, Information Retrieval, Natural Language Processing, Machine Learning

Abstract

This work investigates the application of Active Learning methodologies for data annotation in Named Entity Recognition (NER) tasks mainly when used in documents from the legal domain in Portuguese. Its aim is to determine an algorithm able to improve the efficiency of the annotation process and reduce the human cost involved, without compromising the quality of the classifiers trained in these corpora. Three sample selection methods were explored: (i) Multi-Criteria Active Learning, using informativeness, representativeness, and diversity as selection criteria, (ii) Dynamic Selection Guided by Entity Volume, and (iii) Random Sentence Selection (used as a baseline for evaluating the other two). The study was conducted using the BERT model for classification, employing different amounts of labeled data for each approach (annotation budgets). The results show that, although Multi-Criteria Active Learning performed better in some scenarios, Dynamic Selection Guided by Entity Volume consistently showed good performance, especially for low annotation budgets, in addition to being computationally more efficient. Thus, the analysis of the results suggests that the volume of named entities is a good predictor for selecting informative samples. This study contributes to the Active Learning field by applying these techniques to modern language models and providing efficient solutions for reducing costs in data annotation for Named Entity Recognition.

Downloads

Download data is not yet available.

References

Albuquerque, H. O., Costa, R., Silvestre, G., Souza, E., da Silva, N. F. F., Vitório, D., Moriyama, G., Martins, L., Soezima, L., Nunes, A., Siqueira, F., Tarrega, J. P., Beinotti, J. V., Dias, M., Silva, M., Gardini, M., Silva, V., de Carvalho, A. C. P. L. F., & Oliveira., A. L. I. (2022). Ulyssesner-br: A corpus of brazilian legislative documents for named entity recognition. PROPOR 2022.

Albuquerque, H. O., Souza, E., Silva, T., Gouveia, R. P., Junior, F., Vitório, D., da Silva, N. F. F., de Carvalho, A. C., Oliveira, A. L., & de Andrade, F. E. (2024, March). Ulyssesnerq: Expanding queries from brazilian portuguese legislative documents through named entity recognition. [link]

Brandt, M. B. (2020). Modelagem da informação legislativa: Arquitetura da informação para o processo legislativo brasileiro [Doctoral dissertation, UNIVERSIDADE ESTADUAL PAULISTA].

Costa, R., Albuquerque, H. O., Silvestre, G., Silva, N. F. F., Souza, E., Vitório, D., Nunes, A., Siqueira, F., Tarrega, J. P., Beinotti, J. V., de Souza Dias, M., Pereira, F. S. F., Silva1, M., Gardini, M., Silva, V., de Carvalho, A. C. P. L. F., & Oliveira, A. L. I. (2022). Expanding ulyssesner-br named entity recognition corpus with informal user-generated text. EPIA 2022.

da Costa, R. P. (2023, Abril). Reconhecimento de entidades nomeadas em textos informais no domínio legislativo. [link]

de Sá Vitório, D. Á. M. (2020). Avaliando estratégias de seleção de active learning para mineração de opinião com fluxos contínuos de dados. [link]

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.

Hugging Face, I. (2019). [link]

Jehangir, B., Radhakrishnan, S., & Agarwal, R. (2023). A survey on named entity recognition – datasets, tools, and methodologies. Natural Language Processing Journal, 3, 100017. DOI: https://doi.org/10.1016/j.nlp.2023.100017.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.

Mitchell, T. (1998). Machine learning. McGraw-Hill Science/Engineering/Math.

Navarro, D. (2022, Dezembro). The independent samples t-test (welch test). [link]

Nunes, R. O., Spritzer, A. S., Freitas, C. M. D. S., & Balreira, D. G. (2024, Setembro). Reconhecimento de entidades nomeadas e vazamento de dados em textos legislativos: Uma reavaliação da literatura [Não publicado]. [link]

Polo, F. M., Mendonça, G. C. F., Parreira, K. C. J., Gianvechio, L., Cordeiro, P., Ferreira, J. B., de Lima, L. M. P., do Amaral Maia, A. C., & Vicente, R. (2021, Outubro). Legalnlp – natural language processing methods for the brazilian legal language. [link]

Riesener, M., Kuhn, M., Schümmelfeder, S., & Xiao, D. (2024, Outubro). Active learning with pre-trained language models for named entity recognition in requirements engineering. [link]

Schohn, G., & Cohn, D. A. (2000, June). Less is more: Active learning with support vector machines. [link]

Scikit-learn. (2024). Cross-validation: Evaluating estimator performance. scikit-learn. [link]

Shen, D., Zhang, J., Su, J., Zhou, G., & Tan, C.-L. (2004, July). Multi-criteria-based active learning for named entity recognition. [link]

Siqueira, F. A., Vitório, D., Souza, E. P., & Santos, J. A. P. (2024, July). Ulysses tesemõ: A new large corpus for brazilian legal and governmental domain. [link]

Souza, F., Nogueira, R., & de Alencar Lotufo, R. (2020, Outubro). Bertimbau: Pretrained bert models for brazilian portuguese. [link]

Tsuruoka, Y., Tsujii, J., & Ananiadou, S. (2008, November). Accelerating the annotation of sparse named entities by dynamic sentence selection. [link]

Applying Active Learning in Named Entity Recognition Corpora Expansion in Legal Domain

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Conference Proceedings Volume

Section

License

Programme dg.o

Browse per article type:

Search:

Questions?