Applying Active Learning in Named Entity Recognition Corpora Expansion in Legal Domain
DOI:
https://doi.org/10.59490/dgo.2025.945Keywords:
Active Learning, Named Entity Recognition (NER), Selection Criteria, Information Retrieval, Natural Language Processing, Machine LearningAbstract
This work investigates the application of Active Learning methodologies for data annotation in Named Entity Recognition (NER) tasks mainly when used in documents from the legal domain in Portuguese. Its aim is to determine an algorithm able to improve the efficiency of the annotation process and reduce the human cost involved, without compromising the quality of the classifiers trained in these corpora. Three sample selection methods were explored: (i) Multi-Criteria Active Learning, using informativeness, representativeness, and diversity as selection criteria, (ii) Dynamic Selection Guided by Entity Volume, and (iii) Random Sentence Selection (used as a baseline for evaluating the other two). The study was conducted using the BERT model for classification, employing different amounts of labeled data for each approach (annotation budgets). The results show that, although Multi-Criteria Active Learning performed better in some scenarios, Dynamic Selection Guided by Entity Volume consistently showed good performance, especially for low annotation budgets, in addition to being computationally more efficient. Thus, the analysis of the results suggests that the volume of named entities is a good predictor for selecting informative samples. This study contributes to the Active Learning field by applying these techniques to modern language models and providing efficient solutions for reducing costs in data annotation for Named Entity Recognition.
Downloads
Downloads
Published
How to Cite
Conference Proceedings Volume
Section
License
Copyright (c) 2025 Rafael P. Gouveia, André C.P.L.F. de Carvalho, Ellen Souza, Hidelberg O. Albuquerque, Douglas Vitório, Nádia F.F. da Silva

This work is licensed under a Creative Commons Attribution 4.0 International License.