Applying Active Learning in Named Entity Recognition Corpora Expansion in Legal Domain

Authors

DOI:

https://doi.org/10.59490/dgo.2025.945

Keywords:

Active Learning, Named Entity Recognition (NER), Selection Criteria, Information Retrieval, Natural Language Processing, Machine Learning

Abstract

This work investigates the application of Active Learning methodologies for data annotation in Named Entity Recognition (NER) tasks mainly when used in documents from the legal domain in Portuguese. Its aim is to determine an algorithm able to improve the efficiency of the annotation process and reduce the human cost involved, without compromising the quality of the classifiers trained in these corpora. Three sample selection methods were explored: (i) Multi-Criteria Active Learning, using informativeness, representativeness, and diversity as selection criteria, (ii) Dynamic Selection Guided by Entity Volume, and (iii) Random Sentence Selection (used as a baseline for evaluating the other two). The study was conducted using the BERT model for classification, employing different amounts of labeled data for each approach (annotation budgets). The results show that, although Multi-Criteria Active Learning performed better in some scenarios, Dynamic Selection Guided by Entity Volume consistently showed good performance, especially for low annotation budgets, in addition to being computationally more efficient. Thus, the analysis of the results suggests that the volume of named entities is a good predictor for selecting informative samples. This study contributes to the Active Learning field by applying these techniques to modern language models and providing efficient solutions for reducing costs in data annotation for Named Entity Recognition.

Downloads

Download data is not yet available.

Downloads

Published

2025-05-19

How to Cite

Gouveia, R. P., de Carvalho, A. C., Souza, E., Albuquerque, H. O., Vitório, D., & da Silva, N. F. (2025). Applying Active Learning in Named Entity Recognition Corpora Expansion in Legal Domain. Conference on Digital Government Research, 1. https://doi.org/10.59490/dgo.2025.945