Assessing rule-based document segmentation and word normalization for legal ruling classification
DOI:
https://doi.org/10.59490/dgo.2025.948Keywords:
Legal document classification, text preprocessing operations, document segmentation and legal terms normalizationAbstract
The Brazilian judiciary has shown promising interest in applying Natural Language Processing (NLP) techniques to various legal tasks. One such application is classifying legal rulings by the topics of recurring appeals. This study investigates two key strategies for preprocessing legal documents, drawing on insights from legal domain experts: whether using specific sections of the document is more effective for legal ruling classification than analyzing the entire document, and which expressions can be normalized to standardize the document vocabulary. The experimental results indicate that combining normalization preprocessing with the extraction of the judge’s manifestation section yields better performance, as measured by the F1 score. Additionally, we demonstrate how the Jaccard similarity index provides valuable insight into the impact of the preprocessing pipeline on the TF-IDF feature extraction method and, by extension, on document representation. This paper underscores the importance of leveraging domain expertise to guide an optimal set of preprocessing operations.
Downloads
Downloads
Published
How to Cite
Conference Proceedings Volume
Section
License
Copyright (c) 2025 Giliard Almeida de Godoi, Adriano Rivolli, Daniela Lopes Freire, Fabíola Souza Fernandes Pereira, Nubia Regina Ventura, Alex Marino Gonçalves de Almeida, Luís Paulo Faina Garcia, Márcio de Souza Dias, André Carlos Ponce de Leon Ferreira de Carvalho

This work is licensed under a Creative Commons Attribution 4.0 International License.