Assessing rule-based document segmentation and word normalization for legal ruling classification

Authors

  • Giliard Almeida de Godoi Institute of Mathematics and Computer Sciences, University of São Paulo, Brazil
  • Adriano Rivolli Federal University of Technology Paraná, Brazil
  • Daniela Lopes Freire Institute of Mathematics and Computer Sciences, University of São Paulo, Brazil
  • Fabíola Souza Fernandes Pereira Federal University of Uberlândia, Brazil
  • Nubia Regina Ventura Institute of Mathematics and Computer Sciences, University of São Paulo, Brazil
  • Alex Marino Gonçalves de Almeida Ourinhos College of Technology, Brazil
  • Luís Paulo Faina Garcia University of Brasília, Brazil
  • Márcio de Souza Dias Federal University of Catalão (UFCAT), Brazil
  • André Carlos Ponce de Leon Ferreira de Carvalho Institute of Mathematics and Computer Sciences, University of São Paulo, Brazil

DOI:

https://doi.org/10.59490/dgo.2025.948

Keywords:

Legal document classification, text preprocessing operations, document segmentation and legal terms normalization

Abstract

The Brazilian judiciary has shown promising interest in applying Natural Language Processing (NLP) techniques to various legal tasks. One such application is classifying legal rulings by the topics of recurring appeals. This study investigates two key strategies for preprocessing legal documents, drawing on insights from legal domain experts: whether using specific sections of the document is more effective for legal ruling classification than analyzing the entire document, and which expressions can be normalized to standardize the document vocabulary. The experimental results indicate that combining normalization preprocessing with the extraction of the judge’s manifestation section yields better performance, as measured by the F1 score. Additionally, we demonstrate how the Jaccard similarity index provides valuable insight into the impact of the preprocessing pipeline on the TF-IDF feature extraction method and, by extension, on document representation. This paper underscores the importance of leveraging domain expertise to guide an optimal set of preprocessing operations.

Downloads

Download data is not yet available.

Downloads

Published

2025-05-19

How to Cite

Almeida de Godoi, G., Rivolli, A., Lopes Freire, D., Souza Fernandes Pereira, F., Regina Ventura, N., Marino Gonçalves de Almeida, A., Paulo Faina Garcia, L., de Souza Dias, M., & Carlos Ponce de Leon Ferreira de Carvalho, A. (2025). Assessing rule-based document segmentation and word normalization for legal ruling classification. Conference on Digital Government Research, 1. https://doi.org/10.59490/dgo.2025.948