Assessing rule-based document segmentation and word normalization for legal ruling classification

Authors

  • Giliard Almeida de Godoi Institute of Mathematics and Computer Sciences, University of São Paulo, Brazil
  • Adriano Rivolli Federal University of Technology Paraná, Brazil
  • Daniela Lopes Freire Institute of Mathematics and Computer Sciences, University of São Paulo, Brazil
  • Fabíola Souza Fernandes Pereira Federal University of Uberlândia, Brazil
  • Nubia Regina Ventura Institute of Mathematics and Computer Sciences, University of São Paulo, Brazil
  • Alex Marino Gonçalves de Almeida Ourinhos College of Technology, Brazil
  • Luís Paulo Faina Garcia University of Brasília, Brazil
  • Márcio de Souza Dias Federal University of Catalão (UFCAT), Brazil
  • André Carlos Ponce de Leon Ferreira de Carvalho Institute of Mathematics and Computer Sciences, University of São Paulo, Brazil

DOI:

https://doi.org/10.59490/dgo.2025.948

Keywords:

Legal document classification, text preprocessing operations, document segmentation and legal terms normalization

Abstract

The Brazilian judiciary has shown promising interest in applying Natural Language Processing (NLP) techniques to various legal tasks. One such application is classifying legal rulings by the topics of recurring appeals. This study investigates two key strategies for preprocessing legal documents, drawing on insights from legal domain experts: whether using specific sections of the document is more effective for legal ruling classification than analyzing the entire document, and which expressions can be normalized to standardize the document vocabulary. The experimental results indicate that combining normalization preprocessing with the extraction of the judge’s manifestation section yields better performance, as measured by the F1 score. Additionally, we demonstrate how the Jaccard similarity index provides valuable insight into the impact of the preprocessing pipeline on the TF-IDF feature extraction method and, by extension, on document representation. This paper underscores the importance of leveraging domain expertise to guide an optimal set of preprocessing operations.

Downloads

Download data is not yet available.

References

AlMasaud, A., Sampaio, S., & Sampaio, P. (2024). Mining Data Wrangling Workflows for Design Patterns Discovery and Specification. Information Systems Frontiers. DOI: https://doi.org/10.1007/s10796-023-10458-7.

Araújo, D. C., Lima, A., Lima, J. P., & Costa, J. A. (2021). A Comparison of Classification Methods Applied to Legal Text Data. In G. Marreiros, F. S. Melo, N. Lau, H. Lopes Cardoso, & L. P. Reis (Eds.), Progress in artificial intelligence (pp. 68–80). Springer International Publishing.

Aumiller, D., Almasian, S., Lackner, S., & Gertz, M. (2021). Structural text segmentation of legal documents. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, 2–11. DOI: https://doi.org/10.1145/3462757.3466085.

Brandão, M., Silva, M., Oliveira, G., Hott, H., Lacerda, A., & Pappa, G. (2023). Impacto do Pré-processamento e Representação Textual na Classificação de Documentos de Licitações. Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, 102–114. DOI: https://doi.org/10.5753/sbbd.2023.231658.

Brasil. Superior Tribunal de Justiça. (2016). Manual de padronização de textos do STJ (2nd ed.). STJ. Brasília.

Cao, N., & Cui, W. (2016). Overview of Text Visualization Techniques. In Introduction to text visualization (pp. 11–40). Atlantis Press. DOI: https://doi.org/10.2991/978-94-6239-186-4_2.

Castro Júnior, A. P., Wainer, G. A., & Calixto, W. P. (2022). Application of Artificial Intelligence in the automatic identification and classification repetitive demand resolution incident in the Brazilian Court of Justice. Revista da Faculdade de Direito da UFG, 45(2). DOI: https://doi.org/10.5216/rfd.v45i2.70086.

CNJ. (2024). Justice 4.0 program (tech. rep.). Conselho Nacional de Justiça. Brasília.

Costa, J. A. F., Dantas, N. C. D., & Silva, E. D. S. A. (2023). Evaluating Text Classification in the Legal Domain Using BERT Embeddings. In Lecture notes in computer science (pp. 51–63). Springer Nature Switzerland. DOI: https://doi.org/10.1007/978-3-031-48232-8_6.

Feijó, D. d. V., & Moreira, V. P. (2018). RulingBR: A Summarization Dataset for Legal Texts. In A. Villavicencio, V. Moreira, A. Abad, H. Caseli, P. Gamallo, C. Ramisch, H. Gonçalo Oliveira, & G. H. Paetzold (Eds.), Computational processing of the portuguese language (pp. 255–264). Springer International Publishing.

Gôlo, M., Marcacini, R., & Rossi, R. (2019). Uma extensa avaliação empírica de técnicas de pré-processamento e algoritmos de aprendizado supervisionado de uma classe para classificação de texto. Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2019), 262–273. DOI: https://doi.org/10.5753/eniac.2019.9289.

HaCohen-Kerner, Y., Miller, D., & Yigal, Y. (2020). The influence of preprocessing on text classification using a bag-of-words representation (W. Zhang, Ed.). PLOS ONE, 15(5), e0232525. DOI: https://doi.org/10.1371/journal.pone.0232525.

Luz de Araujo, P. H., de Campos, T. E., Ataides Braz, F., & Correia da Silva, N. (2020, May). VICTOR: a Dataset for Brazilian Legal Documents Classification. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the twelfth language resources and evaluation conference (pp. 1449–1458). European Language Resources Association. [link]

Noguti, M. Y., Vellasques, E., & Oliveira, L. S. (2020). Legal Document Classification: An Application to Law Area Prediction of Petitions to Public Prosecution Service. 2020 International Joint Conference on Neural Networks (IJCNN), 1–8. DOI: https://doi.org/10.1109/IJCNN48605.2020.9207211.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Siino, M., Tinnirello, I., & La Cascia, M. (2024). Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers. Information Systems, 121, 102342. DOI: https://doi.org/10.1016/j.is.2023.102342.

Silva, M. D., Santana, E., Lobato, F., & Jr., A. J. (2023). Preprocessing Applied to Legal Text Mining: analysis and evaluation of the main techniques used. Anais do XX Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2023), 1010–1021. DOI: https://doi.org/10.5753/eniac.2023.234555.

Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104–112. DOI: https://doi.org/10.1016/j.ipm.2013.08.006.

Downloads

Published

2025-05-19

How to Cite

Almeida de Godoi, G., Rivolli, A., Lopes Freire, D., Souza Fernandes Pereira, F., Regina Ventura, N., Marino Gonçalves de Almeida, A., Paulo Faina Garcia, L., de Souza Dias, M., & Carlos Ponce de Leon Ferreira de Carvalho, A. (2025). Assessing rule-based document segmentation and word normalization for legal ruling classification. Conference on Digital Government Research, 26. https://doi.org/10.59490/dgo.2025.948

Conference Proceedings Volume

Section

Research papers