Assessing rule-based document segmentation and word normalization for legal ruling classification
DOI:
https://doi.org/10.59490/dgo.2025.948Keywords:
Legal document classification, text preprocessing operations, document segmentation and legal terms normalizationAbstract
The Brazilian judiciary has shown promising interest in applying Natural Language Processing (NLP) techniques to various legal tasks. One such application is classifying legal rulings by the topics of recurring appeals. This study investigates two key strategies for preprocessing legal documents, drawing on insights from legal domain experts: whether using specific sections of the document is more effective for legal ruling classification than analyzing the entire document, and which expressions can be normalized to standardize the document vocabulary. The experimental results indicate that combining normalization preprocessing with the extraction of the judge’s manifestation section yields better performance, as measured by the F1 score. Additionally, we demonstrate how the Jaccard similarity index provides valuable insight into the impact of the preprocessing pipeline on the TF-IDF feature extraction method and, by extension, on document representation. This paper underscores the importance of leveraging domain expertise to guide an optimal set of preprocessing operations.
Downloads
References
AlMasaud, A., Sampaio, S., & Sampaio, P. (2024). Mining Data Wrangling Workflows for Design Patterns Discovery and Specification. Information Systems Frontiers. DOI: https://doi.org/10.1007/s10796-023-10458-7.
Araújo, D. C., Lima, A., Lima, J. P., & Costa, J. A. (2021). A Comparison of Classification Methods Applied to Legal Text Data. In G. Marreiros, F. S. Melo, N. Lau, H. Lopes Cardoso, & L. P. Reis (Eds.), Progress in artificial intelligence (pp. 68–80). Springer International Publishing.
Aumiller, D., Almasian, S., Lackner, S., & Gertz, M. (2021). Structural text segmentation of legal documents. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, 2–11. DOI: https://doi.org/10.1145/3462757.3466085.
Brandão, M., Silva, M., Oliveira, G., Hott, H., Lacerda, A., & Pappa, G. (2023). Impacto do Pré-processamento e Representação Textual na Classificação de Documentos de Licitações. Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, 102–114. DOI: https://doi.org/10.5753/sbbd.2023.231658.
Brasil. Superior Tribunal de Justiça. (2016). Manual de padronização de textos do STJ (2nd ed.). STJ. Brasília.
Cao, N., & Cui, W. (2016). Overview of Text Visualization Techniques. In Introduction to text visualization (pp. 11–40). Atlantis Press. DOI: https://doi.org/10.2991/978-94-6239-186-4_2.
Castro Júnior, A. P., Wainer, G. A., & Calixto, W. P. (2022). Application of Artificial Intelligence in the automatic identification and classification repetitive demand resolution incident in the Brazilian Court of Justice. Revista da Faculdade de Direito da UFG, 45(2). DOI: https://doi.org/10.5216/rfd.v45i2.70086.
CNJ. (2024). Justice 4.0 program (tech. rep.). Conselho Nacional de Justiça. Brasília.
Costa, J. A. F., Dantas, N. C. D., & Silva, E. D. S. A. (2023). Evaluating Text Classification in the Legal Domain Using BERT Embeddings. In Lecture notes in computer science (pp. 51–63). Springer Nature Switzerland. DOI: https://doi.org/10.1007/978-3-031-48232-8_6.
Feijó, D. d. V., & Moreira, V. P. (2018). RulingBR: A Summarization Dataset for Legal Texts. In A. Villavicencio, V. Moreira, A. Abad, H. Caseli, P. Gamallo, C. Ramisch, H. Gonçalo Oliveira, & G. H. Paetzold (Eds.), Computational processing of the portuguese language (pp. 255–264). Springer International Publishing.
Gôlo, M., Marcacini, R., & Rossi, R. (2019). Uma extensa avaliação empírica de técnicas de pré-processamento e algoritmos de aprendizado supervisionado de uma classe para classificação de texto. Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2019), 262–273. DOI: https://doi.org/10.5753/eniac.2019.9289.
HaCohen-Kerner, Y., Miller, D., & Yigal, Y. (2020). The influence of preprocessing on text classification using a bag-of-words representation (W. Zhang, Ed.). PLOS ONE, 15(5), e0232525. DOI: https://doi.org/10.1371/journal.pone.0232525.
Luz de Araujo, P. H., de Campos, T. E., Ataides Braz, F., & Correia da Silva, N. (2020, May). VICTOR: a Dataset for Brazilian Legal Documents Classification. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the twelfth language resources and evaluation conference (pp. 1449–1458). European Language Resources Association. [link]
Noguti, M. Y., Vellasques, E., & Oliveira, L. S. (2020). Legal Document Classification: An Application to Law Area Prediction of Petitions to Public Prosecution Service. 2020 International Joint Conference on Neural Networks (IJCNN), 1–8. DOI: https://doi.org/10.1109/IJCNN48605.2020.9207211.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Siino, M., Tinnirello, I., & La Cascia, M. (2024). Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers. Information Systems, 121, 102342. DOI: https://doi.org/10.1016/j.is.2023.102342.
Silva, M. D., Santana, E., Lobato, F., & Jr., A. J. (2023). Preprocessing Applied to Legal Text Mining: analysis and evaluation of the main techniques used. Anais do XX Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2023), 1010–1021. DOI: https://doi.org/10.5753/eniac.2023.234555.
Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104–112. DOI: https://doi.org/10.1016/j.ipm.2013.08.006.
Downloads
Published
How to Cite
Conference Proceedings Volume
Section
License
Copyright (c) 2025 Giliard Almeida de Godoi, Adriano Rivolli, Daniela Lopes Freire, Fabíola Souza Fernandes Pereira, Nubia Regina Ventura, Alex Marino Gonçalves de Almeida, Luís Paulo Faina Garcia, Márcio de Souza Dias, André Carlos Ponce de Leon Ferreira de Carvalho

This work is licensed under a Creative Commons Attribution 4.0 International License.
