Comparing Machine Learning and an Expert System for Legal Document Classification

Authors

DOI:

https://doi.org/10.59490/dgo.2025.947

Keywords:

machine learning, legal document classification, expert systems, overfitting, natural language processing

Abstract

This study assesses the performance of machine learning models and a rule-based expert system in classifying legal documents, specifically in distinguishing relevant from irrelevant cases. The evaluated models include Random Forest, Naive Bayes, XGBoost, SVM, and Decision Tree, alongside an expert system developed by a State Attorney from PGE-PE. The datasets, representing Alvará, Arrolamento, and Inventário legal processes, contain labeled instances of legal cases. The models were assessed based on accuracy, precision, recall, and F1-score. The results suggest that while machine learning models—particularly Random Forest—achieve higher accuracy and precision, the expert system outperforms in recall and F1-score, ensuring that no relevant cases are overlooked. The choice between machine learning models and expert systems depends on the legal context, requiring a balance between efficiency (reducing false positives) and reliability (capturing all relevant cases).

Downloads

Download data is not yet available.

References

Al Hasan, S., Hussain, M. G., Protim, J., Rahman, M. M., Fahim, N., Chowdhury, M. Z., & Pritom, A. I. (2022). Classification of multi-labeled text articles with reuters dataset using svm. 2022 International Conference on Science and Technology (ICOSTECH), 01–05.

Bento, F. M., & Teive, R. C. G. (2023). Classificação de documentos jurídicos utilizando a arquitetura transformer: Uma análise comparativa com algoritmos tradicionais de machine learning e chatgpt. Brazilian Journal of Development, 9(6), 20208–20224.

Bischl, B., Binder, M., Lang, M., Pielok, T., Richter, J., Coors, S., Thomas, J., Ullmann, T., Becker, M., Boulesteix, A.-L., et al. (2023). Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(2), e1484.

Cahyani, D. E., & Patasik, I. (2021). Performance comparison of tf-idf and word2vec models for emotion text classification. Bulletin of Electrical Engineering and Informatics, 10(5), 2780–2788.

Chari, H., Aswale, S., Pawar, V. N., Shetgaonkar, P., & Kumar, K. C. (2021). Advertisement click fraud detection usingmachine learning techniques.2021 International Conference onTechnological Advancements and Innovations (ICTAI), 109–114.

Dias, L. C. M., & Cavalcante, L. G. M. (2023). Aplicação do classificador naive bayes para detecção de fraudes. Ciência Da Computação: Avanços E Tendências Em Pesquisa, 1, 9–26.

Gêda, B. M., et al. (2021). Classificação de textos de decisões judiciais.

Magalhães, D., Pozo, A., & Machado, S. (2022). Técnicas de aprendizado de máquinas aplicadas à classificação de decisões judiciais. Revista de Estudos Empíricos em Direito, 9.

Masri, N., Sultan, Y. A., Akkila, A. N., Almasri, A., Ahmed, A., Mahmoud, A. Y., Zaqout, I., & Abu-Naser, S. S. (2019). Survey of rule-based systems. International Journal of Academic Information Systems Research (IJAISR), 3(7), 1–23.

Moradi, R., Berangi, R., & Minaei, B. (2020). A survey of regularization strategies for deep models. Artificial Intelligence Review, 53(6), 3947–3986.

Morris, J. X., & Rush, A. M. (2024). Contextual document embeddings. arXiv preprint arXiv:2410.02525.

Noguti, M. Y., Vellasques, E., & Oliveira, L. S. (2020). Legal document classification: An application to law area prediction of petitions to public prosecution service. 2020 International joint conference on neural networks (IJCNN), 1–8.

Polo, F. M., Ciochetti, I., & Bertolo, E. (2021). Predicting legal proceedings status: Approaches based on sequential text data. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, 264–265.

Prentzas, J., & Hatzilygeroudis, I. (2007). Categorizing approaches combining rule-based and case-based reasoning. Expert Systems, 24(2), 97–122.

Sasikumar, M. (2007). A practical introduction to rule based expert systems.

Serras, F. R., & Finger, M. (2022). Verbert: Automating brazilian case law document multi-label categorization using bert. arXiv preprint arXiv:2203.06224.

Taha, A. Y., Tiun, S., Abd Rahman, A. H., & Sabah, A. (2021). Multilabel over-sampling and under-sampling with class alignment for imbalanced multilabel text classification. Journal of Information and Communication Technology, 20(3), 423–456.

Villena Román, J., Collada Pérez, S., Lana Serrano, S., & González Cristóbal, J. C. (2011). Hybrid approach combining machine learning and a rule-based expert system for text categorization.

Westermann, H., Šavelka, J., Walker, V. R., Ashley, K. D., & Benyekhlef, K. (2019). Computer-assisted creation of boolean search rules for text classification in the legal domain. In Legal knowledge and information systems (pp. 123–132). IOS Press.

Zhou, X., Du, H., Sun, Y., Ren, H., Cui, P., & Ma, Z. (2023). A new framework integrating reinforcement learning, a rule-based expert system, and decision tree analysis to improve building energy flexibility. Journal of Building Engineering, 71, 106536.

Downloads

Published

2025-05-19

How to Cite

de Queiroz Santos Filho, J. J., Araújo Dantas, F., da Silva Lima, M., Barbosa dos Santos, S., Genesis, G., Lima da Salva, M. G., Farias Pinheiro, Álvaro, & Galdino da Silva, E. (2025). Comparing Machine Learning and an Expert System for Legal Document Classification. Conference on Digital Government Research, 26. https://doi.org/10.59490/dgo.2025.947

Conference Proceedings Volume

Section

Research papers