Automating Test Design Using LLM

Results from an Empirical Study on the Public Sector

Authors

  • Artur Raffael Cavalcanti Computer Science Department, Federal Rural University of Pernambuco, Brazil https://orcid.org/0009-0005-9079-766X
  • Luan Accioly Computer Science Department, Federal Rural University of Pernambuco, Brazil
  • George Valença Computer Science Department, Federal Rural University of Pernambuco, Brazil https://orcid.org/0000-0001-9375-5354
  • Sidney C. Nogueira Computer Science Department, Federal Rural University of Pernambuco, Brazil https://orcid.org/0000-0002-8817-5029
  • Ana Carolina Morais Tribunal de Contas do Estado de Pernambuco, Brazil
  • Antônio Oliveira Tribunal de Contas do Estado de Pernambuco, Brazil
  • Sérgio Gomes Tribunal de Contas do Estado de Pernambuco, Brazil

DOI:

https://doi.org/10.59490/dgo.2025.1025

Keywords:

automated test design, LLM, user story, system test cases

Abstract

An efficient test process can detect failures earlier in software development, contributing to the quality of software produced by governmental entities and bringing the potential to improve government service delivery. Due to the high cost and the reduced resources available, the automation of the test activities plays a strategic role in improving testing efficiency. Designing test cases from user stories is a common approach to assessing the quality of a system under testing. This paper reports on the research, implementation, and evaluation of a tool that automatically generates system test designs from user stories with the support of Generative Pre-trained Transformer 4 (GPT-4) in the context of a public sector organization. The tool has been conceived to match the needs and particularities of the organization’s test process. Such a tool reads user stories from the Redmine tool, interacts with GPT-4 using a prompt that outputs test cases, and stores the automatically designed tests in the Squash TM test management tool. Organization test analysts stated that the tool produces good quality tests and reduces the effort to create tests. As a consequence, analysts can put more energy into other activities related to testing. Moreover, comparing the tests designed manually by test analysts with the tests designed by the tool shows that both have the same functional coverage. The paper discusses the impacts of the approach in the process, limitations, and related and future work.

Downloads

Download data is not yet available.

References

Augustine, S. (2005). Managing agile projects. Prentice Hall PTR.

Bahrami, M., Mansoorizadeh, M., & Khotanlou, H. (2023). Few-shot learning with prompting methods. 2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA), 1–5. DOI: https://doi.org/10.1109/IPRIA59240.2023.10147172.

Carvalho, G., Barros, F., Carvalho, A., Cavalcanti, A., Mota, A., & Sampaio, A. (2015). Nat2test tool: From natural language requirements to test cases based on csp. In R. Calinescu & B. Rumpe (Eds.), Software engineering and formal methods (pp. 283–290). Springer International Publishing.

Chinnaswamy, A., Sabarish, B. A., & Deepak Menan, R. (2024). User story based automated test case generation using nlp. In M. L. Owoc, F. E. Varghese Sicily, K. Rajaram, & P. Balasundaram (Eds.), Computational intelligence in data science (pp. 156–166). Springer Nature Switzerland.

Claude. (2025). [link]

Dias Neto, A. C., Subramanyan, R., Vieira, M., & Travassos, G. H. (2007). A survey on model-based testing approaches: A systematic review. Proceedings of the 1st ACM International Workshop on Empirical Assessment of Software Engineering Languages and Technologies: Held in Conjunction with the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE) 2007, 31–36. DOI: https://doi.org/10.1145/1353673.1353681.

Garousi, V., & Varma, T. (2010). A replicated survey of software testing practices in the canadian province of alberta: What has changed from 2004 to 2009? [Interplay between Usability Evaluation and Software Development]. Journal of Systems and Software, 83(11), 2251–2262. DOI: https://doi.org/10.1016/j.jss.2010.07.012.

Gemini. (2025). [link]

Gpt-4. (2025). [link]

Granda, M. F., Parra, O., & Alba-Sarango, B. Towards a model-driven testing framework for gui test cases generation from user stories. In: In Proceedings of the 16th international conference on evaluation of novel approaches to software engineering - enase. INSTICC. SciTePress, 2021, 453–460. ISBN: 978-989-758-508-1. DOI: https://doi.org/10.5220/0010499004530460.

Kushwaha, D. S., & Misra, A. K. (2008). Software test effort estimation. SIGSOFT Softw. Eng. Notes, 33(3). DOI: https://doi.org/10.1145/1360602.1361211.

Leotta, M., Yousaf, H. Z., Ricca, F., & Garcia, B. (2024). Ai-generated test scripts for web e2e testing with chatgpt and copilot: A preliminary study. Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, 339–344. DOI: https://doi.org/10.1145/3661167.3661192.

Likert scale. (2006). [link]

Lins, L., Accioly, L., Nogueira, S., Valença, G., Machado, A., Lira, A., & Gomes, S. (2023). Evolução do processo de testes do tce-pe: Resultados preliminares de um projeto de bpm. Anais Estendidos do XIX Simpósio Brasileiro de Sistemas de Informação, 120–122. DOI: https://doi.org/10.5753/sbsi_estendido.2023.229378.

Mathur, A., Pradhan, S., Soni, P., Patel, D., & Regunathan, R. (2023). Automated test case generation using t5 and gpt-3. 2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS), 1, 1986–1992. DOI: https://doi.org/10.1109/ICACCS57279.2023.10112971.

Mollah, H., & van den Bos, P. (2023). From user stories to end-to-end web testing. 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 140–148. DOI: https://doi.org/10.1109/ICSTW58534.2023.00036.

Myers, G. J., & Sandler, C. (2004). The art of software testing. John Wiley & Sons, Inc.

Ng, S., Murnane, T., Reed, K., Grant, D., & Chen, T. (2004). A preliminary survey on software testing practices in australia. 2004 Australian Software Engineering Conference. Proceedings., 116–125. DOI: https://doi.org/10.1109/ASWEC.2004.1290464.

Nogueira, S., Araujo, H., Araujo, R., Iyoda, J., & Sampaio, A. (2019). Test case generation, selection and coverage from natural language. Science of Computer Programming, 181, 84–110. DOI: https://doi.org/10.1016/j.scico.2019.01.003.

Nomura, N., Kikushima, Y., & Aoyama, M. (2013). Business-driven acceptance testing methodology and its practice for e-government software systems. 2013 20th Asia-Pacific Software Engineering Conference (APSEC), 2, 99–104. DOI: https://doi.org/10.1109/APSEC.2013.122.

Pallavi Pandit, S. T. (2015). Agileuat: A framework for user acceptance testing based on user stories and acceptance criteria. International Journal of Computer Applications, 120(10), 16–21. DOI: https://doi.org/10.5120/21262-3533.

Raharjana, I. K., Siahaan, D., & Fatichah, C. (2021). User stories and natural language processing: A systematic literature review. IEEE Access, 9, 53811–53826. DOI: https://doi.org/10.1109/ACCESS.2021.3070606.

Redmine. (2006). [link]

Ross, S. I., Martinez, F., Houde, S., Muller, M., & Weisz, J. D. (2023). The programmer’s assistant: Conversational interaction with a large language model for software development. Proceedings of the 28th International Conference on Intelligent User Interfaces, 491–514. DOI: https://doi.org/10.1145/3581641.3584037.

S Al-Tarawneh, E., & Al-Tarawneh, L. k. (2022). Introducing comprehensive software quality model for evaluating and development e-government websites in jordan using fuzzy analytical hierarchy process. Webology, 19(1), 890–902.

Sivaji, A., Abdullah, A., & Downe, A. G. (2011). Usability testing methodology: Effectiveness of heuristic evaluation in e-government website development. 2011 Fifth Asia Modelling Symposium, 68–72. DOI: https://doi.org/10.1109/AMS.2011.24.

Sommerville, I. (2007). Software engineering. Addison-Wesley. [link]

Squash tm. (2025). [link]

Downloads

Additional Files

Published

2025-05-23

How to Cite

Cavalcanti, A. R., Accioly, L., Valença, G., Nogueira, S. C., Morais, A. C., Oliveira, A., & Gomes, S. (2025). Automating Test Design Using LLM: Results from an Empirical Study on the Public Sector. Conference on Digital Government Research, 26. https://doi.org/10.59490/dgo.2025.1025

Conference Proceedings Volume

Section

Research papers