Can Large Language Models Automate Systematic Studies of Literature? Exploring Automated Screening - A Case Study in the Field of Computer Science Education

Authors

DOI:

https://doi.org/10.20453/spirat.v3iNE1.5639

Keywords:

computer science education, large language models, systematic literature studies

Abstract

This study evaluates the efficacy of Large Language Models in the screening process of Systematic Literature Studies in Computer Science Education, a domain with increasing contributions. Using models such as GPT-4o, Claude-3.5-Sonnet, and Llama-3-70B, the automation of the screening process is explored, comparing its results with a manual process carried out by researchers in the area. The data worked with are from July 2024 and the results of the selection process show high sensitivity (≥0.8644) in all models, indicating that at least 86% of the relevant articles are included, and it is highlighted that Claude-3.5Sonnet includes 96.6% of the relevant articles. The F1-Score values for Claude3.5-Sonnet and GPT-4o (≥0.74) show that the models’ performance is acceptable for this study’s context. Although the low precision (≥0.355) indicates that the models tend to include non-relevant articles, the results obtained suggest that LLMs have significant potential as support tools at the inclusion/exclusion stage, potentially reducing manual review time. However, a hybrid approach combining automation with human judgment in the final tasks of this stage is recommended.

Downloads

Download data is not yet available.

Author Biographies

Franklin Leonel Sánchez Catota, Universidad Carlos III de Madrid. Getafe, España.

Profesor titular en el Departamento de Electrónica, Telecomunicaciones y Redes de Información de la Escuela Politécnica Nacional. Estudiante del Doctorado en Ingeniería Telemática de la Universidad Carlos III de Madrid y Máster en Ingeniería Telemática por la Universidad Carlos III de Madrid. Su principal área de investigación es la innovación en la enseñanza de programación.

Carlos Alario Hoyos, Universidad Carlos III de Madrid. Getafe, España.

Profesor titular en el Departamento de Ingeniería Telemática, de la Universidad Carlos III de Madrid. Doctor en Tecnologías de la Información y las Comunicaciones e Ingeniero de Telecomunicación, por la Universidad de Valladolid. Sus habilidades y experiencia incluyen investigación y desarrollo en MOOCs y SPOCs, redes sociales, aprendizaje colaborativo y evaluación de experiencias de aprendizaje.

Danny Santiago Guamán Loachamín, Escuela Politécnica Nacional. Madrid, España.

Ingeniero en Electrónica y Redes de la Información de la Escuela Politécnica Nacional de Quito, Ecuador, en 2010, y doctor por la Universidad Politécnica de Madrid, Madrid, España, en 2021.

Actualmente, es profesor asistente de la Escuela Politécnica Nacional. Su áreas principales de investigación son la innovación en la enseñanza de programación y la evaluación de cumplimento de privacidad y protección de datos en los sistemas de información.

Julio César Caiza Ñacato, Escuela Politécnica Nacional. Madrid, España.

Ingeniero en Electrónica y Redes de la Información de la Escuela Politécnica Nacional de Quito, Ecuador (2010), y doctor por la Universidad Politécnica de Madrid (2020).

Actualmente, es profesor asistente de la Escuela Politécnica Nacional. Su principal investigación es en el área de innovación en la enseñanza de programación.

References

Anthropic. (2024, junio 20). Introducing Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners (arXiv:2005.14165). arXiv. https://doi.org/10.48550/arXiv.2005.14165

Castillo-Segura, P., Alario-Hoyos, C., Kloos, C. D., & Fernández Panadero, C. (2023). Leveraging the Potential of Generative AI to Accelerate Systematic Literature Reviews: An Example in the Area of Educational Technology. 2023 World Engineering Education Forum - Global Engineering Deans Council (WEEF-GEDC), 1-8. https://doi.org/10.1109/WEEF-GEDC59520.2023.10344098

Hugging Face. (s. f.). MMLU Pro—A Hugging Face Space by TIGER-Lab. Recuperado 9 de julio de 2024, de https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro

Kojima, T., Gu, S. (Shane), Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. Advances in Neural Information Processing Systems, 35, 22199-22213. https://proceedings.neurips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef-112099c16f326-Abstract-Conference.html

Liang, W., Zhang, Y., Wu, Z., Lepp, H., Ji, W., Zhao, X., Cao, H., Liu, S., He, S., Huang, Z., Yang, D., Potts, C., Manning, C. D., & Zou, J. Y. (2024). Mapping the Increasing Use of LLMs in Scientific Papers (Versión 1). arXiv. https://doi.org/10.48550/ARXIV.2404.01268

Lipton, Z. C., Elkan, C., & Narayanaswamy, B. (2014). Thresholding Classifiers to Maximize F1 Score. arXiv: Machine Learning. https://www.semanticscholar.org/paper/Thresholding-Classifiers-to-Maximize-F1-Score-Lipton-Elkan/0fc904dbde45f9e1b696c34b389b6e880094379d

Medeiros, R. P., Ramalho, G. L., & Falcão, T. P. (2019). A Systematic Literature Review on Teaching and Learning Introductory Programming in Higher Education. IEEE Transactions on Education, 62(2), 77-90. IEEE Transactions on Education. https://doi.org/10.1109/TE.2018.2864133

Meta. (2024, abril 18). Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/

Muñoz, M., Cruz, L., Herrera, E., Jiménez, J., Muñoz, A., & Ramos, D. (2020). Pensamiento Computacional para la formación de maestros: Una revisión sistemática de literatura. Proceedings of the 18th LACCEI International Multi-Conference for Engineering, Education, and Technology: Engineering, Integration, And Alliances for A Sustainable Development” “Hemispheric Cooperation for Competitiveness and Prosperity on A Knowledge-Based Economy”. The 18th LACCEI International Multi-Conference for Engineering, Education, and Technology: Engineering, Integration, And Alliances for A Sustainable Development” “Hemispheric Cooperation for Competitiveness and Prosperity on A Knowledge-Based Economy”. https://doi.org/10.18687/LACCEI2020.1.1.135

Oliveira, L., Rosa, S. S., & Pimentel, A. (2019). Revisão Sistemática da Literatura: Formação de Grupos na Aprendizagem Colaborativa com Suporte Computacional. Anais do XXX Simpósio Brasileiro de Informática na Educação (SBIE 2019), 1955. https://doi.org/10.5753/cbie.sbie.2019.1955

OpenAI. (2024, mayo 13). Hello GPT-4o. https://openai.com/index/hello-gpt-4o/

Petersen, K., Vakkalanka, S., & Kuzniarz, L. (2015). Guidelines for conducting systematic mapping studies in software engineering: An update. Information and Software Technology, 64, 1-18. https://doi.org/10.1016/j.infsof.2015.03.007

Sachs, N. A. (2018). Here’s Some Great Research! Now What? Translating Research Into Practice. HERD: Health Environments Research & Design Journal, 11(1), 40-42. https://doi.org/10.1177/1937586718757309

Schulhoff, S., Ilie, M., Balepur, N., Kahadze, K., Liu, A., Si, C., Li, Y., Gupta, A., Han, H., Schulhoff, S., Dulepet, P. S., Vidyadhara, S., Ki, D., Agrawal, S., Pham, C., Kroiz, G., Li, F., Tao, H., Srivastava, A., … Resnik, P. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques (arXiv:2406.06608). arXiv. http://arxiv.org/abs/2406.06608

Shin, T., Razeghi, Y., Logan Iv, R. L., Wallace, E., & Singh, S. (2020). AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4222-4235. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://doi.org/10.18653/v1/2020.emnlp-main.346

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2023). Chainof-Thought Prompting Elicits Reasoning in Large Language Models (arXiv:2201.11903). arXiv. https://doi.org/10.48550/arXiv.2201.11903

Published

2025-07-22

How to Cite

Sánchez Catota, F. L., Alario Hoyos, C., Guamán Loachamín, D. S., & Caiza Ñacato, J. C. (2025). Can Large Language Models Automate Systematic Studies of Literature? Exploring Automated Screening - A Case Study in the Field of Computer Science Education. Spirat. Revista Académica De Docencia Y Gestión Universitaria, 3(NE1), e5639. https://doi.org/10.20453/spirat.v3iNE1.5639