Can Large Language Models Automate Systematic Studies of Literature? Exploring Automated Screening - A Case Study in the Field of Computer Science Education
DOI:
https://doi.org/10.20453/spirat.v3iNE1.5639Keywords:
computer science education, large language models, systematic literature studiesAbstract
This study evaluates the efficacy of Large Language Models in the screening process of Systematic Literature Studies in Computer Science Education, a domain with increasing contributions. Using models such as GPT-4o, Claude-3.5-Sonnet, and Llama-3-70B, the automation of the screening process is explored, comparing its results with a manual process carried out by researchers in the area. The data worked with are from July 2024 and the results of the selection process show high sensitivity (≥0.8644) in all models, indicating that at least 86% of the relevant articles are included, and it is highlighted that Claude-3.5Sonnet includes 96.6% of the relevant articles. The F1-Score values for Claude3.5-Sonnet and GPT-4o (≥0.74) show that the models’ performance is acceptable for this study’s context. Although the low precision (≥0.355) indicates that the models tend to include non-relevant articles, the results obtained suggest that LLMs have significant potential as support tools at the inclusion/exclusion stage, potentially reducing manual review time. However, a hybrid approach combining automation with human judgment in the final tasks of this stage is recommended.
Downloads
References
Anthropic. (2024, junio 20). Introducing Claude 3.5 Sonnet. https://www.anthropic.com/news/claude-3-5-sonnet
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners (arXiv:2005.14165). arXiv. https://doi.org/10.48550/arXiv.2005.14165
Castillo-Segura, P., Alario-Hoyos, C., Kloos, C. D., & Fernández Panadero, C. (2023). Leveraging the Potential of Generative AI to Accelerate Systematic Literature Reviews: An Example in the Area of Educational Technology. 2023 World Engineering Education Forum - Global Engineering Deans Council (WEEF-GEDC), 1-8. https://doi.org/10.1109/WEEF-GEDC59520.2023.10344098
Hugging Face. (s. f.). MMLU Pro—A Hugging Face Space by TIGER-Lab. Recuperado 9 de julio de 2024, de https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
Kojima, T., Gu, S. (Shane), Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. Advances in Neural Information Processing Systems, 35, 22199-22213. https://proceedings.neurips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef-112099c16f326-Abstract-Conference.html
Liang, W., Zhang, Y., Wu, Z., Lepp, H., Ji, W., Zhao, X., Cao, H., Liu, S., He, S., Huang, Z., Yang, D., Potts, C., Manning, C. D., & Zou, J. Y. (2024). Mapping the Increasing Use of LLMs in Scientific Papers (Versión 1). arXiv. https://doi.org/10.48550/ARXIV.2404.01268
Lipton, Z. C., Elkan, C., & Narayanaswamy, B. (2014). Thresholding Classifiers to Maximize F1 Score. arXiv: Machine Learning. https://www.semanticscholar.org/paper/Thresholding-Classifiers-to-Maximize-F1-Score-Lipton-Elkan/0fc904dbde45f9e1b696c34b389b6e880094379d
Medeiros, R. P., Ramalho, G. L., & Falcão, T. P. (2019). A Systematic Literature Review on Teaching and Learning Introductory Programming in Higher Education. IEEE Transactions on Education, 62(2), 77-90. IEEE Transactions on Education. https://doi.org/10.1109/TE.2018.2864133
Meta. (2024, abril 18). Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/
Muñoz, M., Cruz, L., Herrera, E., Jiménez, J., Muñoz, A., & Ramos, D. (2020). Pensamiento Computacional para la formación de maestros: Una revisión sistemática de literatura. Proceedings of the 18th LACCEI International Multi-Conference for Engineering, Education, and Technology: Engineering, Integration, And Alliances for A Sustainable Development” “Hemispheric Cooperation for Competitiveness and Prosperity on A Knowledge-Based Economy”. The 18th LACCEI International Multi-Conference for Engineering, Education, and Technology: Engineering, Integration, And Alliances for A Sustainable Development” “Hemispheric Cooperation for Competitiveness and Prosperity on A Knowledge-Based Economy”. https://doi.org/10.18687/LACCEI2020.1.1.135
Oliveira, L., Rosa, S. S., & Pimentel, A. (2019). Revisão Sistemática da Literatura: Formação de Grupos na Aprendizagem Colaborativa com Suporte Computacional. Anais do XXX Simpósio Brasileiro de Informática na Educação (SBIE 2019), 1955. https://doi.org/10.5753/cbie.sbie.2019.1955
OpenAI. (2024, mayo 13). Hello GPT-4o. https://openai.com/index/hello-gpt-4o/
Petersen, K., Vakkalanka, S., & Kuzniarz, L. (2015). Guidelines for conducting systematic mapping studies in software engineering: An update. Information and Software Technology, 64, 1-18. https://doi.org/10.1016/j.infsof.2015.03.007
Sachs, N. A. (2018). Here’s Some Great Research! Now What? Translating Research Into Practice. HERD: Health Environments Research & Design Journal, 11(1), 40-42. https://doi.org/10.1177/1937586718757309
Schulhoff, S., Ilie, M., Balepur, N., Kahadze, K., Liu, A., Si, C., Li, Y., Gupta, A., Han, H., Schulhoff, S., Dulepet, P. S., Vidyadhara, S., Ki, D., Agrawal, S., Pham, C., Kroiz, G., Li, F., Tao, H., Srivastava, A., … Resnik, P. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques (arXiv:2406.06608). arXiv. http://arxiv.org/abs/2406.06608
Shin, T., Razeghi, Y., Logan Iv, R. L., Wallace, E., & Singh, S. (2020). AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4222-4235. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://doi.org/10.18653/v1/2020.emnlp-main.346
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2023). Chainof-Thought Prompting Elicits Reasoning in Large Language Models (arXiv:2201.11903). arXiv. https://doi.org/10.48550/arXiv.2201.11903
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Franklin L. Sánchez, Carlos Alario-Hoyos, Danny S. Guamán, Julio C. Caiza

This work is licensed under a Creative Commons Attribution 4.0 International License.





1.png)







