Hostname: page-component-74d7c59bfc-ddv54 Total loading time: 0 Render date: 2026-02-02T00:17:14.189Z Has data issue: false hasContentIssue false

Comparing AI and human feedback at higher education: Level appropriateness, quality and coverage

Published online by Cambridge University Press:  09 September 2025

Yusuf Cengiz*
Affiliation:
School of Foreign Languages, Boğaziçi University, Istanbul, Türkiye
Nur Yigitoglu Aptoula
Affiliation:
Department of Foreign Language Education, Boğaziçi University, Istanbul, Türkiye
*
Corresponding author: Yusuf Cengiz; Email: yusuf.cengiz@bogazici.edu.tr
Rights & Permissions [Opens in a new window]

Abstract

Information

Type
Research in Progress
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
© The Author(s), 2025. Published by Cambridge University Press.

1. Introduction

Feedback is recognized as a vital component of second language (L2) writing due to its role in bridging the gap between current and desired performance levels (Hattie et al., Reference Hattie, Crivelli, Van Gompel, West-Smith and Wike2021; Hattie & Timperley, Reference Hattie and Timperley2007; Lee, Reference Lee2017). Corrective feedback practices in L2 writing have been demonstrated to be effective in fostering a deeper understanding of language structures (Bitchener, Reference Bitchener2008; Link et al., Reference Link, Mehrzad and Rahimi2022; Van Beuningen, Reference Van Beuningen2010) and lexical units, as well as improving the contextual (Kang, Reference Kang2020) and organizational (Yu & Lee, Reference Yu and Lee2014) aspects of written texts.

Although traditional feedback practices have been investigated for many years (Ngo et al., Reference Ngo, Chen and Lai2024), the use of automated tools for writing feedback has gained significant attraction with the introduction of Large Language Models (LLMs) (Steinert et al., Reference Steinert, Avila, Ruzika, Kuhn and Küchemann2024; Venter et al., Reference Venter, Coetzee and Schmulian2024), such as ChatGPT, Gemini, and Copilot. These generative Artificial Intelligence (AI) tools have introduced a paradigm shift within the field of education (O’Dea, Reference O’Dea2024) and their uses in corrective feedback in writing are being investigated in recent literature.

Research suggests that integrating AI tools in L2 writing processes can facilitate immediate and personalized (Kasneci et al., Reference Kasneci, Sessler, Küchemann, Bannert, Dementieva, Fischer, Gasser, Groh, Günnemann, Hüllermeier, Krusche, Kutyniok, Michaeli, Nerdel, Pfeffer, Poquet, Sailer, Schmidt, Seidel and Kasneci2023; Venter et al., Reference Venter, Coetzee and Schmulian2024) feedback opportunities, while also supporting the development learners’ self-efficacy (Nazari et al., Reference Nazari, Shabbir and Setiawan2021), analytical thinking, and critical thinking skills (Han & Li, Reference Han and Li2024; Li et al., Reference Li, Jiang, Hu, Feng, Chen and Ouyang2024). However, issues have been raised regarding the consistency (Han & Li, Reference Han and Li2024) and comprehensiveness (Liu & Chang, Reference Liu and Chang2024) of AI feedback in L2 writing. These tools are susceptible to generating fabricated (Win Myint et al., Reference Win Myint, Lo and Zhang2024), oversimplified, or inaccurate responses (Al-khresheh, Reference Al-khresheh2024). Although certain studies have compared human feedback to AI feedback to investigate the extent of its consistency, comprehensiveness, and correctness (Lin & Crosthwaite, Reference Lin and Crosthwaite2024; Steiss et al., Reference Steiss, Tate, Graham, Cruz, Hebert, Wang, Moon, Tseng, Warschauer and Olson2024), scant research has examined these dynamics within authentic classroom settings or made use of the prompt engineering opportunities while comparing AI feedback to human feedback.

To investigate the appropriateness, quality, and coverage of AI feedback when compared to human feedback given in real-life classroom setting, this study aims to answer the following research questions:

  1. 1. What is the perceived quality of AI feedback compared to human feedback, as evaluated by teachers?

  2. 2. How does AI feedback compare to human feedback in term of coverage of content, grammar, vocabulary, organization, and mechanics?

2. Methodology

2.1. Context and participants

The study was conducted at the pre-departmental, intensive English program of a state university in Istanbul, Türkiye. Three English instructors and 56 students participated in the study. The students were 18-year-old Turkish learners of English. Based on CEFR (Council of Europe, 2020) standards, 10 of the students had A1-level English proficiency, 18 students had A2-level proficiency, while the remaining 28 students had B1-level proficiency. The teachers were aged between 26 and 30, and their years of experience ranged between 2 and 7.

2.2. Data collection

The data was collected throughout the 2024 fall term of the program. As part of their program, the students were required to write a paragraph, and at later stages an essay, on the given topic in a biweekly manner. The total number of paragraphs and essays produced by students was 117.

The instructors provided feedback on the assignments during their office hours. ChatGPT 4o (OpenAI., 2025) was used for AI feedback. The texts were submitted to ChatGPT 4o using a Python script. The AI prompt included the definition of the context, the task, feedback areas, a sample text, and some sample feedback for the text.

2.3. Data analysis

Using the error analysis categories developed by Ferris (Reference Ferris, Hyland and Hyland2006) and Ene and Upton (Reference Ene and Upton2014), all human and AI feedback instances were coded by the researchers based on their area of focus, that is, grammar, vocabulary, organization, content, or mechanics.

To answer RQ1, each AI feedback instances were investigated to determine whether (i) they were correct, and (ii) they were appropriate and necessary for the level of the students. In order to answer RQ2, the remaining correct AI feedback instances were juxtaposed with human feedback to determine overlapping and distant feedback instances according to each feedback area.

3. Results

3.1. AI feedback quality and level appropriateness

In total, ChatGPT provided 1986 feedback instances for all the texts. Table 1 provides the feedback areas of correct, incorrect, and unnecessary instances.

Table 1. Feedback areas for correct, incorrect, and unnecessary AI feedback instances

In 1750 of these instances, or 88.11% of all instances, ChatGPT correctly identified the issues with the text and provided level-appropriate and useful feedback. In 126 feedback instances (6.35%), ChatGPT marked a correct student production as incorrect or provided incorrect feedback to an erroneous use, while 110 instances (5.54%) included unnecessary feedback which was correct but was either beyond the students’ current proficiency levels, or was not necessary to implement into the texts.

3.2. Comparing AI and human feedback

The teachers in this study provided 1751 instances of feedback for the whole dataset, while the number of correct ChatGPT feedback instances was 1750. The analysis has shown that ChatGPT matched 985 (56.29%) human feedback instances. Table 2 provides the statistics regarding these matches and feedback areas.

Table 2. Matches between human and correct AI feedback based on feedback areas

Note: Matching percentage is calculated by dividing the total number of matches by the total number of human feedback and multiplying by 100 for each feedback area.

* Human feedback includes 59 (3.37%) instances of personal remarks of the teachers, which could not be classified under these categories and thus were not included in the table or analysis.

The highest alignment between AI and human feedback was found in mechanics-related issues – such as punctuation, capitalization, and spelling – with AI covering 67.76% of human feedback. Grammar feedback showed a 65.07% match, while vocabulary issues aligned at 51.12%. Among the subcategories, the highest overlaps were observed in capitalization (100%), word forms (86.96%), spelling (84.76%), subject-verb agreement (84.06%), and idiomatic usage (81.82%).

Conversely, the lowest alignment was observed in content (23.28%) and organization (22.67%). Apart from these categories, the lowest matches occurred in the use of pronouns (35.29%), punctuation (35.82%), and word choice (37.80%).

3.3. AI feedback without corresponding human feedback

While the overall alignment between human and AI feedback was 56.29%, the remaining 858 instances – representing 49.08% of all accurate AI feedback – addressed errors not identified by the teachers. Table 3 presents the total number of these AI feedback instances where there were no corresponding human feedback AI feedback instances by category.

Table 3. Correct AI feedback without corresponding human feedback

4. Discussion and implications

Even though concerns have been raised in terms of the quality, accuracy (Steiss et al., Reference Steiss, Tate, Graham, Cruz, Hebert, Wang, Moon, Tseng, Warschauer and Olson2024), and consistency (Lin & Crosthwaite, Reference Lin and Crosthwaite2024) of AI-generated feedback in L2 writing, our study revealed that ChatGPT 4o can produce highly accurate feedback, as 88.11% of all AI feedback instances correctly addressed the issues and only 11.98% were classified as inaccurate or unnecessary. Our findings align with those of the literature, showing that AI tools can effectively provide corrective feedback (Jamshed et al., Reference Jamshed, Ahmed, Sarfaraj and Warda2024), particularly when context relevant prompts are utilized (Venter et al., Reference Venter, Coetzee and Schmulian2024).

Providing feedback within a classroom setting can be challenging, exhausting, and time-consuming for L2 teachers (Dikli & Bleyle, Reference Dikli and Bleyle2014; Lee, Reference Lee2019; Wilson et al., Reference Wilson, Olinghouse and Andrada2014). Given that ChatGPT 4o aligned with 56.29% of human feedback instances and 858 other instances that the teachers had not addressed, covering 49.08% of all correct AI feedback instances, it can be suggested that ChatGPT 4o provides a valuable feedback mechanism through which it can meaningfully complement and support human feedback practices (Han & Li, Reference Han and Li2024; Lin & Crosthwaite, Reference Lin and Crosthwaite2024; Xue, Reference Xue2024). The data further indicate that ChatGPT 4o could provide extensive feedback on mechanics and grammar. By utilizing AI to identify surface-level errors, the efforts of the language teachers could be directed towards more attention-demanding language related issues, such as organization and content (Link et al., Reference Link, Mehrzad and Rahimi2022; Steinert et al., Reference Steinert, Avila, Ruzika, Kuhn and Küchemann2024), where AI feedback showed limited alignment in this study.

This study has certain limitations. Student evaluations of ChatGPT 4o’s feedback are needed to assess comprehensibility and perceived usefulness of the feedback. Additionally, potential biases may have influenced the classification of “unnecessary feedback” instances, as the decision was based on the teachers’ evaluation of the feedback according to their own teaching preferences. Moreover, despite prompt optimization, ChatGPT remains susceptible to generating different responses even with the same texts and prompts (Lin & Crosthwaite, Reference Lin and Crosthwaite2024).

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/S0261444825000199.

Yusuf Cengiz is a Ph.D. candidate in English Language Teaching at Boğaziçi University, Istanbul, Türkiye, where he also works as an English instructor. His research focuses on second language (L2) writing, corrective feedback, and use of AI.

Nur Yiğitoğlu Aptoula is an associate professor in the Department of Foreign Language Education at Boğaziçi University, Istanbul, Türkiye. Her current research focuses on L2 writing and L2 teacher education. She is one of the recipients of the 2023 Turkish Academy of Sciences (TÜBA) Outstanding Young Scientists Awards (GEBİP).

Footnotes

A reproduction of the poster discussed is available in the supplementary material published alongside this article on Cambridge Core

References

Al-khresheh, M. H. (2024). Bridging technology and pedagogy from a global lens: Teachers’ perspectives on integrating ChatGPT in English language teaching. Computers and Education: Artificial Intelligence, 6, 100218. https://doi.org/10.1016/j.caeai.2024.100218Google Scholar
Bitchener, J. (2008). Evidence in support of written corrective feedback. Journal of Second Language Writing, 17(2), 102118. https://doi.org/10.1016/j.jslw.2007.11.004CrossRefGoogle Scholar
Council of Europe. (2020). Common European Framework of Reference for Languages: Companion Volume. Cambridge University Press.Google Scholar
Dikli, S., & Bleyle, S. (2014). Automated essay scoring feedback for second language writers: How does it compare to instructor feedback? Assessing Writing, 22, 117. https://doi.org/10.1016/j.asw.2014.03.006CrossRefGoogle Scholar
Ene, E., & Upton, T. A. (2014). Learner uptake of teacher electronic feedback in ESL composition. System, 46, 8095. https://doi.org/10.1016/j.system.2014.07.011CrossRefGoogle Scholar
Ferris, D. (2006). Does error feedback help student writers? New evidence on the short- and long-term effects of written error correction. In Hyland, F., & Hyland, K. (Eds.), Feedback in second language writing: Contexts and issues (pp. 81104). Cambridge University Press. https://doi.org/10.1017/CBO9781139524742.007CrossRefGoogle Scholar
Han, J., & Li, M. (2024). Exploring ChatGPT-supported teacher feedback in the EFL context. System, 126, 103502. https://doi.org/10.1016/j.system.2024.103502CrossRefGoogle Scholar
Hattie, J., Crivelli, J., Van Gompel, K., West-Smith, P., & Wike, K. (2021). Feedback that leads to improvement in student essays: Testing the hypothesis that “Where to Next” feedback is most powerful. Frontiers in Education, 6. https://doi.org/10.3389/feduc.2021.645758Google Scholar
Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81112. https://doi.org/10.3102/003465430298487CrossRefGoogle Scholar
Jamshed, M., Ahmed, A. S. M. M., Sarfaraj, M., & Warda, W. U. (2024). The impact of Chatgpt on English language learners’ writing skills: An assessment of AI feedback on mobile. International Journal of Interactive Mobile Technologies (Ijim) 18(19), 1836. https://doi.org/10.3991/ijim.v18i19.50361CrossRefGoogle Scholar
Kang, E. Y. (2020). Using model texts as a form of feedback in L2 writing. System, 89, 102196. https://doi.org/10.1016/j.system.2019.102196CrossRefGoogle Scholar
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., & Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274CrossRefGoogle Scholar
Lee, I. (2017). Classroom writing assessment and feedback in L2 school contexts. Springer. https://doi.org/10.1007/978-981-10-3924-9CrossRefGoogle Scholar
Lee, I. (2019). Teacher written corrective feedback: Less is more. Language Teaching, 52(4), 524536. https://doi.org/10.1017/S0261444819000247CrossRefGoogle Scholar
Li, X., Jiang, S., Hu, Y., Feng, X., Chen, W., & Ouyang, F. (2024). Investigating the impact of structured knowledge feedback on collaborative academic writing. Education and Information Technologies, 29(14), 1900519033. https://doi.org/10.1007/s10639-024-12560-yCrossRefGoogle Scholar
Lin, S., & Crosthwaite, P. (2024). The grass is not always greener: Teacher vs. GPT-assisted written corrective feedback. System, 127, 103529. https://doi.org/10.1016/j.system.2024.103529CrossRefGoogle Scholar
Link, S., Mehrzad, M., & Rahimi, M. (2022). Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Computer Assisted Language Learning, 35(4), 605634. https://doi.org/10.1080/09588221.2020.1743323CrossRefGoogle Scholar
Liu, Y., & Chang, P. (2024). Exploring EFL teachers’ emotional experiences and adaptive expertise in the context of AI advancements: A positive psychology perspective. System, 126, 103463. https://doi.org/10.1016/j.system.2024.103463CrossRefGoogle Scholar
Nazari, N., Shabbir, M. S., & Setiawan, R. (2021). Application of Artificial Intelligence powered digital writing assistant in higher education: Randomized controlled trial. Heliyon, 7(5), e07014. doi:https://doi.org/10.1016/j.heliyon.2021.e07014CrossRefGoogle ScholarPubMed
Ngo, T. T.-N., Chen, H. H.-J., & Lai, K. K.-W. (2024). The effectiveness of automated writing evaluation in EFL/ESL writing: A three-level meta-analysis. Interactive Learning Environments, 32(2), 727744. https://doi.org/10.1080/10494820.2022.2096642CrossRefGoogle Scholar
O’Dea, X. (2024). Generative AI: Is it a paradigm shift for higher education? Studies in Higher Education, 49(5), 811816. https://doi.org/10.1080/03075079.2024.2332944CrossRefGoogle Scholar
OpenAI. (2025). ChatGPT 4o (February 2025 version). https://chat.openai.com/chatGoogle Scholar
Steinert, S., Avila, K. E., Ruzika, S., Kuhn, J., & Küchemann, S. (2024). Harnessing large language models to develop research-based learning assistants for formative feedback. Smart Learning Environments, 11(1), 62. https://doi.org/10.1186/s40561-024-00354-1CrossRefGoogle Scholar
Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang, J., Moon, Y., Tseng, W., Warschauer, M., & Olson, C. B. (2024). Comparing the quality of human and ChatGPT feedback of students’ writing. Learning and Instruction, 91, 101894. https://doi.org/10.1016/j.learninstruc.2024.101894CrossRefGoogle Scholar
Van Beuningen, C. (2010). Corrective feedback in L2 writing: Theoretical perspectives, empirical insights, and future directions. International Journal of English Studies, 10(2), 1. https://doi.org/10.6018/ijes/2010/2/119171CrossRefGoogle Scholar
Venter, J., Coetzee, S. A., & Schmulian, A. (2024). Exploring the use of artificial intelligence (AI) in the delivery of effective feedback. Assessment and Evaluation in Higher Education, 121. https://doi.org/10.6018/ijes/2010/2/119171Google Scholar
Wilson, J., Olinghouse, N. G., & Andrada, G. N. (2014). Does automated feedback improve writing quality? Learning Disabilities: A Contemporary Journal, 12(1), 93118. https://eric.ed.gov/?id=EJ1039856Google Scholar
Win Myint, P. Y., Lo, S. L., & Zhang, Y. (2024). Harnessing the power of AI-instructor collaborative grading approach: Topic-based effective grading for semi open-ended multipart questions. Computers and Education: Artificial Intelligence, 7, 100339. https://doi.org/10.1016/j.caeai.2024.100339Google Scholar
Xue, Y. (2024). Towards automated writing evaluation: A comprehensive review with bibliometric, scientometric, and meta-analytic approaches. Education and Information Technologies, 29(15), 1955319594. https://doi.org/10.1007/s10639-024-12596-0CrossRefGoogle Scholar
Yu, S., & Lee, I. (2014). An analysis of Chinese EFL students’ use of first and second language in peer feedback of L2 writing. System, 47, 2838. https://doi.org/10.1016/j.system.2014.08.007CrossRefGoogle Scholar
Figure 0

Table 1. Feedback areas for correct, incorrect, and unnecessary AI feedback instances

Figure 1

Table 2. Matches between human and correct AI feedback based on feedback areas

Figure 2

Table 3. Correct AI feedback without corresponding human feedback

Supplementary material: File

Cengiz and Yigitoglu Aptoula supplementary material

Cengiz and Yigitoglu Aptoula supplementary material
Download Cengiz and Yigitoglu Aptoula supplementary material(File)
File 2.7 MB