1. Introduction
Feedback is recognized as a vital component of second language (L2) writing due to its role in bridging the gap between current and desired performance levels (Hattie et al., Reference Hattie, Crivelli, Van Gompel, West-Smith and Wike2021; Hattie & Timperley, Reference Hattie and Timperley2007; Lee, Reference Lee2017). Corrective feedback practices in L2 writing have been demonstrated to be effective in fostering a deeper understanding of language structures (Bitchener, Reference Bitchener2008; Link et al., Reference Link, Mehrzad and Rahimi2022; Van Beuningen, Reference Van Beuningen2010) and lexical units, as well as improving the contextual (Kang, Reference Kang2020) and organizational (Yu & Lee, Reference Yu and Lee2014) aspects of written texts.
Although traditional feedback practices have been investigated for many years (Ngo et al., Reference Ngo, Chen and Lai2024), the use of automated tools for writing feedback has gained significant attraction with the introduction of Large Language Models (LLMs) (Steinert et al., Reference Steinert, Avila, Ruzika, Kuhn and Küchemann2024; Venter et al., Reference Venter, Coetzee and Schmulian2024), such as ChatGPT, Gemini, and Copilot. These generative Artificial Intelligence (AI) tools have introduced a paradigm shift within the field of education (O’Dea, Reference O’Dea2024) and their uses in corrective feedback in writing are being investigated in recent literature.
Research suggests that integrating AI tools in L2 writing processes can facilitate immediate and personalized (Kasneci et al., Reference Kasneci, Sessler, Küchemann, Bannert, Dementieva, Fischer, Gasser, Groh, Günnemann, Hüllermeier, Krusche, Kutyniok, Michaeli, Nerdel, Pfeffer, Poquet, Sailer, Schmidt, Seidel and Kasneci2023; Venter et al., Reference Venter, Coetzee and Schmulian2024) feedback opportunities, while also supporting the development learners’ self-efficacy (Nazari et al., Reference Nazari, Shabbir and Setiawan2021), analytical thinking, and critical thinking skills (Han & Li, Reference Han and Li2024; Li et al., Reference Li, Jiang, Hu, Feng, Chen and Ouyang2024). However, issues have been raised regarding the consistency (Han & Li, Reference Han and Li2024) and comprehensiveness (Liu & Chang, Reference Liu and Chang2024) of AI feedback in L2 writing. These tools are susceptible to generating fabricated (Win Myint et al., Reference Win Myint, Lo and Zhang2024), oversimplified, or inaccurate responses (Al-khresheh, Reference Al-khresheh2024). Although certain studies have compared human feedback to AI feedback to investigate the extent of its consistency, comprehensiveness, and correctness (Lin & Crosthwaite, Reference Lin and Crosthwaite2024; Steiss et al., Reference Steiss, Tate, Graham, Cruz, Hebert, Wang, Moon, Tseng, Warschauer and Olson2024), scant research has examined these dynamics within authentic classroom settings or made use of the prompt engineering opportunities while comparing AI feedback to human feedback.
To investigate the appropriateness, quality, and coverage of AI feedback when compared to human feedback given in real-life classroom setting, this study aims to answer the following research questions:
1. What is the perceived quality of AI feedback compared to human feedback, as evaluated by teachers?
2. How does AI feedback compare to human feedback in term of coverage of content, grammar, vocabulary, organization, and mechanics?
2. Methodology
2.1. Context and participants
The study was conducted at the pre-departmental, intensive English program of a state university in Istanbul, Türkiye. Three English instructors and 56 students participated in the study. The students were 18-year-old Turkish learners of English. Based on CEFR (Council of Europe, 2020) standards, 10 of the students had A1-level English proficiency, 18 students had A2-level proficiency, while the remaining 28 students had B1-level proficiency. The teachers were aged between 26 and 30, and their years of experience ranged between 2 and 7.
2.2. Data collection
The data was collected throughout the 2024 fall term of the program. As part of their program, the students were required to write a paragraph, and at later stages an essay, on the given topic in a biweekly manner. The total number of paragraphs and essays produced by students was 117.
The instructors provided feedback on the assignments during their office hours. ChatGPT 4o (OpenAI., 2025) was used for AI feedback. The texts were submitted to ChatGPT 4o using a Python script. The AI prompt included the definition of the context, the task, feedback areas, a sample text, and some sample feedback for the text.
2.3. Data analysis
Using the error analysis categories developed by Ferris (Reference Ferris, Hyland and Hyland2006) and Ene and Upton (Reference Ene and Upton2014), all human and AI feedback instances were coded by the researchers based on their area of focus, that is, grammar, vocabulary, organization, content, or mechanics.
To answer RQ1, each AI feedback instances were investigated to determine whether (i) they were correct, and (ii) they were appropriate and necessary for the level of the students. In order to answer RQ2, the remaining correct AI feedback instances were juxtaposed with human feedback to determine overlapping and distant feedback instances according to each feedback area.
3. Results
3.1. AI feedback quality and level appropriateness
In total, ChatGPT provided 1986 feedback instances for all the texts. Table 1 provides the feedback areas of correct, incorrect, and unnecessary instances.
Table 1. Feedback areas for correct, incorrect, and unnecessary AI feedback instances

In 1750 of these instances, or 88.11% of all instances, ChatGPT correctly identified the issues with the text and provided level-appropriate and useful feedback. In 126 feedback instances (6.35%), ChatGPT marked a correct student production as incorrect or provided incorrect feedback to an erroneous use, while 110 instances (5.54%) included unnecessary feedback which was correct but was either beyond the students’ current proficiency levels, or was not necessary to implement into the texts.
3.2. Comparing AI and human feedback
The teachers in this study provided 1751 instances of feedback for the whole dataset, while the number of correct ChatGPT feedback instances was 1750. The analysis has shown that ChatGPT matched 985 (56.29%) human feedback instances. Table 2 provides the statistics regarding these matches and feedback areas.
Table 2. Matches between human and correct AI feedback based on feedback areas

Note: Matching percentage is calculated by dividing the total number of matches by the total number of human feedback and multiplying by 100 for each feedback area.
* Human feedback includes 59 (3.37%) instances of personal remarks of the teachers, which could not be classified under these categories and thus were not included in the table or analysis.
The highest alignment between AI and human feedback was found in mechanics-related issues – such as punctuation, capitalization, and spelling – with AI covering 67.76% of human feedback. Grammar feedback showed a 65.07% match, while vocabulary issues aligned at 51.12%. Among the subcategories, the highest overlaps were observed in capitalization (100%), word forms (86.96%), spelling (84.76%), subject-verb agreement (84.06%), and idiomatic usage (81.82%).
Conversely, the lowest alignment was observed in content (23.28%) and organization (22.67%). Apart from these categories, the lowest matches occurred in the use of pronouns (35.29%), punctuation (35.82%), and word choice (37.80%).
3.3. AI feedback without corresponding human feedback
While the overall alignment between human and AI feedback was 56.29%, the remaining 858 instances – representing 49.08% of all accurate AI feedback – addressed errors not identified by the teachers. Table 3 presents the total number of these AI feedback instances where there were no corresponding human feedback AI feedback instances by category.
Table 3. Correct AI feedback without corresponding human feedback

4. Discussion and implications
Even though concerns have been raised in terms of the quality, accuracy (Steiss et al., Reference Steiss, Tate, Graham, Cruz, Hebert, Wang, Moon, Tseng, Warschauer and Olson2024), and consistency (Lin & Crosthwaite, Reference Lin and Crosthwaite2024) of AI-generated feedback in L2 writing, our study revealed that ChatGPT 4o can produce highly accurate feedback, as 88.11% of all AI feedback instances correctly addressed the issues and only 11.98% were classified as inaccurate or unnecessary. Our findings align with those of the literature, showing that AI tools can effectively provide corrective feedback (Jamshed et al., Reference Jamshed, Ahmed, Sarfaraj and Warda2024), particularly when context relevant prompts are utilized (Venter et al., Reference Venter, Coetzee and Schmulian2024).
Providing feedback within a classroom setting can be challenging, exhausting, and time-consuming for L2 teachers (Dikli & Bleyle, Reference Dikli and Bleyle2014; Lee, Reference Lee2019; Wilson et al., Reference Wilson, Olinghouse and Andrada2014). Given that ChatGPT 4o aligned with 56.29% of human feedback instances and 858 other instances that the teachers had not addressed, covering 49.08% of all correct AI feedback instances, it can be suggested that ChatGPT 4o provides a valuable feedback mechanism through which it can meaningfully complement and support human feedback practices (Han & Li, Reference Han and Li2024; Lin & Crosthwaite, Reference Lin and Crosthwaite2024; Xue, Reference Xue2024). The data further indicate that ChatGPT 4o could provide extensive feedback on mechanics and grammar. By utilizing AI to identify surface-level errors, the efforts of the language teachers could be directed towards more attention-demanding language related issues, such as organization and content (Link et al., Reference Link, Mehrzad and Rahimi2022; Steinert et al., Reference Steinert, Avila, Ruzika, Kuhn and Küchemann2024), where AI feedback showed limited alignment in this study.
This study has certain limitations. Student evaluations of ChatGPT 4o’s feedback are needed to assess comprehensibility and perceived usefulness of the feedback. Additionally, potential biases may have influenced the classification of “unnecessary feedback” instances, as the decision was based on the teachers’ evaluation of the feedback according to their own teaching preferences. Moreover, despite prompt optimization, ChatGPT remains susceptible to generating different responses even with the same texts and prompts (Lin & Crosthwaite, Reference Lin and Crosthwaite2024).
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0261444825000199.
Yusuf Cengiz is a Ph.D. candidate in English Language Teaching at Boğaziçi University, Istanbul, Türkiye, where he also works as an English instructor. His research focuses on second language (L2) writing, corrective feedback, and use of AI.
Nur Yiğitoğlu Aptoula is an associate professor in the Department of Foreign Language Education at Boğaziçi University, Istanbul, Türkiye. Her current research focuses on L2 writing and L2 teacher education. She is one of the recipients of the 2023 Turkish Academy of Sciences (TÜBA) Outstanding Young Scientists Awards (GEBİP).