1. Introduction
Technology is applied to all aspects of foreign language learning and teaching including assessments. Among these technologies, there has been an increase in the use of automated writing evaluation (AWE) for writing assessment. Natural language processing (NLP) and machine learning are employed in AWE systems to provide language learners with automated corrective feedback (Li, Dursun, and Hegelheimer Reference Li, Dursun and Hegelheimer2017) and more accurate and objective scoring, which can otherwise be biased when performed by test raters. Because automated scoring is faster and more cost-effective compared to human scoring, it is used to help language teachers easily assess endless essays. Owing to these benefits, many scholars developed and implemented AWE systems for various languages including English (Shermis and Burstein Reference Shermis and Burstein2003), Japanese,Footnote a Bahasa Malay, Chinese, Hebrew, Spanish, and Turkish.Footnote b
Despite the large number of pre-existing AWE systems, AWE for Korean L2 writing remains unexplored. Based on the Modern Language Association (MLA) report, Korean is the only language that demonstrated a sharp increase in enrollment over the past few years compared to other foreign languages. Furthermore, Korean has been consistently ranked as the 15th most commonly taught foreign languages in US colleges and universities between 2013 and 2016. Therefore, it is necessary to develop AWE for Korean to provide innovative resources in the growing field of Korean language education.
In the most basic terms, AWE is defined as “the process of evaluating and scoring written prose via computer programs” (Shermis and Burstein Reference Shermis and Burstein2003). With the advent of automatic scoring in the 1960s (Page Reference Page1966), advanced language processing technologies and statistical methods led to the development of various AWE systems (Li et al. Reference Li, Dursun and Hegelheimer2017). The first computerized scoring system called project essay grader ${}^{\text{TM}}$ (PEG ${}^{\text{TM}}$ ) could detect syntactic errors and predict scores that were comparable to those of human raters (Page and Petersen Reference Page and Petersen1995).
More advanced AWE systems were developed in the 1990s; the intelligent essay assessor ${}^{\text{TM}}$ (IEA) utilized latent semantic analysis to move beyond the capability of scoring and include feedback on semantics (Foltz, Laham, and Landauer Reference Foltz, Laham and Landauer1999). Recently, several scoring engines with more sophisticated language processing techniques and statistical methods have been developed (Li et al. Reference Li, Link, Ma, Yang and Hegelheimer2014). E-rater ® , Knowledge analysis technologies ${}^{\text{TM}}$ , and IntelliMetric ${}^{\text{TM}}$ analyze a wide range of text features at lexical, semantic, syntactic, and discourse levels.
E-rater (developed by ETS) is an early AWE scoring engine designed to evaluate essays written by nonnative English learners; it is still widely used for TOEFL and GMAT, which are high-stakes tests for undergraduate admission or graduate business admission in the United States (Burstein, Tetreault, and Madnani Reference Burstein, Tetreault and Madnani2013). E-rater identifies and extracts several feature classes for model building and scoring using statistical and rule-based NLP (Attali and Burstein Reference Attali and Burstein2006). Some of the feature classes include (1) grammatical errors (e.g., subject–verb agreement errors); (2) word usage errors (e.g., here versus hear); (3) errors in mechanics (e.g., spelling and punctuation); (4) presence of discourse elements (e.g., thesis statement, supporting details, and concluding paragraphs); (5) development of discourse elements; (6) style (e.g., repeated use of the same word); (7) content-vector analysis (CVA)-based features to evaluate topical word usage; (8) features associated with the correct usage of prepositions and collocations (e.g., powerful versus strong); and (9) a variety of sentence structure formation (Burstein et al. Reference Burstein, Tetreault and Madnani2013). After measuring these features, the e-rater provides a holistic score that corresponds with human-rated scores. A randomly selected sample of human-scored essays is run through the e-rater, after which a variety of linguistic features are extracted and converted to numerical values. Using a regression modeling approach, the values obtained from this sample are used to determine the weight for each feature. To score a new essay, the e-rater extracts the set of features and converts the features to a vector value, and then, these values are multiplied by the weights relevant to each feature. Finally, the sum of the weighted feature is computed to predict the final score, which represents the overall quality of an essay (Attali, Bridgeman, and Trapani Reference Attali, Bridgeman and Trapani2010).
Another important scoring engine is IntelliMetric, which uses the same holistic scoring approach employed by human raters (Schultz Reference Schultz2013). Similar to the training requirements for human raters to score a specific prompt, the IntelliMetric system needs to be trained with a set of previously scored responses from human raters. The system then internalizes the features of the responses linked to each score point and applies it to score essays with unknown scores. The IntelliMetric system uses a multistage process to score essays. First, the essays need to be provided in an electronic form. After the information is received and prepared for analysis, the text is then parsed to understand the grammatical and syntactic structure of the language. Each sentence is identified in terms of parts of speech, vocabulary, sentence structure, and expression. After all the information is collected from the text, statistical techniques are employed to translate the text into a numerical form. Then, IntelliMetric uses virtual raters (mathematical models) to assign scores. Each virtual rater attempts to link the features extracted from the text to the scores assigned in the training set to ensure accurate scoring for essays with unknown scores. IntelliMetric finally integrates the information received from the virtual rates to present a single and reliable score.
Powered by these above-mentioned scoring engines, AWE tools such as Criterion and MYAccess! have been developed. These AWE tools can provide writing scores and feedback instantly, and students can benefit from these tools by practicing writing and receiving immediate feedback from the tools. In the context of writing instructions, AWE tools can assist instructors by providing immediate scoring and feedback, especially in large classroom scenarios.
In general, AWE studies have focused on the validity and reliability of AWE tools (Dikli and Bleyle Reference Dikli and Bleyle2014). Previous validation studies reported high agreement rates between the AWE tools and human raters (Burstein et al. Reference Burstein, Braden-Harder, Chodorow, Hua, Kaplan, Kukich, Lu, Nolan, Rock and Wolff1998; Landauer, Laham, and Foltz Reference Landauer, Laham and Foltz2003; Chodorow, Gamon, and Tetreault Reference Chodorow, Gamon and Tetreault2010). For example, Shermis et al. (Reference Shermis, Koch, Page, Keith and Harrington2002) showed that PEG ${}^{\text{TM}}$ achieved scores that were highly correlated with human scores ( $r = 0.82$ ) compared with human inter-rater reliability ( $r = 0.71$ ). Furthermore, Enright and Quinlan (Reference Enright and Quinlan2010) found high agreement indices between ratings provided by two human raters and those provided by e-rater and one human in TOEFL iBT. E-rater proved to be a reliable complement to human ratings under specific testing contexts (Burstein et al. Reference Burstein, Braden-Harder, Chodorow, Hua, Kaplan, Kukich, Lu, Nolan, Rock and Wolff1998; Powers et al. Reference Powers, Burstein, Chodorow, Fowles and Kukich2000; Burstein Reference Burstein2003; Chodorow and Burstein Reference Chodorow and Burstein2004; Attali Reference Attali2007; Lee, Gentile, and Kantor Reference Lee, Gentile and Kantor2008).
Neural models have dominated current AWE systems. Ke and Ng (Reference Ke and Ng2019), Ramesh and Sanampudi (Reference Ramesh and Sanampudi2021), and Uto (Reference Uto2021) have summarized recent neural models well. For automatic essay scoring, there are two main model types. Firstly, in RNN-based models, the RNN output is sent to mean-over-time to aggregate the input to the fixed length vector and a linear layer for the scalar value (Taghipour and Ng Reference Taghipour and Ng2016) or a simple BiLSTM to the linear layer is used for predicting essay scores (Alikaniotis, Yannakoudakis, and Rei Reference Alikaniotis, Yannakoudakis and Rei2016). Secondly, transformer-based models, for example, BERT with BiLSTM with attention (Nadeem et al. Reference Nadeem, Nguyen, Liu and Ostendorf2019) or BERT concatenated with handcrafted features (Uto, Xie, and Ueno Reference Uto, Xie and Ueno2020), can be used to predict the score. Fine-tuning BERT using multiple losses including regression loss and reranking loss for constraining automated essay scores has been shown to produce state-of-the-art results (Yang et al. Reference Yang, Cao, Wen, Wu and He2020).
Although there are many studies that explore AWE tools and their validation, a majority of the studies focus on AWE systems developed for native English-speaking writers (Powers et al. Reference Powers, Burstein, Chodorow, Fowles and Kukich2001; Rudner, Garcia, and Welch Reference Rudner, Garcia and Welch2006; Wang and Brown Reference Wang and Brown2007) or English as a second language (ESL) writers (Chen and Cheng Reference Chen and Cheng2008; Choi and Lee Reference Choi and Lee2010). Only a few studies investigate the use of the AWE system for less commonly taught languages, and to the best of our knowledge, there are no studies that investigate AWE for Korean as a foreign language (KFL) because of the lack of available AWE tools. This study aims to extend the scope of research in this area by introducing a state-of-the-art AWE system that is developed based on the Korean learner corpus for Koreans.
The goal of this study is to develop a neural Korean AWE engine and validate it in terms of its capacity to distinguish the developmental level of second language learners. In this paper, we address the question of how recent advancements in neural network models can help improve automatic writing evaluation, and how neural network models can use different linguistic features to improve AWE performance using linguistic features for AWE in a complementary manner. This paper includes a description of the automated essay scoring system, its natural language processing-centered approach within the neural system, and details on the validation of the AWE system in terms of predicting the proficiency level and holistic score simultaneously of the learners.
The rest of this paper is organized as follows. First, the paper presents the Korean learner corpus used to develop the Korean AWE program and discusses how we define features in the learner corpus (Section 2). Next, the basic AWE model is presented (Section 3), followed by a proposed neural AWE model that was designed to compensate for the limitations of the basic model (Section 4). Finally, the results from an experiment are reported with detailed discussions (Section 5) and future perspectives for the AWE model in the conclusion (Section 6).
2. Korean learner corpus
2.1 Learner corpus dataset
We use the dataset from the Korean learner corpus (Park and Lee Reference Park and Lee2016); this database contains proficiency levels (from Level 1 to Level 6) (<level>), native language by nationality (<nationality>), gender (<gender>), teacher-attributed score (<score>), and text. Figure 1 shows an example of the Korean learner corpus dataset which indicates the learner’s proficiency level = Level 1 (A1), L1 = Chinese, gender = F, and score = 70. Furthermore, it shows the title of the text (<topic>) and the entire text where the sentence is delimited using s (the beginning of a sentence) and /s (the end of a sentence), and the paragraph using p (the beginning of a paragraph) and /p (the end of a paragraph).
The Common European Framework of Reference for Languages (CEFR) suggest common reference levels divided into three level groups: A1 and A2 (basic), B1 and B2 (independent), and C1 and C2 (proficient) users. The Korean proficiency test divides students into beginner, intermediate, and advanced groups, which are further divided into levels based on each student’s ability. These groups are subdivided into Levels 1 (A1) and 2 (A2) for the beginner levels ( chogeub, literally “beginner”), Levels 3 (B1) and 4 (B2) for the intermediate levels ( junggeub, “intermediate”), and Levels 5 (C1) and 6 (C2) for the advanced levels ( gogeub, “advanced’’). The minimum requirement in universities for foreign students whose first language is not Korean should be at least Levels 3 and 4 respectively admission and completing their university degree regardless of their major. For students in Korean studies, Levels 5 and 6 are required for admission and degree completion, respectively.
Although they hailed from over 80 different countries, the majority of the learners were from Asian countries where Chinese and Japanese are the first and second most spoken languages. Writing examples for L1 Mandarin Chinese and Japanese in the corpus represent 38.27% and 21.09%, respectively. If we place students from China, Hong Kong, and Taiwan together, the percentage of learners who speak Chinese as L1 increases to 49.72%, and thus, half of the writing tests can be said to be produced by Chinese L1 learners.
A total of 2523 learners participated in a writing examination to produce 4094 writing examples. All examinees provided their native language (L1) and gender; there were 700 men, 1822 women, and a participant who did not specify their gender. The corpus also specified that all students were high school graduates, and over 60% were university graduates. In the learner corpus, the beginner levels (Levels 1 and 2) represent almost 50% of the corpus. Writing examples represent about 75% of the corpus if Level 3 (intermediate level) is also considered. Table 1 presents the most frequently used prompts in the learner corpus. While some writing prompts are given only to learners at a specific proficiency level (e.g., My weekend requested only for Level 1), other topics can be used for different proficiency levels (e.g., The day that I remember the most for Level 3 and Level 5).
There are over 100 prompts which are used by only a small number of writing examples (“Other prompts” row in Table 1). Most prompts are only for specific proficiency levels, such as My weekend, Seasons and weather in my country.
There are over 100 writing prompts. Twenty-one writing prompts are given to multiple proficiency levels, and these prompts represent 42.96% of the dataset. For the proposed AWE system, we use <level> and <score> as target classes, and extract various linguistic features only from sentences. Although other annotations in the learner corpus would be target classes for other learner corpus-related applications, such as <nationality> for native language identification, we do not use them in this study.
2.2 Features in the learner corpus
We explore various automatic metrics that aim to describe the characteristics of the learner corpus, and we find relevant features for the classification tasks. Such characteristics are represented in terms of complexity, fluency, and accuracy features. These features can be used for learner corpus-related applications such as automated assessment and language proficiency classification. All metrics described here should be measured and extracted automatically from the corpus. Therefore, they are evaluated without any human intervention to assess writing quality and classify language proficiency automatically.
2.2.1 Complexity features
Complexity features use quantitative measures such as the number of words and sentences in the text with their numbers and mean lengths. The length of the written text is considered as an important feature in the learner corpus. Most previous work on proficiency classification focused on the number of words (Ortega Reference Ortega2003; Vajjala and Loo Reference Vajjala and Loo2013; Alfter et al. Reference Alfter, Bizzoni, Agebjörn, Volodina and Pilán2016). Since many official writing tests for proficiency levels define the number of words for each level, the quantitative measures of text in the learner corpus become the most obvious feature for learner corpus applications.
We use a part-of-speech (POS) tagging system for Korean morphological analysis to count the number of morphemes instead of eojeols (a blank-separated word unit in Korean). The POS tagger can attribute POS tag information while performing the segmentation task for the word in Korean. For example, the following sentence in (1b) is morphologically analyzed and segmented in (1c). Although the number of tokens differs based on basic units such as eojeols and morpheme, we can deal with compound words in which these units may appear with or without a blank space, in which case we can tokenize Korean words into morphemes to obtain a consistent number of tokens for compound words regardless of the blanks. For example, for two identical but differently segmented compound nouns hakseubja kopeoseu and hakseubjakopeoseu (“a learner corpus”)—both of which are correct and grammatical—the number of morphemes can be homogeneously counted as two using the proposed counting scheme. This scheme performs counting based on what the compound word or phrase semantically represents instead of its surface segmentation, which can be different. Therefore, this scheme counts both as two tokens (as for hakseubja kopeoseu) instead of one token (as for hakseubjakopeoseu).
-
(1)
a. hajiman bili ssi-hago naoko ssi-neun modu sajingi-ga eobs-eoss-eoyo. However, Billy Mr.-conj Naoko Ms.-top all camera-nom do_not_have-past-decl. “However, Mr. Billy and Ms. Naoko, both of them do not have a camera.”
b. hajiman bili ssi-hago naoko ssi-neun modu sajingi-ga eobs-eoss-eoyo. (# of tokens by word = 8)
c. hajiman bili ssi -hago naoko ssi -neun modu sajingi -ga eobs -eoss -eoyo. (# of tokens by a morpheme = 13, punctuations excluded)
A type/token ratio is calculated using $\frac{\text{\# of types}}{\text{\# of tokens}}$ , where the number of types represents the unique number of tokens, and the number of tokens represents the number of morphemes. This ratio can help measure the vocabulary richness of a corpus between 0 and 1. Within this range, 0 and 1 indicate low and high lexical variation, respectively. We use the morphological analysis and POS tagging model described in Park and Tyers (Reference Park and Tyers2019), which can generate POS tagging results, as shown in Figure 2.
Complexity features can also measure syntactic complexity in L2 writing (Polio Reference Polio1997; Ortega Reference Ortega2003; Lu Reference Lu2010), whereas first language syntactic complexity measures include Yngve’s depth algorithm (Yngve Reference Yngve1960), Frazier’s local non-terminal numbers (Frazier Reference Frazier1985), and the D-level scale (Rosenberg and Abbeduto Reference Rosenberg and Abbeduto1987; Covington et al. Reference Covington, He, Brown, Naci and Brown2006), we do not consider them in this manuscript for second language learning. A tree structure obtained by constituent parsing can show linguistic discrepancy. For example, if the subject is omitted in the sentence, a tree structure of the parsing result has a vp node as a root. A standard tree has an s node as a root as shown in Figure 3. If the root node is a vp, we may consider it as a syntactic complexity feature. We note that a vp root sentence also may be a grammatically relevant sentence in Korean. We use the phrase-structure models described in Kim and Park (Reference Kim and Park2022), which trained the Sejong treebank for Korean using the Berkeley neural parser (Kitaev, Cao, and Klein Reference Kitaev, Cao and Klein2019) with the pre-training of deep bidirectional transformers (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019). For syntactic complexity features, we add the distribution of grammatical morphemes such as the number of verbal endings and prepositions.
2.2.2 Fluency features
We define fluency as the capability of producing language effortlessly. Fluency is the potential of a language learner to apply their knowledge of grammar to produce intelligible speech and writing. This plays an important role in language production. We differentiate between the language fluency of a learner by observing their level of comfort when using that language and identifying if they can efficiently express themselves verbally and in text. Pauses in production and the length of written text are good indicators of fluency (Towell, Hawkins, and Bazergui Reference Towell, Hawkins and Bazergui1996; Ge, Wei, and Zhou Reference Ge, Wei and Zhou2018; Martindale and Carpuat Reference Martindale and Carpuat2018; Qiu and Park Reference Qiu and Park2019). Previous work defined various metrics for fluency. Two metrics defined in previous work and an additional fluency metric by the unigram language model are given below.
-
1. Fluency by Asano, Mizumoto, and Inui (Reference Asano, Mizumoto and Inui2017): $\displaystyle f(h) = \frac{\log P_m(h) - \log P_u(h)}{|h|}$
-
2. Fluency by Ge et al. (Reference Ge, Wei and Zhou2018): $\displaystyle f(h) = \frac{1}{1+H(x)}$ where $\displaystyle H(x) = -\frac{\log P_m(h)}{|h|}$
-
3. Fluency by the unigram language model: $\displaystyle f(h) = -\frac{\log P_u(h)}{|h|}$
here $P_m$ represents the probability of the sentences given by the language model, and $P_u$ denotes the unigram probability of the sentences.
We collect a very large monolingual dataset for Korean, which contains over 9.6 M sentences and 130.6 M eojeols, to create a language model: Korean WikipediaFootnote c (5.3 M sentences and 71.8 M eojeols), the Sejong morphologically analyzed corpus (3.0 M and 40.0 M), and articles from The Hankyoreh daily newspaper during 2016 (1.2 M and 18.6 M, previously presented in Park Reference Park2017). After preprocessing the raw text into morpheme-segmented text using the POS tagging system (Park and Tyers Reference Park and Tyers2019), we create a linearly interpolated trigram model and implement the fluency metrics described in Asano et al. (Reference Asano, Mizumoto and Inui2017) and Ge et al. (Reference Ge, Wei and Zhou2018), and the fluency feature counted by the unigram language model. As indicated in (2), we attach the POS label to the morpheme-segmented lexicon and explicitly include a + symbol for consecutive morphemes. A raw text collection for creating a language model is available at http://doi.org/10.5281/zenodo.4317288 by authors of the manuscript.
-
(2)
-
1.
2.2.3 Accuracy features
Thus far, we discussed features that can be extracted automatically from the learner corpus. Now, we define accuracy as a feature in the learner corpus. This feature represents the ability to produce correct sentences using correct grammar and vocabulary. However, such a learner corpus requires linguistic information such as grammatical error categories and error correction (e.g., the NUS learners corpus Dahlmeier, Ng, and Wu Reference Dahlmeier, Ng and Wu2013 or the treebank of learner English Berzak et al. Reference Berzak, Kenney, Spadine, Wang, Lam, Mori, Garza and Katz2016). These errors are annotated based on target expressions that a native speaker would produce given the identical context, and they are used to distinguish non-standardized linguistic expressions in the learner corpus. Figure 4 shows a conceptual example of the annotated sentence described in (3) from the Korean learner. S represents the learner’s sentence, and A represents the error correction annotation. 1 2 indicates the path of the tokens where the correction needs to be introduced. The value R:ADP indicates the type of error. For example, yeseo, a functional morpheme (ADP) at 1 2, should be replaced by eseo according to the annotation.
-
(3)
a. gohyang yeseo jumal e chingu wa manna seo eseo eoyo. “ø (met) a friend (in the hometown) on weekend.”
b. gohyang eseo jumal e chingu wa manna si eoss eoyo.hometown loc weekend ajt friend cjt meet hon past ind. “ø met a friend in the hometown on weekend.”
The correct sentence is presented in (3b). This example illustrates functional morpheme errors, which are among the most common errors: specifically, these errors involve postposition and honorific morphemes, which we denote as adpositions (ADP) for functional morphemes using a universal part-of-speech tagset (Petrov, Das, and McDonald Reference Petrov, Das and McDonald2012). Using the error-annotated learner corpus, it is possible to perform a grammatical error correction (GEC) process by automatically detecting and correcting grammatical errors in the text. In recent years, the consistent increase in the number of foreign language learners, especially learners of Korean, and the demand to facilitate their learning with timely feedback have resulted in GEC becoming increasingly popular and attracting considerable attention in both academia and industry. However, because the learner corpus needs to be in another form, that is, an error-annotated corpus instead of the current version of the corpus because of the lack of the error correction dataset in the learner corpus for Korean L2 writing, a task such as GEC including accuracy features is beyond the scope of this study, and we leave it as future work.
2.2.4 Summary
We summarize the list of features, including the bag of morphemes, in Table 2, which also shows examples of feature values for the learner corpus presented in Figure 1, which contains six sentences. We present several quantitative complexity features, such as the mean length of sentence by morpheme, mean length of word by morpheme, and morpheme type versus token ratio. In addition, the table shows statistical complexity features such as the number of sentences, number of paragraphs, and number of tokens using morphemes. We consider the bag of functional morphemes as a morpho-syntactic complexity feature and the number of vp heads as a syntactic complexity feature. We denote both the morpho-syntactic and syntactic complexity features as syntactic complexity features for convenience, so that they are differentiated from quantitative complexity features.
Bag of morph = bag of morphemes; # of sent = number of sentences; # of para = number of paragraphs; # of tok = number of tokens; sent by morph = mean number of morphemes per sentence; wd by morph = mean number of morphemes per word; type/token ratio = ratio of morpheme types to tokens; bag of funct = bag of functional morphemes; # of vp heads = number of vp heads. The fluency assessment by Asano et al. (Reference Asano, Mizumoto and Inui2017) uses $f(h) = \frac{\log P_m(h) - \log P_u(h)}{|h|}$ , the fluency assessment by Ge et al. (Reference Ge, Wei and Zhou2018) uses $f(h) = \frac{1}{1+H(x)}$ where $H(x) = -\frac{\log P_m(h)}{|h|}$ , and the fluency by the unigram language model uses $f(h) = -\frac{\log P_u(h)}{|h|}$ .
3. Baseline statistical automated writing evaluation models
First, we propose the use of a statistical automated writing evaluation system as a baseline system. Statistical automated writing evaluation systems use linear and logistic regression models. Thus, we separately implemented two independent systems to predict proficiency levels and scores instead of using a single integrated system. Figure 5 shows a distribution of criterial features for each level, including quantitative measures (mean lengths of sentence by morpheme and fluency by Asano et al. Reference Asano, Mizumoto and Inui2017). This figure corresponds to the logistic regression model for classifying proficiency levels. Figure 6 presents a distribution of learners’ scores between levels. Note that there are 22 writing examples for which scores are either not provided or annotated as not specified (NA).
We evaluated scores using 5-fold cross-validation with accuracy for proficiency classification as in (1) and mean squared error regression loss to assess writing quality as in (2).
where n denotes the total number of examples. Table 3 shows the results of the baseline statistical models. Our experiments using the baseline statistical models followed the experimental settings suggested in previous work by either predicting a score or classifying the proficiency of a learner, independently. We obtained rudimentary initial results using the basic statistical models, which have the following limitations. First, it is difficult to determine the effect of each feature. Second, it has low performance compared to current deep learning-based systems. Third, it requires two separate systems that predict the score and classify the proficiency level; this makes it difficult to use these models for general purposes as a complete AWE system. Therefore, we propose a state-of-the-art neural model for automated writing evaluation which is able to assess proficiency levels and scores simultaneously, and we aim to introduce a system that can be widely used throughout Korean language teaching classrooms.
Average results of 5-fold cross-validation with the standard deviation.
4. Neural automated writing evaluation models
We propose a state-of-the-art neural Korean AWE model and provide a deeper investigation into each feature proposed in Section 2.2. Our system applies XLM-Roberta to represent word forms as word representations along with the multitask learning (MTL) approach that trains several tasks simultaneously (Hashimoto et al. Reference Hashimoto, Xiong, Tsuruoka and Socher2017; Lim et al. Reference Lim, Lee, Carbonell and Poibeau2020). The details of the XLM-Roberta feature representation method and our MTL approach is depicted in Figure 7.
4.1 Representation of words
Machine learning (ML)-based grammar checking (Soni and Thakur Reference Soni and Thakur2018) and AWE (Persing, Davis, and Ng Reference Persing, Davis and Ng2010; Taghipour and Ng Reference Taghipour and Ng2016; Yang, Xia, and Zhao Reference Yang, Xia and Zhao2019) have been proposed and widely used in recent years because of their outstanding performance. The main idea behind ML-based AWE is applying deep learning techniques for automated essay scoring. To compute the score of writing in terms of machine learning, the system has to learn from a training dataset T that comprises a pair of essays $x_i$ and scores $y_i$ , where $(x_{i},y_{i})\in T$ . In the deep learning-based AWE such as in Yang et al. (Reference Yang, Xia and Zhao2019), the sequence of words from the essay $x_i$ is represented as a sequence of vector representations (i.e., word embeddings). Therefore, the essay $x_i$ is composed of m words such that $x = (w_{i,1}, \cdots , w_{i,m})$ , and the system creates a set of sequences of word embeddings $e^{w}_{i,1}, \cdots, e^{w}_{i,m}$ . This vector representation of a word $e^{w}_{i,j}$ is trained to capture syntactic and semantic meanings of a word in a sentence (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014). We apply a bidirectional encoder representations from transformers (BERT)-like word representation method that is trained using a masked language model (MLM). Many MLM pre-learning methods such as BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) perform training by replacing certain input words with [MASK] and restoring them to the original token by training a deep neural network. For example, let the input text be I have no clue; then, the system selects tokens randomly and replaces them as I have [MASK] clue. This process makes the system predict the masked word based on its surrounding words. During training, the system may struggle to learn the best parameters by comparing its prediction and the masked word.
BERT is a pre-trained word representation model that is trained with large quantities of Wikipedia text as input and over 110 million parameters. RoBERTa (Liu et al. Reference Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer and Stoyanov2019) is an extended version of BERT, which consumes 270 million parameters and a bigger input dataset, and XLM-RoBERTa (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020) is a multilingual model of Roberta trained in 100 different languages. These pre-trained models are effective when transferred to a downstream NLP task because they capture a deep contextual representation of words. In this study, we apply the multilingual XLM-Roberta model to transform the Korean text into a sequence of word representations as
where $E^{(w)}_i$ is a matrix that denotes a set of vector representation of words, and it comprises k subwords as $E^{(w)}_i=(e^{{(w)}}_{i,1}, \cdots , e^{(w)}_{i,k})$ . This is because XML-Roberta tokenizes a word into several subwords to handle character-level subword information.
For example, the word joyful turns into two subwords, joy and ful using XLM-Roberta; therefore, the number of words m in an essay is always equal to or smaller than the number of XLM-Roberta representations k. We implement our word representation model using the pre-trained XLM-Roberta provided by Huggingface.Footnote d
4.2 Representation of linguistic features
Quantitative complexity, syntactic complexity, and fluency of the learner’s writing are features that are traditionally important to predict essay scores to assess writing, and they can be transformed into vector representations in ML applications using a simple linear transformation method. First, we concatenate the features in Section 2.2 for each quantitative complexity $t_i^{(q)}$ , syntactic complexity $t_i^{(s)}$ , and fluency $t_i^{(f)}$ . Then, we transform each concatenated output based on a linear model with an activation function Relu as
where $e_i^{(q)}$ , $e_i^{(s)}$ , and $e_i^{(f)}$ denote a vector representation of each feature, G and U represent learnable parameters, and b indicates a bias. We concatenate the representation of features with the vector representation of words. Finally, we unify the representation between the word and the linguistic features as
where k denotes the number of words in the learner’s writing. The proposed unified representation is commonly used with BERT-like models. For example, Prakash and Madabushi (Reference Prakash and Madabushi2020) designed an enhanced version of contextual representation based on count-based features (BERT with a term frequency), and Xue et al. (Reference Xue, Zhou, Ma, Ruan, Zhang and He2019) investigated the effect of relational features with BERT for the Chinese NER task. To combine pairs of features, a simple concatenation method was applied. However, the concatenation method may not be the best method in our case because our model uses diverse features simultaneously. To investigate this issue, we applied an attention-based method to form unified representations.
4.3 Self-attentive representations
The vector representations for each word (Section 4.1) and linguistic features (Section 4.2) are independent representations although they are concatenated. Thus, no co-relational information can be represented between two representations. The self-attention mechanism is adequate to address this issue. Self-attention involves applying a linear transformation (Cao and Rei Reference Cao and Rei2016) over the matrix of the unified representation $E^{(wl)}_{i}$ for which the attention weights $a^{(wl)}_{i}$ are computed as
where $R^{(wl)}$ denotes a learnable parameter. The attention weight $a^{(wl)}_{i}$ corresponds to the most informative word $w_{j}$ ( $1\leq j\leq k$ ) in the learner’s writing and linguistic features. The system obtains a self-attentive vector representation $c^{(wl)}_{i}$ through the dot-product between the attention weight and unified representation. Intuitively, the attention weight denotes a probability score that represents “how much our system focuses on a specific word or linguistic feature that we propose.” Given an input, when a specific word or expression is important, the system provides more weight to build a self-attentive representation. The attention weight is discussed in Section 5.3.
4.4 Prediction of a proficiency level and a score
Our final goal is to build a system that can automatically measure the proficiency level and the score of a learner’s writing. We use a linear classifier to measure the proficiency level of the essay and use another linear regressor for scoring.
z denotes an index number of levels where level = {Level 1, …, Level 6}, and $P^{(c)}$ and $P^{(r)}$ are learnable parameters. The classification result $\hat{y}^{(c)}_{i}$ is computed by the selection of the maximum value of $e^{(c)}_{i,z}$ . During the training phase, our system learns by backpropagation of the prediction errors over the entire training dataset T. Because we train two different classification and regression tasks, we use the individual CrossEntropy objective function for predicting the proficiency level and the MSE function for assigning the score of the learner’s writing.
where $(x_{i},y_{i})\in T$ denotes an element from the training set T, $y_i$ denotes a set of gold labels ( $y^{level}_i$ , $y^{score}_i$ ), and $\hat{y}_i$ represents a set of predicted results.
5. Results of neural AWE models and discussion
5.1 Experiment setup
As presented in Section 4, we evaluate the scores using 5-fold cross-validation with the proposed regression loss to assess writing quality and the prediction accuracy for its proficiency level. Table 4 lists our hyperparameter settings. We apply 768 dimensions for parameters U and Q in (4) and set 400 dimensions for P and D in (10). We run through 80% of the training dataset during the learning phase using an epoch with a batch size of 6 randomly selected sentences. The remaining 20% is used as the test dataset. We report the best performance on the test dataset within 100 epochs over five times for the 5-fold cross-validation.
5.2 Experiment results
Table 5 summarizes our results on how we use different linguistic information to improve AWE results using XLM-RoBERTa. The linguistic features are syntactic complexity features (S), fluency features (F), quantitative features (Q), and self-attention mechanism (A). To investigate the effect of LMs on AWE performance, we compare results between multilingual BERT (M) and XLM-RoBERTa (X). Besides word representation methods, we also evaluate performance that is solely based on linguistic features without the pre-trained language model. For the models without self-attention, we applied a weighted average of the BERT word representations and linguistic features as $c_i^{(wl)} =$ $\frac{1}{(k+3)}$ ( $e_{i}^{(q)}$ + $e_{i}^{(s)}$ + $e_{i}^{(f)}$ + $\sum^{k}_{j=1}{e^{(w)}_{j}}$ ). Note that the dimension of linguistic features is identical to that of the BERT embedding.
Accuracy for predicting a proficiency level and MSE for assigning a score for the learner’s writing: (M) multilingual BERT only, (X) XLM-RoBERTa only, (X) + (A) XLM-RoBERTa and attention, (X) + (S) XLM-RoBERTa and syntactic complexity features, (X) + (F) XLM-RoBERTa and fluency features, (X) + (Q) XLM-RoBERTa and quantitative complexity features, (X) + (A) + (S) + (F) + (Q) XLM-RoBERTa and all features, and (A) + (S) + (F) + (Q) w/o pre-trained LMs.
Overall observations. XLM-RoBERTa and syntactic complexity features outperform other experimental settings for in terms of predicting both the proficiency level and the score. The features described in Section 2.2 only narrowly impact the overall results, and linguistic features without the pre-trained language model result in a severely limited performance.
Effect of different BERT-like pre-trained language models. The model based on XLM-Roberta naturally outperforms the multilingual BERT system, wherein the former was empirically evaluated for result gains including the trade-offs between positive transfer and capacity dilution (Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020).
Effect of linguistic features for AWE We observed a meaningful improvement in the results when using linguistic features compared to that between only XLM-RoBERTa and XLM-RoBERTA and all other features, as listed in Table 5. Among the three different linguistic features, syntactic complexity is found to be the most impactful factor in both assessing the proficiency level and the score. Furthermore, we found that quantitative complexity features have a positive effect on our empirical experiment; however, fluency features lead to performance degradation of about -0.1 points.
Effect of self-attention. In practice, there are no result gains from using self-attention: A -0.02 accuracy for predicting a proficiency level (a negative result) and -0.48 MSE for assigning a score (a positive result) were observed. This may be attributed to the multi-head self-attention, which computes several attentions simultaneously (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017), being already applied in the XLM-RoBERTa model; therefore, our attention representation is relatively less effective than expected.
5.3 Analysis
In the previous section, we showed the performance of our model using different feature selection scenarios. Among the proposed features, syntactic complexity features are relatively more important than other features. However, these observations are based on empirical experiments, and thus, one cannot explain why the neural model makes such a decision. To gain a better understanding of the decision making process of the system, we conduct additional experiments to visualize the attention score added on the top of feature representations. The visualization of the attention score is the most powerful explainable AI (XAI) method where the results of the solution are understood by humans (Park et al. Reference Park, Hendricks, Akata, Schiele, Darrell and Rohrbach2016).
During the training, the attention score of the i-th learner’s writing— $a^{(wl)}_i$ in (8)—is computed as a probability distribution where $\sum^{k+3}_{j=1}{a^{(wl)}_{i,j}} = 1$ , k denotes the number of subtokens, and three different types of linguistic features—syntactic complexity, quantitative complexity, and fluency features—are proposed. Intuitively, the attention score, therefore, represents the importance assigned by our system to each linguistic feature and the word to yield results for predicting a proficiency level and for assigning a score in a learner’s writing.
Attention results on all features. Visualization results in Figure 8 show the attention score on the learner’s writing number 1. In the figure, the darker the color, the more attention points of the element are assigned. Figure 8a shows the result of applying three different linguistic features as well as words. We find two interesting observations in Figure 8a compared to those in Figure 8b, where there is attention only with words. First, we observe that the system focuses on [S-COMPLEXITY] (syntactic complexity features). This result is in line with the result reported in Table 5, where the accuracy of our system was improved by 0.69 points when syntactic complexity features were introduced. Second, the system lacks interest in focusing on misspelled words. In this figure, there are several misspelled words such as (ibun) instead of (ibeon, ‘this time’), (jejudu) instead of (jejudo, ‘Jeju Island’); (gapi) instead of (gati, ‘together’); (bipingbab) instead of (bibimbab); and (jacheonggeo) instead of (jajeongeo, ‘bicycle’). Since we do not use accuracy features provided by human annotation, our system can be considered to be sound for the following reasons: (1) The attention mechanism focuses on the proposed linguistic features based on automatic metrics, and (2) a pre-trained large language model can be associated with more proper words instead of spelling errors to yield classification and predicting results.
Attention results on only words. As reported, our system tends to focus on syntactic complexity features when all linguistic features are available. Then, what happens if the system can only see words? Figure 8b presents the results when we apply only words as the (X) + (A) model in Table 5. We found that the higher attention score is assigned to verbs such as (gabnida, “be going”). However, the distribution of attention scores on words varies based on the input dataset. Therefore, it is difficult to find a specific word or an expression that can directly affect the score of the learner’s writing.
Attention results on only linguistic features. Table 5 shows that our system predicts a score and a proficiency level of the learner’s writing only with the proposed linguistic features. We are interested in linguistic features that are the most important. Figure 9 presents attention scores of the (A) + (S) + (F) + (Q) model in Table 5 for three sample instances in the dataset. This model does not have any word information, that is, it is without the pre-trained language model. By observing the graph on Essay Number 1 and Essay Number 2, the syntactic complexity is found to be the most significant feature. For 82.7% of essays in the test dataset, the mean attention score of syntactic complexity features is more than 0.8 out of 1. However, we also observe that the quantitative complexity is a more crucial feature for decision-making for some essays such as Essay Number 273. We assume that the attention mechanism attempts to capture quantitative complexity features if it fails to utilize syntactic complexity features. However, in any case, the fluency weight does not exceed 6.7% or more of its attention score. Thus, we can assume that fluency is relatively the least important property for AWE when the system have other complexity information.
The most frequent and important words based on proficiency level. In most Korean textbooks, polite verbal ending (yo) is introduced first because it is the most commonly used ending in everyday context. Then, deferential ending (seubnida) is introduced in the upper beginner level, followed by plain ending (da) in the intermediate level. Accordingly, Figure 10 shows the distributions of verbal endings based on learners’ proficiency levels.
Discussion of the usage of Korean monolingual BERT. Table 5 shows that XLM-RoBERTa outperforms the multilingual BERT. However, the proposed multilingual BERT and the XLM-RoBERTa models are designed for multilingual purposes. There are several publicly available Korean monolingual BERT models, such as KLUE-RoBERTa,Footnote e KoBERT,Footnote f DistilBERT,Footnote g and KoELECTRA.Footnote h Because these models have been trained with different amounts of training data, their parameters also vary. We additionally investigate the performance of Korean AWE using these monolingual BERT models for following reasons. First, we are interested in whether monolingual Korean BERT models perform better than multilingual BERTs. Second, we must determine the importance of the different hyperparameters in the monolingual BERT models, as well as the optimally cost-effective BERT model size. Table 6 provides data from the ablation study on multilingual and Korean monolingual BERT models. Overall, we did not observe performance improvement by using the monolingual BERTs. Instead, we observed that the model size is more important for monolingual BERT models when comparing KoBERT and DistilKoBERT. One interesting result of the experiment is that comparing KoELECTRA small-V2 and small-V3 shows almost identical results, even with different sizes of training data. Among the monolingual models, KLUE-RoBERTa (Park et al. Reference Park, Moon, Kim, Cho, Han, Park, Song, Kim, Song, Oh, Lee, Oh, Lyu, Jeong, Lee, Seo, Lee, Kim, Lee, Jang, Do, Kim, Lim, Lee, Park, Shin, Kim, Park, Oh, Ha and Cho2021) showed the best performance regardless of their model sizes.
We evaluated the model using only the BERT model (i.e., we did not apply the proposed linguistic features). Training Data denotes the size of the Korean corpus used for training BERT.
Feature comparison with previous work. We compare our linguistic features with others previously proposed and utilized. Most previous work focused on complexity features by our criteria such as statistical features (e.g., length and n-gram)) or style-based features (e.g., part-of-speech labels, sentence structure, and other lexical patterns) (Ramesh and Sanampudi Reference Ramesh and Sanampudi2021). There are also content-based features (e.g., similarities between sentences and prompt overlapping), in which the similarity metric is introduced: for example, Sakaguchi, Heilman, and Madnani (Reference Sakaguchi, Heilman and Madnani2015) used BLEU, Word2vec similarity and WordNet similarity for their reference-based approach, and Dong and Zhang (Reference Dong and Zhang2016) counted the number of words and their synonyms in the essay appearing in the prompt. Due to the availability of spell checker for English, spelling, punctuation, and capitalization errors could also be utilized as accuracy features (Persing and Ng Reference Persing and Ng2013; Sakaguchi et al. Reference Sakaguchi, Heilman and Madnani2015; Dong and Zhang Reference Dong and Zhang2016; Cummins, Zhang, and Briscoe Reference Cummins, Zhang and Briscoe2016; Dong, Zhang, and Yang Reference Dong, Zhang and Yang2017). Table 7 shows a summary of handcrafted features in previous work. We used more detailed quantitative measures (token ratio; length of morphemes, words, and sentences for lexical diversity) and linguistic features by POS tagging and syntactic parsing. We also introduced fluency measures, which no previous work has considered. As we mentioned, in future work we are planning to include a grammar error correction system where we can obtain accuracy features beyond simple spelling errors.
PN13 (Persing and Ng Reference Persing and Ng2013), SH15 (Sakaguchi et al. Reference Sakaguchi, Heilman and Madnani2015), DZ16 (Dong and Zhang Reference Dong and Zhang2016), CZ16 (Cummins et al. Reference Cummins, Zhang and Briscoe2016), DE18 (Dasgupta et al. Reference Dasgupta, Naskar, Dey and Saha2018), UE20 (Uto et al. Reference Uto, Xie and Ueno2020), RE21 (Ridley et al. Reference Ridley, He, Dai, Huang and Chen2021).
6. Conclusion
In this paper, we explored several types of linguistic features in the learner corpus: quantitative complexity, syntactic complexity, and fluency. These features can be used for learner corpus-related applications that make use of machine learning techniques in addition to pre-trained language models for the neural system.
We used various metrics that were automatically measured for these features. Therefore, these metrics could be evaluated without any human intervention to assess the proficiency and holistic score of writing automatically. The proposed neural-based state-of-the-art system applied the transformer-based multilingual masked language model and XLM-RoBERTa. In addition, based on the proposed attention mechanism score, we observed how the proposed linguistic features benefit AWE in a complementary manner for neural systems, and we analyzed which sequence of words and expression can be focused on in the neural system.
Because our AWE system could provide a reliable holistic score while simultaneously detecting students’ proficiency levels, it could offer potential solutions for Korean language instructors who might be struggling with the workload. Furthermore, it can be used as a resource for grading student essays in large classes or placement tests that need to be graded accurately and promptly. Furthermore, the AWE system can benefit Korean language learners in their writing practice. Learners can use the AWE system to self-grade their essays before submission and learn how their scores change as they change vocabulary, syntactic structure, etc. in their writing.
Although the proposed neural AWE engine can judge the grammaticality of the learner’s writing using linguistic features and a pre-trained neural language model, the current AWE tool has several limitations. One is that it does not “read” students’ essays. That is, the program can detect syntactic complexity and fluency, but does not make judgment on its content whether it is written according to the given writing topic. Similarity between the content and the topic can be estimated by defining the distance between words in the content and the concept of the topic. While previous work has proposed content-based features to calculate similarities with the prompt or reference text (Sakaguchi et al. Reference Sakaguchi, Heilman and Madnani2015; Dong and Zhang Reference Dong and Zhang2016), we have left this for future work. Another limitation is that our approach can possibly show biased performance on limited topics that are included in the training data set. However, we observed that this issue can be mitigated by utilizing the pre-trained neural language model. Lastly, the current model does not provide specific error feedback to students. Although learners could check their scores and proficiency level with the AWE tool, they cannot check their errors, thus making it hard for them to learn from their errors.
Given that adding error types to the learner corpus has been presented for multiple grammatical (either morphological or syntactic) levels and for several languages (Ramos et al. Reference Ramos, Wanner, Vincze, del Bosque, Veiga, Suárez and González2010; Boyd Reference Boyd2010; Han et al. Reference Han, Tetreault, Lee and Ha2010; Seo et al. Reference Seo, Lee, Lee, Kweon and Kim2012; Dickinson and Ledbetter Reference Dickinson and Ledbetter2012), our next goal is to add error annotations in the Korean learner corpus to broaden the usage of our AWE system. As the current NLP systems used for feature extraction are developed for the standard Korean language, it is expected that the automatic processing system may produce errors. This error-annotated learner corpus can lead to grammatical error correction (GEC) as a preprocessing step for learner corpus applications. We hope that the additional GEC task will improve learner corpus applications. It is important that the writing be relevant to the given subject, which is an aspect we cannot deal with using the proposed system. To the best of the authors’ knowledge, this has not been presented in previous literature on leaner corpus applications, and we will consider this problem for future work.
Acknowledgement
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1F1A1063474) for KyungTae Lim.