Hostname: page-component-78c5997874-8bhkd Total loading time: 0 Render date: 2024-11-11T15:57:35.079Z Has data issue: false hasContentIssue false

CON-FOLD Explainable Machine Learning with Confidence

Published online by Cambridge University Press:  28 October 2024

LACHLAN MCGINNESS
Affiliation:
School of Computer Science, ANU and CSIRO/Data61, Canberra, Australia (e-mail: lachlan.mcginness@anu.edu.au)
PETER BAUMGARTNER
Affiliation:
CSIRO/Data61 and School of Computer Science, ANU, Canberra, Australia (e-mail: peter.baumgartner@data61.csiro.au)
Rights & Permissions [Opens in a new window]

Abstract

FOLD-RM is an explainable machine learning classification algorithm that uses training data to create a set of classification rules. In this paper, we introduce CON-FOLD which extends FOLD-RM in several ways. CON-FOLD assigns probability-based confidence scores to rules learned for a classification task. This allows users to know how confident they should be in a prediction made by the model. We present a confidence-based pruning algorithm that uses the unique structure of FOLD-RM rules to efficiently prune rules and prevent overfitting. Furthermore, CON-FOLD enables the user to provide preexisting knowledge in the form of logic program rules that are either (fixed) background knowledge or (modifiable) initial rule candidates. The paper describes our method in detail and reports on practical experiments. We demonstrate the performance of the algorithm on benchmark datasets from the UCI Machine Learning Repository. For that, we introduce a new metric, Inverse Brier Score, to evaluate the accuracy of the produced confidence scores. Finally, we apply this extension to a real-world example that requires explainability: marking of student responses to a short answer question from the Australian Physics Olympiad.

Type
Original Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press

1 Introduction

Machine learning (ML) has been shown to be incredibly successful at learning patterns from data to solve problems and automate tasks. However, it is often difficult to interpret and explain the results obtained from ML models. Decision trees are one of the few ML methods that offer transparency with regard to how decisions are made. They allow users to follow a set of rules to determine what the outcome of a task (say, classification) should be. The difficulty with this approach is finding an algorithm that is able to construct a reliable set of decision trees.

One approach of generating a set of rules equivalent to a decision tree is the First Order Learner of Default (FOLD) approach introduced by (Shakerin et al. Reference Shakerin, Salazar and Gupta2017). To improve a model’s ability to handle exceptions in rule sets, Shakerin, Wang, Gupta, and others introduced and refined an explainable ML algorithm called First Order Learner of Default (FOLD) (Shakerin et al. Reference Shakerin, Salazar and Gupta2017; Wang Reference Wang2022; Wang and Gupta Reference Wang, Gupta, Hanus and Igarashi2022; Wang et al. Reference Wang, Shakerin and Gupta2022; Wang and Gupta Reference Wang, Gupta, Gebser and Sergey2023; Padalkar et al. Reference Padalkar, Wang and Gupta2024), which learns non-monotonic stratified logic programs (Quinlan, Reference Quinlan1990b). The FOLD algorithm is capable of handling numerical data (FOLD-R) (Shakerin et al. Reference Shakerin, Salazar and Gupta2017), multi-class classification (FOLD-RM) (Wang et al. Reference Wang, Shakerin and Gupta2022), and image inputs (NeSyFOLD) (Padalkar et al. Reference Padalkar, Wang and Gupta2024). FOLD-SE uses Gini Impurity instead of information gain in order to obtain a more concise set of rules (Wang and Gupta Reference Wang, Gupta, Gebser and Sergey2023). Thanks to these improvements, variants of the FOLD algorithm are now competitive with state-of-the-art ML techniques such as XGBoost and RIPPER in some domains (Wang and Gupta Reference Wang, Gupta, Hanus and Igarashi2022, Reference Wang, Gupta, Gebser and Sergey2023).

The rules produced by the FOLD algorithm are highly interpretable; however, they can be misleading. As an example let’s consider the popular Titanic dataset (Kaggle 2012), where passengers are classified into two categories: perished or survived. One rule from the FOLD algorithm might say that a passenger survives if they are female and do not have a third-class ticket. When given a new set of data and this rule, a user might be unpleasantly surprised to find that such a passenger perished. This is because the rule can have the appearance of being definitive. In reality, this could be a good rule that is correct 99% of the time. In order for a FOLD model to be more understandable and trustworthy for users, a confidence value could be provided. This would provide the user with a measure of the certainty of the rule and would make it clear to the user that not all women with second-class tickets survive.

In this paper, we introduce Confidence-FOLD (CON-FOLD), an extension of the FOLD-RM algorithm. CON-FOLD provides confidence values for each rule generated by the FOLD algorithm so users know how confident they should be for each rule in the model. In addition, we present a pruning algorithm that makes use of these confidence values. These techniques applied to the Titanic example yield the following rules and confidencesFootnote 1 :

We also provide a metric, Inverse Brier Score, which can be used to evaluate probabilistic or confidence-based predictions that maintain compatibility with the traditional metric of accuracy. We provide the capability to introduce modifiable initial knowledge into a FOLD model. Finally, we demonstrate the effectiveness of adding background knowledge in the marking of student responses to physics questions. Note that we choose to focus on multi-class classification tasks in our experiments; however, CON-FOLD can also be applied to binary classification.

2 Formal framework and background

We work with the usual logic programming terminology. A (logic program) rule is of the form

(1) \begin{equation} h\ :\hbox{-}\ l_1,\ldots, l_k, \texttt{not}\,e_1, \ldots, \texttt{not}\,e_n\ . \end{equation}

where the $h$ , $l_1,\ldots, l_k$ , and $e_1,\ldots, e_n$ all are atoms, for some $k, n \ge 0$ . A program is a finite set of rules. We adopt the formal learning framework described for the FOLD-RM algorithm. In this, two-ary predicate symbols are used for representing feature values, which can be categorical or numeric. Example feature atoms are name(i1,sam) and age(i1,30) of an individual i1. Auxiliary predicates and Prolog-like built-in predicates can be used as well. Rules always pertain to one single individual and its features as in this example:

These rules for a target relation female could have been learned from training data where all individuals older than 16 are female, except those named sam and whose favorite color is not purple. Below, we use the letter $r$ to refer to the rule for the target relation, $r =$ female, (1) in this example, and the letter $R$ for the set of auxiliary rules, $R = \{\textrm{(2)}, \textrm{(3)}\}$ , that are needed to define the predicates in $r$ .

The learning task in general is defined in Wang et al. (Reference Wang, Shakerin and Gupta2022). The learning algorithm takes as input two disjoint sets $X = X_p \uplus X_n$ of positive and negative training examples, respectively. It assumes that the training set distribution is approximately the same as the test set distribution.

For any $d \in X$ , let ${features}(d)$ denote $d$ ’s features as a set of atoms over some a priori fixed Skolem constant, say, c, for example: ${features}(d) = \{\texttt{age(c,18)}, \texttt{name(c,adam)}, \texttt{fav\_color(c,red)}\}$ . The learning algorithm below checks if a current set $R \cup \{r\}$ puts an example $d$ into the target class. Letting $\ell$ denote the target class as an atom, for example $\ell = \texttt{female(c)}$ , this is accomplished via an entailment check $R \cup \{r\} \cup{features}(d) \models \ell$ .

Because all learned programs are stratified, we adopt the standard semantics for that case, the perfect model semantics (layered bottom-up fixpoint computation), which is also reflected in the FOLD-RM algorithm. We emphasize that default negation causes no problems in calculating the confidence scores of a rule.

The confidence scores are attached only to the top-level rules $r$ , never to the rules $R$ referred to under default negation. That is, having to deal with confidence scores in a negated context will never be necessary. We will describe the rationale behind this design decision shortly below.

The Boolean learning task is to determine a stratified program $P$ such that

\begin{align*} P \cup{features}(d) \models \ell & \text{ for all $d \in X_p$} & \text{and} & & P \cup{features}(d) \not \models \ell & \text{ for all $d \in X_n$}\text{ .} \end{align*}

The multi-class learning problem is a generalization to a finite set of classes. The training data $X$ then is comprised of atoms indicating the target class, for each data point, for example, survived(c,false) or survived(c,true) in the Titanic example. Conversely, by splitting multi-class learning problems can be treated as a sequence of Boolean learning problems.

The basis of CON-FOLD is the FOLD-RM algorithm (Wang et al. Reference Wang, Shakerin and Gupta2022). FOLD-RM simplifies a multi-class classification task into a series of Boolean classification task. It first chooses the class with the most examples and sets data which correspond to this class as positive and all other classes as negative. It then generates a rule that uses input features to maximize the information gain. This process is repeated on a subset of the training examples by eliminating those that are correctly classified with the new rule. The process stops when all examples are classified.

3 Related work

Confidence scores have been introduced in decision tree learning for assessing the admissibility of a pruning step (Quinlan Reference Quinlan, Kodratoff and Michalski1990a). In essence, decision tree pruning removes a sub-tree if the classification accuracy of the resulting tree does not fall below a certain threshold, for example, in terms of standard errors, and possibly corrected for small domain sizes. Decision trees can be expressed as sets of production rules (Quinlan Reference Quinlan1987), one production rule for each branch in a decision tree. The production rules can again be simplified using scoring functions (Quinlan Reference Quinlan1987).

The FOLD family of algorithms learns rules with exceptions by means of default negation (negation-as-failure). The rules defining the exceptions can have exceptions themselves. This sets FOLD apart from the production systems learned by decision tree classifiers, which do not take advantage of default negation. This may lead to more complex rule sets. Indeed, (Wang et al. Reference Wang, Shakerin and Gupta2022) observe that “For most datasets we experimented with, the number of leaf nodes in the trained C4.5 decision tree is much more than the number of rules that FOLD-R++/FOLD-RM generate. The FOLD-RM algorithm outperforms the above methods in efficiency and scalability due to (i) its use of learning defaults, exceptions to defaults, exceptions to exceptions, and so on, (ii) its top-down nature, and (iii) its use of improved method (prefix sum) for heuristic calculation.” We do feel, however, that these experimental results could be completed from a more conceptual point of view. This is beyond the scope of this paper and left as future work.

Scoring functions have been used in many rule-learning systems. The common idea is to allow accuracy to degrade within given thresholds for the benefit of simpler rules. See Law et al. (Reference Law, Russo and Broda2020) for a discussion of the more recent ILASP3 system and the references therein. Hence, we do not claim the originality of using scoring systems for rule learning. We see our main contribution differently and not in competition with other systems. Indeed, one of the main goals of this paper is to equip an existing technique that has been shown to work well – FOLD-RM – with confidence scores. With our algorithm design and experimental evaluation we show that this goal can be achieved in an “almost modular” way. Moreover, as our extension requires only minimal changes to the base algorithm, we expect that our method is transferrable to other rule-learning algorithms.

4 The CON-FOLD algorithm and confidence scores

The CON-FOLD algorithm assigns each rule a confidence score as it is created. For easy interpretability, confidence scores should be equal to probability values (p-values from a binomial distribution) in the case of large amounts of data. However, p-values would be a very poor approximation in the case of small amounts of training data; if a rule covered only one training example, it would receive a confidence value of 1 ( $100\%$ ). There are many techniques for estimating p-values from a sample, we chose the center of the Wilson Score Interval (Wilson, Reference Wilson1927) given by Eq. (2). The Wilson Score Interval adjusts for the asymmetry in the binomial distribution which is particularly pronounced in the case of extreme probabilities and small sample sizes. It is less prone to producing misleading results in these situations compared with the normal approximation method, thus making it more trustworthy for users (Agresti and Coull Reference Agresti and Coull1998).

(2) \begin{equation} p = \frac{n_p + \frac{1}{2} Z^2}{n + Z^2}\text{, } \end{equation}

where $p$ is the confidence score, $n_p$ is the number of training examples corresponding to the target class covered by the rule, $n$ is the number of training examples covered by the rule corresponding to all classes, and $Z$ is the standard normal interval half width; by default, we use $Z=3$ .

Theorem 4.1. In the limit where there is a large amount of data classified by a rule ( $n \rightarrow \infty$ ), the confidence score approaches the true probability of the sample being from the target class.

Proof The true probability that a randomly selected example that follows the provided rule is from the target class is $p_r=\frac{n_p}{n}$ . As $n$ increases, the law of large numbers states that the portion of examples which belong to the target class becomes $\lim _{n \rightarrow \infty } n_p = p_r n$ . Also, as $n \rightarrow \infty$ , the relative contribution of $Z^2$ terms becomes negligible. Therefore:

\begin{equation*} \lim _{n \rightarrow \infty } p = \lim _{n \rightarrow \infty } \frac {n_p + \frac {1}{2} Z^2}{n + Z^2} = \lim _{n \rightarrow \infty } \frac {n p_r}{n} = p_r \end{equation*}

Once rules have the associated confidence scores, they are expressed as follows:

(3) \begin{equation} p:: h\ :\hbox{-}\ l_1,\ldots, l_k, \texttt{not}\,e_1, \ldots, \texttt{not}\,e_n\ . \end{equation}

where $p$ is the confidence score. The format of confidence score annotations is directly supported by probabilistic logic programming systems such as Problog (De Raedt et al. Reference De Raedt, Kimmig and Toivonen2007) and Fusemate (Baumgartner and Tartaglia Reference Baumgartner and Tartaglia2023). In this paper, we do not explore this possibility and just let the confidence scores allow the user to know the reliability of a prediction made by a rule in the logic program.

Algorithm 1 CON-FOLD

Fig. 1. This toy example illustrates the difference between the FOLD-RM and CON-FOLD core algorithms. Both produce rules of the form shown. CON-FOLD would not consider the Flamingo as part of the data to fit when generating rule 2. FOLD-RM would consider the Flamingo. Note that in many cases, both algorithms would generate an abnormal rule ab(X) :- flamingo(X), preventing the Flamingo from being covered by the first rule. In this case, both FOLD-RM and CON-FOLD would include the Flamingo. When harsh pruning occurs and there are few abnormal rules, this subtle change becomes noticeable.

The CON-FOLD algorithm (Algorithm1) closely follows the presentation of FOLD-RM in Wang et al. (Reference Wang, Shakerin and Gupta2022); see there for definitions of split_by_literal, learn_rule (slightly adapted) and most. On line 11, ${conf}(P, X_p, X_n)$ computes the Wilson score $p$ as in (2), letting

and $\ell$ the target class as an atom. Note that line 6 is only used if pruning. Other than this, the only difference between CON-FOLD and FOLD-RM are lines 9 and 10. In our terminology, FOLD-RM includes in the updated set $X$ the full set $X_n$ instead of $X_{{tn}}$ . The consequence of this difference is highlighted in Figure 1.

Theorem 4.2. CON-FOLD algorithm always terminates on any set of finite examples.

Proof Each pass through the while loop produces a rule that will either successfully classify at least one example or not. If no examples are successfully classified, the algorithm terminates immediately (lines 8–9). Otherwise, the examples classified are removed from the set (line 11) strictly decreasing its size. Since the set is finite and each cycle removes at least one element, it will become empty eventually, and the loop will terminate on that condition.

FOLD-RM has a complexity of $\mathcal{O}(N M^3)$ where N is the number of features and M is the number of examples (Wang et al. Reference Wang, Shakerin and Gupta2022). In the worst case, each literal only covers one example and requires an additional $M-1$ literals to exclude the remaining data. Then the pruning algorithm can be called once per rule for each target class, which in the worst case is $M$ times. The complexity of the pruning algorithm is below.

Once confidence values have been assigned to each rule, it is possible to use these values to allow for pruning and prevent overfitting. In the following, we introduce two such pruning methods: improvement threshold and confidence threshold pruning.

4.1 Improvement threshold pruning

Improvement threshold pruning is designed to stop rules from overfitting data by becoming unnecessarily “deep.” The rule structure in FOLD allows for exceptions, for exceptions to exceptions, and then for exceptions to these exceptions and so on. This may overfit the model to noise in the data very easily. We will refer to any exception to an exception at any depth as a sub-exception.

In order to avoid this overfitting, each time a rule is added to the model, each exception to the rule is temporarily removed, and a new confidence score is calculated. If this changes the confidence by less than the improvement threshold, then this exception is removed. If an exception is kept, then this process is applied to each sub-exception. This process is repeated until each exception and sub-exception has been checked.

Algorithm 2 Evaluate Exceptions

The details are formalized in the pseudocode in Algorithm2. In there, ${remove\_rule}(R,r)$ removes the rule $r$ from the set $R$ of rules and removes the possibly default-negated atom with the head of $r$ from the bodies of all rules in $R$ .

Theorem 4.3. The Evaluate Exceptions algorithm always terminates on any set of finite set of rules, exceptions, and data points.

Proof On any given rule, each exception is checked once, and the algorithm terminates when there are no more exceptions to be checked. The depth of recursions and therefore the number of times that evaluate_exceptions is called recursively is bound by the number of exceptions.

To determine the complexity of this pruning algorithm, let $R$ be the total number of exceptions and sub-exceptions and $M=|X_p|+|X_n|$ be the number of examples. Therefore, the number of sub-exceptions to any exception is bounded in the worse case by $R-1$ or $\mathcal{O}(R)$ . The calculation of confidence within the loop takes $\mathcal{O}(MR)$ time as it requires comparing each sub-exception to a maximum of $M$ data points. The loop must run once for each exception which in the worst case is $\mathcal{O}(R)$ . Therefore the total complexity of running the pruning algorithm is $\mathcal{O}(MR^2)$ .

4.2 Confidence threshold pruning

In addition to decreasing the depth of rules, it is also desirable to decrease the number of rules. A confidence threshold is used to determine whether a rule is worth keeping in the model or whether it is too uncertain to be useful. If the rule has a confidence value below the confidence threshold, then it is removed. This effectively means that rules could be removed on two grounds:

  • There are insufficient examples in the training data to lead to a high confidence value.

  • There may be many examples in the training data that match the rule; however, a large fraction of the examples are actually counterexamples that go against the rule.

The two pruning methods introduced above can greatly simplify a model making it more human interpretable. However, the pruning of rules reduces the model’s ability to fit noise and can lead to an increase in accuracy. However, if rules are pruned too harshly, then the model will be underfit, and performance will decrease. In order to assess these effects we applied CON-FOLD to a sample dataset, the E.coli dataset. In our experiments, we varied the values for the two threshold parameters corresponding to the two pruning methods. This allowed us to assess the performance with respect to accuracy and to derive recommendations for parameter settings (see Figure 2). For this dataset, accuracy is highest for pruning with a low confidence threshold and a moderate improvement threshold. If the pruning is too harsh, this leads to a significant decrease in performance.

Fig. 2. Scatter plot of the accuracy and number of rules for a ruleset generated by the pruning algorithm with different values of the improvement threshold and the confidence threshold. Each point has two circles. The background circle displays the number of rules and accuracy for no pruning; therefore, all background circles are the same. The front circle displays the rules and accuracy when pruning is applied. The accuracy is indicated by the color shown in the scale bar on the right-hand side. Pruning conditions that are more accurate than the unpruned condition are indicated with a black dot in the center. The number of rules is indicated by the area of the circle (equal amount of ink for number of rules), normalized by the number of rules in the unpruned case. The results shown for both the accuracy and the number of rules are the average of 300 trial runs for each test condition.

5 Inverse Brier Score

A standard measure of performance in ML is accuracy which for multi-class tasks is defined by:

(4) \begin{equation} \text{Accuracy} = \frac{1}{N} \sum ^N_{i=1} A(y^*_i,y_i) \end{equation}

where $A(y^*,y) = 1$ if $y^*=y$ and $0$ otherwise.

When predictions are made with confidence scores, it is possible to use more sophisticated measures of the model’s performance. In particular, the model should only be given a small reward if it makes a correct prediction with a low confidence, and the model should be punished when making high confidence predictions that are incorrect. We have three desiderata that for such a scoring system:

  1. 1. It is a proper scoring system: the maximum reward can be gained when the probability value given matches the true probability value.

  2. 2. The scoring system reduces to accuracy when non-probabilistic predictions are made. This allows for the comparison between probabilistic and non-probabilistic models in terms of performance.

  3. 3. The scoring system has an inbuilt mechanism for dealing with no given prediction.

In order to meet all three of these desiderata, we propose a variant of the Brier scoring system and call it Inverse Brier Score (IBS). Brier score (or quadrature score) is often used to evaluate the quality of weather forecasts (Allen et al. Reference Allen, Ferro and Kwasniok2023; Liu et al. Reference Liu, Yu, Lin, Zhang, Zhang, Li and Lin2023) and can be obtained with the following formula (Brier Reference Brier1950):

(5) \begin{equation} \text{Brier Score} = \frac{1}{N} \sum ^N_{i=1} \sum ^K_{k=1} (p_{ki}-y_{ik})^2\text{, } \end{equation}

where $i$ is an index over each data point in a test dataset, $k$ corresponds to a class in the dataset, $N$ is the total number of test examples, and $K$ the total number of classes. This is commonly reduced to the following when only making predictions for one class (Murphy and Epstein Reference Murphy and Epstein1967):

(6) \begin{equation} \text{One Class Brier Score} = \frac{1}{N} \sum ^N_{i=1} (p_{i}-y_{i})^2 \end{equation}

We define the IBS as:

(7) \begin{equation} \text{IBS} = 1 - \frac{1}{N} \sum ^N_{i=1} (p_{i}-y_{i})^2 \end{equation}

Theorem 5.1. Inverse Brier Score is a proper scoring system.

Proof It known that the Brier Score is a proper scoring system (Murphy and Epstein Reference Murphy and Epstein1967). Since the propriety of a scoring system is preserved under linear transformations, the IBS is a proper scoring system.

Theorem 5.2. When definite predictions are made ( $p=1$ or $p=0$ ), IBS is equivalent to accuracy.

Proof For non-probabilistic decisions, $p_{i}$ is replaced by the prediction $y^*$ . Therefore, $(y_i^*-y_i)^2 = 1-A(y_i^*,y_i)$ , and the IBS reduces to the definition of accuracy as follows:

\begin{equation*} 1 - \frac {1}{N} \sum ^N_{i=1} 1-A(y_i^*,y_i) = 1 - \frac {N}{N}+ \frac {1}{N} \sum ^N_{i=1} A(y_i^*,y_i) = \frac {1}{N} \sum ^N_{i=1} A(y_i^*,y_i) \end{equation*}

Finally, IBS has a natural way of dealing with no predictions being made. If any class was given a prediction of $p=0.5$ , then IBS would be $0.75$ regardless of the class that is chosen. This provides a natural default value if a model refuses to make a prediction due to low confidence, satisfying the third desideratum. Note that this is also important for the FOLD algorithm because it is possible that it will find a test sample, which is significantly different to the training examples and may not match any rules.

Therefore, the IBS satisfies all three desiderata. Both IBS and accuracy are used to evaluate CON-FOLD against other models in Section 7. We note that confidence values are regularly used in weather forecasting where Brier score is used as a metric to evaluate their quality. We use this metric to show that FOLD models with confidence scores obtain higher scores. Although Brier Score does not always align with user’s expectations (Jewson Reference Jewson2004), it is a metric which indicates that models with confidence are more interpretable.

6 Manual addition of rules and physics marking

A key advantage of FOLD over other ML methods is that users are able to incorporate background knowledge about the domain in the form of rules. In addition to fixed background knowledge, we include modifiable initial knowledge. Initial knowledge can be provided with or without confidence values. During the training, CON-FOLD can add confidence values to provided rules, prune exceptions, and even prune these rules if they do not match the dataset.

In order for rules to be added to a FOLD model, they must be admissible rules as defined below, with optional confidence values. Admissible rules obey the following conditions:

  • Every head must have a predicate of the value to be decided.

  • Each body must consist only of predicates of values corresponding to features.

  • Bodies can be Boolean combinations literals.

  • The $ < $ , $\gt$ , $\leq$ , $\geq$ , $=$ and $\neq$ operators are allowed for comparison of numeric variables.

  • For categorical variables, only $=$ and $\neq$ are allowed.

Given formulas of this form are translated into logic programs. Stratified default negation is in place to ensure that rule evaluation is done sequentially, mirroring a decision list structure. Therefore, the rules are in a hierarchical structure, and the order in which the initial/background knowledge is added can influence model output in the case of overlapping rule bodies.

The inclusion of background and initial knowledge can be very helpful in cases where only very limited training data is available. An example of such a problem domain is grading a students’ responses to physics problems. Usually, background domain knowledge is easily available in the form of well-defined rules for how responses should be scored.

We evaluated this idea with data from the 2023 Australian Physics Olympiad provided by Australian Science Innovations. The data included 1525 student responses to 38 Australian Physics Olympiad questions and the grades awarded for each of these responses. We chose to mark the first question of the exam because most students attempted it and the answer is simply a number with units and direction. This problem was also favorable because the marking scheme was simple with only three possible marks, $0$ , $0.5$ and $1$ . Responses to this question were typed, so there was no need for optical character recognition (OCR) to interpret hand-written work. An example of a rule used for marking is:

grade(1,X) :- rule1(X).

rule1(X) :- correct_number(X), correct_unit(X).

In order to apply the marking scheme, relevant features would need to be extracted from the student responses. Feature extraction is an active area of research in natural language processing (NLP) (Qi Reference Zhenzhen and Z.2024). Many approaches focus on extracting information and relationships about entities from large quantities of text and can extract large numbers of features (Carenini et al. Reference Carenini, Ng and Zwart2005; Vicient et al. Reference Vicient, Sánchez and Moreno2013). The tools that use term frequency can struggle with short answers (Liu et al. Reference Liu, Wang, Zhang, Yang and Wang2018) especially if they contain large numbers of symbols and numbers, making them inappropriate for feature extraction for grading physics papers.

In our work, we use SpaCy (Honnibal and Montani Reference Honnibal and Montani2017) for part-of-speech (POS) tagging to extract noun phrases and numbers to be used as keywords. The presence or absence of these keywords is then one hot encoded for each piece of text to create a set of features. This method alone was not sufficient to extract sufficiently sophisticated features that would allow the marking scheme to be implemented. Therefore, we also used regular expressions to extract the required features for the marking scheme. This required significant customization to match the wide variety of notations used by students.

7 Results

We compare the CON-FOLD algorithm to XGBoost (Chen and Guestrin Reference Chen and Guestrin2016), a standard ML method, to FOLD-RM, and FOLD-SE, which is currently the state-of-the-art FOLD algorithm. The results can be found in Table 1. The experiments use datasets from the UCI Machine Learning Repository (Kelly et al. Reference Kelly, Longjohn and Nottingham2024). The comparison uses 30 repeat trials, where a random 80% of the data is selected for training and the remaining 20% is used for testing.

Table 1. Accuracy, runtime, and number of rules and predicates for different methods and different benchmarks from the UCI repository. Pruning hyperparameters are provided for the CON-FOLD algorithm. The uncertainty values provided are the standard deviations from 30 trial runs. FOLD-SE results are taken from Wang and Gupta (Reference Wang, Gupta, Gebser and Sergey2023)

The hyperparameters for all XGBoost experiments are the defaults in the existing Python implementation.Footnote 2 In all FOLD-RM and CON-FOLD experiments, the ratio is set to $0.5$ (default in FOLD-RM).

The hyperparameters for the pruned CON-FOLD algorithm are included with the results. FOLD-SE was not included in the experiments as the implementation is not publicly available, but values from Wang and Gupta (Reference Wang, Gupta, Gebser and Sergey2023) are included. We note that the accuracy and number of rules for XGBoost and FOLD-RM are very similar to the results given in the 2023 study and are confident that the experiments are equivalent. The times measured are wall time measured when the result is run on a PC with an Intel Core i9-13900K CPU with 64 GB of RAM. Note that when only small amounts of training data were used, we ensured that at least one example of each class was included in the training data; we will refer to this as stratified training data.

Fig. 3. Plot of IBS against the percentage of data included in the stratified training data for the E.coli UCI dataset. Thirty trials for each condition were performed, and error bars indicate one standard deviation across the trials. Pruned CON-FOLD used a confidence threshold of $0.65$ and a pruning threshold of $0.07$ .

Fig. 4. Each of the plots shows the performance of models using the Inverse Brier Score metric with different amounts of training data. Plots a and b show the regimes where large amounts of training data are available, while plots e and f explore model performance with very small amounts of training data available. Plots a, c, and e use automatic feature extraction, while plots b, d, and f use manual feature extraction using regular expressions, which allows for domain knowledge in the form of a marking scheme to be included. The total number of student responses was $n=1525$ .

Figure 3 demonstrates how the amount of training data impacts the IBS for XGBoost and the CON-FOLD algorithm with and without pruning.

One particular use case where high levels of explainability are required is marking students’ responses to exam questions. When artificial intelligence or ML is applied to automated marking, it is desirable for a minimal training dataset. This reduces the cost to mark initial examples to use as training data for the models. Therefore, we tested each model on very small training datasets with the first having just three examples of student work ( $0.2\%$ of the dataset).

Three different models were tested. The first two were XGBoost and the unpruned CON-FOLD algorithm. The third model tested was the CON-FOLD algorithm with the marking scheme given as domain knowledge before training. Each rule from the marking scheme was given a confidence of 0.99. Thirty trials were conducted with randomly generated stratified data ranging from $0.2\%$ to $90\%$ of the data used for training. The results of this experiment can be seen in Figure 4. The average is represented by the points on the graph, and the standard deviation from the 30 trials is given as error bars.

Plots of IBS against amounts of training data for grading student responses to an Australian Physics Olympiad problem with and without manual feature extraction.

8 Discussion

The runtime for XGBoost on the weight lifting data in Table 1 appears anomalously long. This has been observed previously (Wang and Gupta Reference Wang, Gupta, Gebser and Sergey2023) and can be attributed to the large number of features ( $m=155$ ) in the dataset, which is an order of magnitude greater than the others. This result confirms that both CON-FOLD and the pruning algorithm scale well with the number of input features.

Table 1 shows that the pruning algorithm reduces the number of rules compared with the FOLD-RM algorithm with a small decrease in accuracy. The main advantage of decreasing the number of rules is that it makes the results more interpretable to humans, as having hundreds of rules becomes quite difficult for a human to follow. Furthermore, a smaller set of rules reduces the inference time of the model. The FOLD-SE algorithm produces a smaller number of rules and higher accuracy for most datasets. We note that the CON-FOLD pruning technique and the use of Gini Impurity from FOLD-SE are not mutually exclusive and applying both may result in even more concise results while maintaining performance.

Our experiment in Figure 3 shows that for small amounts of stratified training data, the ability to put confidence values on predictions gives CON-FOLD a significant advantage over XGBoost. We also note that pruning gives a very slight advantage in cases of very small training datasets. We attribute this to the prevention of the model from overestimating confidence from only a small number of examples.

For the physics marking dataset, the rules created from the marking scheme align almost perfectly with the scores that students were actually awarded, as expected. This results in very strong performance even when there is very little training data. We note that the unpruned CON-FOLD algorithm gradually increases its IBS as the amount of training data increases. We attribute this to a combination of increasing confidence in rules that were learned and being able to learn more complex rules from larger training datasets.

For the XGBoost experiments with features generated by regular expressions in Figure 4f, we note that the algorithm is able to score approximately 0.92 consistently until $1.8\%$ of the training data is reached. Then the model’s performance seems to vary wildly between trials before jumping to an accuracy of 0.99 at $2.8\%$ . We attribute the instability as sensitivity to specific training examples being included in its training data. Once $2.8\%$ of the data is included in training, this seems to settle as this is a sufficient amount that the required examples reliably fall into the training dataset. A similar pattern is also present in Figure 4c.

For small amounts of training data, IBS of the CON-FOLD algorithm is mostly independent on whether regular expression features were included or not. However, in the regime of large amounts of training data shown in Figure 4a and Figure 4d, manually extracted features make a small but significant improvement in the performance of both XGBoost and CON-FOLD. This has implications for the use of automated feature POS tagging tools when being used for feature extraction for automated marking. Regular expressions for feature extraction allow for more accurate results and for the implementation of background rules, but this comes at significant development costs.

9 Conclusion and future work

We have introduced confidence values that allow users to know the probability that a rule from a FOLD model will be correct when applied to a dataset. This removes the illusion of certainty when using rules created by the FOLD algorithm. We introduce a pruning algorithm that can use these confidence values to decrease the number and depth of rules. The pruning algorithm allows for the number of rules to be significantly decreased with a small impact on performance; however, it is not as effective as the use of the Gini Impurity methods in FOLD-SE.

IBS is a metric that can be used to reward accurate forecasting of probabilities of rules while maintaining compatibility with non-probabilistic models by reducing accuracy in the case of non-probabilistic predictions.

CON-FOLD allows for the inclusion of background and initial knowledge into FOLD models. We use the marking of short answer physics exams as a potential use case to demonstrate the effectiveness of incorporating readily-available domain knowledge in the form of a marking scheme. With this background knowledge, the CON-FOLD model’s performance is significantly improved and out performs XGBoost, especially in the presence of small amounts of training data.

Besides improvements to the FOLD algorithm, an area for future work is feature extraction from short snippets of free-form text. NLP feature extraction tools were not able to capture the required features to implement a marking scheme, but otherwise were able to extract enough features to allow for accuracy and IBS of over $99\%$ in the regime of large amounts of training data. OCR could allow for the grading of hand-written responses to be explored. As a final suggestion for future research, more advanced NLP tools such as large language models could be used to allow for the automated extraction of numbers and units from text.

Acknowledgments

The authors would like to thank Australian Science Innovations for access to data from the 2023 Australian Physics Olympiad. This research was supported by a scholarship from CSIRO’s Data61. The ethical aspects of this research have been approved by the ANU Human Research Ethics Committee (Protocol 2023/1362). All code can be accessed on GitHub at https://github.com/lachlanmcg123/CONFOLD . We thank Daniel Smith for helpful comments.

Footnotes

1 For simplicity of demonstration, the rules were obtained from the Titanic test dataset. We used an improvement threshold of $0.1$ (Section 4.1) and a confidence threshold of $0.5$ (Section 4.2).

2 max_depth $= 6$ , learning_rate $= 0.3$ , n_estimators $= 100$ , objective $=$ binary:logistic and scale_pos_weight $= 1$

References

Agresti, Alan and Coull, Brent A. 1998. Approximate is better than “Exact” for interval estimation of binomial proportions. The American Statistician 52, 2, 119126. doi: 10.2307/2685469.Google Scholar
Allen, Sam, Ferro, Christopher A. T. and Kwasniok, Frank 2023. A conditional decomposition of proper scores: quantifying the sources of information in a forecast. Quarterly Journal of the Royal Meteorological Society 149, 754, 17041725. doi: 10.1002/qj.4478.CrossRefGoogle Scholar
Baumgartner, Peter and Tartaglia, Elena 2023. Bottom-up stratified probabilistic logic programming with fusemate. EPTCS 385, 87100.CrossRefGoogle Scholar
Brier, Glenn W. 1950. Verification of forecasting expressed in term of probablity I. Monthly Weather Review 78, 1, 13. doi: 10.1175/1520-0493(1950)0780001.2.0.CO;2>CrossRefGoogle Scholar
Carenini, Giuseppe, Ng, Raymond T. and Zwart, Ed 2005. Extracting knowledge from evaluative text, Association for Computing Machinery, InProc. of the 3rd international conference on Knowledge capture, K-CAP ’05, New York, NY, USA, 1118, doi: 10.1145/1088622.1088626 CrossRefGoogle Scholar
Chen, Tianqi and Guestrin, Carlos 2016. XGBoost: A Scalable Tree Boosting System, Association for Computing Machinery, Proc. of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, 785794, doi: 10.1145/2939672.2939785.CrossRefGoogle Scholar
De Raedt, Luc, Kimmig, Angelika and Toivonen, Hannu January 2007. ProbLog: a probabilistic prolog and its application in link discovery. Proc. of the 20th international joint conference on Artifical intelligence, IJCAI’07, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc, 24682473.Google Scholar
Honnibal, Matthew and Montani, Ines. 2017. SpaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. URL: https://spacy.io/. [Accessed on February 1, 2024].Google Scholar
Jewson, Stephen . 2004. The problem with the Brier score. arXiv:physics/0401046.Google Scholar
Kaggle. 2012. Titanic-Machine Learning from Disaster. Kaggle. URL: https://kaggle.com/competitions/titanic Google Scholar
Kelly, Markelle, Longjohn, Rachel and Nottingham, Kolby. 2024. Home - UCI Machine Learning Repository. URL: https://archive.ics.uci.edu/.Google Scholar
Law, Mark, Russo, Alessandra, and Broda, Krysia. 2020. The ILASP system for inductive learning of answer set programs. Theory and Practice of Logic Programming, 20(4-5), 633652. https://doi.org/10.1017/S1471068420000257Google Scholar
Liu, Jing, Yu, Jin, Lin, Shen, Zhang, Guodong, Zhang, Shuo, Li, Min and Lin, Xiaoyue. 2023. Research on rainbow probabilistic forecast model based on meteorological conditions in ZhaoSu region. Meteorological Applications 30, 3, e2131. doi: 10.1002/met.2131.CrossRefGoogle Scholar
Liu, Qing, Wang, Jing, Zhang, Dehai, Yang, Yun and Wang, NaiYao 2018. Text features extraction based on TF-IDF associating semantic. In 2018 IEEE 4th International Conference On Computer and Communications (ICCC), 23382343, 10.1109/CompComm.2018.8780663. URL https://ieeexplore.ieee.org/document/8780663 CrossRefGoogle Scholar
Murphy, Allan H. and Epstein, Edward S. 1967. A note on probability forecasts and “Hedging”. Journal of Applied Meteorology and Climatology 6, 6, 10021004. doi: 10.1175/1520-0450(1967)006, Publisher: American Meteorological Society Section: Journal of Applied Meteorology and Climatology, URL https://journals.ametsoc.org/view/journals/apme/6/6/1520-0450_1967_006_1002_anopfa_2_0_co_2.xml.2.0.CO;2>CrossRefGoogle Scholar
Padalkar, Parth, Wang, Huaduo and Gupta, Gopal 2024. NeSyFOLD: a framework for interpretable image classification. Proceedings of the AAAI Conference On Artificial Intelligence 38, 5, 43784387. doi: 10.1609/aaai.v38i5.28235. URL https://ojs.aaai.org/index.php/AAAI/article/view/28235.CrossRefGoogle Scholar
Quinlan, J. R. 1987. Simplifying decision trees. International Journal of Man-Machine Studies 27, 3, 221234. doi: 10.1016/S0020-7373(87)80053-6, URL https://www.sciencedirect.com/science/article/pii/S0020737387800536,CrossRefGoogle Scholar
Quinlan, J. R. 1990a. 5 - Probabilistic decision tree. In Machine Learning, Kodratoff, Yves and Michalski, Ryszard S. Eds. San Francisco (CA): Morgan Kaufmann, 140152. doi:10.1016/B978-0-08-051055-2.50011-0, URL https://www.sciencedirect.com/science/article/pii/B9780080510552500110, ISBN 978-0-08-051055-2.Google Scholar
Quinlan, J. R. 1990b. Learning logical definitions from relations. Machine Learning 5, 3, 239266. doi: 10.1023/A:1022699322624.CrossRefGoogle Scholar
Shakerin, Farhad, Salazar, Elmer and Gupta, Gopal. 2017. A new algorithm to automate inductive learning of default theories. Theory and Practice of Logic Programming 17, 5-6, 10101026. doi: 10.1017/S1471068417000333. ISSN 1471-0684, 1475-3081.CrossRefGoogle Scholar
Vicient, Carlos, Sánchez, David and Moreno, Antonio. 2013. An automatic approach for ontology-based feature extraction from heterogeneous textualresources. Engineering Applications of Artificial Intelligence 26, 3, 10921106. doi 10.1016/j.engappai.2012.08.002.CrossRefGoogle Scholar
Wang, Huaduo. 2022. Explainable AI algorithms for classification tasks with mixed data. URL: https://utd-ir.tdl.org/items/89289f1a-c517-42bc-bb32-a7f3337a7410 Google Scholar
Wang, Huaduo and Gupta, Gopal 2022. FOLD–R++: a scalable toolset for automated inductive learning of default theories from mixed data. In Functional and Logic Programming. Hanus, Michael and Igarashi, AtsushiEds. Springer International Publishing, 224242. doi: 10.1007/978-3-030-99461-7_13, ISBN 978-3-030-99461-7.CrossRefGoogle Scholar
Wang, Huaduo and Gupta, Gopal 2023. FOLD-SE: an efficient rule-based machine learning algorithm with scalable explainability. In Practical Aspects of Declarative Languages, Gebser, Martin and Sergey, Ilya Eds. Springer Nature Switzerland, pp. 3753. doi: 10.1007/978-3-031-52038-9_3.CrossRefGoogle Scholar
Wang, Huaduo, Shakerin, Farhad and Gupta, Gopal 2022. FOLD-RM: a scalable, efficient, and explainable inductive learning algorithm for multi-category classification of mixed data. Theory and Practice of Logic Programming 22, 5, 658677.CrossRefGoogle Scholar
Wilson, Edwin B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22, 158, 209212. doi: 10.2307/2276774 CrossRefGoogle Scholar
Zhenzhen, Qiand Z., Q. 2024. English sentence semantic feature extraction method based on fuzzy logic algorithm. Journal of Electrical Systems 20, 1, 262275. doi: 10.52783/jes.681, URL https://journal.esrgroups.org/jes/article/view/681.CrossRefGoogle Scholar
Figure 0

Algorithm 1 CON-FOLD

Figure 1

Fig. 1. This toy example illustrates the difference between the FOLD-RM and CON-FOLD core algorithms. Both produce rules of the form shown. CON-FOLD would not consider the Flamingo as part of the data to fit when generating rule 2. FOLD-RM would consider the Flamingo. Note that in many cases, both algorithms would generate an abnormal rule ab(X) :- flamingo(X), preventing the Flamingo from being covered by the first rule. In this case, both FOLD-RM and CON-FOLD would include the Flamingo. When harsh pruning occurs and there are few abnormal rules, this subtle change becomes noticeable.

Figure 2

Algorithm 2 Evaluate Exceptions

Figure 3

Fig. 2. Scatter plot of the accuracy and number of rules for a ruleset generated by the pruning algorithm with different values of the improvement threshold and the confidence threshold. Each point has two circles. The background circle displays the number of rules and accuracy for no pruning; therefore, all background circles are the same. The front circle displays the rules and accuracy when pruning is applied. The accuracy is indicated by the color shown in the scale bar on the right-hand side. Pruning conditions that are more accurate than the unpruned condition are indicated with a black dot in the center. The number of rules is indicated by the area of the circle (equal amount of ink for number of rules), normalized by the number of rules in the unpruned case. The results shown for both the accuracy and the number of rules are the average of 300 trial runs for each test condition.

Figure 4

Table 1. Accuracy, runtime, and number of rules and predicates for different methods and different benchmarks from the UCI repository. Pruning hyperparameters are provided for the CON-FOLD algorithm. The uncertainty values provided are the standard deviations from 30 trial runs. FOLD-SE results are taken from Wang and Gupta (2023)

Figure 5

Fig. 3. Plot of IBS against the percentage of data included in the stratified training data for the E.coli UCI dataset. Thirty trials for each condition were performed, and error bars indicate one standard deviation across the trials. Pruned CON-FOLD used a confidence threshold of $0.65$ and a pruning threshold of $0.07$.

Figure 6

Fig. 4. Each of the plots shows the performance of models using the Inverse Brier Score metric with different amounts of training data. Plots a and b show the regimes where large amounts of training data are available, while plots e and f explore model performance with very small amounts of training data available. Plots a, c, and e use automatic feature extraction, while plots b, d, and f use manual feature extraction using regular expressions, which allows for domain knowledge in the form of a marking scheme to be included. The total number of student responses was $n=1525$.