1 Introduction
Dramatic success of machine learning has led to an avalanche of applications of artificial intelligence (AI). However, the effectiveness of these systems is limited by the machines’ current inability to explain their decisions to human users. That is mainly because statistical machine learning methods produce models that are complex algebraic solutions to optimization problems such as risk minimization or geometric margin maximization. Lack of intuitive descriptions makes it hard for users to understand and verify the underlying rules that govern the model. Also, these methods cannot produce a justification for a prediction they arrive at for a new data sample. The problem of explaining (or justifying) a model’s decision to its human user is referred to as the model interpretability problem. The subfield is referred to as explainable AI (XAI).
The inductive logic programming (ILP) learning problem is the problem of searching for a set of logic programming clauses from which the training examples can be deduced. ILP provides an excellent solution for XAI. ILP is a thriving field and a large number of such clause search algorithms have been devised as described by Muggleton et al. (Reference Muggleton, de Raedt, Poole, Bratko, Flach, Inoue and Srinivasan2012) and Cropper and Dumancic (Reference Cropper and Dumancic2020). The search in these ILP algorithms is performed either top down or bottom up. A bottom-up approach builds most-specific clauses from the training examples and searches the hypothesis space by using generalization. This approach is not applicable to large-scale datasets, nor it can incorporate negation-as-failure (NAF) into the hypothesis, as explained in the book by Baral (Reference Baral2003). A survey of bottom-up ILP systems and their shortcomings has been compiled by Sakama (Reference Sakama2005). In contrast, top-down approach starts with the most general clauses and then specializes them. A top-down algorithm guided by heuristics is better suited for large-scale and/or noisy datasets, as explained by Zeng et al. (Reference Zeng, Patel and Page2014).
The FOIL algorithm by Quinlan is a popular top-down ILP algorithm that learns a logic program. The FOLD algorithm by Shakerin et al. (Reference Shakerin, Salazar and Gupta2017) is a novel top-down algorithm that learns default rules along with exception(s) that closely model human thinking. It first learns default predicates that cover positive examples while avoiding covering negative examples. Then it swaps the covered positive examples and negative examples and calls itself recursively to learn the exception to the default. It repeats this process to learn exceptions to exceptions, exceptions to exceptions to exceptions, and so on. The FOLD-R++ algorithm by Wang and Gupta (Reference Wang and Gupta2022) is a new scalable ILP algorithm that builds upon the FOLD algorithm to deal with the efficiency and scalability issues of the FOLD and FOIL algorithms. It introduces the prefix sum computation and other optimizations to speed up the learning process while providing human-friendly explanation for its prediction using the s(CASP) answer set programming system (ASP) of Arias et al. (Reference Arias, Carro, Salazar, Marple and Gupta2018). However, all these algorithms focus on binary classification tasks, and cannot deal with multi-category classification tasks. Note that a binary classification task checks whether a data record is a member of a given class or not, for example, does a given creature fly or not fly? In multi-category classification, there can be multiple membership classes, for example, a given creature’s habitat can be predicted to be one of desert, mountain, plain, salt water, or fresh water (see the textbook by Bishop (Reference Bishop2006)).
In this paper we propose a new ILP learning algorithm called FOLD-RM for multi-category classification that builds upon the FOLD-R++ algorithm. FOLD-RM also provides native explanations for prediction without external libraries or tools. Our experimental results indicates that the FOLD-RM algorithm is comparable in performance to traditional, popular machine learning algorithms such as XGBoost by Chen and Guestrin (Reference Chen and Guestrin2016) and multi-layer perceptrons (MLP) described in the book by Aggarwal (Reference Aggarwal2018). In most cases, FOLD-RM outperforms them in execution efficiency. Of course, neither XGBoost nor MLP are interpretable.
Note that the term model in the field of machine learning and logic programming have different meanings. We use the term model in this paper in machine learning sense. Thus, the answer set program generated by our FOLD-RM algorithm is the model that we learn in the sense of machine learning. We use the term answer set in this paper to refer to stable models of answer set programs, where a model means assignment of truth values to program predicates that make the program true. Note also that we use the terms clause and rule interchangeably in this paper.
2 Background
2.1 Inductive logic programming
ILP as described in Muggleton (Reference Muggleton1991) is a subfield of machine learning that learns models in the form of logic programming clauses comprehensible to humans. This problem is formally defined as:
Given
-
1. A background theory B, in the form of an extended logic program, that is, clauses of the form $h \leftarrow l_1, ... , l_m,\ not \ l_{m+1},...,\ not \ l_n$ , where $h,l_1,...,l_n$ are positive literals and not denotes NAF as described in Baral (Reference Baral2003). For reasons of efficiency, we restrict B to be stratified (stratified logic programs are explained in the book by Gelfond and Kahl (Reference Gelfond and Kahl2014)).
-
2. Two disjoint sets of ground target predicates $E^+, E^-$ known as positive and negative examples, respectively.
-
3. A hypothesis language of function free predicates L, and a refinement operator $\rho$ under $\theta$ -subsumption described in Plotkin (Reference Plotkin1971) (for more details, see the paper by Cropper and Dumancic (Reference Cropper and Dumancic2020)). The hypothesis language L is also assumed to be stratified.
Find a set of clauses H such that:
-
1. $ \forall e \in \ E^+ ,\ B \cup H \models e$ .
-
2. $ \forall e \in \ E^- ,\ B \cup H \not \models e$ .
-
3. $B \land H$ is consistent.
The target predicate is the predicate whose definition we want to learn as a stratified normal logic program. The positive and negative examples are grounded target predicates, that is, suppose we want to learn the concept of which creatures can fly, then we will give positive examples $E^{+} = \{{\tt fly(tweety), fly(sam), \dots}\}$ and negative examples $E^{-} = \{{\tt fly(kitty), fly(polly), \dots}\}$ , where tweety, sam,, are names of creatures that can fly, and kitty, polly,, are names of creatures that cannot fly.
Note that the reason for restricting to stratified normal logic programs is that we can realize a simple and efficient ASP interpreter in the FOLD-RM system code for the training process. If we allowed for non-stratified programs, the training process will have to invoke a full-fledged ASP interpreter during the training and testing process, resulting in significant inefficiency. Considering non-stratified programs is part of our future research plan. We restrict ourselves to function-free predicates, that is, we allow only datalog rules, again, for reasons of efficiency.
2.2 Default rules
Default logic proposed by Reiter (Reference Reiter1980) is a non-monotonic logic to formalize commonsense reasoning. A default D is an expression of the form
which states that the conclusion $\Gamma$ can be inferred if pre-requisite A holds and B is justified. $\textbf{M} B$ stands for “it is consistent to believe B” as explained in the book by Gelfond and Kahl (Reference Gelfond and Kahl2014). Normal logic programs can encode a default quite elegantly. A default of the form:
can be formalized as the following normal logic program rule:
where $\alpha$ ’s and $\beta$ ’s are positive predicates and not represents NAF (under the stable model semantics as described in Baral (Reference Baral2003)). We call such rules default rules. Thus, the default $bird(X): M \lnot penguin(X)\over fly(X)$ will be represented as the following ASP-coded default rule:
fly(X) :- bird(X), not penguin(X).
We call bird(X), the condition that allows us to jump to the default conclusion that X can fly, as the default part of the rule, and not penguin(X) as the exception part of the rule.
Default rules closely represent the human thought process (i.e. frequently used in commonsense reasoning). FOLD-R and FOLD-R++ learn default rules represented as answer set programs. Note that the programs currently generated are stratified normal logic programs, however, we eventually hope to learn non-stratified answer set programs too as in the work of Shakerin and Gupta (Reference Shakerin and Gupta2018) and Shakerin (Reference Shakerin2020). Hence, we continue to use the term answer set program for a normal logic program in this paper. An advantage of learning default rules is that we can distinguish between exceptions and noise as explained by Shakerin et al. (Reference Shakerin, Salazar and Gupta2017) and Shakerin (Reference Shakerin2020). The introduction of (nested) exceptions, or abnormal predicates, in a default rule increases coverage of the data by that default rule. A single rule can now cover more examples which results in reduced number of generated rules. The equivalent program without the abnormal predicates will have many more rules if the abnormal predicates calls are fully expanded.
2.3 Classification problems
Classification problems are either binary or multi-category.
-
1. Binary classification is the task of classifying the elements of a set into two groups on the basis of a classification rule. For example, a specific patient (given a set of patients) has a particular disease or not, or a particular manufactured article (in a set of manufactured articles) will pass quality control or not. Details can be found in the book by Bishop (Reference Bishop2006).
-
2. Multi-category or multinomial classification is the problem of classifying instances into one of three or more classes. For example, an animal can be predicted to have one of the following habitats: sea water, fresh water, desert, mountain, or plains. Again, details can be found in the book by Bishop (Reference Bishop2006).
3 The FOLD-R++ algorithm
The FOLD-R++ algorithm by Wang and Gupta (Reference Wang and Gupta2022) is a new ILP algorithm for binary classification that is built upon the FOLD algorithm of Shakerin et al. (Reference Shakerin, Salazar and Gupta2017). Our FOLD-RM algorithm builds upon the FOLD-R++ algorithm. FOLD-R++ increases the efficiency and scalability of the FOLD algorithm. The FOLD-R++ algorithms divides features into two categories: categorical features and numerical features. For a categorical feature, all the values in the feature would be considered as categorical values even though some of them are numbers. For categorical features, the FOLD-R++ algorithm only generates equality or inequality literals. For numerical features, the FOLD-R++ algorithm would try to read all the values as numbers, converting them to categorical values if conversion to numbers fails. FOLD-R++ additionally generates numerical comparison ( $\leq$ and $>$ ) literals for numerical values. For a mixed type feature that contains both categorical values and numerical values, the FOLD-R++ algorithm treats them as numerical features.
The FOLD-R++ algorithm employs information gain (IG) heuristic to guide literal selection during the learning process. It uses a simplified calculation process for IG by using the number of true positive, false positive, true negative, and false negative examples that a literal can imply. The IG for a given literal is calculated as shown in Algorithm 1.
The goal of the ILP algorithm is to find an answer set program whose answer set has all the positive examples and none of the negative examples. Our algorithm incrementally learns this program using the IG heuristic. The IG heuristic allows us to refine our program incrementally, that is, the answer set of the program after each refinement step has more and more positive examples included and fewer and fewer of the negative ones.
The comparison between two numerical values or two categorical values in FOLD-R++ is straightforward, as commonsense would dictate, that is, two numerical (resp. categorical) values are equal if they are identical, else they are unequal. However, a different assumption is made to compare a numerical value and a categorical value in FOLD-R++. The equality between a numerical value and a categorical value is always false, and the inequality between a numerical value and a categorical value is always true. Additionally, numerical comparison ( $\leq$ and $>$ ) between a numerical value and a categorical value is always false. An example is shown in Table 1 (left), while an evaluation example for a given literal, $literal{(i,>,4)}$ , based on the comparison assumption is shown in Table 1 (right). Given E $^+=\{1,2,2,4,5,x,x,y\}$ , E $^-=\{1,3,4,y,y,y,z\}$ , and $literal{(i,>,4)}$ , the true positive example $\mathrm{E}_{tp}$ , false negative examples $\mathrm{E}_{fn}$ , true negative examples $\mathrm{E}_{tn}$ , and false positive examples $\mathrm{E}_{fp}$ implied by the literal are $\{5\}$ , $\{1,2,2,4,x,x,y\}$ , $\{1,3,4,y,y,y,z\}$ , Ø respectively. Then, the IG of literal $(i,>,4)$ is calculated $IG_{(i,>,4)}(1,7,7,0)=-0.647$ through Algorithm 1.
The FOLD-R++ algorithm starts with the clause p(…) :- true., where p(…) is the target predicate to learn. It specializes this clause by adding literals to its body during the inductive learning process. It selects a literal to add that maximizes IG. The literal selection process is summarized in Algorithm 2. In line 2, pos and neg are dictionaries that hold, respectively, the numbers of positive and & negative examples for each unique value. In line 3, xs and & cs are lists that hold, respectively, the unique numerical and categorical values. In line 4, xp and & xn are the total number of, respectively, positive and negative examples with numerical values; cp and cn are the same for categorical values. In line 11, the IG of literal $(i,\le, x)$ is calculated by taking the parameters pos[x] as the number of true positive examples, $xp-pos[x]+cp$ as the number of false negative examples, $xn-neg[x]+cn$ as the number of true negative examples, and neg[x] as the number of false positive examples. After computing the prefix sum in line 6, pos[x] holds the total number of positive examples that has a value less than or equal to x. Therefore, $xp-pos[x]$ represents the total number of positive examples that have a value greater than x. cp, the total number of positive examples that have a categorical value, is added to the number of false negative examples because of the assumption that numerical comparison between a numerical value and a categorical value is always false. The negative examples that have a value greater than x or a categorical value would be evaluated as false by literal $(i,\le, x)$ , so $xn-neg[x]$ is added as true negative parameter. And, cn, the total number of negative examples that has a categorical value, is added to true negative parameter. The expression neg[x] means the number of negative examples that have the value less than or equal to x; neg[x] is added as false positive parameter because the evaluations of these examples by literal $(i,\le, x)$ are true. The IG calculation processes of other literals also follows the comparison assumption mentioned above. Finally, the best_info_gain function returns the best score on IG and the corresponding literal except the literals that have been used in current rule-learning process. For each feature, we compute the best literal, then the find_best_literal function returns the best literal among this set of best literals.
Example 1 Given positive and negative examples in Table 2, E $^+$ , E $^-$ , with mixed type of values on ith feature, the target is to find the literal with the best IG on the given feature. There are 8 positive examples, their values on ith feature are [1,2,2,4,5,x,x,y], and the values on ith feature of the 7 negative examples are [1,3,4,y,y,y,z].
With the given examples and specified feature, the number of positive examples and negative examples for each unique value are counted first, which are shown as pos, neg on right side of Table 2. Then, the prefix sum arrays are calculated for computing heuristic as psum $^+$ , psum $^-$ . Table 3 shows the IG for each literal, the literal $(i, =, x)$ has been selected with the highest score.
4 The FOLD-RM algorithm
The FOLD-R++ algorithm performs binary classification. We generalize the FOLD-R++ algorithm to perform multi-category classification. The generalized algorithm is called FOLD-RM. The FOLD-R++ algorithm is summarized in Algorithm 3. The FOLD-R++ algorithm generates an ASP rule set, in which all the rules have the same rule head. An example covered by any rule in the set would imply the rule head is true. The FOLD-R++ algorithm generates a model by learning one rule at a time. Ruling out the already covered example in line 9 after learning a rule would help select better literal for remaining examples. In the rule-learning process, the best literal would be selected according to the useful information it can provide for current training examples (line 17) till the literal selection fails. If the ratio of false positive examples to true positive examples drops below the threshold ratio in line 22, it would next learn exceptions by swapping residual positive and negative examples and calling itself recursively (line 26). Any examples that cannot be covered by the selected literals would be ruled out in line 20, 21. The ratio in line 22 represents the upper bound on the number of true positive examples to the number of false positive examples implied by the default part of a rule. It helps speed up the training process and reduces the number of rules learned.
Generally, avoiding covering negative examples by adding literals to the default part of a rule will reduce the number of positive examples the rule can imply. Explicitly activating the exception learning procedure (line 26 in Algorithm 3) could increase the number of positive example a rule can cover while reducing the total number of rules generated. As a result, the interpretability is increased due to fewer rules being generated.
The FOLD-RM algorithm performs multi-category classification. It generates rules that it can learn for each category. If an example cannot be implied by any rule in the learned rule set, it means the model fails to classify this example. The FOLD-RM algorithm, summarized in Algorithm 4, first finds a target literal that represents the category with most examples among the current training set (line 4). It next splits the training set into positive and negative examples based on the target literal (line 5). Then, it learns a rule to cover the target category (line 6) by calling the learn_rule function of the FOLD-R++ algorithm. The already covered examples would be ruled out from the training set in line 11, and the rule head would be changed to the target literal in line 12. However, there is a difference between the outputs of FOLD-RM and FOLD-R++. Unlike FOLD-R++, the output of FOLD-RM is a textually ordered answer set program, which means a rule is checked only if all the rules before it did not apply. The FOLD-RM system is publicly available at https://github.com/hwd404/FOLD-RM.
Note that for learning each rule, FOLD-RM (Algorithm 4) chooses the target predicate by finding the label value with the most examples in the remaining training examples and sets it as the target predicate for this rule. In other words, the target predicate is the “most popular” label value. The names of the predicates are the names of features in the data. The head predicate and predicates in rule body each have exactly two arguments. The first argument is a reference to the data record itself. For the target predicate, the second argument is the predicted label for that record, while for predicates in the body, the second argument is used to extract the appropriate feature value for that record. The abnormal predicates only take one argument, namely, the data record itself. For example, consider:
class(X,’2’) :- condition(X, ’s’), not ab5(X).
ab5(X) :- not steel(X,’r’), not enamelability(X,’2’).
The first rule says that the predicted class of data record X is ‘2’ if the condition feature of X has value ‘s’, and abnormal case ab5 does not apply. ab5(X) is an abnormal case predicate and has only one argument. It says that the record X should not be predicted to have class value ‘2’, if the value of steel feature is not ‘r’, and the value of enamelability feature is not ‘2’.
4.1 Algorithmic complexity
Next, we analyze the complexity of the FOLD-RM algorithm. If M is the number of examples and N is the number of features, it is easy to see that the time complexity of finding the best literal (Algorithm 2) is O(NM). We assume that counting sort (complexity O(M)) with a pre-sorted list is used at line 5 in Algorithm 2. The worst case in the FOLD-RM algorithm arises when each generated rule only covers one example and each literal only excludes one non-target example. Therefore, in the worst case there will be $O(M^2)$ literals chosen in total. The worst case time complexity of the FOLD-RM algorithm (Algorithm 4) can be calculated to be $O(NM^3)$ . However, this is a theoretical upper bound. The actual learning process is really efficient because the heuristics we employ helps select very effective literals, reducing the number of iterations in the algorithm.
One can also prove that the FOLD-RM algorithm always terminates. The fold_rm function calls the learn_rule function to induce a rule that can cover at least one ‘most popular’ remaining example till all the examples have been covered or the learned rule fails to cover any ‘most popular’ example. The loop in the fold_rm function iterates at most $|E|$ times while excluding the already covered examples. The learn_rule function refines the rule with a given target by adding the best literal to the rule body. By adding literals to the rules, the numbers of true positive and false positive examples the rule implies can only monotonically decrease. The learned valid literal excludes at least one false positive example that the rule implies. So, the loop in the learn_rule function iterates at most $|E^-|$ times. When the $|E^-|<|E^+|*ratio$ condition is met, the fold_rpp function is called to learn exception rules for the current default rule. Similar to the fold_rm function, the fold_rpp function iterates at most $|E^+|$ times. Also, there are only finite for-loops inside the find_best_literal function. Therefore, we can conclude that the FOLD-RM algorithm will always terminate.
4.2 An illustrative example
We illustrate FOLD-RM, next, with a simple example.
Example 2 The target is to learn rules for habitat using the FOLD-RM algorithm. B, E are background knowledge and training examples, respectively. There are 3 classifications: two explicit ones (land and water), and one implicit one (neither land, nor water).
For the first rule, the target predicate {habitat(X,land):- true} is specified at line 4 in Algorithm 4 because ‘land’ is the majority label. The find_best_literal function selects literal mammal(X) as result and adds it to the clause r = {habitat(X,land):- mammal(X)} at line 17 in Algorithms 3 because it provides the most useful information among literals <monospcae>{cat,whale,bear,dog,fish,clownfish}</monospcae>. Then the training set rules out covered examples at line 20–21 in Algorithm 3, $E^+=$ Ø, $E^-=$ {john,nemo}. The default learning is finished at this point because the candidate literal cannot provide any further useful information. Therefore, the fold_rpp function is called recursively with swapped positive and negative examples, $E^+=$ {john,nemo}, $E^-=$ Ø, to learn exceptions. In this case, an abnormal predicate <monospcae>{ab1(X):-whale(X)}</monospcae> is learned and added to the previously generated clause as r = <monospcae>{habitat(X,land):- mammal(X), not ab1(X)}</monospcae>. And the exception rule <monospcae>{ab1(X):- whale(X)}</monospcae> is added to the answer set program. FOLD-RM next learns rules for target predicate {habitat(X,water):- true} and two rules are generated as {habitat(X,water):- fish(X)} and {habitat(X,water):- whale(X)}. The generated final answer set program is:
The program above is a logic program, which means rules are not mutually exclusive. For correctness, a rule should be checked only if all the earlier rules result in failure. FOLD-RM generates further rules to make the learned rules mutually exclusive. The program above is transformed as shown below.
5 Experimental results
In this section, we present our experiments on standard UCI benchmarks. The XGBoost classifier is a well-known classification model and used as a baseline model in our experiments. The settings used for XGBoost classifier is kept simple without limiting its performance. MLP is another widely used classification model that can deal with generic classification tasks. However, both XGBoost model and MLP cannot take mixed type (numerical and categorical values in a row or a column) as training data without pre-processing. For mixed type data, one-hot encoding—as explained in the book by Aggarwal (Reference Aggarwal2018)—has been used for data preparation. For binary classification, we use accuracy, precision, recall, and F $_1$ score as evaluation metrics. For the multi-category classification tasks, following convention, we use accuracy, weighted average of Macro precision, weighted average of Macro recall, and weighted average of Macro F $_1$ score to compare models as explained by Grandini et al. (Reference Grandini, Bagli and Visani2020). The average numbers of generated rules are also reported for the FOLD-R++ and FOLD-RM algorithms in Tables 4, 6, and 7, some of them are not integers because of being averaged over a number of repeated experiments.
Both FOLD-R++ and FOLD-RM algorithms do not need any encoding for training. After specifying the numerical features, they can deal with mixed type data directly, that is, no one-hot encoding is needed. Even missing values are handled and do not need to be provided. We implemented both algorithms with Python. The hyper-parameter ratio is simply set as 0.5 for all the experiments. And all the learning processes have been run on a desktop with Intel i5-10400 CPU @ 2.9 GHz and 32 GB RAM. To have good performance test, we performed 10-fold cross-validation test on each dataset and average classification metrics and execution time are shown. The best performer is highlighted in boldface.
The XGBoost classifier utilizes decision tree ensemble method to build model and provides good performance. Performance comparison of FOLD-R++ and XGBoost is shown in Table 4. The FOLD-R++ algorithm is comparable to XGBoost classifier for classification, but it is more efficient in terms of execution time, especially on datasets with many unique feature values.
For multi-category classification experiments, we collected 15 datasets for comparison with XGBoost and MLP. The drug consumption dataset has many output attributes, we perform training on heroin, crack, and semer attributes. The size and label distribution of the datasets used is shown in Table 5: number of rows indicates the number of data records, while the number of columns indicates the number of features. We first compare the performance of FOLD-RM and XGBoost in Table 6. XGBoost performs much better on datasets avila and yeast, and FOLD-RM performs much better on datasets ecoli, dry-bean, eeg, and weight-lifting. After analyzing these dataset, FOLD-RM seems to perform better on more complicated datasets with mixed type values. XGBoost seems to perform better on the datasets that have limited information. However, for those datasets for which FOLD-RM has similar performance with XGBoost, FOLD-RM is more efficient in terms of execution speed. In addition, FOLD-RM is explainable/interpretable, and XGBoost is not.
The comparison with MLP is presented in Table 7. For most datasets, FOLD-RM can achieve equivalent scores, similar to the comparison with XGBoost, FOLD-RM performs much better on datasets ecoli, dry-bean, eeg, and weight-lifting, while MLP performs much better on datasets avila and yeast. MLP takes much more time for training than XGBoost because of its algorithmic complexity. Like the XGBoost classifier, for complex datasets with mixed values, MLP also suffers from pre-processing complications such as having to use one-hot encoding.
RIPPER algorithm by Cohen (Reference Cohen1995) is a popular rule induction algorithm that generates conjunctive normal form (CNF) formulas. Eight datasets have been used for comparison between RIPPER and FOLD-RM. We did not find the RIPPER algorithm implementation with multiclass classification. Therefore, we have collected the accuracy data reported by Asadi and Shahrabi (Reference Asadi and Shahrabi2016) and performed the same experiment with the same datasets with the FOLD-RM algorithm. Two-thirds of the dataset was used for training by Asadi and Shahrabi (Reference Asadi and Shahrabi2016) and the remaining one-third used as the test set. We follow the same convention. For each dataset, this process was repeated 50 times. The average of accuracy is shown in Table 8. Both algorithms have similar accuracy on most datasets, though FOLD-RM outperforms on nursery dataset. Ripper is explainable, as it outputs CNF formulas. However, the CNF formulae generated tend to have large number of literals. In contrast, FOLD-RM rules are succinct due to use of NAF and they have an operational semantics (that aligns with how humans reason) by virtue of being a normal logic program.
6 Prediction and justification
The FOLD-RM algorithm generates rules that can be interpreted by the human user to understand the patterns and correlations that are implicit in the table data. These rules can also be used to make prediction given new data input. Thus FOLD-RM serves as a machine learning algorithm in its own right. However, making good predictions is not enough for critical tasks such as disease diagnosis and loan approval. FOLD-RM comes with a built-in prediction and justification facility. We illustrate this justification facility via an example.
Example 3 The “annealing” UCI dataset is a multi-category classification task which contains 798 training examples and 100 test examples and their classes based on features such as steel, carbon, hardness, condition, strength, etc. FOLD-RM generates the following answer set program with 20 rules for 5 classes, which is pretty concise and precise:
The above generated rule set achieves 0.99 accuracy, 0.99 weighted Macro precision, 0.99 weighted Macro recall, and 0.99 weighed Macro F1 score. The justification tree generated by the FOLD-RM system for the 8th test example is shown below:
This justification tree is also shown in another format: by showing which rules were involved in the proof/justification. For each call in each rule that was invoked, FOLD-RM shows whether it is true ([T]) or false ([F]). The head of each applicable rule is similarly annotated. We illustrate this for the 8th test example:
7 Conclusions and related work
In this paper we presented FOLD-RM, an efficient and highly scalable algorithm for multi-category classification tasks. FOLD-RM can generate explainable answer set programs and human-friendly justification for predictions. Our algorithm does not need any encoding (such as one-hot encoding) for data preparation. Compared to the well-known classification models like XGBoost and MLP, our new algorithm has similar performance in terms of accuracy, weighted macro precision, weighted macro recall, and weighted macro F $_1$ score. However, our new approach is much more efficient and interpretable than these other approaches. It is remarkable that an ILP system is comparable in accuracy to state-of-the-art traditional machine learning systems.
ALEPH by Srinivasan (Reference Srinivasan2001) is a well-known ILP algorithm that employs bottom-up approach to induce rules for non-numerical data. Also, no automatic method is available for the specialization process. A tree-ensemble based rule extraction algorithm is proposed by Takemura and Inoue (Reference Takemura and Inoue2021), its performance relies on trained tree-ensemble model. It may also suffer from scalability issue because its running time is exponential in the number of valid rules.
In practice, statistical machine learning models show good performance for classification. Extracting rules from statistical models is also a long-standing research topic. Rule extraction algorithms are of two kinds: 1) pedagogical (learning rules from black box models without looking into internal structures), such as, TREPAN by Craven and Shavlik (Reference Craven and Shavlik1995), which learns decision trees from neural networks, 2) decompositional (learning rules by analyzing the models inside out) such as SVM+Prototypes by NuÑez et al. (Reference NuÑez, Angulo and CatalÀ2006), which employs clustering algorithm to extract rules from SVM classifiers by utilizing support vectors. RuleFit by Friedman and Popescu is another rule extraction algorithm that learns sparse linear models with original feature decision rules from shallow tree-ensemble model for both classification and regression tasks. However, its interpretability decreases when too many decision rules have been generated. Also, simpler approaches that are a combination of statistical method with ILP have been extensively explored. The kFOIL system by Landwehr et al. (Reference Landwehr, Passerini, Raedt and Frasconi2006) incrementally learns kernel for SVM FOIL style rule induction. The nFOIL system by Landwehr et al. (Reference Landwehr, Kersting and Raedt2005) is an integration of Naive Bayes model and FOIL. TILDE by Blockeel and De Raedt (Reference Blockeel and De Raedt1998) is another top-down rule induction algorithm based on C4.5 decision tree, it can achieve similar performance with decision tree. However, it would suffer from scalability issue when there are too many unique numerical values in the dataset. For most datasets we experimented with, the number of leaf nodes in the trained C4.5 decision tree is much more than the number of rules that FOLD-R++/FOLD-RM generate. The FOLD-RM algorithm outperforms the above methods in efficiency and scalability due to (i) its use of learning defaults, exceptions to defaults, exceptions to exceptions, and so on (ii) its top-down nature, and (iii) its use of improved method (prefix sum) for heuristic calculation.
Acknowledgment
Authors acknowledge support from NSF grants IIS 1718945, IIS 1910131, IIP 1916206, US DoD, Atos Corp and Amazon Corp. We thank our colleagues Joaquin Arias, Parth Padalkar, Kinjal Basu, Sarat Chandra Varanasi, Elmer Salzar, Fang Li, Serdar Erbatur, and Doug DeGroot for discussions and help.