1 Introduction
Interpretability in machine learning is the ability to explain or to present in understandable terms to a human (Doshi-Velez and Kim Reference Doshi-Velez and Kim2017; Miller Reference Miller2019; Molnar Reference Molnar2020). Interpretability is particularly important when, for example, the goal of the user is to gain knowledge from some form of explanations about the data or process through machine learning models, or when making high-stakes decisions based on the outputs from the machine learning models where the user has to be able to trust the models. Explainability is another term that is often used interchangeably with interpretability, but some emphasize the ability to produce post hoc explanations for the black-box models (Rudin Reference Rudin2019). For convenience, we shall use the term explanation when referring to post hoc explanations in this paper.
In this work,Footnote 1 we address the problem of explaining trained tree-ensemble models by extracting meaningful rules from them. This problem is of practical relevance in business and scientific domains, where the understanding of the behavior of high-performing machine learning models and extraction of knowledge in human-readable form can aid users in the decision-making process. We use answer set programming (ASP) (Gelfond and Lifschitz Reference Gelfond and Lifschitz1988; Lifschitz Reference Lifschitz2008) to generate rule sets from tree-ensembles. ASP is a declarative programming paradigm for solving difficult search problems. An advantage of using ASP is its expressiveness and extensibility, especially when representing constraints. To our knowledge, ASP has never been used in the context of rule set generation from tree-ensembles, although it has been used in pattern mining (Järvisalo Reference Jarvisalo2011; Guyet et al. Reference Guyet, Moinard and Quiniou2014; Gebser et al. Reference Gebser, Guyet, Quiniou, Romero and Schaub2016; Paramonov et al. Reference Paramonov, Stepanova and Miettinen2019).
Generating explanations for machine learning models is a challenging task, since it is often necessary to account for multiple competing objectives. For instance, if accuracy is the most important metric, then it is in direct conflict with explainability because accuracy favors specialization while explainability favors generalization. Any explanation method should also strive to imitate the behavior of learned models as to minimize misrepresentation of models, which in turn may result in misinterpretation by the user. While there are many explanation methods available (some are covered in Section 6), we propose to use ASP as a medium to represent the user requirements declaratively and to quickly search feasible solutions for faster prototyping. By implementing a rule selection method as a post-processing step to model training, we aim to offer an off-the-shelf objective explanation tool which can be applied to existing processes with minimum modification, as an alternative to subjective manual rule selection.
To demonstrate the adaptability of our approach, we present implementations for both global and local explanations of learned tree-ensemble models using our method. In general, global explanation refers to descriptions of how the overall system works (also referred to as model explanation), and local explanation refers to specific descriptions of why a certain decision was made (outcome explanation) (Guidotti et al. Reference Guidotti, Monreale, Ruggieri, Turini, Glannotti and Pedreschi2018). The global explanations are more useful in situations where the explanations behind the opaque model is needed, for example, when designing systems for faster detection of certain events such as credit issues or illnesses. In contrast, the local explanations are suitable, for example, when explaining the outcome of such systems to its users, since they are more likely to be interested in particular decisions that led to the outcome.
We consider the two-step procedure for rule set generation from trained tree-ensemble models (Figure 1): (1) extracting rules from tree-ensembles, and (2) computing sets of rules according to selection criteria and preferences encoded declaratively in ASP. For the first step, we employ the efficiency and prediction capability of modern tree-ensemble algorithms in finding useful feature partitions for prediction from data. For the second step, we exploit the expressiveness of ASP in encoding constraints and preference to select useful rules from tree-ensembles, and rule selection is automated through a declarative encoding. In the end, we obtain the generated rule sets which serve as explanations for the tree-ensemble models, providing insights into their behavior. These aim to mimic the models’ behavior rather than offering exhaustive and formally correct explanations, thus aligning with heuristic-based explanation methods in the sense of, for example, (Izza and Marques-Silva Reference Izza and Marques-SILVA2021; Ignatiev et al. Reference Ignatiev, Izza, Stuckey and Marques-Silva2022; Audemard et al. Reference Audemard, Bellart, Bounia, Koriche, Lagniez and Marquis2022b).
We then evaluate our approach using public datasets. For evaluating global explanations, we use the number and relevance of rules in the rule sets. The number of rules is often associated with explainability, with many rules being less desirable. Performance metrics such as classification accuracy, precision, and recall can be used as a measure of relevance of the rules to the prediction task. For evaluating local explanations, we use precision and coverage metrics to compare against existing systems.
This paper makes the following contributions:
-
• We present a novel application of ASP for explaining trained machine learning models. We propose a method to generate explainable rule sets from tree-ensemble models with ASP. More broadly, this work contributes to the growing body of knowledge on integrating symbolic reasoning with machine learning.
-
• We present how the rule set generation problem can be reformulated as an optimization problem, where we leverage existing knowledge on declarative pattern mining with ASP.
-
• We show how both global and local explanations can be generated by our approach, while comparative methods tend to focus on either one exclusively.
-
• To demonstrate the practical applicability of our approach, we provide both qualitative and quantitative results from evaluations with public datasets, where machine learning methods are used in a realistic setting.
The rest of this paper is organized as follows. In Section 2, we review tree-ensembles, ASP, and pattern mining. Section 3 presents our method to generate rule sets from tree-ensembles using pattern mining and optimization encoded in ASP. Section 4 describes global and local explanations in the context of our approach. Section 5 presents experimental results on public datasets. In Section 6, we review and discuss related works. Finally, in Section 7 we present the conclusions.
2 Background
In the remainder of this paper, we shall use learning algorithms to refer to methods used to train models, as in machine learning literature. We use models and explanations to refer to machine learning models and post hoc explanations about the said models, respectively.
2.1 Tree-ensemble learning algorithms
Tree-Ensemble (TE) learning algorithms are machine learning methods widely used in practice, typically, when learning from tabular datasets. A trained TE model consists of multiple base decision trees, each trained on an independent subset of the input data. For example, Random Forests (Breiman Reference Breiman2001) and Gradient Boosted Decision Tree (GBDT) (Friedman Reference Friedman2001) are tree-ensemble learning algorithms. Recent surge of efficient and effective GBDT algorithms, for example, LightGBM (Ke et al. Reference Ke, Meng, Finley, Wang, Chen, Ma, Ye and Liu2017), has led to wide adoption of TE learning algorithms in practice. Although individual decision trees are considered to be interpretable (Huysmans et al. Reference Huysmans, Dejaeger, Mues, Vanthienen and Baesens2011), ensembles of decision trees are seen as less interpretable.
The purpose of using TE learning algorithms is to train models that predict the unknown value of an attribute $y$ in the dataset, referred to as labels, using the known values of other attributes $\mathbf{x}=(x_1,x_2,\ldots, x_m)$ , referred to as $\textit{features}$ . For brevity, we restrict our discussion to classification problems. During the training or learning phase, each input instance to the TE learning algorithm is a pair of features and labels, that is $(\mathbf{x}_i, y_i)$ , where $i$ denotes the instance index, and during the prediction phase, each input instance only includes features, $(\mathbf{x}_i)$ , and the model is tasked to produce predictions $\hat{y}_i$ . A collection of input instances, complete with features and labels, is referred to as a dataset. Given a dataset $\mathscr{D}=\{(\mathbf{x}_i, y_i)\}$ with $n\in \mathbb{N}$ examples and $m\in \mathbb{N}$ features, a decision tree classifier $t$ will predict the class label $\hat{y}_i$ based on the feature vector $\mathbf{x}_i$ of the $i$ -th sample: $\hat{y}_i = t(\mathbf{x}_i)$ . A tree-ensemble $\mathscr{T}$ uses $K\in \mathbb{N}$ trees and additionally an aggregation function $f$ over the $K$ trees which combines the output from the trees: $\hat{y}_i = f(t_{k\in K}(\mathbf{x}_i))$ . As for Random Forest, for example, $f$ is a majority voting scheme (i.e., argmax of sum), and in GBDT $f$ may be a summation followed by softmax to obtain $\hat{y}_i$ in terms of probabilities.
In this paper, a decision tree is assumed to be a binary tree where the internal nodes hold split conditions (e.g., $x_1 \leq 0.5$ ) and leaf nodes hold information related to class labels, such as the number of supporting data points per class label that have been assigned to the leaf nodes. Richer collections of decision trees provide higher performance and less uncertainty in prediction compared to a single decision tree. Typically, each TE model has specific algorithms for learning base decision trees, adding more trees and combining outputs from the base trees to produce the final prediction. In GBDT, the base trees are trained sequentially by fitting the residual errors from the previous step. Interested readers are referred to Friedman (Reference Friedman2001), and its more recent implementations, LightGBM (Ke et al. Reference Ke, Meng, Finley, Wang, Chen, Ma, Ye and Liu2017) and XGBoost (Chen and Guestrin Reference Chen and Guestrin2016).
2.2 Answer set programming
Answer Set Programming (Lifschitz Reference Lifschitz2008) has its roots in logic programming and non-monotonic reasoning. A normal logic program is a set of rules of the form
where each $\mathrm{a_i}$ is a first-order atom with $1 \leq \mathrm{i} \leq \mathrm{n}$ and not is default negation. If only $\mathrm{a_1}$ is included ( $\mathrm{n} = 1$ ), the above rule is called a fact, whereas if $\mathrm{a_1}$ is omitted, it represents an integrity constraint. A normal logic program induces a collection of intended interpretations, which are called answer sets, defined by the stable model semantics (Gelfond and Lifschitz Reference Gelfond and Lifschitz1988). Additionally, in modern ASP systems, constructs such as conditional literals and cardinality constraints are supported. The former in clingo (Gebser et al. Reference Gebser, Kaminski, Kaufmann and Schaub2014) is written in the form $\{ \mathrm{a}(\texttt{X}) \ \text{:} \ \mathrm{b}(\texttt{X}) \}$ Footnote 2 and expanded into the conjunction of all instances of $\mathrm{a}(\texttt{X})$ where corresponding $\mathrm{b}(\texttt{X})$ holds. The latter are written in the form $s_1 \ \{ \mathrm{a}(\texttt{X}) \ \text{:} \ \mathrm{b}(\texttt{X}) \} \ s_2$ , which is interpreted as $s_1 \leq \texttt{#count} \{ \mathrm{a}(\texttt{X}) \ \text{:} \ \mathrm{b}(\texttt{X}) \} \leq s_2$ where $s_1$ and $s_2$ are treated as lower and upper bounds, respectively; thus the statement holds when the count of instances $\mathrm{a}(\texttt{X})$ where $\mathrm{b}(\texttt{X})$ holds, is between $s_1$ and $s_2$ . The minimization (or maximization) of an objective function can be expressed with $\texttt{#minimize}$ (or $\texttt{#maximize}$ ). Similarly to the #count aggregate, the #sum aggregate sums the first element (weight) of the terms, while also following the set property. clingo supports multiple optimization statements in a single program, and one can implement multi-objective optimization with priorities by defining two or more optimization statements. For more details on the language of clingo, we refer the reader to the clingo manual.Footnote 3
2.3 Pattern mining
In a general setting, the goal of pattern mining is to find interesting patterns from data, where patterns can be, for example, itemsets, sequences, and graphs. For example, in frequent itemset mining (Agrawal and Srikant Reference Agrawal and Srikant1994), the task is to find all subsets of items that occur together more than the threshold count in databases. In this work, a pattern is a set of predictive rules. A predictive rule has the form $ c \Leftarrow s_1 \wedge s_2 \wedge, \ldots, s_n$ , where $c$ is a class label, and $\{s_i\}$ ( $1 \leq i \leq n$ ) represents conditions.
For pattern mining with constraints, the notion of dominance is important, which intuitively reflects the pairwise preference relation $(\lt ^*)$ between patterns (Negrevergne et al. Reference Negrevergne, Dries, Guns and Nijssen2013). Let $C$ be a constraint function that maps a pattern to $\{\top, \bot \}$ , and let $p$ be a pattern, then the pattern $p$ is valid iff $C(p)=\top$ , otherwise it is invalid. An example of $C$ is a function which checks that the support of a pattern is above the threshold. The pattern $p$ is said to be dominated iff there exists a pattern $q$ such that $p \lt ^* q$ and $q$ is valid under $C$ . Dominance relations have been used in ASP encoding for pattern mining (Paramonov et al. Reference Paramonov, Stepanova and Miettinen2019).
3 Rule set generation
3.1 Problem statement
The rule set generation problem is represented as a tuple $P=\{R,M,C,O\}$ , where $R$ is the set of all rules extracted from the tree-ensemble, $M$ is the set of meta-data and properties associated with each rule in $R$ , $C$ is the set of user-defined constraints including preferences, and $O$ is the set of optimization objectives. The goal is to generate a set of rules from $R$ by selection under constraints $C$ and optimization objectives $O$ , where constraints and optimization may refer to the meta-data $M$ . In the following sections, we describe how we construct each $R$ , $M$ , $C$ , and $O$ , and finally, how we solve this problem with ASP.
3.2 Rule extraction from decision trees
This subsection describes how $R$ , the set of all rules, is constructed. The first two steps in “tree-ensemble processing” in Figure 1 are also described in this subsection. Recall that a tree-ensemble $\mathscr{T}$ is a collection of $K$ decision trees, and we refer to individual trees $t_k$ with subscript $k$ . An example of a decision tree-ensemble is shown in Figure 2. A decision tree $t_k$ has $N_{t_k}$ nodes and $L_{t_k}$ leaves. Each node represents a split condition, and there are $L_{t_k}$ paths from the root node to the leaves. For simplicity, we assume only features that have orderable values (continuous features) are present in the dataset in the examples below.Footnote 4 The tree on the left in Figure 2 has four internal nodes including the root node with condition $[x_1 \leq 0.2]$ and five leaf nodes; therefore, there are five paths from the root note to the leaf nodes 1 to 5.
From the left-most path of the decision tree on the left in Figure 2, the following prediction rule is created. We assume that node 1 predicts class label 1 in this instance.Footnote 5
Assuming that node 2 predicts class label 0, we also construct the following rule (note the reversal of the condition on $x_4$ ):
To obtain the candidate rule set, we essentially decompose a tree-ensemble into a rule set. The steps are outlined in Algorithm1. By constructing the candidate rule set $R$ in this way, the bodies (antecedents) of rules included in rule sets are guaranteed to exist in at least one of the trees in the tree-ensemble. Rule sets generated in this manner are therefore faithful to the representation of the original model in this sense. If we were to construct rules from the unique set of split conditions, the resulting rule may have combinations of conditions that do not exist in any of the trees.
We now analyze the computational complexities associated with constructing the set of all rules $R$ . Let us assume that (1) all $K$ trees in the ensemble are perfect binary decision trees and have the same height $h$ , (2) there are $n$ examples and $m$ features in the dataset, and (3) there are no duplicate rules and conditions across trees.
Proposition 1. The maximum size of $R$ , constructed by only considering the rules at the leaf nodes, is $K \times 2^h$ .
This follows immediately from the number of leaf nodes in a perfect binary decision tree with height $h$ , that is $2^h$ . In practice, there are duplicate split conditions across trees in a tree-ensemble, so the unique count of rules is often smaller than the maximum value.
Proposition 2. The time complexity of the proposed method to construct $R$ is $O(K \times (2^h \times n \times h))$ .
Proof For each rule in $O(2^h)$ rules, all conditions in the rule need to be applied to the data. Since there are at most $h$ conditions in a rule, and there are $n$ examples, it takes $O(n \times h)$ time to apply all conditions in a rule.
3.3 Computing metrics and meta-data for selection
After the candidate rule set $R$ is constructed, we gather information about the performance and properties of each rule and collect them into a set $M$ . This is the last step in the tree-ensemble processing process depicted in Figure 1 (“Assign Metrics”). The meta-data, or properties, of a rule are information such as the size of the rule, as defined by the number of conditions in the rule, and the ratio of instances which are covered by the rule. Computing classification metrics, for example, accuracy and precision, requires access to a subset of the dataset with ground truth labels, which could be either a training or a validation set. On the other hand, when access to the labeled subset is not available at runtime, these metrics and their corresponding predicates cannot be used in the ASP encoding. In our experiments, we used the training sets to compute these classification metrics during rule set generation, and later used the validation sets to evaluate their performance.
Performance metrics measure how well a rule can predict class labels. Here we calculate the following performance metrics: accuracy, precision, recall, and F1-score, as shown below.
For classification tasks, a true positive (TP) and a true negative (TN) refer to a correctly predicted positive class and negative class, respectively, with respect to the labels in the dataset. Conversely, a false positive (FP) and a false negative (FN) refer to an incorrectly predicted positive class and negative class, respectively. These metrics are not specific to the rules and can be computed for trained tree-ensemble models, as well as explanations of trained machine learning models, as we shall show later in Section 5.2.4. We compute multiple metrics for a single rule, to meet a range of user requirements for explanation. One user may only be interested in simply the most accurate rules (maximize accuracy), whereas another user could be interested in more precise rules (maximize precision), or rules with more balanced performance (maximize F1-score).
The candidate rule set $R$ and meta-data set $M$ are represented as facts in ASP, as shown in Table 1. For example, Rule 1 (the first rule in Section 3.2) may be represented as follows:Footnote 6
Unique conditions are indexed and denoted by the condition predicate. For instance, in the example above (representing Rule 1), “condition(1,1)” represents $(x_1 \leq 0.2)$ , “condition(1,2)” corresponds to $(x_2 \leq 4.5)$ , and so forth.
$^{a}$ Properties and metrics marked with asterisks(*) are multiplied by 100 and rounded to the nearest integer.
3.4 Encoding inclusion criteria and constraints
As with previous works in pattern mining in ASP, we follow the “generate-and-test” approach, where a set of potential solutions are generated by a choice rule and subsequently constraints are used to filter out unacceptable candidates. In the context of rule set generation, we use a choice rule to generate candidate rule sets that may constitute a solution (“Generate Candidate Rule Sets” in Figure 1). In this section, we introduce the following selection criteria and constraints: (1) individual rule selection criteria that are applied on a per-rule basis, (2) pairwise constraints that are applied to pairs of rules, and (3) collective constraints that are applied to a set of rules.
The “generator” choice rule has the following form:
The choice rule above generates candidate subsets of size between 1 and 5 from $R$ , where we use the selected/1 predicate to indicate that a rule (rule(X)) is included in the subset.
Individual rule selection criteria are integrated into the generator choice rule by the valid/1 predicate, where a rule rule(X) is valid whenever invalid(X) cannot be inferred.
Example 2. The following criterion excludes rules with low support from the candidate set:
Pairwise constraints can be used to encode dominance relations between rules. For a rule X to be dominated by Y, Y must be strictly better in one criterion than X and at least as good as X or better in other criteria. In the following case, we encode the dominance relation between rules using the accuracy metric and support, where we prefer rules that are accurate and cover more data.
Collective constraints are applied to collections of rules, as opposed to individual or pairs of rules. The following restricts the maximum number of conditions in rule sets, using the #sum aggregate:Footnote 7
We envision two main use cases for the criteria and constraints introduced in this section: (1) to generate rule sets with certain properties, and (2) to reduce the computation time. For (1), the user can use the individual selection criteria to ensure that the rules included into the candidate rule sets have certain properties, or the collective constraints to put restrictions on the aggregate properties of the rule sets. The latter use-case has more practical relevance because in our case, as in pattern mining, the complexity of a naive “generate-and-test” approach is exponential with respect to the number of candidates.
To reduce the search space, one can place an upper bound on the size of generated candidate sets and use the invalid/1 predicate to prevent unacceptable rules being included into the candidates, as shown above. Because setting unreasonable conditions leads to zero rule sets generated, care should be taken when using the selection criteria and constraints for this purpose. In particular, if any of the metric predicates listed in Table 1 are used in defining invalid/1, for example, invalid(X) :- rule(X), metric(X, N), N < B., to avoid all rule(X) being invalid(X), one should respect the conditions listed in Table 2. In the following example, we will show how the invalid/1 predicate can be used to reduce the search space.
Example 3. Let the logic program be:
Then, there is at least one valid rule if $\texttt{B} \leq max(\texttt{N}_1,\ldots, \texttt{N}_{|R|})$ . Let $\texttt{B} = 1+max(\texttt{N}_1,\ldots, \texttt{N}_{|R|})$ , then by line 3 ( $\texttt{N} \lt \texttt{B}$ ), all rules will be invalid, and valid(X) cannot be inferred. Then, the choice rule (line 1) is not satisfied. Alternatively, let $\texttt{B} = max(\texttt{N}_1,\ldots, \texttt{N}_{|R|})$ , then there is at least one rule such that $\texttt{N} = \texttt{B}$ . Since $\texttt{invalid(X)}$ cannot be inferred for such a rule, it is $\texttt{valid}$ and the choice rule is satisfied.
The upper bound parameter (5 in Example1 and 1 in Example3) controls the potential maximum number of rules that can be included in a rule set. The actual number of rules that emerge in the final rule sets is highly dependent on the selection criteria, user preferences, and the characteristics of the tree-ensemble model. Practically, we recommend initially setting this parameter to a lower value (e.g., 3) while focusing on refining other aspects of the encoding, since this allows for a more manageable starting point. Given the “generate-and-test” approach, high values may lead to excessively slow run time. If it becomes evident that a larger rule set could be beneficial, the parameter can be incrementally increased. This ensures more efficient use of computational resources while also catering to the evolving needs of the encoding process.
3.5 Optimizing rule sets
Finally, we pose the rule set generation problem as a multi-objective optimization problem, given the aforementioned facts and constraints encoded in ASP. The desiderata for generated rule sets may contain multiple competing objectives. For instance, we consider a case where the user wishes to collect accurate rules that cover many instances, while minimizing the number of conditions in the set. This is encoded as a group of optimization statements:
Instead of maximizing/minimizing the sums of metrics, we may wish to optimize more nuanced metrics, such as average accuracy and coverage of selected rules:
This metric can be maximized by selecting the smallest number of short and accurate rules. Similar metrics can be defined for precision-coverage,
and for precision-recall.
For optimization, we introduce a measure of overlap between the rules to be minimized. Intuitively, minimizing this objective should result in rule sets where rules share only a few conditions, which should further improve the explainability of the resulting rule sets. Specifically, we introduce a predicate rule_overlap(X,Y,Cn) to measure the degree of overlap between rules X and Y.
4 Rule set generation for global and local explanations
In this section, we will describe how to generate global and local explanations with the rule set generation method. Guidotti et al. (Reference Guidotti, Monreale, Ruggieri, Turini, Glannotti and Pedreschi2018) defined global explanation as descriptions of how the overall system works, and local explanation as specific descriptions of why a certain decision was made. We shall now adopt these definitions to our rule set generation task from tree-ensemble models.
Definition 1. A global explanation is a set of rules derived from the tree-ensemble model, that approximates the overall predictive behavior of the base tree-ensemble model.
Examples of measures of approximation for global explanations are: accuracy, precision, recall, F1-score, fidelity and coverage.
Example 4. Given a tree-ensemble as in Figure 2, a global explanation can be constructed from a candidate rule set that includes all possible paths to the leaf nodes (1, 2, $\ldots$ , 10 in Figure 2), then selecting rules based on user-defined criteria.
Definition 2. An instance to be explained is called a prediction instance. A local explanation is a set of rules derived from the tree-ensemble model, that approximates the predictive behavior of the base tree-ensemble model when applied to a specific prediction instance.
Example 5. Given a tree-ensemble as in Figure 2, and a prediction instance, a local explanation can be constructed by only considering rules that were active during the prediction, then selecting rules based on user-defined criteria. For example, if leaf nodes 2 and 6 were active, then $R$ only includes rules constructed from the paths leading to nodes 2 and 6. Here, a leaf node is considered active during prediction if the decision path for the prediction instance leads to it, meaning the conditions leading up to that node are satisfied by the instance’s features.
The predictive behavior in this context refers to the method by which the model makes the prediction (aggregating decision tree outputs) and the outcomes of the prediction. The differences between the global and local explanations have implications on the encoding we use for rule set generation. Note that these two types of explanations serve distinct purposes. The global explanation seeks to explain the model’s overall behavior, while the local explanation focuses on the reasoning behind a specific prediction instance. There is no inherent expectation for a global explanation to align with or fully encompass a local explanation. In particular, when a local explanation is applicable to multiple instances due to these instances having similar feature values, for instance, this local explanation might not be able to accurately predict for these instances. This is measured by a precision metric and evaluated further in Section 5.3
In Table 3 we show examples of global and local explanations on the same dataset (adult). For this dataset, the task is to predict whether an individual earns more or less than $50,000 annually. The global and local explanations consist of four conditions and share an attribute (hours-per-week) with different threshold values. While these two rules have the same outcome, the attributes in the bodies are different: in this instance, the global explanation focuses more on the numerical attributes, while the local explanation contains categorical attributes.
Recall that we start with the candidate rule set, $R$ , which is created by processing the tree-ensemble model. The rules in $R$ are different between global and local explanations, even when the underlying tree-ensemble model is the same. For global explanations, we can enumerate all rules including internal nodes (Section 3.2) regardless of the outcomes of the rules because we are more interested in obtaining a simpler classifier with the help of constraints (Section 3.4) and optimization criteria (Section 3.5). On the other hand, for local explanations, it is necessary to consider the match between the rules’ prediction and the actual outcome of the tree-ensemble model as to keep the precision of explanations high.
By definition, a local explanation should describe the behavior of the model on a single prediction instance. Thus, we shall make the following modifications to $R$ when generating rule sets for local explanations. We start from the candidate rule set $R$ as in Algorithm 1 (Section 3.2), then, for each predicted instance:
-
1. Identify the leaf nodes that were active during the prediction.
-
2. Exclude rules that did not participate in the prediction.
-
3. Replace the outcome of the rule with the predicted label.
After the modification as outlined above, the maximum size of the starting rule set will be the number of trees in a tree-ensemble. Let $K$ be the number of decision trees in a tree-ensemble model, then, since there is exactly one leaf node per tree responsible for the prediction, there will be $K$ rules. Compared to the global explanation case (Proposition1), the size of the candidate rule set is exponentially smaller for the local explanation. This reduction is enabled by analyzing the behavior of the decision trees during prediction, and it is one of the benefits of using an explanation method which can take advantage of the structure of the model under study.
5 Experiments
In this section, we present a comprehensive evaluation of our rule set generation framework, focusing on both global and local explanations. We evaluate the explanations on public datasets using various metrics, and we also compare the performance to existing methods, including rule-based classifiers. We used several metrics to assess the quality of the generated explanations. These metrics are designed to evaluate different aspects of the explanations, including their comprehensibility, fidelity, and usability. Below, we provide an overview of the metrics used in our evaluations. Detailed discussions of these metrics can be found in the respective sections of this paper.
Global Explanations (Section 5.2):
-
• Number of Rules and Conditions (Section 5.2.1 and 5.2.2): Assesses the simplification of the original model by counting the number of rules and conditions.
-
• Relevance (Section 5.2.3): Measures the relevance of the rules by comparing the classification performance against the original model.
-
• Fidelity (Section 5.2.4): Measures the degree to which the rules accurately describe the behavior of the original model.
-
• Run Time (Section 5.2.6): Measures the efficiency of generating global explanations.
Local Explanations (Section 5.3):
-
• Number of Conditions (Section 5.3.1): Measures the conciseness of the local explanations by counting the number of conditions.
-
• Local-Precision and Coverage (Section 5.3.2): Local-precision compares the model’s predictions for instances covered by the local explanation with the prediction for the instance that induced the explanation. Local-coverage measures the proportion of instances in the validation set that are covered by the local explanation.
-
• Run Time (Section 5.3.3): Measures the efficiency of generating local explanations.
5.1 Experimental setup
5.1.1 Datasets
We used in total 14 publicly available datasets, where except for the adult Footnote 8 dataset, all datasets were taken from the UCI Machine Learning RepositoryFootnote 9 (Dua and Graff Reference Dua and Graff2017). Datasets were chosen from public repositories to ensure a diverse range in terms of the number of instances, the number of categorical variables, and class balance. This was done intentionally, to observe the variance in, for example, explanation generation times. Additionally, the variation in categorical variables ratio and class balance was designed to produce a wide array of tree-ensemble configurations (e.g., more or fewer trees, varying widths and depths). We expected these configurations, in turn, to influence the nature of the explanations generated. We included 3 datasets (adult, credit german, compas) for comparison because they were widely used in local explainability literature. The adult dataset is actually a subset of the census dataset, but we included the former for consistency with existing literature, and the latter for demonstrating the applicability of our approach to larger datasets. The summary of these datasets is shown in Table 4.
$^{a}$ The maximum depth parameter. The minimum and maximum values set during the hyperparameter search are shown in parentheses. The hyperparameters shown in this table are averaged over 5 folds. $^{b}$ The number of candidate rules, averaged over 5 folds. $^{c}$ The number of trees (estimators) parameter. $^{d}$ The number of features (columns). The number of categorical features is shown in parentheses.
5.1.2 Experimental settings
We used clingo 5.4.0Footnote 10 (Gebser et al. Reference Gebser, Kaminski, Kaufmann and Schaub2014) for ASP and set the time-out to 1,200 s. We used RIPPER implemented in Weka Witten et al. (Reference Witten, Frank and Hall2016) and an open-source implementation of RuleFitFootnote 11 where Random Forest was selected as the rule generator, and scikit-learnFootnote 12 (Pedregosa et al. Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Duobourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot and Duchesnay2011) for general machine learning functionalities. Our experimental environment is a desktop machine with Ubuntu 18.04, Intel Core i9-9900K 3.6 GHz (8 cores/16 threads) and 64 GB RAM. For reproducibility, all source codes for the implementation, experiments, and preprocessed datasets are available from our GitHub repository.Footnote 13
Unless noted otherwise, all experimental results reported here were obtained with 5-fold cross validation, with hyperparameter optimization in each fold. To evaluate the performance of the extracted rule sets, we implemented a naive rule-based classifier, which is constructed from the rule sets extracted with our method. In this classifier, we apply the rules sequentially to the validation dataset and if all conditions within a rule are true for an instance in the dataset, the consequent of the rule is returned as the predicted class. More formally, given a set of rules $R_s \subset R$ with cardinality $|R_s|$ that shares the same consequent $class(Q)$ , we represent this rule-based classifier as the disjunction of antecedents of the rules:
For a given data point, it is possible that there are no rules applicable, and in such cases the most common class label in the training dataset is returned.
5.2 Evaluating global explanations
Let us recall that the purpose of generating global explanations is to provide the user with a simpler model of the original complex model. Thus, we introduce proxy measures to evaluate (1) the degree to which the model is simplified, by the number of extracted rules and conditions, (2) the relevance of the extracted rules, by comparing classification performance metrics against the original model, and (3) the degree to which the explanation accurately describes the behavior of the original model, by fidelity metrics.
We conducted the experiment in the following order. First, we trained Decision Tree, Random Forest, and LightGBM on the datasets in Table 4. Selected optimized hyperparameters of the tree-ensemble models are also reported in Table 4. Further details on hyperparameter optimization are available in Appendix B. We then applied our rule set generation method to the trained tree-ensemble models. Finally, we constructed a naive rule-based classifier using the set of rules extracted in the previous step and calculated performance metrics on the validation set. This process was repeated in a 5-fold stratified cross validation setting to estimate the performance. We compare the characteristics of our approach against the known methods RIPPER and RuleFit.
We used the following selection criteria to filter out rules that were considered to be undesirable; for example, those rules with low accuracy or low coverage. We used the same set of selection criteria for all datasets, irrespective of underlying label distribution or learning algorithms. When the candidate rules violate any one of those criteria, they are excluded from the candidate rule set, which means that in the worst case where all the candidate rules violate at least one criterion, this encoding will result in an empty rule set (see Section 3.4).
Another scenario in which our method will produce an empty rule set is when the tree-ensemble contains only “leaf-only” or “stump” trees, that have one leaf node and no splits. In this case, we have no split information to create candidate rules; thus, an empty rule set is returned to the user. This is often caused by inadequate setting of hyperparameters that control the growth of the trees, especially when using imbalanced datasets. It is however outside the scope of this paper, and we will simply note such cases (empty rule set returned) in our results without further consideration.
5.2.1 Number of rules
The average sizes of generated rule sets are shown in Table 5. The sizes of candidate rule sets, from which the rule sets are generated, are listed in the $|R|$ columns in Table 4. Rule set size of 1 means that the rule set contains a single rule only. As one might expect, the Decision Tree consistently has the smallest candidate rule set, but in some cases the Random Forest produced considerably more candidate rules than the LightGBM, for example, cars, compas. Our method can produce rule sets which are significantly smaller than the original model, based on the comparison between the sizes of the candidate rule set $|R|$ and resulting rule sets.
$^{a}$ Decision Tree + ASP $^{b}$ Random Forest + ASP $^c$ LightGBM + ASP
We will now compare our method to the two benchmark methods, RuleFit and RIPPER. The average size of generated rule sets is shown in Table 5. RuleFit includes original features (called linear terms) as well as conditions extracted from the tree-ensembles in the construction of a sparse linear model, that is to say, the counts in Table 5 may be inflated by the linear terms. On the other hand, the output from RIPPER only contains rules, and RIPPER has rule pruning and rule set optimization to further reduce the rule set size. Moreover, RIPPER has direct control over which conditions to include into rules, whereas our method and RuleFit rely on the structure of the underlying decision trees to construct candidate rules. Our method consistently produced smaller rule sets compared to RuleFit and RIPPER, although the difference between our method and RIPPER was not as pronounced when compared to the difference between our method and RuleFit. RuleFit produced the largest number of rules compared with other methods, although they were much smaller than the original Random Forest models (Table 5).
5.2.2 Number of conditions in rules
In this subsection, we compare the average number of conditions in each rule and the total number of conditions in rules. One would expect a more precise rule to have a larger number of conditions in its body compared to the one that is more general. It should be noted that, however, due to the experimental condition, the maximum number of conditions in a single rule is set by the maximum depth parameter in each of the learning algorithms, which in turn is set by the hyperparameter tuning algorithm.
The average number of conditions in each rule are shown in Table 5. We note that sometimes the algorithms may produce rules without any conditions in the bodies, such as when the induced trees have only a single split node at the root; thus the average number reported in Table 5 may be biased toward lower numbers. From the table, we see that the average number of conditions in a rule generally falls in the range of between 1 and 10, and this is consistent with the search range of hyperparameters we set for the experiments. Table 5 shows the total number of conditions in a rule set. Unlike the average number of conditions in a rule, we see a large difference between our method and the benchmark methods. In all datasets, RuleFit produced the highest counts of conditions in rules, followed by RIPPER and the ensemble-based methods. From Table 5, we make the following observations: (1) the length of individual rules does not vary as much as the number of rules between different methods (2) the high number of conditions in rules extracted by RuleFit can be explained by the high number of rules, where the length of individual rules are comparable to other methods.
5.2.3 Relevance of rules
To quantify the relevance of the extracted rules, we measured the ratio of performance metrics using the naive rule-based classifier by 5-fold cross validation (Table 6). A performance ratio of less than 1.0 means that the rule-based classifier performed worse than the original classifier (LightGBM and Random Forest), whereas a performance ratio greater than 1.0 means the rule set’s performance is better than the original classifier. We used a version of the ASP encoding shown in Section 3.5 where the accuracy and coverage are maximized. RIPPER was excluded from this comparison because it has a built-in rule generation and refinement process, and it does not have a base model, whereas our method and RuleFit use variants of tree-ensemble models as base models.
$^a$ Decision Tree + ASP $^b$ Random Forest + ASP $^c$ LightGBM + ASP
From Table 6 we observe that in terms of accuracy, RuleFit generally performs as well as, or marginally better than, the original Random Forest. On the other hand, although our method can produce rule sets that are comparable in performance against the original model, they do not produce rules that perform significantly better. With Decision Tree and Random Forest, the generated rule sets perform much worse than the original model, for example, in kidney, voting. The LightGBM+ASP combination resulted in the second-best performance overall, where the resulting rules’ performances were arguably comparable (0.8-0.9 range) to the original model with a few exceptions (e.g., census F1-score) where the performance ratio was about half of the original. While RuleFit’s performance was superior, our method could still produce rule sets with reasonable performance with much smaller rule sets that are an order of magnitude smaller than RuleFit. A rather unexpected result was that using our method (Random Forest) or RuleFit significantly improved the F1-score in the census dataset. In Table 6 we can see that recall was the major contributor to this improvement.
5.2.4 Fidelity metrics of global explanations
In Section 5.2.3, we compared the ratio of performance metrics of different methods when measured against original labels. In the context of evaluating explanation methods, it is also important to investigate the fidelity, that is to which extent the explanation is able to accurately imitate the original model (Guidotti et al. Reference Guidotti, Monreale, Ruggieri, Turini, Glannotti and Pedreschi2018). A fidelity metric is calculated as an agreement metric between the prediction of the original model and the explanation, the latter in this case is the rule set. More concretely, when the predicted class by the model is positive and that by the explanation is positive, it is a true positive (TP), and when the latter is negative, it is a false negative (FN). Thus, the fidelity metrics can be calculated in the same manner as performance metrics using the equations shown in Section 3.3. RIPPER was excluded from this comparison for the same reasons as outlined in Section 5.2.3.
The average fidelity metrics (accuracy, F1-score, precision, and recall) are shown in Table 7. The overall trend is similar to the previous section on rule relevance, where RuleFit performs the best overall in terms of fidelity. The accuracy metrics for our method shows that the global explanations, in general, behaved similarly to the original model, although RuleFit was better in most of the datasets. The precision metrics show that, even when excluding the results for Decision Tree (which is not a tree-ensemble learning algorithm), our method could produce explanations that had high fidelity in terms of precision compared to RuleFit. The fidelity metrics may be improved further by including them in the ASP encodings, since they were not part of the selection criteria or optimization goals.
$^a$ Decision Tree + ASP $^b$ Random Forest + ASP $^c$ LightGBM + ASP
5.2.5 Changing optimization criteria
The definition of optimization objectives has a direct influence over the performance of the resulting rule sets, and the objectives need to be set in accordance with user requirements. The answer sets found by clingo with multiple optimization statements are optimal regarding the set of goals defined by the user. Instead of using accuracy, one may use other rule metrics as defined in Table 1 such as precision and/or recall. If there are priorities between optimization criteria, then one could use the priority notation (weight@priority) in clingo to define them. Optimal answer sets can be computed in this way, however, if enumeration of such optimal sets is important, then one could use the pareto or lexico preference definitions provided by asprin (Brewka et al. Reference Brewka, Delgrande, Romero and Schaub2015) to enumerate Pareto optimal answer sets. Instead of presenting a single optimal rule set to the user, this will allow the user to explore other optimal rule sets.
To investigate the effect of changing optimization objectives, we changed the ASP encoding from max. accuracy-coverage to max. precision-coverage (shown in Section 3.4) while keeping other parameters constant. The results are shown in Table 8. Note that it is the ratio of precision score shown in the table, as opposed to accuracy or F1-score in the earlier tables. Here, since we are optimizing for better precision, we expect the precision-coverage encoding to produce rule sets with better precision scores than the accuracy-coverage encoding. For the Decision Tree and Random Forest + ASP, the effect was not as pronounced as we expected, but we observed noticeable differences in datasets compas and credit german. For the LightGBM+ASP combination, we observed more consistent difference, except for the credit german dataset, the encoding produced intended results in most of the datasets in this experiment.
$^a$ Performance ratio of 1 means the rule set’s precision is identical to the original classifier. Numbers are shown in bold where the performance ratio was better than more than 0.01 compared to the other encoding. $^b$ acc.cov=accuracy and coverage encoding, see Section 4. $^c$ prec.cov=precision and coverage encoding, see Section 4.
5.2.6 Global explanation running time
The average running time of generating global explanations is reported in Table 9. The running time measures the rule extraction and rule set generation steps for our method, and measures the running time for RuleFit, but excludes the time taken for the model training (e.g., Random Forest) and hyperparameter optimization. Comparing the methods that share the same base model (RF + ASP and RuleFit, both based on Random Forest), we observe that our method is slower than RuleFit except when the datasets are relatively large (e.g., adult, census, compas, and credit taiwan), and in the latter cases our method can be much faster than RuleFit. Similar trend is observed for LightGBM, but here in some cases our method was faster than RuleFit (e.g., autism, heart, and voting).
$^a$ DT = Decision Tree $^b$ RF = Random Forest. $^c$ LGBM = LightGBM.
5.3 Evaluating local explanations
The purpose of generating local explanations is to provide the user with an explanation for the model’s prediction for each predicted instance. Here, we use commonly used metrics local-precision and coverage as proxy measures for the quality of the explanation.Footnote 14 The local-precision compares the (black-box) model predictions of instances covered by the local explanation and the model prediction of the original instance used to induce the local explanation. The coverage is the ratio of instances in the validation set that are covered by the local explanation. These two metrics are in a trade-off relationship, where pursuing high coverage is likely to result in low precision explanation and vice versa. Furthermore, we also study the number of conditions in the explanation to measure the conciseness of the extracted rules. Additionally, we will also compare the running time to generate the local explanation.
The experiments were carried out similarly to the global explanation evaluation, except that: (1) we replaced RIPPER and RuleFit with Anchors, (2) instead of using the full validation set, we resampled the validation dataset to generate 100 instances in each cross-validation fold for each dataset to estimate the metrics, to complete the experiments in a reasonable amount of time, and (3) in the ASP encoding, we removed the rule selection criteria to avoid excluding rules that are relevant to the predicted instance. We were unable to complete Anchors experiment with the census dataset due to limited memory (64 GB) on our machine. For comparison, we computed direct and sufficient explanations with the PyXAIFootnote 15 library, which internally uses SAT and MaxSAT solvers to compute local explanations (Audemard et al. Reference Audemard, Bellart, Bounia, Koriche, Lagniez and Marquis2022b,a). Our method currently does not include the rule simplification feature. Therefore, to maintain a consistent comparison, we also deactivated the rule simplification feature in PyXAI. Furthermore, as of writing, PyXAI does not yet support LightGBM classifier, so only results for decision tree and random forests are included. For the running time comparison, we exclude all data preprocessing, training and tree processing and focus solely on the time taken to generate local explanations.
5.3.1 Number of conditions in rules
Similarly to the evaluation of global explanations (Section 5.2.2), we evaluate the number of conditions in local explanations in this section. The average number of conditions in rules are listed in Tables 10, 11, 12. For the Decision Tree (Table 10) and Random Forest (Table 11), the Anchors produced rules with smaller number of conditions on average overall compared to our method. As for the LightGBM, the Anchors produced rules with significantly larger number of conditions than our method, and often there was an order of magnitude of difference in the number of rules. It is possible that the precision guarantee of the Anchors required the algorithm to produce more specific rules, as also indicated by the long run time especially on the datasets that produced the longest rules (e.g., census, credit taiwan, and adult). This result shows that depending on the underlying learning algorithm, our method can produce shorter and more concise explanations compared to the Anchors.
$^a$ Anchors. $^b$ PyXAI, direct explanations. $^c$ PyXAI, sufficient explanations. $^d$ Excluded from the run time comparison since the SAT solver is not involved in computing direct explanations.
$^a$ Anchors. $^b$ PyXAI, direct explanations. $^c$ PyXAI, sufficient explanations. $^d$ Excluded from the run time comparison since the SAT solver is not involved in computing direct explanations.
$^a$ Anchors.
It is interesting to note that, while PyXAI’s direct explanations performed almost identically to ours for Decision Tree, it produced much larger rules for Random Forest. Similarly, PyXAI’s sufficient explanations produced more conditions in the explanations compared to our method or Anchors. It is possible that rule simplification could help reduce the number of conditions in such cases.
5.3.2 Local-precision and coverage
The average local-precision, averaged over 5 cross-validation folds, is reported in Tables 10, 11, 12. Note that while Anchors has a minimum precision threshold (we used the default 0.95 setting), ours does not, and indeed we see that all Anchors explanations have higher local-precision than the threshold. PyXAI produced the most precise explanations out of the three methods compared, and both direct and sufficient explanations almost always had the perfect local-precision of 1. The Decision Tree will always have exactly one rule that is relevant to the prediction; therefore, we expect to see exactly 1 local-precision using our method. For the Random Forest and LightGBM, our method produced local explanations with local-precision in 0.8-0.9 range for most of the datasets, but Anchors’ explanations had higher local-precision in most cases.
The average coverage, averaged over 5 cross-validation folds, is reported in Tables 10, 11, 12. Interestingly, when using simpler models such as the Decision Tree and Random Forest, Anchors can produce rules that have relatively high coverage, but the pattern does not hold when using a more complex model, which in our case is LightGBM. With LightGBM, our method consistently outperformed Anchors in terms of coverage in all datasets, except for the census dataset, which we could not run. For Random Forest, PyXAI’s direct explanations are much more precise and apply to a smaller number of instances compared to our method. Sufficient explanations from PyXAI tended to show greater coverage compared to direct explanations, which aligns with expectations given their fewer number of conditions.
5.3.3 Local explanation running time
The average running time per instance is reported in Tables 10, 11, 12. For Decision Tree, PyXAI was much faster than both our method and Anchors, whereas for Random Forest, our method was faster than Anchors and PyXAI’s sufficient explanations in most datasets. For LightGBM, our method consistently outperformed Anchors in terms of run time. We also note that our method has a more consistent running time of below 1 s across all datasets, regardless of the complexity of the underlying models, whereas Anchors’ running time varies from sub-1 s to tens of seconds, depending on the dataset and model. This is likely to be caused by the differences in which these methods query or use information from the original model and generate explanations. In fact, a significant amount of time is spent in tree processing in our method, whereas in Anchors the search process is often the most time-consuming step. Nonetheless, this comparative experiment demonstrated that our method can produce local explanations in a matter of seconds even when the underlying tree-ensemble is large.
To conclude the experimental section, we summarize the main results obtained in this section. For global explanations, we analyzed (1) the average size of generated rule sets and compared it against known methods, as a proxy measure for the degree of simplifications, (2) the relative performance of the rule sets and compared it against known methods, as a proxy measure for the relevance of the explanations, (3) the fidelity of the explanations, and (4) the effect of modifying the ASP encoding on the precision metric of the explanations. Overall, our method was shown to be able to produce smaller rule sets compared to the known methods, however, in terms of the relevance and fidelity of the rules, RuleFit performed better in most cases, demonstrating the trade-off relationship between the complexity of the explanations and performance.
For local explanations, we compared (1) number of conditions, (2) local-precision, (3) coverage and (4) running time of our method against Anchors and PyXAI. In terms of local-precision, although our method could produce explanations with reasonably high precision (0.8–0.9 range), Anchors and PyXAI performed better overall. As for coverage, we found that explanations generated by our method can cover more examples for tree-ensemble. Regarding running time, our method had a consistent running time of less than 1 s, whereas the running time of Anchors varied between datasets. The experiments for local explanations also highlight the differences between our method and Anchors: while Anchors can produce high-precision rules, our method has an advantage in terms of memory requirement and consistent running time.
6 Related works
Summarizing tree-ensemble models has been studied in literature, see for example, Born Again Trees (Breiman and Shang Reference Breiman and Shang1996), defragTrees (Hara and Hayashi Reference Hara and Hayashi2018) and inTrees (Deng Reference Deng2019). While exact methods and implementations differ among these examples, a popular approach to tree-ensemble simplification is to create a simplified decision tree model that approximates the behavior of the original tree-ensemble model. Depending on how the approximate tree model is constructed, this could lead to a deeper tree with an increased number of conditions, which makes them difficult to interpret.
Integrating association rule mining and classification is also known, for example, Class Association Rules (CARs) (Liu et al. Reference Liu, Hsu and Ma1998), where association rules discovered by pattern mining algorithms are combined to form a classifier. Repeated Incremental Pruning to Produce Error Reduction (RIPPER) (Cohen Reference Cohen1995) was proposed as an efficient approach for classification based on association rule mining, and it is a well-known rule-based classifier. In CARs and RIPPER, rules are mined from data with dedicated association rule mining algorithms, then processed to produce a final classifier.
Interpretable classification models is another area of active research. Interpretable Decision Sets (IDSs) (Lakkaraju et al. Reference Lakkaraju, Bach and Leskovec2016) are learned through an objective function, which simultaneously optimizes accuracy and interpretability of the rules. In Scalable Bayesian Rule Lists (SBRL) (Yang et al. Reference Yang, Rudin and Seltzer2017), probabilistic IF-THEN rule lists are constructed by maximizing the posterior distribution of rule lists. In RuleFit (Friedman and Popescu Reference Friedmann and Popescu2008), a sparse linear model is trained with rules extracted from tree-ensembles. RuleFit is the closest to our work in this regard, in the sense that both RuleFit and our method extract conditions and rules from tree-ensembles, but differ in the treatment of rules and representation of final rule sets. In RuleFit, rules are accompanied by regression coefficients, and it is left up to the user to further interpret the result.
Lundberg et al. (Reference Lundberg, Erion, Chen, Degrave, Prutkin, Nair, Katz, Himmelfarb, Bansal and Lee2020) showed how a variant of SHAP (Lundberg and Lee Reference Lundberg and Lee2017), which is a post hoc explanation method, can be applied to tree-ensembles. While our method does not produce importance measures for each feature, the information about which rule fired to reach the prediction can be offered as an explanation in a human-readable format. Shakerin and Gupta (Reference Shakerin and Gupta2019) proposed a method to use LIME weights (Ribeiro et al. Reference Ribeiro, Singh and Guestrin2016) as a part of learning heuristics in inductive learning of default theories. Anchors (Ribeiro et al. Reference Ribeiro, Singh and Guestrin2018) generates a single high-precision rule as a local explanation with probabilistic guarantees. It should be noted that both LIME and Anchors require the features to be discretized, while recent tree-ensemble learning algorithms can work with continuous features. Furthermore, instead of learning rules with heuristics from data, our method directly handles rules which exist in decision tree models with an answer set solver.
There are existing ASP encodings of pattern mining algorithms, for example, (Järvisalo Reference Jarvisalo2011; Gebser et al. Reference Gebser, Guyet, Quiniou, Romero and Schaub2016; Paramonov et al. Reference Paramonov, Stepanova and Miettinen2019), that can be used to mine itemsets and sequences. Here, we develop and apply our encoding on rules to extract explanatory rules from tree-ensembles. On the surface, our problem setting (Section 3.1) may appear similar to frequent itemset and sequence mining; however, rule set generation is different from these pattern mining problems. We can indeed borrow some ideas from frequent itemset mining for encoding; however, our goal is not to decompose rules (cf. transactions) into individual conditions (cf. items) then construct rule sets (cf. itemsets) from conditions, but rather to treat each rule in its entirety then combining rules to form rule sets. The body (antecedent) of a rule can also be seen as a sequence, where the conditions are connected by conjunction connective $\wedge$ , however, in our case, the ordering of conditions does not matter, thus sequential mining encodings that use slots to represent positional constraints (Gebser et al. Reference Gebser, Guyet, Quiniou, Romero and Schaub2016) cannot be applied directly to our problem.
Solvers other than ASP solvers have been utilized for similar tasks. For example, Yu et al. (Reference Yu, Ignatiev, Stuckey, Le Bodic and Simonis2020) proposed SAT- and MaxSAT-based approaches to minimize the total number of conditions used in the target decision set. Their approaches construct IDSs based on SAT- and MaxSAT-encodings, instead of using a weighted objective function (Lakkaraju et al. Reference Lakkaraju, Bach and Leskovec2016) that contains multiple terms such as coverage, number of rules and conditions. Chen et al. (Reference Chen, Zheng, Si, Li, Boning and Hsieh2019) proposed an efficient algorithm for the robustness verification of tree-ensemble models, which surpasses existing MILP (mixed integer linear programming) methods in terms of speed. While they do not consider (local) explanations explicitly in their setting, their method allows computation of anchor features such that changes outside these features cannot change the prediction. More recently, alternative methods to generate explanations based on logical definitions have been proposed (Izza and Marques-Silva Reference Izza and Marques-SILVA2021; Ignatiev et al. Reference Ignatiev, Izza, Stuckey and Marques-Silva2022; Audemard et al. Reference Audemard, Bellart, Bounia, Koriche, Lagniez and Marquis2022b,a). These methods focus on local explanations conforming to logical conditions, such as abductive (sufficient) explanations. Similar to our approach, these methods are tailored for tree-ensemble models and utilize SAT and MaxSAT solvers for efficient processing. On the other hand, our method differs from these methods by using heuristics for generating explanations, and, by leveraging the flexibility of ASP, facilitates the inclusion of user-defined selection criteria and preferences. Regarding model-agnostic explanation methods, while Anchors is model-agnostic, its reliance on sampling to construct explanations often results in a longer run time, as exemplified by experimental results reported in Section 5.3.
Guns et al. (Reference Guns, Nijssen and De Raedt2011) applied constraint programming (CP), a declarative approach, to itemset mining. This constraint satisfaction perspective led to the development of ASP encoding of pattern mining (Järvisalo Reference Jarvisalo2011; Guyet et al. Reference Guyet, Moinard and Quiniou2014). Gebser et al. (Reference Gebser, Guyet, Quiniou, Romero and Schaub2016) applied preference handling to sequential pattern mining, and Paramonov et al. (Reference Paramonov, Stepanova and Miettinen2019) extended the declarative pattern mining by incorporating dominance programming (DP) from Negrevergne et al. (Reference Negrevergne, Dries, Guns and Nijssen2013) to the specification of constraints. Paramonov et al. (Reference Paramonov, Stepanova and Miettinen2019) proposed a hybrid approach where the solutions are effectively screened first with dedicated algorithms for pattern mining tasks, then declarative ASP encoding is used to extract condensed patterns. While the aforementioned works focused on extracting interesting patterns from transaction or sequence data, our focus in this paper is to generate rule sets from tree-ensemble models to help users interpret the behavior of machine learning models. As for the ASP encoding, we use dominance relations similar to the ones presented in Paramonov et al. (Reference Paramonov, Stepanova and Miettinen2019) to further constrain the search space.
7 Conclusion
In this work, we presented a method for generating rule sets as global and local explanations from tree-ensemble models using pattern mining techniques encoded in ASP. Unlike other explanation methods that focus exclusively on either global or local explanations, our two-step approach allows us to handle both global and local explanation tasks. We showed that our method can be applied to two well-known tree-ensemble learning algorithms, namely Random Forest and LightGBM. Evaluation on various datasets demonstrated that our method can produce explanations with good quality in a reasonable amount of time, compared to existing methods.
Adopting the declarative programming paradigm with ASP allows the user to take advantage of the expressiveness of ASP in representing constraints and optimization criteria. This makes our approach particularly suitable for situations where fast prototyping is required, since changing the constraint and optimization settings require relatively low effort compared to specialized pattern mining algorithms. Useful explanations can be generated using our approach, and combined with the expressive ASP encoding, we hope that our method will help the users of tree-ensemble models to better understand the behavior of such models.
A limitation of our method in terms of scalability is the size of search space, which is exponential in the number of valid rules. When the number of candidate rules is large, we suggest using stricter individual rule constraints on the rules, or reducing the maximum number of rules to be included into rule sets (Section 3.4), to achieve reasonable solving time. Another limitation is the lack of rule simplification in the generation of explanations, since more straightforward rules could enhance the user’s comprehension. Furthermore, while the current ASP encoding considers the overlap between rule sets with the same consequent class (Section 3.5), it does not consider the overlaps between the two rule sets with different consequent classes.
There are a number of directions for further research. First, while the current work did not modify the conditions in the rules in any way, rule simplification approaches could be incorporated to remove redundant conditions. Second, we could extend the current work to support regression problems. Third, further research might explore alternative approaches to implement a model-agnostic explanation method, for example, by combining a sampling-based local search strategy with a rule selection component implemented with ASP. In addition, while the multi-objective optimization approach (Section 3.5) allows for incorporating user desiderata, the fidelity to the original models can still be improved. Future works could focus on exploring alternative encodings or additional optimization strategies to better capture the nuances of the original models’ decision-making processes, thereby improving the effectiveness of the explanations. Furthermore, although local and global explanations serve different purposes and may not always align perfectly (Section 4), achieving a certain level of consistency is important for maintaining the credibility of the explanations. Future research could explore methods to reconcile such differences, thereby creating a more unified model explanation framework. More generally, in the future, we plan to explore how ASP and modern statistical machine learning could be integrated effectively to produce more interpretable machine learning systems.
Acknowledgments
This work has been supported by JSPS KAKENHI Grant Number JP21H04905 and JST CREST Grant Number JPMJCR22D3, Japan.
Appendix A Additional tables
The accuracy, F1-scores, precision and recall of the base models after hyperparameter optimization are shown in Tables A1 and A2. The values in this table are used as the denominators when calculating the performance ratio in Table 6.
$^a$ DecisionTree $^b$ RandomForest $^c$ LightGBM $^d$ RuleFit $^e$ RIPPER
Appendix B Hyperparameter optimization
All hyperparameters were optimized using optuna Akiba et al. (Reference Akiba, Yanase, Ohta and Koyama2019). As evaluation metric, we chose the F1-score. Hyperparameter tuning was performed separately for each fold on the training data. Within each fold, we used 20% of the training data as the validation set. We used early stopping for hyperparameter search, where the search was terminated after 30 trials if the validation metric did not improve for 30 consecutive rounds. We set the maximum number of search trials to 200, and the time-out for each study was set to 1,200 seconds. The search range for each model is shown in Table B1.