1 Introduction

Decision trees are one of the most widely-used techniques for classification problems. Guided by the training data \(({\varvec{x}}_i, y_i), i = 1, \ldots , n\), decision trees recursively partition the feature space and assign a label to each resulting partition. The tree is then used to classify future points according to these splits and labels. The key advantage of decision trees over other methods is that they are very interpretable, and in many applications, such as healthcare, this interpretability is often preferred over other methods that may have higher accuracy but are relatively uninterpretable.

The leading work for decision tree methods in classification is classification and regression trees (CART), proposed by Breiman et al. (1984), which takes a top-down approach to determining the partitions. Starting from the root node, a split is determined by solving an optimization problem (typically minimizing an impurity measure), before proceeding to recurse on the two resulting child nodes. The main shortcoming of this top-down approach, shared by other popular decision tree methods like C4.5 (Quinlan 1993) and ID3 (Quinlan 1986), is its fundamentally greedy nature. Each split in the tree is determined in isolation without considering the possible impact of future splits in the tree. This can lead to trees that do not capture well the underlying characteristics of the dataset, potentially leading to weak performance when classifying future points.

Another limitation of top-down induction methods is that they typically require pruning to achieve trees that generalize well. Pruning is needed because top-down induction is unable to handle penalties on tree complexity while growing the tree, since powerful splits may be hidden behind weaker splits. This means that the best tree might not be found by the top-down method if the complexity penalty is too high and prevents the first weaker split from being selected. The usual approach to resolve this is to grow the tree as deep as possible before pruning back up the tree using the complexity penalty. This avoids the problem of weaker splits hiding stronger splits, but it means that the training occurs in two phases, growing via a series of greedy decisions, followed by pruning. Lookahead heuristics such as IDX (Norton 1989), LSID3 and ID3-k (Esmeir and Markovitch 2007) also aim to resolve this problem of strong splits being hidden behind weak splits by finding new splits based on optimizing deeper trees rooted at the current leaf, rather than just optimizing a single split. However, it is unclear whether these methods can lead to trees with better generalization ability and avoid the so-called look-ahead pathology of decision tree learning (Murthy and Salzberg 1995a).

These top-down induction methods typically optimize an impurity measure when selecting splits, rather than using the misclassification rate of training points. This seems odd when the misclassification rate is the final objective being targeted by the tree, and indeed is also the measure that is used when pruning the tree. Breiman et al. (1984, p. 97) explain why impurity measures are used in place of the more natural objective of misclassification:

...the [misclassification] criterion does not seem to appropriately reward splits that are more desirable in the context of the continued growth of the tree. ...This problem is largely caused by the fact that our tree growing structure is based on a one-step optimization procedure.

It seems natural to believe that growing the decision tree with respect to the final objective function would lead to better splits, but the use of top-down induction methods and their requirement for pruning prevents the use of this objective.

The natural way to resolve this problem is to form the entire decision tree in a single step, allowing each split to be determined with full knowledge of all other splits. This would result in the optimal decision tree for the training data. This is not a new idea; Breiman et al. (1984, p. 42) noted the potential for such a method:

Finally, another problem frequently mentioned (by others, not by us) is that the tree procedure is only one-step optimal and not overall optimal. ...If one could search all possible partitions ...the two results might be quite different.

This issue is analogous to the familiar question in linear regression of how well the stepwise procedures do as compared with ‘best subsets’ procedures. We do not address this problem. At this stage of computer technology, an overall optimal tree growing procedure does not appear feasible for any reasonably sized dataset.

The use of top-down induction and pruning in CART was therefore not due to a belief that such a procedure was inherently better, but instead was guided by practical limitations of the time, given the difficulty of finding an optimal tree. Indeed, it is well-known that the problem of constructing optimal binary decision trees is NP-hard (Hyafil and Rivest 1976). Nevertheless, there have been many efforts previously to develop effective ways of constructing optimal univariate decision trees using a variety of heuristic methods for growing the tree in one step, including linear optimization (Bennett 1992), continuous optimization (Bennett and Blue 1996), dynamic programming (Cox et al. 1989; Payne and Meisel 1977), genetic algorithms (Son 1998), and more recently, optimizing an upper bound on the tree error using stochastic gradient descent (Norouzi et al. 2015). However, none of these methods have been able to produce certifiably optimal trees in practical times. A different approach is taken by the methods T2 (Auer et al. 1995), T3 (Tjortjis and Keane 2002) and T3C (Tzirakis and Tjortjis 2016), a family of efficient enumeration approaches which create optimal non-binary decision trees of depths up to 3. However, trees produced using these enumeration schemes are not as interpretable as binary decision trees, and do not perform significantly better than current heuristic approaches (Tjortjis and Keane 2002; Tzirakis and Tjortjis 2016).

We believe the lack of success in developing an algorithm for optimal decision trees can be attributed to a failure to address correctly the underlying nature of the decision tree problem. Constructing a decision tree involves a series of discrete decisions—whether to split at a node in the tree, which variable to split on—and discrete outcomes—which leaf node a point falls into, whether a point is correctly classified—and as such, the problem of creating an optimal decision tree is best posed as a mixed-integer optimization (MIO) problem.

Continuous optimization methods have been widely used in statistics over the past 40 years, but MIO methods, which have been used to great effect in many other fields, have not had the same impact on statistics. Despite the knowledge that many statistical problems have natural MIO formulations (Arthanari and Dodge 1981), there is the belief within the statistics/machine learning community that MIO problems are intractable even for small to medium instances, which was true in the early 1970s when the first continuous optimization methods for statistics were being developed.

However, the last 25 years have seen an incredible increase in the computational power of MIO solvers, and modern MIO solvers such as Gurobi (Gurobi Optimization Inc 2015b) and CPLEX (IBM ILOG CPLEX 2014) are able to solve linear MIO problems of considerable size. To quantify this increase, Bixby (2012) tested a set of MIO problems on the same computer using CPLEX 1.2, released in 1991, through CPLEX 11, released in 2007. The total speedup factor was measured to be more than 29,000 between these versions (Bixby 2012; Nemhauser 2013). Gurobi 1.0, an MIO solver first released in 2009, was measured to have similar performance to CPLEX 11. Version-on-version speed comparisons of successive Gurobi releases have shown a speedup factor of nearly 50 between Gurobi 6.5, released in 2015, and Gurobi 1.0 (Bixby 2012; Nemhauser 2013; Gurobi Optimization Inc 2015a), so the combined machine-independent speedup factor in MIO solvers between 1991 and 2015 is approximately 1,400,000. This impressive speedup factor is due to incorporating both theoretical and practical advances into MIO solvers. Cutting plane theory, disjunctive programming for branching rules, improved heuristic methods, techniques for preprocessing MIOs, using linear optimization as a black box to be called by MIO solvers, and improved linear optimization methods have all contributed greatly to the speed improvements in MIO solvers (Bixby 2012). Coupled with the increase in computer hardware during this same period, a factor of approximately 570,000 (Top500 Supercomputer Sites 2015), the overall speedup factor is approximately 800 billion! This astonishing increase in MIO solver performance has enabled many recent successes when applying modern MIO methods to a selection of these statistical problems (Bertsimas et al. 2016; Bertsimas and King 2015, 2017; Bertsimas and Mazumder 2014).

The belief that MIO approaches to problems in statistics are not practically relevant was formed in the 1970s and 1980s and it was at the time justified. Given the astonishing speedup of MIO solvers and computer hardware in the last 25 years, the mindset of MIO as theoretically elegant but practically irrelevant is no longer supported. In this paper, we extend this re-examination of statistics under a modern optimization lens by using MIO to formulate and solve the decision tree training problem, and provide empirical evidence of the success of this approach.

One key advantage of using MIO is the richness offered by the modeling framework. For example, consider multivariate (or oblique) decision trees, which split on multiple variables at a time, rather than the univariate or axis-aligned trees that are more common. These multivariate splits are often much stronger than univariate splits, however the problem of determining the split at each node is much more complicated than the univariate case, as a simple enumeration is no longer possible. Many approaches to solving this problem have been proposed, including using logistic regression (Truong 2009), support vector machines (Bennett and Blue 1998), simulated annealing (Heath et al. 1993), linear discriminant analysis (Loh and Shih 1997; López-Chau et al. 2013), Householder transformations (Wickramarachchi et al. 2016), and perturbation of existing univariate trees (Breiman et al. 1984; Murthy et al. 1994). Most of these approaches do not have easily accessible implementations that can be used on practically-sized datasets and as such, the use of multivariate decision trees in the statistics/machine learning community has been limited. We also note that these multivariate decision tree approaches share the same major flaw as univariate decision trees, in that the splits are formed one-by-one using top-down induction, and so the choice of split cannot be guided by the possible influence of future splits. However, when viewed from an MIO perspective the problems of generating univariate and multivariate problems are very similar, and the second is in fact simpler than the first, allowing a practical way to generate such trees which has not previously been possible for any heuristic. The flexibility of modeling the problem using MIO allows many such extensions to incorporated very easily.

Our goal in this paper is to demonstrate that formulating and solving the decision tree problem using MIO is a tractable approach and leads to practical solutions that outperform the classical approaches, often significantly.

We summarize our contributions in this paper below:

  1. 1.

    We present a new, novel formulation of the classical univariate decision tree problem as an MIO problem that motivates our new classification method, optimal classification trees (OCT). We also show that by relaxing constraints in this model we obtain the multivariate decision tree problem, and that the resulting classification method, optimal classification trees with hyperplanes (OCT-H), is as easy to train as its univariate counterpart, a fact that is not true about heuristic approaches for generating multivariate decision trees.

  2. 2.

    Using a range of tests with synthetic data comparing OCT against CART, we demonstrate that solving the decision tree problem to optimality yields trees that better reflect the ground truth in the data, refuting the belief that such optimal methods will simply overfit to the training set and not generalize well.

  3. 3.

    We demonstrate that our MIO methods outperform classical decision tree methods in practical applications. We comprehensively benchmark both OCT and OCT-H against the state-of-the-art CART on a sample of 53 datasets from the UCI machine learning repository. We show that across this sample, the OCT and OCT-H problems are practically solvable for datasets with sizes in the thousands and yield higher out-of-sample accuracy than CART, with average absolute improvements over CART of 1–2 and 3–5% for OCT and OCT-H, respectively, across all datasets depending on the depth of tree used.

  4. 4.

    We provide a comparison of OCT and OCT-H to Random Forests. Across all 53 datasets, OCT closes the gap between CART and Random Forests by about one-sixth, and OCT-H by about half. Furthermore, for the 31 datasets with two or three classes, and \(p \le 25\), OCT-H and Random Forests are comparable in accuracy.

  5. 5.

    To provide guidance to machine learning practitioners, we present simple decision rules using characteristics of the dataset that predict with high accuracy when the optimal tree methods will deliver consistent and significant accuracy improvements over CART. OCT is highly likely to improve upon CART by 2–4% when the CART accuracy is low, and by 1.2–1.3% when the CART accuracy is high and sufficient training data is availble. OCT-H improves upon CART by 4–7% when the CART accuracy is low or the dimension of the dataset is small.

We note that there have been efforts in the past to apply MIO methods to classification problems. Classification and regression via integer optimization (CRIO), proposed by Bertsimas and Shioda (2007), uses MIO to partition and classify the data points. CRIO was not able to solve the classification problems to provable optimality for moderately-sized classification problems, and the practical improvement over CART was not significant. In contrast, in this paper we present a very different MIO approach based around solving the same problem that CART seeks to solve, and this approach provides material improvement over CART for a variety of datasets.

The structure of the paper is as follows. In Sect. 2, we review decision tree methods and formulate the problem of optimal tree creation within an MIO framework. In Sect. 3, we present a complete training algorithm for optimal tree classification methods. In Sect. 4, we extend the MIO formulation to consider trees with multivariate splits. In Sect. 5, we conduct a range of experiments using synthetic datasets to evaluate the performance of our method in recovering a known ground truth. In Sect. 6, we perform a variety of computational tests on real-world datasets to benchmark our performance against classical decision tree methods in practical applications. In Sect. 7, we include our concluding remarks.

2 Optimal decision tree formulation using MIO

In this section, we first give an overview of the classification problem and the approach taken by typical decision tree methods. We then present the problem that CART seeks to solve as a formal optimization problem and develop an MIO model formulation for this problem allowing us to solve it to optimality, providing the basis for our method OCT.

2.1 Overview of classification problem and decision trees

We are given the training data \(({\varvec{X}}, {\varvec{Y}})\), containing n observations \(({\varvec{x}}_i, y_i)\), \(i = 1, \ldots , n\), each with p features \({\varvec{x}}_i \in \mathbb {R}^p\) and a class label \(y_i \in \{1, \ldots , K\}\) indicating which of K possible labels is assigned to this point. We assume without loss of generality that the values for each dimension across the training data are normalized to the 0–1 interval, meaning each \({\varvec{x}}_i \in {[0, 1]}^p\).

Decision tree methods seek to recursively partition \({[0, 1]}^p\) to yield a number of hierarchical, disjoint regions that represent a classification tree. An example of a decision tree is shown in Fig. 1. The final tree is comprised of branch nodes and leaf nodes:

  • Branch nodes apply a split with parameters \({\varvec{a}}\) and b. For a given point i, if \({\varvec{a}}^T {\varvec{x}}_i < b\) the point will follow the left branch from the node, otherwise it takes the right branch. A subset of methods, including CART, produce univariate or axis-aligned decision trees which restrict the split to a single dimension, i.e., a single component of \({\varvec{a}}\) will be 1 and all others will be 0.

  • Leaf nodes are assigned a class that will determine the prediction for all data points that fall into the leaf node. The assigned class is almost always taken to be the class that occurs most often among points contained in the leaf node.

Fig. 1
figure 1

An example of a decision tree with two partition nodes and three leaf nodes

Classical decision tree methods like CART, ID3, and C4.5 take a top-down approach to building the tree. At each step of the partitioning process, they seek to find a split that will partition the current region in such a way to maximize a so-called splitting criterion. This criterion is often based on the label impurity of the data points contained in the resulting regions instead of minimizing the resulting misclassification error, which as discussed earlier is a byproduct of using a top-down induction process to grow the tree. The algorithm proceeds to recursively partition the two new regions that are created by the hyperplane split. The partitioning terminates once any one of a number of stopping criteria are met. The criteria for CART are as follows:

  • It is not possible to create a split where each side of the partition has at least a certain number of nodes, \(N_{\min }\);

  • All points in the candidate node share the same class.

Once the splitting process is complete, a class label \(1, \ldots , K\) is assigned to each region. This class will be used to predict the class of any points contained inside the region. As mentioned earlier, this assigned class will typically be the most common class among the points in the region.

The final step in the process is pruning the tree in an attempt to avoid overfitting. The pruning process works upwards through the partition nodes from the bottom of the tree. The decision of whether to prune a node is controlled by the so-called complexity parameter, denoted by \(\alpha \), which balances the additional complexity of adding the split at the node against the increase in predictive accuracy that it offers. A higher complexity parameter leads to more and more nodes being pruned off, resulting in smaller trees.

As discussed previously, this two-stage growing and pruning procedure for creating the tree is required when using a top-down induction approach, as otherwise the penalties on tree complexity may prevent the method from selecting a weaker split that then allows a selection of a stronger split in the next stage. In this sense, the strong split is hidden behind the weak split, and this strong split may be passed over if the growing-then-pruning approach is not used.

Using the details of the CART procedure, we can state the problem that CART attempts to solve as a formal optimization problem. There are two parameters in this problem. The tradeoff between accuracy and complexity of the tree is controlled by the complexity parameter \(\alpha \), and the minimum number of points we require in any leaf node is given by \(N_{\min }\). Given these parameters and the training data \(({\varvec{x}}_i, y_i), i = 1, \ldots , n\), we seek a tree T that solves the problem:

$$\begin{aligned} \begin{aligned} \min&\quad R_{xy}(T) + \alpha |T|\\ \text {s.t.}&\quad N_x(l) \ge N_{\min }&\quad \forall l \in \text {leaves}(T) \end{aligned} \end{aligned}$$
(1)

where \(R_{xy}(T)\) is the misclassification error of the tree T on the training data, |T| is the number of branch nodes in tree T, and \(N_x(l)\) is the number of training points contained in leaf node l. We refer to this problem as the optimal tree problem.

Notice that if we can solve this problem in a single step we obviate the need to use an impurity measure when growing the tree, and also remove the need to prune the tree after creation, as we have already accounted for the complexity penalty while growing the tree.

We briefly note that our choice to use CART to define the optimal tree problem was arbitrary, and one could similarly define this problem based on another method like C4.5; we simply use this problem to demonstrate the advantages of taking a problem that is traditionally solved by a heuristic and instead solving it to optimality. Additionally, we note that the experiments of Murthy and Salzberg (1995b) found that CART and C4.5 did not differ significantly in any measure of tree quality, including out-of-sample accuracy, and so we do not believe our choice of CART over C4.5 to be an important one.

2.2 Formulating optimal tree creation as an MIO problem

As mentioned previously, the top-down, greedy nature of state-of-the-art decision tree creation algorithms can lead to solutions that are only locally optimal. In this section, we first argue that the natural way to pose the task of creating the globally optimal decision tree is as an MIO problem, and then proceed to develop such a formulation.

To see that the most natural representation for formulating the optimal decision tree problem (1) is using MIO, we note that at every step in tree creation, we are required to make a number of discrete decisions:

  • At every new node, we must choose to either branch or stop.

  • After choosing to stop branching at a node, we must choose a label to assign to this new leaf node.

  • After choosing to branch, we must choose which of the variables to branch on.

  • When classifying the training points according to the tree under construction, we must choose to which leaf node a point will be assigned such that the structure of the tree is respected.

Formulating this problem using MIO allows us to model all of these discrete decisions in a single problem, as opposed to recursive, top-down methods that must consider these decision events in isolation. Modeling the construction process in this way allows us to consider the full impact of the decisions being made at the top of the tree, rather than simply making a series of locally optimal decisions, also avoiding the need for pruning and impurity measures.

We will next formulate the optimal tree creation problem (1) as an MIO problem. Consider the problem of trying to construct an optimal decision tree with a maximum depth of D. Given this depth, we can construct the maximal tree of this depth, which has \(T = 2^{(D+1)} - 1\) nodes, which we index by \( t = 1, \ldots , T\). Figure 2 shows the maximal tree of depth 2.

Fig. 2
figure 2

The maximal tree for depth \(D = 2\)

We use the notation p(t) to refer to the parent node of node t, and A(t) to denote the set of ancestors of node t. We also define \(A_L(t)\) as the set of ancestors of t whose left branch has been followed on the path from the root node to t, and similarly \(A_R(t)\) is the set of right-branch ancestors, such that \(A(t) = A_L(t) \cup A_R(t)\). For example, in the tree in Fig. 2, \(A_L(5) = \{1\}\), \(A_R(5) = \{2\}\), and \(A(5) = \{1,2\}\).

We divide the nodes in the tree into two sets:

  • Branch nodes Nodes \(t \in \mathcal T_B = \{1, \ldots , \lfloor T/2 \rfloor \}\) apply a split of the form \({\varvec{a}}^{\mathsf{T}}{\varvec{x}}< b\). Points that satisfy this split follow the left branch in the tree, and those that do not follow the right branch.

  • Leaf nodes Nodes \(t \in \mathcal T_L = \{\lfloor T/2 \rfloor + 1, \ldots , T \}\) make a class prediction for each point that falls into the leaf node.

We track the split applied at node \(t \in \mathcal T_B\) with variables \({\varvec{a}}_t\in \mathbb {R}^{p}\) and \(b_t \in \mathbb {R}^{}\). We will choose to restrict our model to univariate decision trees (like CART), and so the hyperplane split at each node should only involve a single variable. This is enforced by setting the elements of \({\varvec{a}}_{t}\) to be binary variables that sum to 1. We want to allow the option of not splitting at a branch node. We use the indicator variables \(d_t = \mathbbm {1}\{\text {node } t \text { applies a split}\}\) to track which branch nodes apply splits. If a branch node does not apply a split, then we model this by setting \({\varvec{a}}_t = {\varvec{0}}\) and \(b_t = 0\). This has the effect of forcing all points to follow the right split at this node, since the condition for the left split is \(0 < 0\) which is never satisfied. This allows us to stop growing the tree early without introducing new variables to account for points ending up at the branch node—instead we send them all the same direction down the tree to end up in the same leaf node. We enforce this with the following constraints:

$$\begin{aligned}&\sum _{j = 1}^{p}a_{jt} = d_t,\quad \forall t \in \mathcal T_B, \end{aligned}$$
(2)
$$\begin{aligned}&0 \le b_t \le d_t,\quad \forall t \in \mathcal T_B, \end{aligned}$$
(3)
$$\begin{aligned}&a_{jt} \in \{0, 1\},\quad j = 1, \ldots , p,\quad \forall t \in \mathcal T_B, \end{aligned}$$
(4)

where the second inequality is valid for \(b_t\) since we have assumed that each \({\varvec{x}}_i \in {[0,1]}^p\), and we know that \({\varvec{a}}_t\) has one element that is 1 if and only if \(d_t=1\), with the remainder being 0. Therefore we it is always true that \(0 \le {\varvec{a}}_t ^{\mathsf{T}}{\varvec{x}}_i \le d_t\) for any i and t, and we need only consider values for \(b_t\) in this same range.

Next, we will enforce the hierarchical structure of the tree. We restrict a branch node from applying a split if its parent does not also apply a split.

$$\begin{aligned} d_t \le d_{p(t)},\quad \forall t \in \mathcal T_BP{\setminus } \{ 1 \}, \end{aligned}$$
(5)

where no such constraint is required for the root node.

We have constructed the variables that allow us to model the tree structure using MIO; we now need to track the allocation of points to leaves and the associated errors that are induced by this structure.

We introduce the indicator variables \(z_{it} = \mathbbm {1}\{{\varvec{x}}_i \text { is in node } t\}\) to track the points assigned to each leaf node, and we will then use the indicator variables \(l_t = \mathbbm {1}\{\text {leaf } t \text { contains any points}\}\) to enforce a minimum number of points at each leaf, given by \(N_{\min }\):

$$\begin{aligned} z_{it}&\le l_t,\quad t \in \mathcal T_B, \end{aligned}$$
(6)
$$\begin{aligned} \sum _{i = 1}^{n}z_{it}&\ge N_{\min } l_t, \quad t \in \mathcal T_B. \end{aligned}$$
(7)

We also force each point to be assigned to exactly one leaf:

$$\begin{aligned} \sum _{t \in \mathcal T_L}z_{it} = 1,\quad i = 1, \ldots , n. \end{aligned}$$
(8)

Finally, we apply constraints enforcing the splits that are required by the structure of the tree when assigning points to leaves:

$$\begin{aligned}&{\varvec{a}}_m ^{\mathsf{T}}{\varvec{x}}_i < b_t + M_1 (1 - z_{it}),\quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_B, \quad \forall m \in A_L(t), \end{aligned}$$
(9)
$$\begin{aligned}&{\varvec{a}}_m ^{\mathsf{T}}{\varvec{x}}_i \ge b_t - M_2 (1 - z_{it}),\quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_B,\quad \forall m \in A_R(t), \end{aligned}$$
(10)

Note that the constraints (9) use a strict inequality which is not supported by MIO solvers, so this must be converted into a form that does not use a strict inequality. To do this we can add a small constant \(\epsilon \) to the left-hand-side of (9) and change the inequality to be non-strict:

$$\begin{aligned} {\varvec{a}}_m^T {\varvec{x}}_{i} + \epsilon \le b_{m} + M_1 \left( 1 - z_{it}\right) , \quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_B, \quad \forall m \in A_L(t) \end{aligned}$$
(11)

However, if \(\epsilon \) is too small, this could cause numerical instabilities in the MIO solver, so we seek to make \(\epsilon \) as big as possible without affecting the feasibility of any valid solution to the problem. We can achieve this by specifying a different \(\epsilon _j\) for each feature j. The largest valid value is the smallest non-zero distance between adjacent values of this feature. To find this, we sort the values of the jth feature and take

$$\begin{aligned} \epsilon _j = \min \left\{ x_j^{(i + 1)} - x_j^{(i)} \biggm | x_j^{(i + 1)} \ne x_j^{(i)},\quad i = 1, \ldots , n - 1 \right\} , \end{aligned}$$

where \(x_j^{(i)}\) is the ith largest value in the jth feature. We can then use these values for \(\epsilon \) in the constraint, where the value of \(\epsilon _j\) that is used is selected according to the feature we are using for this split:

$$\begin{aligned} {\varvec{a}}_m^T ({\varvec{x}}_{i} + {\varvec{\epsilon }}) \le b_{m} + M_1 \left( 1 - z_{it}\right) ,\quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_B, \quad \forall m \in A_L(t) \end{aligned}$$
(12)

We must also specify values for the big-M constants \(M_1\) and \(M_2\). As mentioned previously, we know that both \({\varvec{a}}_t ^T {\varvec{x}}_i \in [0, 1]\) and \(b_t \in [0,1]\), and so the largest possible value of \({\varvec{a}}_t ^T ({\varvec{x}}_i + \epsilon ) - b_t\) is \(1 + \epsilon _{\max }\), where \(\epsilon _{\max } = \max _j \{\epsilon _j\}\). We can therefore set \(M_1 = 1 + \epsilon _{\max }\). Similarly, we have the largest possible value of \(b_t - {\varvec{a}}_t ^T x_i\) is 1, and so we can set \(M_2 = 1\). This gives the following final constraints that will enforce the splits in the tree:

$$\begin{aligned}&{\varvec{a}}_m^T ({\varvec{x}}_{i} + {\varvec{\epsilon }}) \le b_{m} + (1 + \epsilon _{\max }) \left( 1 - z_{it}\right) , \quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_L, \quad \forall m \in A_L(t), \end{aligned}$$
(13)
$$\begin{aligned}&{\varvec{a}}_m ^{\mathsf{T}}{\varvec{x}}_i \ge b_m - (1 - z_{it}), \quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_L, \quad \forall m \in A_R(t). \end{aligned}$$
(14)

The objective is to minimize the misclassification error, so an incorrect label prediction has cost 1, and a correct label prediction has cost 0. Let us define the matrix \({\varvec{Y}}\) using the data, where

$$\begin{aligned} Y_{ik} = {\left\{ \begin{array}{ll} +1, &{}\quad \text {if } y_i = k \\ -1, &{}\quad \text {otherwise} \end{array}\right. } ,\quad k = 1, \ldots , K, \quad i = 1, \ldots , n. \end{aligned}$$

We set \(N_{kt}\) to be the number of points of label k in node t, and \(N_t\) to be the total number of points in node t:

$$\begin{aligned} N_{kt}&= \frac{1}{2} \sum _{i = 1}^{n}(1 + Y_{ik}) z_{it}, \quad k = 1, \ldots , K, \quad t \in \mathcal T_L, \end{aligned}$$
(15)
$$\begin{aligned} N_t&= \sum _{i = 1}^{n}z_{it}, \quad \forall t \in \mathcal T_L. \end{aligned}$$
(16)

We need to assign a label to each leaf node t in the tree, which we denote with \(c_t \in \{1, \ldots , K\}\). It is clear that the optimal label to predict is the most common of the labels among all points assigned to the node:

$$\begin{aligned} c_t = \mathop {{{\mathrm{arg\,max}}}}\limits _{k = 1, \ldots , K} \{ N_{kt} \} \end{aligned}$$
(17)

We will use binary variables \(c_{kt}\) to track the prediction of each node, where \(c_{kt} = \mathbbm {1}\{c_t = k\}\). We must make a single class prediction at each leaf node that contains points:

$$\begin{aligned} \sum _{k = 1}^{K}c_{kt} = l_t, \quad \forall t \in \mathcal T_L. \end{aligned}$$
(18)

Since we know how to make the optimal prediction at each leaf t using (17), the optimal misclassification loss in each node, denoted \(L_t\) is going to be equal to the number of points in the node less the number of points of the most common label:

$$\begin{aligned} L_t = N_t - \max _{k = 1, \ldots , K} \{N_{kt} \} = \min _{k = 1, \ldots , K} \{ N_t - N_{kt} \}, \end{aligned}$$
(19)

which can be linearized to give

$$\begin{aligned}&L_t \ge N_t - N_{kt} - M(1 - c_{kt}), \quad k = 1, \ldots , K, \quad \forall t \in \mathcal T_L, \end{aligned}$$
(20)
$$\begin{aligned}&L_t \le N_t - N_{kt} + M c_{kt}, \quad k = 1, \ldots , K, \quad \forall t \in \mathcal T_L, \end{aligned}$$
(21)
$$\begin{aligned}&L_t \ge 0, \quad \forall t \in \mathcal T_L, \end{aligned}$$
(22)

where again M is a sufficiently large constant that makes the constraint inactive depending on the value of \(c_{kt}\). Here, we can take \(M = n\) as a valid value.

The total misclassification cost is therefore \(\sum _{t \in \mathcal T_L}L_t\), and the complexity of the tree is the number of splits included in the tree, given by \(\sum _{t \in \mathcal T_B}d_t\). Following CART, we normalize the misclassification against the baseline accuracy, \(\hat{L}\), obtained by simply predicting the most popular class for the entire dataset. This makes the effect of \(\alpha \) independent of the dataset size. This means the objective from problem (1) can be written:

$$\begin{aligned} \min \quad \frac{1}{\hat{L}} \sum _{t \in \mathcal T_L}L_t + \alpha \sum _{t \in \mathcal T_B}d_t. \end{aligned}$$
(23)

Putting all of this together gives the following MIO formulation for problem (1), which we call the OCT model:

$$\begin{aligned} \min&\quad \frac{1}{\hat{L}} \sum _{t \in \mathcal T_L}L_t + \alpha \sum _{t \in \mathcal T_B}d_t \\ \text {s.t.}&\quad L_t \ge N_t - N_{kt} - n(1 - c_{kt}), \quad k = 1, \ldots , K,\quad \forall t \in \mathcal T_L, \nonumber \\&\quad L_t \le N_t - N_{kt} + n c_{kt}, \quad k = 1, \ldots , K, \quad \forall t \in \mathcal T_L, \nonumber \\&\quad L_t \ge 0, \quad \forall t \in \mathcal T_L, \nonumber \\&\quad N_{kt} = \frac{1}{2} \sum _{i = 1}^{n}(1 + Y_{ik}) z_{it},\quad k = 1, \ldots , K, \quad \forall t \in \mathcal T_L,\nonumber \\&\quad N_t = \sum _{i = 1}^{n}z_{it},\quad \forall t \in \mathcal T_L, \nonumber \\&\quad \sum _{k = 1}^{K}c_{kt} = l_t,\quad \forall t \in \mathcal T_L, \nonumber \\&\quad {\varvec{a}}_m ^{\mathsf{T}}{\varvec{x}}_i \ge b_t - (1 - z_{it}),\quad i = 1, \ldots , n,\quad \forall t \in \mathcal T_B,\quad \forall m \in A_R(t),\nonumber \\&\quad {\varvec{a}}_m ^{\mathsf{T}}({\varvec{x}}_i + {\varvec{\epsilon }}) \le b_t + (1 + \epsilon _{\max })(1 - z_{it}),\quad i = 1, \ldots , n,~ \forall t \in \mathcal T_B, \quad \forall m \in A_L(t),\nonumber \\&\quad \sum _{t \in \mathcal T_L} z_{it} = 1, \quad i = 1, \ldots , n,\nonumber \\&\quad z_{it} \le l_t,\quad \forall t \in \mathcal T_L,\nonumber \\&\quad \sum _{i = 1}^{n}z_{it} \ge N_{\min } l_t,\quad \forall t \in \mathcal T_L,\nonumber \\&\quad \sum _{j = 1}^{p}a_{jt} = d_t,\quad \forall t \in \mathcal T_B,\nonumber \\&\quad 0 \le b_t \le d_t, \quad \forall t \in \mathcal T_B, \nonumber \\&\quad d_t \le d_{p(t)},\quad \forall t \in \mathcal T_B {\setminus } \{1\}, \nonumber \\&\quad z_{it}, l_t \in \{0,1\}, \quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_L,\nonumber \\&\quad a_{jt}, d_t \in \{0,1\}, \quad j = 1, \ldots , p, \quad \forall t \in \mathcal T_B.\nonumber \end{aligned}$$
(24)

This model as presented is in a form that can be directly solved by any MIO solver. The difficulty of the model is primarily determined by the number of binary variables \(z_{it}\), which is \(n \cdot 2^D\). Empirically we observe that we can find high-quality solutions in minutes for depths up to 4 for datasets with thousands of points. Beyond this depth or dataset size, the rate of finding solutions is slower, and more time is required.

There are three hyper-parameters that need to be specified: the maximum depth D, the minimum leaf size \(N_{\min }\), and the complexity parameter \(\alpha \). We will present in Sect. 3 effective methods for tuning these parameters in validation.

2.3 Improving MIO performance using warm starts

MIO solvers benefit greatly when supplied an integer-feasible solution as a warm start for the solution process. Injecting a strong warm start solution before starting the solver greatly increases the speed with which the solver is able to generate strong feasible solutions (Bertsimas and Weismantel 2005). The warm start provides a strong initial upper bound on the optimal solution that allows more of the search tree to be pruned, and it also provides a starting point for local search heuristics. The benefit realized increases with the quality of the warm start, so it is desirable to be able to quickly and heuristically find a strong integer-feasible solution before solving.

To demonstrate the effectiveness of warm starts, Fig. 3 shows an example of the typical evolution of upper and lower bounds as we solve the MIO problem (24) for the optimal tree of depth 2 on the “Wine” dataset, which has \(n=178\) and \(p=13\). We see when no warm start is supplied, the upper bound decreases gradually until the eventual optimal solution is found after 270 s. An additional 1360 s are required to prove the optimality of this solution. When we inject a heuristic solution (in this case, the CART solution) as a warm start, the same optimal solution is found after just 105 s, and it takes an additional 230 s to prove optimality. The total time required to find and prove optimality in this example decreases by a factor of 5 when the warm start is added, and the time taken to find the optimal solution decreases by a factor of around 2.5, showing that using high-quality warm start solutions can have a significant effect on the MIO solution times. Additionally, we see that the optimal tree solution has an objective with roughly half the error of the CART warm start, showing that the CART solutions can indeed be far from optimality in-sample. Finally, we observe that the majority of the time is spent in proving that the solution is optimal. A proof of optimality is good to have for comparison to other methods, but is not necessarily required in practice when evaluating a classifier (note that other methods make no claims as to their optimality). It is therefore a reasonable option to terminate the solve early once the best solution has remained unchanged for some time, because this typically indicates the solution is indeed optimal or close to it, and proving optimality may take far more time.

Fig. 3
figure 3

Comparison of upper and lower bound evolution while solving MIO problem (24) with and without warm starts for a tree of depth \(D=2\) for the Wine dataset with \(n=178\) and \(p=13\). a Without warm start, b with warm start

We already have access to good heuristic methods for the MIO problem (24); we can use CART to generate these warm start solutions. Given a solution from CART, it is simple to construct a corresponding feasible solution to the MIO problem (24) using the splits from CART to infer the values of the remaining variables in the MIO problem.

By design, the parameters \(N_{\min }\) and \(\alpha \) have the same meaning in CART and in problem (24), so we can run CART with these parameters to generate a good solution. We must then prune the splits on the CART solution until it is below the maximum depth D for the MIO problem to ensure it is feasible.

We can also use MIO solutions that we have previously generated as warm starts. In particular, if we have a solution generated for a depth D, this solution is a valid warm start for depth \(D+1\). This is important because the problem difficulty increases with the depth, so for larger depths it may be advantageous to run the MIO problem with a smaller depth to generate a strong warm start. This technique is used heavily in the validation procedure described in Sect. 3.

3 Using optimal classification trees for classification

In this section, we present the implementation details of a full algorithm for training a decision tree classifier using the OCT MIO problem from Sect. 2. In particular, we devise a procedure for effectively tuning the values of the hyperparameters of this model.

The most important parameter to choose is the maximum depth of the tree D. As discussed in Sect. 2, the depth has a very large effect on the difficulty of the resulting problem, and it is advantageous to reuse solutions from lower depths as warm starts for the higher-depth problems. To take advantage of this, we will specify a maximum depth \(D_{\max }\), and start building trees from depth 2 up to this maximum depth, maintaining a pool of possible warm start solutions.

We also have to tune the value of the complexity parameter \(\alpha \). Recall that the objective is given by

$$\begin{aligned} \min \quad \frac{1}{\hat{L}} \sum _{t \in \mathcal T_L}L_t + \alpha \sum _{t \in \mathcal T_B}d_t. \end{aligned}$$
(25)

It is hard to tune the value of the continuous-valued parameter \(\alpha \). A typical approach would be to discretize the search space and test each value. In our case, this would involve solving the MIO problem for each value, which is expensive. Additionally, multiple values of \(\alpha \) might give the same solution which would mean that solving these problems was unnecessary. Ideally we could simply search over the “critical” values of \(\alpha \) that would lead to different solutions, minimizing the number of MIO problems that need to be solved.

Observe that the penalty term can be rewritten into constraint form as follows:

$$\begin{aligned} \min&\quad \frac{1}{\hat{L}} \sum _{t \in \mathcal T_L}L_t\\ \text {s.t.}&\quad \sum _{t \in \mathcal T_B}d_t \le C, \end{aligned}$$

where C is some constant corresponding to the value of \(\alpha \).

Note that the term in this constraint is the number of splits in the model and is integer-valued. This means we need only test values of C that are integral. We can therefore search over \(C = 1, \ldots , C_{\max }\) to generate the entire range of possible solutions, where \(C_{\max } = 2^D-1\) is the maximum number of possible splits in the solution. Additionally, a solution to the problem for C is a feasible warm start solution for the problem at \(C+1\), so the solutions generated during this search can be stored and reused during the remaining search.

It can happen that the solution for a particular value of C is not actually an optimal solution to the original problem (24) for any value of \(\alpha \). This means it is dominated by solutions for other values of C. Such solutions can be removed in a post-processing step to ensure that only solutions optimal for the original problem for any value of \(\alpha \) remain.

Given this set of optimal solutions for each value of C, we evaluate each on a validation set and identify the solution that performs best. We can then calculate the interval for \(\alpha \) in which this solution is the minimizer for problem (24). We use the midpoint of this interval as the final tuned value for \(\alpha \). In case of a tie, we take the union of the intervals for all such solutions and take the midpoint of this union interval.

An overview of the complete process for tuning the parameters using a validation set follows:

  1. 1.

    Set the maximal depth for the MIO problem, \(D_{\max }\), and the minimum leaf size \(N_{\min }\).

  2. 2.

    For \(D = 1, \ldots , D_{\max }\):

    1. (a)

      For \(C = 1, \ldots , 2^D-1\):

      1. i.

        Run CART using \(N_{\min }\) with \(\alpha =0\). Trim the solution to depth D and to a maximum of C splits.

      2. ii.

        Search through pool of candidate warm starts (including CART) and choose the one with the lowest error.

      3. iii.

        Solve MIO problem for depth D and C splits using the selected warm start.

      4. iv.

        Add the MIO solution to the warm start pool.

  3. 3.

    Post-process the solution pool to remove all solutions that are not optimal to (24) for any value of \(\alpha \).

  4. 4.

    Identify the best performing solution on a validation set, and the range of \(\alpha \) for which this solution is optimal. Use the midpoint of this interval as the tuned value of \(\alpha \).

We have not addressed how to tune the value of \(N_{\min }\) as this was not required for the computational experiments in Sect. 6. In theory it would be simple to use a similar approach for reusing solutions as warm starts by noting that a solution for a value of \(N_{\min }\) is also valid for any smaller value of this parameter. In this way, it would be best to search over the \(N_{\min }\) values in reverse order, and this could be carried out as the outermost loop in the search procedure.

Having tuned the parameter values using the training and validation sets, we proceed to use these parameters to train a final tree on the combined training and validation sets. Following a similar procedure as in the validation, we use solutions from lower depths to warm start higher depths. For each depth \(D = 1, \ldots , D_{\max }\), we run the MIO problem, with the warm start being either CART trimmed to this depth, or the best MIO solution from a lower depth. Finally, we take the final solution and evaluate this on the test set to get the out-of-sample error.

4 Optimal multivariate decision trees using MIO

So far, we have only considered decision trees that use a single variable in their splits at each node, known as univariate decision trees. In this section, we show that it is simple to extend our MIO formulation for univariate trees to yield a problem for determining the optimal multivariate decision tree.

4.1 Formulating the multivariate optimal tree problem

In a multivariate decision tree, we are no longer restricted to choosing a single variable upon which to split, and instead can choose a general hyperplane split at each node. The variables \({\varvec{a}}_t\) will be used to model the split at each node as before, except we relax (4) and instead choose \({\varvec{a}}_t \in {[-1, 1]}^p\) at each branch node t. We must modify (2) to account for the the possibility these elements are negative by dealing with the absolute values instead:

$$\begin{aligned} \sum _{j = 1}^{p}|a_{jt}| \le d_t,\quad \forall t \in \mathcal T_B,\\ \end{aligned}$$

which can be linearized using auxiliary variables to track the value of \(|a_{jt}|\):

$$\begin{aligned}&\sum _{j = 1}^{p}\hat{a}_{jt} \le d_t,\quad \forall t \in \mathcal T_B,\\&\hat{a}_{jt} \ge a_{jt}, \quad j = 1, \ldots , p, \quad \forall t \in \mathcal T_B,\\&\hat{a}_{jt} \ge -a_{jt},\quad j = 1, \ldots , p, \quad \forall t \in \mathcal T_B.\\ \end{aligned}$$

As before, these constraints force the split to be all zeros if \(d_t = 0\) and no split is applied.

We now have that \({\varvec{a}}_t^T {\varvec{x}}_i \in {[-1, 1]}\), so we replace (3) with:

$$\begin{aligned} -d_t \le b_t \le d_t, \quad \forall t \in \mathcal T_B. \end{aligned}$$

Now we consider the split constraints (9) and (14). Previously we had that the range of \(({\varvec{a}}_t^T {\varvec{x}}_i - b_t)\) was \({[-1, 1]}\), whereas it is now \({[-2, 2]}\). This means we need \(M=2\) to ensure that the constraint is trivially satisfied when \(z_{it} = 0\). The constraints therefore become:

$$\begin{aligned}&{\varvec{a}}_m^T {\varvec{x}}_{i} < b_{m} + 2 \left( 1 - z_{it}\right) , \quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_B, \quad \forall m \in A_L(t), \end{aligned}$$
(26)
$$\begin{aligned}&{\varvec{a}}_m^T {\varvec{x}}_{i} \ge b_{m} - 2 \left( 1 - z_{it}\right) , \quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_B, \quad \forall m \in A_R(t), \end{aligned}$$
(27)

Finally, as before we need to convert the strict inequality in (26) to a non-strict version. We do this by introducing a sufficiently small constant \(\mu \):

$$\begin{aligned} {\varvec{a}}_m^T {\varvec{x}}_{i} + \mu \le b_{m} + (2 + \mu ) \left( 1 - z_{it}\right) ,\quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_B, \quad \forall m \in A_L(t), \end{aligned}$$

Note that we need to include \(\mu \) in the rightmost term to ensure the constraint is always satisfied when \(z_{ik} = 0\). Additionally, unlike in the univariate case, we cannot choose a value for \(\mu \) in an intelligent manner, and instead need to choose a small constant. Choosing \(\mu \) too small can lead to numerical issues in the MIO solver, while too large reduces the size of the feasible region, potentially reducing the quality of the optimal solution. We take \(\mu = 0.005\) as a compromise between these extremes.

Previously, we penalized the number of splits in the tree. In the multivariate regime where a single split may use multiple variables, it seems sensible to generalize this to instead penalize the total number of variables used in the splits. To achieve this, we introduce binary variables \(s_{jt}\) to track if the jth feature is used in the tth split:

$$\begin{aligned} -s_{jt} \le a_{jt} \le s_{jt}, \quad j = 1, \ldots , p, \quad \forall t \in \mathcal T_B\end{aligned}$$

We must also make sure that the values of \(s_{jt}\) and \(d_t\) are compatible. The following constraints ensure that \(d_t = 1\) if and only if any variable is used in the split:

$$\begin{aligned}&s_{jt} \le d_t, \quad j = 1, \ldots , p, \quad \forall t \in \mathcal T_B,\\&\sum _{j = 1}^{p}s_{jt} \ge d_t,\quad \forall t \in \mathcal T_B. \end{aligned}$$

Finally, we modify the objective function to penalize the number of variables used across the splits in the tree:

$$\begin{aligned} \min ~ \frac{1}{\hat{L}} \sum _{t \in \mathcal T_L}L_t + \alpha \cdot \sum _{t \in \mathcal T_B}\sum _{j = 1}^{p}s_{jt} \end{aligned}$$

Combining all of these changes yields the complete OCT-H model:

$$\begin{aligned} \min&\quad \phantom {} \frac{1}{\hat{L}} \sum _{t \in \mathcal T_L}L_t + \alpha \cdot \sum _{t \in \mathcal T_B}\sum _{j = 1}^{p}s_{jt} \\ \text {s.t.}&\quad L_t \ge N_t - N_{kt} - n(1 - c_{kt}), \quad k = 1, \ldots , K, \quad \forall t \in \mathcal T_L, \nonumber \\&\quad L_t \le N_t - N_{kt} + n c_{kt}, \quad k = 1, \ldots , K, \quad \forall t \in \mathcal T_L, \nonumber \\&\quad L_t \ge 0, \quad \forall t \in \mathcal T_L, \nonumber \\&\quad N_{kt} = \frac{1}{2} \sum _{i = 1}^{n}(1 + Y_{ik}) z_{it}, \quad k = 1, \ldots , K, \quad \forall t \in \mathcal T_L,\nonumber \\&\quad N_t = \sum _{i = 1}^{n}z_{it}, \quad \forall t \in \mathcal T_L, \nonumber \\&\quad \sum _{k = 1}^{K}c_{kt} = l_t, \quad \forall t \in \mathcal T_L, \nonumber \\&\quad {\varvec{a}}_m^T {\varvec{x}}_{i} + \mu \le b_{m} + (2 + \mu ) \left( 1 - z_{it}\right) ,\quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_B, \quad \forall m \in A_L(t),\nonumber \\&\quad {\varvec{a}}_m^T {\varvec{x}}_{i} \ge b_{m} - 2 \left( 1 - z_{it}\right) , \quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_B, \quad \forall m \in A_R(t),\nonumber \\&\quad \sum _{t \in \mathcal T_L} z_{it} = 1, \quad i = 1, \ldots , n,\nonumber \\&\quad z_{it} \le l_t, \quad \forall t \in \mathcal T_L,\nonumber \\&\quad \sum _{i = 1}^{n}z_{it} \ge N_{\min } l_t, \quad \forall t \in \mathcal T_L,\nonumber \\&\quad \sum _{j = 1}^{p}\hat{a}_{jt} \le d_t, \quad \forall t \in \mathcal T_B,\nonumber \\&\quad \hat{a}_{jt} \ge a_{jt}, \quad j = 1, \ldots , p, \quad \forall t \in \mathcal T_B,\nonumber \\&\quad \hat{a}_{jt} \ge -a_{jt}, \quad j = 1, \ldots , p, \quad \forall t \in \mathcal T_B,\nonumber \\&\quad -s_{jt} \le a_{jt} \le s_{jt}, \quad j = 1, \ldots , p, \quad \forall t \in \mathcal T_B\nonumber \\&\quad s_{jt} \le d_t, \quad j = 1, \ldots , p, \quad \forall t \in \mathcal T_B,\nonumber \\&\quad \sum _{j = 1}^{p}s_{jt} \ge d_t, \quad \forall t \in \mathcal T_B,\nonumber \\&\quad -d_t \le b_t \le d_t, \quad \forall t \in \mathcal T_B, \nonumber \\&\quad d_t \le d_{p(t)}, \quad \forall t \in \mathcal T_B {\setminus } \{1\}, \nonumber \\&\quad z_{it}, l_t \in \{0,1\}, \quad i = 1, \ldots , n, \quad \forall t \in \mathcal T_L,\nonumber \\&\quad \quad s_{jt}, d_t \in \{0,1\}, \quad j = 1, \ldots , p, \quad \forall t \in \mathcal T_B.\nonumber \end{aligned}$$
(28)

Note that the OCT-H formulation contains the same number of binary variables as the OCT problem. This means the OCT-H problem tends to be as easy to solve as the OCT problem, which is not usually the case for hyperplane tree methods.

Additionally, from this formulation we can easily obtain the OCT problem as a special case by restoring the integrality constraints on \({\varvec{a}}_t\). The close relationship between the univariate and multivariate problems reinforces the notion that the MIO formulation is the natural way to view the decision tree problem.

4.2 Warm starts for optimal classification trees with hyperplanes

As a result of relaxing the constraints that required the splits in the tree be univariate, the warm start trees from before are no longer ideal for warm starting the solving process. Because we have only relaxed constraints in the model, these previous warm starts are still feasible solutions, but it is possible that we can inexpensively find better warm starts that exploit the increased solution space in the OCT-H problem. Recall that the quality of the warm start solution has a large effect on the speed of the solving process, so it is generally a good idea for us to generate higher quality warm starts where possible.

The approach we employ to generate a high-quality warm start to the OCT-H problem is to use greedy top-down induction with multivariate splits. This is a natural extension of the univariate heuristics to the multivariate case. To determine the multivariate split at each node, we simply solve a reduced version of the OCT-H problem for the points in the current node by restricting the solution to a single split. This is a much smaller problem than the full OCT-H problem, and so it solves much faster and without the need for a warm start, although it can also use the best univariate split as a good warm start.

Note that it would be possible to determine the split at each node using any other multivariate classification technique, such as logistic regression or support vector machines. This may increase the speed at which the multivariate warm starts are generated, since highly optimized versions of these algorithms exist for most programming languages. It may also lead to increased quality of the warm start solution, although this has not been tested.

4.3 Training procedure for optimal classification trees with hyperplanes

The OCT-H problem (28) has exactly the same parameters as the OCT problem, and thus the procedure in Sect. 3 can be used for tuning these parameters. The only difference is that each split in the tree can now use up to p variables, and so the maximum tree complexity, \(C_{\max }\), becomes \(p(2^D-1)\). As before, we can search over all values of \(C = 1, \ldots , C_{\max }\), solving the complexity-constrained problem for each and then post-processing to obtain the corresponding tuned value for \(\alpha \).

It can happen that p is too large to make this computationally feasible, and it may be more efficient to simply search over a discretized range of possible \(\alpha \) values as is typically done during validation. In this case, the number of problems to solve is \(D \cdot |\mathcal A|\), where \(\mathcal A\) is the discretized set of values to search. The same approach of reusing the solutions from lower depths as warm starts can still be applied to speed up the solutions. Whether this approach is better than the constraint-based approach depends on the size of \(\mathcal A\) that is chosen, and the number of problems that must be solved by each approach, which is \(D \cdot |\mathcal A|\) compared to \(p(2^D-1)\).

5 Computational experiments with synthetic datasets

In this section, we examine the performance of OCT on a variety of synthetically-generated datasets in order to understand how effective the method is at discovering the underlying ground truth in a dataset whose structure is in fact described by a decision tree.

The experiments we conduct are adaptations of the experiments by Murthy and Salzberg (1995b). These experiments use a single decision tree as the ground truth to generate datasets, and then different methods are used to induce decision trees on these datasets and are then compared to the ground truth tree to evaluate performance. To construct the ground truth, a decision tree of a specified depth is created by choosing splits at random, and the leaves of the resulting tree are labeled such that no two adjacent leaves have the same label, ensuring that this tree is the smallest representation of the ground truth. The training and test datasets are created by generating each \({\varvec{x}}_i\) as a uniform random vector, and then using the ground truth tree to assign the corresponding label to the point.

Following Murthy and Salzberg (1995b), we report six measures of tree quality in all experiments:

  • In-sample accuracy Accuracy on the training set.

  • Out-of-sample accuracy Accuracy on the test set.

  • Tree size Number of leaf nodes.

  • Maximum depth Total depth of the tree.

  • Average depth Average depth of leaf nodes in the tree.

  • Expected depth Average depth of leaf nodes in the tree, weighted by the proportion of the test set that belongs to each leaf node (this is also sometimes referred to as dynamic complexity).

In each experiment, ten random trees are generated as different ground truths, and ten training and test set pairs are created using each tree. The size of the training set is specified by the experiment, and the size of the test set is always \(2^D \cdot (p - 1) \cdot 500\), where D is the depth of the ground truth tree and p is the number of features in the data, both specified by the particular experiment. The quality measures are calculated over all 100 instances of the problem, and we present the means and standard errors of these measures.

The methods we compare are our OCT method, the standard CART heuristic, and a modified CART heuristic that imposes a depth limit equal to the depth of the ground truth. The purpose of this third method is to examine the impact of trimming the CART solution as a method of obtaining a simpler and more interpretable tree. For experiments with noise in the data, 10% of the training set is reserved for validation purposes in order to tune the complexity parameter \(\alpha \). Note that Murthy and Salzberg (1995b) found that C4.5 and CART gave near-identical results for all experiments they conducted and reported only the C4.5 results, so our results for CART should also be similar to those that would be obtained using C4.5.

Table 1 Effects of noise in ground truth depth

The first experiment evaluates the effectiveness of each method as the problem complexity increases by increasing the depth of the ground truth while holding the training set size and dimension fixed at \(n=100\) and \(p=2\), respectively. There is no noise present in the training data. Table 1 shows the results of this experiment. We see that depth-constrained CART is significantly worse than the other methods when compared to the ground truth, a result that is common to all of these experiments. In terms of out-of-sample accuracy, both CART and OCT are about the same in all tests, with OCT having a very slight advantage at all depths that is largest at depth 3. We see that the out-of-sample accuracies decrease significantly as the depth increases, which is to be expected because the training problem is becoming more complex with the same amount of training data available. In terms of the other quality measures, OCT matches the ground truth nearly identically in all cases across the measures, whereas CART is significantly different and seems to be learning trees that are structurally different to the ground truth. This experiment demonstrates that although different methods can have the same out-of-sample accuracy, the underlying structure learned by each method can still be significantly different from one another, and in this case OCT is a much closer match for the ground truth.

Table 2 Effects of training set size

The second experiment investigates the effect of the amount of available training data relative to the complexity of the problem. We increase the size of the training set while holding the depth of the ground truth fixed at \(D=2\) and the dimension fixed at \(p=2\). There is no noise in the training data. Table 2 shows a summary of the results. We see that all methods increase in out-of-sample accuracy as the training size increases, approaching 100% accuracy at the higher sizes. We see that initially depth-constrained CART is a poor approximation to the ground truth, but it becomes as accurate out-of-sample as the other methods if sufficient training data is available. In a data-poor environment, depth-constrained CART performs significantly worse, and OCT offers a slight improvement over CART in terms of out-of-sample accuracy. This offers clear evidence against the notion that optimal methods tend to overfit the training data in data-poor scenarios; in fact the optimal method performs stronger. In terms of the other metrics, OCT most closely matches the attributes of the true trees in nearly all cases, indicating it is actually learning the truth in the data.

Table 3 Effects of noise in class labels

The third experiment examines the impact of adding noise to the labels of the training data. In this experiment, 100 training points were generated per distinct label of the ground truth tree (\(D=2\), \(p=2\)). Noise was then added to the training set by increasing by 1 the label of a random \(k\%\) of the points, where k is a parameter of the experiment. Table 3 shows the impact of increasing levels of label perturbations in the training data. As the level of noise increases, the accuracies of all methods tend to decrease, as would be expected. The difference between the out-of-sample accuracies of CART and OCT increases with the level of noise up to \(20\%\), indicating that OCT is able to better identify the true tree than CART when exposed to such noise, and this difference quickly becomes significant (1–2% for noise levels of 15–20%). At the highest level of noise, the difference between these methods is diminished, reflecting the difficulty of learning the truth in such a noisy environment. The other metrics also indicate that OCT again most closely matches the ground truth as the noise levels are increased. We remark that these results counter the idea that optimal methods are less robust to noise in the data than current heuristic approaches due to problems of overfitting, as we clearly see that the OCT trees are significantly stronger out-of-sample than the heuristic counterparts for high levels of label noise.

Table 4 Effects of noise in attribute values

The fourth experiment is similar to the previous one, except it considers noise in the features of the training data instead of the labels. Again, 100 training points are generated for each distinct label in the ground truth tree (\(D=2\), \(p=2\)). We introduce noise by selecting a random \(k\%\) of the training points and to each of these adding random noise \(\epsilon \sim U(-0.1, 0.1)\) to every feature. Table 4 shows the impact of increasing levels of this feature noise in the training data. At all levels of noise, the out-of-sample accuracies of CART and OCT are comparable, as are the other tree metrics. Even at the highest level of noise, where \(25\%\) of the training data are perturbed in all features by up to \(10\%\), both methods have an out-of-sample accuracy above \(97\%\). We conclude that in the presence of significant feature noise, neither method outperforms the other, and the optimal method is clearly not overfitting the training data.

Table 5 Effects of dimensionality

The final experiment considers the effect of increasing the dimensionality of the problem while holding the training size and ground truth depth fixed at \(n=100\) and \(D=2\), respectively. Table 5 shows the effect of increasing the number of features for fixed training size and tree depth. At the lower dimensions, there is no significant difference between CART and OCT, but at higher dimensions the difference is about \(0.6\%\) and is significant. The other metrics show that again the OCT trees are the best match for the ground truth. These results echo those from the second test and show that the optimal method performs stronger relative to CART in data-poor environments where relatively little training data is available compared to the problem complexity.

We conclude by summarizing the collective findings of these tests with synthetic data. In nearly all cases, the trees generated by the OCT method most closely matched the quality metrics of the ground truth trees. In particular, the OCT trees in most cases had out-of-sample accuracies that were significantly higher than those of CART, even in the presence of significant levels of noise in the training data. This provides strong evidence against the widely-held belief that optimal methods are more inclined to overfit the training set at the expense of out-of-sample accuracy. A secondary conclusion is that the depth-constrained CART method is a poor way of constructing a decision tree of a given depth. Creating complete trees and then trimming these to shorter depths yields trees that are significantly worse in nearly all cases than those created by OCT, which creates the tree with the depth limit in mind.

6 Computational results with real-world datasets

In this section, we report the performance of Optimal Classification Trees with and without hyperplanes on several publicly available datasets widely-used within the statistics and machine learning communities, and compare these to the prominent decision tree method, CART.

In order to provide a comprehensive benchmark of performance for optimal trees, we used a collection of 53 datasets obtained from the UCI machine learning repository (Lichman 2013). These datasets are used regularly for reporting and comparing the performance of different classification methods. The datasets we use have sizes up to 1000s of points, which we demonstrate can be handled by our methods. Datasets with tens of thousands of points were tested, but the time required for our methods to generate high-quality solutions was prohibitive. For these experiments, we seek to demonstrate the improvement delivered by our methods on these common datasets of manageable size.

Each dataset was split into three parts: the training set (50%), the validation set (25%), and the testing set (25%). The training and validation sets were used to tune the value of the hyperparameter \(\alpha \). We then used this \(\alpha \) to train a tree on the combined training and validation sets, which we then evaluate against the testing set to determine the out-of-sample accuracy.

To reduce the amount of computation required across all the datasets, we elected not to tune the parameter \(N_{\min }\). Instead, we set this to 5% of the total number of points in the dataset so that this value would work across all datasets.

We trained optimal trees of depths 1–4 on all datasets, and compared them to solutions from CART trimmed to the same depth. These CART solutions were obtained by fixing \(N_{\min }\) to the same value as for optimal trees, and tuning the hyperparameter \(\alpha \) in the usual way for CART. The resulting tree was then trimmed to the specified depth. This lets us evaluate the performance of all methods in the same depth-constrained scenario, which is often important for applications that require interpretability of the solutions. In such cases, we seek solutions with very few splits that best capture the structure in the data. We collected data for CART without the depth constraints to examine whether shallower Optimal Trees can achieve the same accuracy as deep CART trees.

In order to minimize the effect of the particular splitting of the data into training, validation and testing sets, the entire process was conducted five times for each dataset, with a different splitting each time. The final out-of-sample accuracies were then obtained by averaging the results across these five runs.

For testing the performance of CART, we used the rpart package (Therneau et al. 2015) in the R programming language (R Core Team 2015). The hyperparameters \(N_{\min }\) and \(\alpha \) correspond to the parameters minbucket and cp. As discussed earlier, we post-processed the trees to satisfy any depth constraints.

The OCT problems (OCT and OCT-H) were implemented in the Julia programming language, which is a rapidly maturing language designed for high-performance scientific computing (Bezanson et al. 2014). We formulated our MIO models using the JuMP package, a state-of-the-art library for algebraic modeling and mathematical optimization (Lubin and Dunning 2015), and solved them using the Gurobi 6.5 solver, one of the fastest available commercial MIO solvers (Gurobi Optimization Inc 2015b). Again, we tuned the value of \(\alpha \) through the validation process described in Sect. 3.

All problems were solved in parallel using the Amazon Web Services (AWS) Elastic Compute Cloud (EC2). We used the compute-optimized c4.xlarge instances, which have have four virtual CPU cores (2.9 GHz) and 7.5 GB RAM.

A time limit was applied to each optimal tree problem. On most datasets for OCT this was 30 min, but for some of the larger datasets we increased the time limit up to 2 h to ensure that the solver could make progress. The OCT-H problem is easier to solve and so only required time limits of 5–15 min. When generating heuristic warm starts for the multivariate problem by recursively solving the OCT-H problem with a single split, we used a time limit of 60 s. We found this was more than enough to find a good greedy solution.

As discussed in Sect. 2, the application of a time limit when solving an MIO problem when using strong warm starts and heuristics does not usually affect the quality of the solution if the time limit is sufficiently large. As we saw in Fig. 3, the solution that is eventually shown to be optimal is typically found very close to the beginning of the solve, especially when warm starts are used. Moreover, the majority of the time is spent in proving that this solution is in fact optimal, which is not required for our purposes. If we set the time limit intelligently, we can avoid the lengthy process of proving optimality while still generating high quality solutions. However, this does mean that most of our results for OCT and OCT-H do not carry explicit certificates of optimality.

6.1 Overall comparison of all methods

Table 6 Mean out-of-sample accuracies across all datasets for each method, according to the maximum tree depth

We first provide a summary of all methods for comparison purposes. Table 6 shows the average out-of-sample accuracy across all datasets for each method and tree depth. For reference we also include the results for a top-down induction version of OCT-H, where the tree is decided greedily, one split at a time, providing a reference point for the performance of top-down multivariate methods. We see that CART performs the weakest, and is outperformed at all depths by OCT by about 1–2%, demonstrating the advantage that can be realized by using optimal trees. OCT-H is significantly more powerful than all other methods; in particular, OCT-H performs significantly stronger than the top-down OCT-H heuristic, over 2% better on average, indicating that multivariate trees also benefit significantly from taking a global perspective when choosing the splits.

When we run CART without constraining the depth of the trees, the maximum tree depth required is 10 and the resulting average accuracy across all datasets is 80.7%. We note this is very close to the accuracy of 80.4% of OCT at depth 4. It is quite possible that with a longer time limit, OCT at depth 4 would be able to achieve accuracies similar to those of CART at depth 10. This would be very significant for applications that require interpretation of the decision trees, since shallow trees are much more interpretable. We also note that even the depth 2 accuracy of OCT-H at 81.0% is greater than this depth 10 CART accuracy, indicating that very shallow multivariate decision trees can outperform the deepest univariate trees.

Table 7 Full results for CART and OCT at depth 2

6.2 Optimal classification trees versus CART

We now present an in-depth direct comparison of OCT and CART. Both methods seek to solve the same decision tree problem, so this comparison aims to show the effectiveness of formulating the problem under an MIO framework and solving the exact same problem to optimality.

The entire set of mean out-of-sample accuracies on each dataset for both methods at depth 2 is provided in Table 7, which reports the size and dimension of the dataset, the number of class labels K, and the average out-of-sample accuracy of each method along with the mean accuracy improvement for OCT over CART and the corresponding standard error (results for depths 3 and 4 are provided in Appendix 1).

Table 8 Comparison of CART and OCT across a range of depths, showing the number of datasets for which each method had the highest out-of-sample accuracy, and the mean improvement in out-of-sample accuracy when using OCT across all datasets along with the p value indicating the statistical significance of this difference

In Table 8, we present the number of datasets for which each method performed strongest, broken down by the maximum depth of the trees. We see that at all depths, OCT is stronger on roughly twice as many datasets as CART. This table also shows the mean difference between the out-of-sample accuracies of each method across all datasets at each depth, along with the p value indicating the statistical significance of this difference. We see that the OCT trees have a higher accuracy of around 0.5–2% at all depths, and that this difference is statistically significant for all depths. The largest difference is at depth 2, and at depths 3 and 4 this difference is smaller, but still significant both statistically and in magnitude. The result at depth 2 shows there is scope for OCT to significantly outperform CART by a large margin. We believe the smaller differences at higher depths can be attributed to the MIO problems being harder to solve, and so less progress is being made at these depths for the same time limit. Another explanation is that the natural advantage we would expect optimal methods to have over CART is less pronounced at higher depths, but this runs contrary to intuition which would suggest that larger trees have more scope for optimization than shallower ones.

Note that even at depth 1, a significant difference is present between the methods, which is simply the effect of using the misclassification score to select the best split as opposed to an impurity measure.

We have established that OCT is more likely to outperform CART, and has a small yet significant gain in out-of-sample accuracy across the collection of datasets. Next, we will consider the relative performance of the methods according to characteristics of the dataset in order to identify the types of problems where each method is more likely to outperform the other.

Table 9 Breakdown of results according to the accuracy of CART on the dataset and the maximum depth of the trees

Table 9 presents a comparison of OCT against CART as a function of the accuracy of CART. We can see OCT consistently has an edge when the CART accuracy is above 80% or below 60%. The win rates in these areas also favor OCT. In the region where CART accuracy is 60–80%, the results are mixed with no clear winner at the higher depths. The main conclusion of this comparison is that OCT is able to consistently deliver improvements in accuracy when CART performs either well or poorly. To understand this, it seems that a high CART accuracy indicates that axis-aligned decision tree methods are well-suited to the dataset in question, which means solving the problem using MIO is more likely to lead to better decision trees with higher accuracy on the problem. Conversely, when CART performs very poorly, it may be the result of bad decisions in the trees that OCT can avoid. However, when the CART accuracy is in the middle, it could indicate that CART is hitting the limit of what is possible for a decision tree method and so the benefits of a stronger decision tree in-sample are not realized out-of-sample.

Fig. 4
figure 4

Plot of winning method (CART vs. OCT) by the accuracy of CART and the ratio of the number of points to the dimension of points, n / p, in each dataset for trees of depth 2. The high-performance where OCT is highly likely to perform strongest is indicated by the dashed lines

Figure 4 shows the winning method for trees of depth 2 plotted according to the accuracy of CART and the ratio of n and p. We see that the top-left corner has a very high concentration of OCT wins, and this pattern is present in both the depth 3 and 4 plots as well so they are omitted for brevity. The dashed lines on the plot indicate a region of high OCT performance, which are those datasets with CART error below 20% and n / p over 10. This region was determined by looking at the results collectively across all depths.

Table 10 Breakdown of results according to whether the dataset is in the region of high-performance and the maximum depth of the trees

Table 10 summarizes the results for each depth according to whether they are in this region of high performance. We see that in the regions of high performance, OCT is almost always the winning method if there is a winner, and the average improvement for OCT is consistently around 1.2–1.3%. For the datasets that are not in this region, the results are more mixed and there is no clear winner, although at depth 2 there is a large difference of 2.2% between the accuracies of CART and OCT, and in all other cases OCT has a slight accuracy advantage.

This breakdown suggests that we can predict with high probability that OCT is very likely to outperform CART if the following expression is true:

$$\begin{aligned} (\text {CART accuracy} \ge 80\%) \wedge (n/p \ge 10) \end{aligned}$$

This rule makes sense intuitively, as datasets where CART performs well are likely those which are well suited to decision tree models, which means that creating these trees optimally could well lead to better performance, whereas this may not be as likely if tree models are a poor fit to the data. Additionally, if \(n \gg p\), this suggests we have enough data relative to the number of possible splits to learn a good tree without overfitting, echoing the results of the second synthetic test in Sect. 5.

The other key takeaway is that if we are not in this region, there is no indication of any ill-effects; the results outside this region are balanced, and there is even a slight edge towards optimal trees.

Based on the strength of these results, this partitioning based on n / p and CART accuracy should provide a highly accurate and useful guide for when OCT are likely to yield accuracy improvements on real-world problems.

6.3 Optimal classification trees with hyperplanes versus CART

We now present a comparison of out-of-sample accuracies between OCT-H and CART. Recall that after formulating the OCT problem using MIO, we were able to easily obtain the OCT-H problem by relaxing integrality constraints on the split variables \({\varvec{a}}\). This comparison aims to examine the possible accuracy gains from first formulating the problem using MIO and then relaxing a subset of difficult constraints to generate a model that is simpler to solve.

The entire set of mean out-of-sample accuracies on each dataset for both methods at depth 2 is provided in Table 11, which again reports the size and dimension of the dataset, the number of classes K, and the average out-of-sample accuracy of each method along with the mean improvement of OCT-H over CART and the associated standard error (results for depths 1, 3 and 4 are provided in Appendix 2).

Table 11 Full results for CART and OCT-H at depth 2
Table 12 Comparison of CART and OCT-H across a range of depths, showing the number of datasets for which each method had the highest out-of-sample accuracy, and the mean improvement in out-of-sample accuracy when using OCT-H across all datasets along with the p value indicating the statistical significance of this difference

Table 12 shows a summary of the relative performance of CART and OCT-H according to the depth of the trees. We see that at all depths, OCT is stronger than CART on the vast majority of datasets. OCT-H also gives a significant increase of 3–5% in out-of-sample accuracy compared to CART across all datasets, depending on the tree depth. The corresponding p values confirm this difference to be highly significant statistically. We believe the smaller differences at higher depths can be attributed to the MIO problems being harder to solve, and so less progress is being made at these depths for the same time limit. Similar to the OCT comparison, the advantage of OCT-H over CART decreases as the depth of the trees increases. This could be due to the increasing difficulty of the problems as the depth increases as for OCT, but it may also be the case that most of the accuracy advantage from multivariate decision trees can be realized with shallower trees, and so the advantages of deeper trees are less pronounced, allowing CART to reduce the accuracy gap with increasing depth.

Unlike the OCT versus CART comparison, the depth 1 difference here is very significant because multivariate trees of depth 1 can still use multiple variables in the splits, and this leads to a very significant mean improvement of around 5%.

As we did for OCT, we will now consider the influence of problem characteristics on the performance of the OCT-H.

Table 13 Breakdown of results according to the accuracy of CART on the dataset and the maximum depth of the trees

Table 13 compares OCT-H to CART broken down by the accuracy of CART. Unlike OCT, we see that OCT-H improves significantly upon CART accuracy in each group, with the most significant gains coming from the datasets where CART is weakest (below 80%), showing average improvements of 5–10%. As the CART accuracy increases, the improvement offered by OCT-H decreases, but they still have an advantage even at the highest CART accuracies. This indicates that multivariate trees are able to exploit problem structure in a way that univariate trees cannot, and this exploitation is most beneficial when univariate decision trees perform poorly.

Fig. 5
figure 5

Plot of winning method (CART vs. OCT-H) by the accuracy of CART and the dimension of points, p, in each dataset for trees of depth 2. The high-performance where OCT-H is highly likely to perform strongest is indicated by the dashed lines

Figure 5 shows the winning method for each dataset as a function of the CART accuracy on each dataset and the number of features in the data. As we did for OCT, we can identify a region where OCT-H is highly likely to outperform CART. Like OCT, we have considered the results for all depths when identifying this region but the plots for other depths show similar patterns and are omitted. We see that this region of high-performance is when either the CART accuracy is low or the number of features is low, with the following test determining membership of the region:

$$\begin{aligned} (\text {CART Accuracy} \le 70\%) \vee (p \le 10) \end{aligned}$$
Table 14 Breakdown of results according to whether the dataset is in the region of high-performance and the maximum depth of the trees

Table 14 shows the relative performance of the methods in each of the regions. We see that in terms of the win rate, OCT-H is highly likely to deliver significant out-of-sample accuracy improvements on datasets that fall into the high-performance region, and in this region the average accuracy improvement over CART is 4–7% depending on depth. Outside the region, OCT-H is still likely to beat CART, but less so than in the region, and the average accuracy improvment is 2–3%. We have already addressed why OCT-H is likely to do well when the CART accuracy is poor, but these results also suggest that OCT-H performs better when the dimension of the data is small. We presume that this may be because there are fewer possible hyperplane splits to consider and so the feasible space for the problem is smaller, allowing better solutions to be found faster.

6.4 Optimal classification trees versus Random Forests

We conclude our experiments on real-world datasets by providing a brief comparison with Random Forests (Breiman 2001). Random Forests achieve state-of-the-art accuracies, and so this allows us to place our results in a wider context. We would like to note that our goal in developing these methods was not necessarily to develop the best classification method overall; rather, we intended to show the benefits of taking a problem that is traditionally solved by a heuristic and instead solving this problem to optimality. Nevertheless, these comparisons help to quantify the significance of the improvements in accuracy.

We ran Random Forests on all datasets using the randomForest package (Liaw and Wiener 2002) in R, using the same training, validation and testing splits as the other methods. The Random Forests do not require any tuning, so they were trained on the combined training and validation sets and then evaluated on the testing set. This was averaged over five splits as for the other methods. The forests were trained with 100 trees.

The average accuracy of Random Forests across all 53 datasets was 85.8%, which is about 6, 5 and 3% higher than CART, OCT and OCT-H at depth 4, respectively. We see that across all datasets, OCT closes the gap between CART and Random Forests by about one-sixth, and OCT-H by about half. We believe this demonstrates the significance of our accuracy improvements when measured against the improvement offered by Random Forests.

We found that Random Forests performed strongest relative to OCT-H on datasets with a high number of classes, K, or features, p. If we consider only datasets with \(K=2\) or \(K=3\) classes, and \(p \le 25\), we find that OCT-H and Random Forests are comparable in accuracy, with average out-of-sample accuracies of 85.2 and 85.1%, respectively. We can therefore achieve state-of-the-art accuracies in this regime with just a single tree learner, allowing us to maintain some interpretability in the model, albeit less than an axis-aligned tree, but still much more than a forest. We note that 31 of the datasets we considered fall into this regime, so we do not believe it is a restrictive setting.

Finally, we note that we did not tune the value of \(N_{\min }\) in our experiments, nor did we consider other improvements to the original OCT problem that could improve accuracy. With the right selection of model and parameters, we believe it is possible to train a single decision tree learner that has the accuracy of a random forest without sacrificing interpretability, which would be significant for those applications where interpretability is required.

7 Conclusions

In this paper, we have revisited the classical problem of decision tree creation under a modern MIO lens. We presented a novel MIO formulation for creating optimal decision trees that captures the discrete nature of this problem and motivates two new methods, OCT and OCT-H.

Experiments with synthetic data provide strong evidence that optimal decision trees can better recover the true generating decision tree in the data, contrary to the popular belief that such optimal methods will just overfit the training data at the expense of generalization ability.

Exploiting the astonishing progress of MIO solvers in the last 25 years, our comprehensive computational experiments showed that both the OCT and OCT-H problems are both practically solvable and deliver solutions that outperform the classical state-of-the-art methods, with average absolute improvements in out-of-sample accuracy over CART of 1–2% for OCT and 3–5% for OCT-H across all datasets, depending on the depth of the tree. We also provided guidance for each method that predict when each is most likely to outperform CART.

For OCT, we predict consistent accuracy improvements of 2–4% when the CART accuracy is below 60%, and improvements of 1.2–1.3% when the CART accuracy is above 80% and there is sufficient training data available relative to the complexity of the problem (\(n/p \ge 10\)). Outside these cases, we find that OCT and CART are competitive with each other.

For OCT-H, we can predict consistent accuracy improvements of 4–7% over CART if the CART accuracy is below 70% or the dimension of the data is no more than 10. On other datasets, we can still predict strong performance by OCT-H, but with a smaller yet still significant average improvement of 2–3%.

These results provide comprehensive evidence that the optimal decision tree problem is tractable for practical applications and leads to significant improvements over heuristic methods.