Classical rough set approach (CRSA)
Rough Set Theory, introduced by Pawlak in [
29], provides methods for generalizing or reducing information so as to facilitate knowledge discovery by exploiting the relational indiscernibility of objects in an information table. Central to RST is the notion that an observed object has a certain amount of information associated with it. When considered in relation to a cohort of observed objects, this information is used to group similar objects into information granules. Together, the information provided by the set of observed objects can be generalized to describe the conditions required for membership in a concept class.
Notation
The methods of classical RST, hereafter referred to as the CRSA, act upon an information table of the form S=(U,A,V,f), where U is a non-empty finite set of objects, called the universe. A=C∪{d} is a set of attributes that describe a given object in U, comprised of a set C of condition attributes and an optional decision attribute d. When d is present, the information table is a decision table. The set of all values, V, contains the value sets V
a
, for every attribute a∈A. Given an object x∈U, f:U×A→V maps the condition attribute of object x to its associated value v=f(x,a)∈V
a
. A value attribute pair (a,v) for a given object is referred to as a descriptor.
Table
1 provides an example of a discretized decision table, where six prognostic factors, as the condition attributes, describe seven patients. The decision attribute, presence of coronary disease in the patient, is represented by the binary attribute
d→{Yes,
No}.
Table 1
Example decision table
x
1
| F | H | M | L | No | No | No |
x
2
| M | L | L | L | No | Yes | No |
x
3
| F | M | M | H | No | No | No |
x
4
| F | M | M | H | No | No | Yes |
x
5
| M | H | H | L | Yes | Yes | Yes |
x
6
| M | H | H | L | Yes | Yes | Yes |
x
7
| F | M | M | H | No | No | Yes |
The objects in a decision table can be grouped according to their descriptors. For example, patients x5 and x6 have the same attribute values and are thus indiscernible from each other. In general, two objects x
i
,x
j
∈U are indiscernible with respect to a set of condition attributes B⊆C if f(x
i
,a)=f(x
j
,a) ∀a∈B. This relation is called an indiscernibility relation, defined as R(B)={(x
i
,x
j
)∈U:∀a∈B,f(x
i
,a)=f(x
j
,a)}.
For example, the patients in Table
1 can be separated into four groups according to the indiscernibility relation
R(
C):
X1={
x1},
X2={
x2},
X3={
x3,
x4,
x7},
X4={
x5,
x6}. These groups of objects are referred to as equivalence classes, or conditional classes for
B⊆
C. An equivalence class for the decision attribute is called a decision class or concept, and in this example there are two groups:
Y
N
o
={
x1,
x2,
x3} and
Y
Y
e
s
={
x4,
x5,
x6,
x7}. The equivalence class specified by the object
x
i
with respect to
R(
B) is denoted as [
x
i
]
B
.
Set approximations
The goal of the CRSA is to provide a definition of a concept according to the values of the attributes of the equivalence classes that contain objects that are known instantiations of the concept. As such, in a consistent decision table, membership in a conditional class implies membership in a particular decision class. In Table
1,
x∈
X4 implies
x∈
Y
Y
e
s
. Membership in
X3, however, does not imply
Y
Y
e
s
as
x4,
x7∈
Y
Y
e
s
but
x3∈
Y
N
o
. Thus Table
1 is inconsistent as
f(
x4,
d)≠
f(
x3,
d) and
f(
x7,
d)≠
f(
x3,
d).
To represent an inconsistent decision table, the CRSA establishes an upper and lower approximation for each decision class,
Y. The lower approximation is comprised of all objects that definitely belong to
Y, while the upper approximation includes all objects that possibly belong to
Y. It can be said that an object
x
i
definitely belongs to a concept
Y if [
x
i
]
C
⊆
Y and that
x
i
possibly belongs to a concept
Y if [
x
i
]
C
∩
Y≠
∅. Thus, the lower and upper approximations are defined as follows:
$$\begin{array}{@{}rcl@{}} \underline{R}_{B} (Y) & =& \left\lbrace x \in U \colon [\!x]_{B} \subseteq Y \right\rbrace = \bigcup \left\lbrace [\!x]_{B} \colon [\!x]_{B} \subseteq Y \right\rbrace \\ \overline{R}_{B} (Y) &=& \left\lbrace x \in U \colon [\!x]_{B} \cap Y \neq \emptyset \right\rbrace = \bigcup \left\lbrace [\!x]_{B} \!\colon\! [\!x]_{B} \cap Y \neq \emptyset \right\rbrace \\ \overline{R}_{B} (Y) &- &\underline{R}_{B} (Y) = BND_{B} (Y) \end{array} $$
The boundary region, BND
B
(Y), contains those objects that possibly, but not certainly, belong to Y. Conversely, the set \(U - \overline {R}_{B} (Y)\) is the outside region containing those objects that certainly do not belong to Y. In our example, the lower and upper approximations for Y
Y
e
s
are \(\underline {R}_{C} (Y_{\mathit {Yes}}) = X_{4} = \{ x_{5}, x_{6} \}\) and \(\overline {R}_{C} (Y_{\mathit {Yes}}) = X_{4} \cup X_{3} = \{ x_{3}, x_{4}, x_{5}, x_{6}, x_{7} \}\), and the boundary region contains the objects BND
B
(Y
Y
e
s
)={x3,x4,x7}.
Let
F={
Y1,
Y2,…,
Y
n
} represent a classification, i.e. a set of decision classes. The quality of approximation of classification,
γ
B
(
F), with respect to attributes
B, expresses the ratio of all objects covered by the lower approximation
R̲
B
(
F) = {
R̲
B
(
Y1),
R̲
B
(
Y2),…,
R̲
B
(
Y
n
)} over all objects in
U. The quality of approximation is expressed as:
$$\gamma_{B} (F) = \frac{\sum_{t=1}^{n} \left| \underbar{R}_{B} (Y_{t}) \right|}{\left| U \right|} $$
Dominance-based rough set approach (DRSA)
Under the DRSA [
39] the relations between objects are no longer made by the indiscernibility relation as described in the CRSA [
29]. In its place, the DRSA introduces a new dominance relation that allows for ordinal attributes with preference-ordered domains wherein a monotonic relationship exists between the attribute and the decision classes. An example of such a relationship occurs when a “better” or “worse” value of an attribute leads to a “better” or “worse” decision class.
Notation
A decision table in the DRSA is expressed in the same way as the CRSA. To differentiate between attributes with and without a preference-ordered domain, those with a preference order are called criteria while those without are referred to as attributes, as in the CRSA.
In the DRSA the domain of criteria a∈A is completely preordered by the outranking relation ≽
a
, representing the preference order of the domain. The outranking relation is also applicable for comparing two objects such that for x
i
,x
j
∈U, x
i
≽
a
x
j
means that x
i
is at least as good as (outranks) x
j
with respect to the criterion a∈A.
Commonly, the domain of a criteria a is a subset of real numbers, V
a
⊆R and the outranking relation is then a simple order “ ≥” on real numbers such that the following relation holds: x
i
≽
a
x
j
⇔f(x
i
,a)≥f(x
j
,a). This relation is straightforward for gain-type criteria (the more, the better), and can be easily reversed for cost-type criteria (the less, the better).
Using Table
1 as an example, the decision criterion
d is preference-ordered such that a positive diagnosis of coronary disease is assumed to be the “preferred” decision class. Criterion preference relations are then organized in the direction of the decision class; values which generally contribute to the incidence of coronary disease are preferred over those which indicate lower risk, much in the same way that a positive diagnosis indicates presence of coronary disease. For the criteria in Table
1, higher values are preferred to lower values—as in the case of
Age,
SystBP, and
HDL—and “Yes” is preferred to “No”—as in the case of
Smoker and
Diabetic. No such preference relation exists for
Gender; as such, it is considered an attribute.
Let T={1,…,n} represent increasing indexes corresponding to the order of preferences of the decision criterion d. Then, the decision table is partitioned into n classes Y
t
, t∈T, where each object x∈U is assigned to one and only one class Y
t
. The decision classes are preference-ordered according to the decision maker, i.e. for all r,s∈T such that for r>s the objects from class Y
r
are strictly preferred to the objects from class Y
s
.
For our example in Table
1,
Y1={
x1,
x2,
x3} corresponds to patients without a coronary disease and
Y2={
x4,
x5,
x6,
x7} corresponds to the patients with a coronary disease. Therefore, each patient in
Y2 is preferred to each patient in
Y1.
Set approximations
In the DRSA, the approximated sets are upwards and downwards unions of decision classes rather than individual decision classes as in the CRSA. Upward and downward unions of classes are defined as:
$$ Y_{t}^{\geq} = \bigcup_{s \geq t} Y_{s} \text{\quad and \quad} Y_{t}^{\leq} = \bigcup_{s \leq t} Y_{s}, \; s,t \in T $$
For any pair of objects (
x
i
,
x
j
)∈
U,
x
i
dominates
x
j
with respect to a set of condition attributes
P⊆
C, denoted by
x
i
D
P
x
j
, if the following conditions are satisfied simultaneously:
$$\begin{array}{@{}rcl@{}} & x_{i} \succeq_{q} x_{j}, &\text{for all critera}~~q \in P \\ f(x_{i},a) & = f(x_{j},a), &\text{for all attributes}~~a \in P \end{array} $$
The dominance relation defines two sets called dominance cones, where for each
x
i
∈
U:
$$\begin{array}{@{}rcl@{}} &D_{P}^{+} (x_{i}) = &\{ x_{j} \in U \colon x_{j} \: D_{P} \: x_{i} \}, \: \text{representing the set of}\\&& \text{objects that dominates}~~ x_{i} \\ &D_{P}^{-} (x_{i}) = & \{ x_{j} \in U \colon x_{i} \: D_{P} \: x_{j} \}, \: \text{representing the set of}\\&& \text{objects dominated by}~~ x_{i} \end{array} $$
Considering the dominance cones, the lower and upper approximations of the union of decision classes are defined as follows. The lower approximation
\(\underline {R}_{P} \left (Y_{t}^{\geq }\right)\) represents objects that certainly belong to
\(Y_{t}^{\geq }\), such that there is no other object that dominates
x and belongs to a decision class inferior to
Y
t
. Similarly, the lower approximation
\(\underline {R}_{P} \left (Y_{t}^{\leq }\right)\) represents objects that certainly belong to
\(Y_{t}^{\leq }\), with no other object dominated by
x and belonging to a decision class superior to
Y
t
. The upper approximations represent objects that possibly belong to one of the upward or downward unions of decision classes.
$$\begin{array}{@{}rcl@{}} \underline{R}_{P} \left(Y_{t}^{\geq} \right) &=&\left\{ x \in U \colon D_{P}^{+} (x) \subseteq Y_{t}^{\geq} \right\} \\ \overline{R}_{P} \left(Y_{t}^{\geq} \right) & = & \bigcup_{x \in Y_{t}^{\geq}} D_{P}^{+}(x) = \left\{x \in U \colon D_{P}^{-} (x) \cap Y_{t}^{\leq} \neq \emptyset \right\} \\ \underline{R}_{P} \left(Y_{t}^{\leq} \right) & = & \left\{ x \in U \colon D_{P}^{-} (x) \subseteq Y_{t}^{\leq} \right\} \\ \overline{R}_{P} \left(Y_{t}^{\leq} \right) & = & \bigcup_{x \in Y_{t}^{\leq}} D_{P}^{-} (x) = \left\{x \in U \colon D_{P}^{+} (x) \cap Y_{t}^{\geq} \neq \emptyset \right\} \\ \end{array} $$
(1)
Similar to the CRSA, the boundary regions are defined as:
$$\begin{array}{@{}rcl@{}} BND_{P} Y_{t}^{\geq} = & \overline{R}_{P} \left(Y_{t}^{\geq} \right) - \underline{R}_{P} \left(Y_{t}^{\geq} \right) \\ BND_{P} Y_{t}^{\leq} = & \overline{R}_{P} \left(Y_{t}^{\leq} \right) - \underline{R}_{P} \left(Y_{t}^{\leq} \right) \end{array} $$
Using our example decision table, Table
1, and considering the full set of condition attributes, it can be seen that
x4
D
C
x3, and furthermore
\(D_{C}^{+} (x_{4}) = \{ x_{3}, x_{4}, x_{7} \}\),
\(D_{C}^{-} (x_{4}) = \{ x_{3}, x_{4}, x_{7} \}\). Considering the dominance cones for all patients, the lower and upper approximations of the union of decision classes are
\(\underline {R}_{C} \left (Y_{2}^{\geq } \right) = \{ x_{5}, x_{6} \}\),
\(\overline {R}_{C} \left (Y_{2}^{\geq }\right) = \{ x_{3}, x_{4}, x_{5}, x_{6}, x_{7} \}\),
\(\underline {R}_{C} \left (Y_{1}^{\leq }\right) = \{ x_{1}, x_{2} \}\),
\(\overline {R}_{C} \left (Y_{1}^{\leq } \right) = \{ x_{1}, x_{2}, x_{3}, x_{4}, x_{7} \}\) and the boundary regions are
\(BND_{C} Y_{2}^{\geq } = BND_{C} Y_{1}^{\leq } = \{ x_{3}, x_{4}, x_{7} \}\).
For every subset of attributes
P⊆
C, the quality of approximation of the decision classes
Y with respect to the attributes
P,
γ
P
(
Y), is defined as the proportion among all objects in
U of objects consistently defined with respect to the attributes
P and the decision classes
Y.
$${} \gamma_{P}(Y) = \frac {\left | U - \left\{\bigcup \limits_{t \in T} BND_{P} Y_{t}^{\leq} \right\} \right |} {|U|} = \frac {\left | U - \left\{\bigcup \limits_{t \in T} BND_{P} Y_{t}^{\geq} \right\} \right |} {|U|} $$
The variable consistency DRSA
The variable consistency DRSA (VC-DRSA) allows the decision maker to relax the strictness of the dominance relation, thus accepting a limited number of inconsistent objects in the lower approximation, according to an object consistency level threshold, l∈(0,1]. In practice, by selecting this consistency level l, a patient x∈U becomes a member of the lower approximation of a given upward union if at least l∗100 % of the patients dominating x also belong to that decision class. By allowing inconsistencies, the VC-DRSA avoids over fitting the training set and thus may be more effective in classifying new cases.
The lower approximations of the VC-DRSA-based model are represented as follows:
$$\begin{array}{@{}rcl@{}} \underline{R}_{P}^{l} \left(Y_{t}^{\geq} \right) & = & \left\{ x \in Y_{t}^{\geq} \colon \frac{| D_{P}^{+} (x) \cap Y_{t}^{\geq} |}{ | D_{P}^{+} (x) |} \geq l \right\}\\ \underline{R}_{P}^{l} \left(Y_{t}^{\leq} \right) & = & \left\{ x \in Y_{t}^{\leq} \colon \frac{| D_{P}^{-} (x) \cap Y_{t}^{\leq} | }{ | D_{P}^{-} (x) |} \geq l \right\} \\ \end{array} $$
Continuing with the example described in Table
1, setting
l=0.6 moves the objects
x4 and
x7, previously included in the upper approximation
\(\overline {R}_{C} \left (Y_{2}^{\geq }\right)\), to the lower approximation of class
\(Y_{2}^{\geq }\), i.e:
\(\underline {R}_{C}^{0.6} \left (Y_{2}^{\geq } \right) = \left \{ x_{4}, x_{5}, x_{6}, x_{7} \right \}\). This follows from
\(\frac {| D_{C}^{+} (x_{i}) \cap Y_{t}^{\geq } |}{ | D_{C}^{+} (x_{i}) |} = \frac {2}{3} \geq l\), for
i=4,5,6,7.
Decision rules
There are a number of methods available for induction of decision rules from the lower or upper approximations of the decision classes [
40‐
42] or from reducts extracted from the decision table [
43]. Decision rules in this study were obtained using the MODLEM [
40,
41] and VC-DomLEM [
42] algorithms for the induction of classical and dominance-based rough set rules, respectively. In both cases, decision rules are induced from approximations of decision classes. Both the MODLEM and VC-DomLEM algorithms generate a minimal set of decision rules using a minimal number of rule conditions, thus the inclusion of MODLEM allows for an evaluation of the impact of accounting for the preference order information in the VC-DRSA. Once decision rules have been induced, the collection of these rules can then be used to classify unseen objects—in the case of our example table, a new patient who may have cardiac disease.
A decision rule has the form if A then B, or A→B, where A is called the antecedent and B the consequent of the rule. The antecedent is a logical conjunction of descriptors and the consequent is the decision class or union of decision classes suggested by the rule.
Formally, in the CRSA, decision rules are generated from the lower or upper approximations. For example, for an approximation containing objects with descriptors
r with respect to a set of condition attributes,
B
r
⊆
C, a decision rule is expressed as
$$\mathit{if} \: \bigwedge_{i} \left(\; f(x, a_{i}) = r_{a_{i}} \right) \: \mathit{then} \: x \in Y_{t} $$
where
a
i
∈
B
r
is an attribute found in the attribute set
B
r
, and
\(r_{a_{i}} \in V_{a_{i}}\) and
Y
t
are the attribute values and a decision class, respectively, of the objects in the rule-generating approximation. From our example in Table
1, a decision rule induced with MODLEM from the lower approximation
\(\underline {R}_{\{Age, Smoker\}} (Y_{\mathit {Yes}}) = \{ x_{5}, x_{6} \}\) would be: if
Age=H and
Smoker=Yes then
Coronary
Disease=Yes.
In the DRSA, decision rules are induced from the lower approximations and the boundaries of the union of decision classes. From the lower approximations, two types of decision rules are considered. Decision rules generated from the
P-lower approximation of the upward union of decision classes
\(Y_{t}^{\geq }\) are described by
$${} {\fontsize{8.9pt}{9.6pt}\selectfont{\begin{aligned} \mathit{if} \: \left(\! \bigwedge_{i} \left(\;f(x, b_{i}) \geq r_{b_{i}} \!\right) \right) \bigwedge \left(\! \bigwedge_{j} \left(\;f(x, a_{j}) = r_{a_{j}} \right)\! \right) \: \mathit{then} \: x \in Y_{t}^{\geq} \end{aligned}}} $$
where
b
i
∈
P are criteria,
a
j
∈
P are attributes,
\(r_{b_{i}} \in V_{b_{i}}\) and
\(r_{a_{j}} \in V_{a_{j}}\). From the example in Table
1, the
P-lower approximation of the upward union of the decision class,
\(Y_{2}^{\geq } = \underline {R_{C}} Y_{2} = \{x_{5}, x_{6}\}\), leads to the following decision rule:
Decision rules generated from the
P-lower approximation of the downward union of classes
\(Y_{t}^{\leq }\) are described by
$${} {\fontsize{8.8pt}{9.6pt}\selectfont{\begin{aligned} \mathit{if} \: \left(\bigwedge_{i} \left(f(x, b_{i}) \leq r_{b_{i}} \right) \right) \bigwedge \left(\bigwedge_{j} \left(f(x, a_{j}) = r_{a_{j}} \right) \right) \: \mathit{then} \: x \in Y_{t}^{\leq} \end{aligned}}} $$
where
b
i
∈
P are criteria,
a
j
∈
P are attributes,
\(r_{b_{i}} \in V_{b_{i}}\) and
\(r_{a_{j}} \in V_{a_{j}}\). From the example in Table
1, the
P-lower approximation of the downward union of classes
\(Y_{1}^{\leq } = \underline {R_{C}} Y_{1} = \{x_{1}, x_{2}\}\), leads to the following decision rules:
-
If Gender = F and Age ≤ H and SystBP ≤ M and HDL ≤ L and Diabetic ≤ No and Smoker ≤ Yes, then Coronary Disease = No
-
If Gender = M and Age ≤ H and SystBP ≤ M and HDL ≤ L and Diabetic ≤ No and Smoker ≤ Yes, then Coronary Disease = No
The boundaries
\(BND_{P} Y_{t}^{\geq }\) and
\(BND_{P} Y_{t}^{\leq }\) generate the following rules
$$\begin{array}{@{}rcl@{}} \mathit{if} \: \left(\bigwedge_{i} \left(\;f(x, b_{i}) \geq r_{b_{i}} \right) \right) \bigwedge \left(\bigwedge_{j} \left(f(x, b_{j}) \leq r_{b_{j}} \right) \right) \\ \bigwedge \left(\bigwedge_{k} \left(\;f(x, a_{k}) = r_{a_{k}} \right) \right) \: \mathit{then} \: x \in Y_{t} \cup Y_{t+1} \cup \ldots \cup Y_{s} \end{array} $$
where
b
i
,
b
j
∈
P are criteria,
a
k
∈
P are attributes,
\(r_{b_{i}} \in V_{b_{i}}\),
\(r_{b_{j}} \in V_{b_{j}}\) and
\(r_{a_{k}} \in V_{a_{k}}\) (note
i and
j are not necessarily different). From the example in Table
1, the boundary decision classes
\(BND Y_{2}^{\geq } = BND Y_{1}^{\leq } = \{x_{3}, x_{4}, x_{7}\}\), leads to the following decision rule:
The MODLEM and the VC-DomLEM algorithms utilize a heuristic strategy called
sequential covering [
44] to iteratively construct a minimal set of minimal decision rules. The sequential covering strategy successively constructs a set of decision rules for each upward and downward union of decision classes in a training set by selecting, at each iteration, the “best” decision rule, after which the training objects described by the rule conditions are removed. Subsequent iterations again select the best decision rule and remove the covered objects until reaching a stopping criteria or until all of the objects in the unions of decision classes are described by a rule in the rule set.
To ensure minimality, antecedent descriptors, called elementary conditions, of each rule are checked at each iteration and redundant elementary conditions are removed. Similarly, redundant rules are removed from the final rule set.
In both algorithms, decision rules are grown by consecutively adding the best available elementary condition to the rule. CRSA elementary conditions are evaluated in the MODLEM algorithm in terms of either the class entropy measure [
45] or Laplacian accuracy [
46]; the former was used in this study. MODLEM does not restrict elementary conditions to those attributes not currently in the rule; as such, multiple elementary conditions may contain the same attribute. Therefore, a decision rule induced by MODLEM may contain antecedents in which attribute values are described as belonging to a range or a set of values or as being greater or less than a particular value.
Dominance-based elementary conditions are evaluated according to a rule consistency measure. VC-DomLEM provides three such measures; the rule consistency measure used in this study is
μ, as described in [
47]. For the sake of clarity,
Y
t
shall be used to represent an individual decision class in the CRSA or alternatively an upward or downward union of decision classes,
\(Y^{\geq }_{t}\) or
\(Y^{\leq }_{t}\), with respect to the DRSA. The consistency,
μ, of a proposed rule,
\(r_{Y_{t}}\), suggesting assignment to
Y
t
is defined as
$$\mu(r_{Y_{t}}) = \frac{\left| \left[ \Phi (r_{Y_{t}}) \right] \cap \underbar{Y}_{t} \right|} {\left| \left[\Phi(r_{Y_{t}})\right] \right|}. $$
Here
\(\left [\Phi (r_{Y_{t}})\right ]\) indicates the set of objects described by the elementary conditions in
\(r_{Y_{t}}\). The elementary condition,
ec, that is selected for inclusion is that which leads to the highest rule consistency measure
\(\mu (r_{Y_{t}} \cup ec)\) when combined with the current set of elementary conditions in the proposed rule. In the event of a tie, the elementary condition providing greatest coverage of the new rule is selected, by
\(\left | \left [ \Phi (r_{Y_{t}} \cup ec) \right ] \cap \underbar {Y}_{t} \right |\). The rule consistency measure,
μ, was also implemented in MODLEM to relax consistency requirements and to allow more general rules to be induced. For further details on the MODLEM and VC-DomLEM algorithms, the reader is referred to [
40‐
42,
47].
To classify an unseen object, a standard voting process [
43] is used to allow all rules to participate in the decision process, arriving at a patient classification by majority vote. Each rule is characterized by two support metrics. The left hand side (LHS) support is the number of patients in the table whose attributes match the antecedent, i.e: |[
Φ(
r)]|, while the right hand side (RHS) support indicates the number of patients matching both the antecedent and the consequent of the rule, i.e: |[
Φ(
r)]∩
Y
t
|. For a new, unseen patient, any rule whose antecedent descriptors match the patient descriptors “fires” by contributing as votes the RHS support for each decision class. For example, drawing up the example Table
1, the decision rule
If Age = H
and Smoker = Yes,
then Coronary Disease = Yes has LHS = 2 since its antecedent matches patient
x5 and
x6 and RHS = 2 since its antecedent and consequent match the same patients. A new patient matching the antecedent of this rule will receive two votes for decision class
Yes and zero votes for decision class
No.
Once all rules have “voted”, the number of votes for each decision class is normalized against the total number of LHS support for all fired rules. The resultant ratio of RHS to LHS support is considered a frequency-based estimate of the probability that the patient belongs to the given decision class.
A final classification is therefore determined according to a threshold value, τ∈[ 0,1]. A patient is classified as not surviving six months if the estimated probability of death in six months is greater than τ. In the event of an estimated probability equal to τ, or in the absence of any fired rules (no rule matches the patient profile), classification is not possible and the patient is labeled undefined. For example, if the threshold value is set as 0.5 and the voting process yields an estimated probability of 70 %, then the patient is classified as not surviving the six month period.
Experimental design
This section provides details on the implementation and performance evaluation procedures for the comparison of the classification methods used in this study. The following two sections, describe the RSA and comparative methods respectively, the software used for their implementation and the selection of appropriate parameters for each of the methods. Finally, the methods for performance evaluation are discussed.
The general schema of the experimental design is as follows: after selecting appropriate parameters for each of the methods, 5-fold cross validation was used to divide the data into training and testing sets. Methods with decision rule outputs were trained and tested on the discretized data set to demonstrate expected performance of a clinically credible rule set. Methods without decision rule outputs were trained on the raw, non-discretized, data set. For these methods, designed to be applied to continuous variables, discretization does not improve clinical credibility and would likely hinder performance [
51,
52].
Rough set rule induction and classification
MODLEM algorithm for CRSA decision rules
CRSA decision rules were obtained using the MODLEM algorithm as described in [
40] and [
41], implemented by the authors in the R programming language [
53]. Decision rules were generated from the lower approximations with a rule consistency level
μ≥
m. The rule syntax follows the presentation in section
Decision rules.
VC-DomLEM algorithm for VC-DRSA decision rules
Dominance-based rules were obtained using the VC-DRSA as described in section
The variable consistency DRSA and the VC-DomLEM algorithm as implemented in jMAF [
54]. VC-DomLEM decision rules were generated from the lower approximation of each decision class, with an object consistency level threshold
l. The syntax of the VC-DRSA decision rules is as shown in section
Decision rules. Only decision rules with rule consistency measure
μ greater than the rule consistency threshold
l are included in the classification model. Note that the rule consistency threshold and the object consistency threshold are equal and set at
l.
Parameter selection
In order to select the most appropriate models for comparison, the performance of the rough set based models was evaluated for varying levels of rule consistency, m and l, for the CRSA and VC-DRSA respectively. Classifier performance at a particular value of m or l is dataset-dependent; however, in general, values close to one provide rule sets that are more conservative in describing the training set objects, while values closer to zero provide rule sets that are more general. Thus, to find the appropriate balance between strict, descriptive models that are prone to overfitting and overly general models that provide little useful information, the RSA models were evaluated at m,l=0.1,0.2,0.4,0.6,0.8,1.0.
Comparative methods
To evaluate the performance of the RSA-based prognostic models, logistic regression, SVM, and RF were applied to the non-discretized SUPPORT dataset. To ensure directly comparable rule sets, C4.5 was applied to the discretized SUPPORT dataset. Each of these methodologies was applied using the software package Weka 3.6.9 [
55], within which appropriate parameters were selected for SVM, C4.5 and RF using GridSearch with 10-fold cross validation settings. Logistic regression was selected for its popularity in classification models using non-censored data and in clinical settings [
18,
56].
Support vector machines, originally presented in [
57], find separating boundaries between decision classes after input vectors are non-linearly mapped into a high dimensional feature space. Support vector machines have been investigated in survival analysis applications [
58] as they—similar to the RSA-based methods—automatically incorporate non-linearities and do not make
a priori assumptions about factor interactions. SVM-based models are known to perform well at classification tasks, however they do not provide clinician-interpretable justification for their results [
59]. Support vector machines were selected to evaluate whether the increased accessibility of the RSA-based methods involves a trade-off in accuracy.
C4.5 is a well known algorithm for generating a decision tree using information entropy to select the best splitting criteria at each node [
60]. A decision tree built by C4.5 can be expressed as a set of if-then decision rules, thus providing a comparative decision rule based method. Decision trees were obtained using the Weka J48 implementation [
60] of the C4.5 algorithm.
Random forests is a popular ensemble classification method based on decision trees [
61]. The random forests algorithm builds an ensemble of decision trees, where each tree is built on bootstrap samples of training data with a randomly selected subset of factors.
The performance of the models was tested by measuring the discriminatory power of both the m- and l-consistent decision rules sets when applied to the reserved testing data. For our notation, a classification of d.6months=Yes is referred to as a positive classification, and d.6months=No is negative. Sensitivity is defined as the fraction of patients who did not survive six months and are correctly classified by the model, or the fraction of true positive classifications of all test patients who did not survive six months. Conversely, specificity is defined as the fraction of patients who did survive six months and were correctly classified by the model, or the fraction of true negatives of all test patients who did survive six months.
The overall accuracy of the classification models is reported in terms of area under the receiver operating characteristic (ROC) curve, or AUC (area under the curve). The ROC curve graphs the sensitivity of the classifier, or the true positive rate, versus 1 − specificity, the false positive rate, as the threshold probability, τ, for positive classification is varied from 0 to 1. The best overall classification performance is realized when AUC is equal to 1, while an AUC of 0.5 indicates a classifier performance no better than random selection. Best separation between decision classes is realized at the threshold corresponding to the point on the ROC curve closest to the point (0,1).
In order to select the most appropriate MODLEM and VC-DomLEM-based models for comparison, two performance issues related to the generated rule set were considered: coverage and AUC of the model. The coverage of the classification model is defined as the percentage of testing set patients for whom a classification is possible. Additionally, to evaluate the number of rules that would fire for an unseen patient, we collected information on the number of rules matching each test case patient for the evaluated levels of m and l.
Cohen’s Kappa coefficient was computed for both the selected RSA-based models and the comparative models [
62]. Cohen’s Kappa coefficient is designed to measure the agreement between two classification methods, but it is commonly used to measure model performance by comparing a classifier with a random allocation of patients among the decision classes. A value of zero indicates classification accuracy equivalent to chance (zero disagreement).
Performance of the prognostic models was evaluated using a 5-fold cross validation procedure [
63] wherein training and testing sets are repeatedly selected. Cross validation is a well known method that provides a reasonable estimate of the generalization error of a prediction model. In 5-fold cross validation, the entire dataset is randomly divided into five subsets, or folds, and then each fold (20 % of the dataset) is used once as a testing set, with the remaining folds (80 %) used for training.