## If-then Rules Search in Data: The Main Problems

*V.A.Duke, Argos Group LLC*

## Abstract

The popular Data Mining systems use the algorithms of decision trees and constrained combinatorial sorting. The efficiency of the stated algorithms is estimated mainly in accordance with final results of an independent validation dataset or with the help of cross validation procedures. At the same time, we tend to overlook some principal drawbacks of the used approaches, which considerably downplay their importance in solving practical tasks. The article analyses the main problems of if-then rules search in data: the problem of attribute segmentation, the combinatorial problem, the lack of criteria for estimating certain rules, the problem of false laws in multidimensional data, etc.

## General Terms

Algorithms, Performance, Theory

## Keywords

Knowledge discovery, if-then rules

## 1. Introduction

The data analysis systems, referring to the area of Data Mining and KDD (Knowledge Discovery in Databases) include if‑then rules search systems as an important component. Systems like that help to solve problems of forecasting, classification, pattern recognition, database segmentation, extraction of the hidden knowledge out of data, data interpretation, finding associations in database, etc. If-then rules search methods have minimum requirements to a data type and can be applied to processing heterogeneous information. Their results are transparent for perception.

The popular approaches in the class of the considered analytical systems use the algorithms of decision trees and constrained combinatorial sorting. The efficiency of the stated algorithms is estimated mainly in accordance with final results of an independent validation test or with the help of cross validation procedures. At the same time, we tend to overlook some principal drawbacks of the used approaches, which considerably downplay their importance in solving practical tasks.

In the present article we will try to reveal some unsolved problems, explain their reasons. In the next articles possible prospects will be considered, taking the latest DataViewSion system as an example.

We will use three tests in order to illustrate the emerging problems.

## 2. Tests

### TEST 1 Ability to solve obvious tasks

This test suggests a task of dividing into 2 classes a set of objects evenly distributed on the plane in an arbitrary quadrate (Fig. 1). The quadrate is divided into 4 fields by the lines passing through the side centers. Each class is located in 2 fields, which are symmetrical in respect to one of the quadrate diagonals. The specific feature of this data configuration is that attributes 1 and 2 taken separately or the intervals in these attributes do not have an independent discriminating power.

The solution of the present test task is obvious. Each class is described by 2 logical rules (4 rules total):

** IF **(X1 > 4) & (X2 < 5) ** THEN **Class 1
crosses

** IF **(X1 < 5) & (X2 > 4) ** THEN **Class 1
crosses

** IF **(X1 < 5) & (X2 < 5) ** THEN **Class 2
zeros

** IF **(X1 > 4) & (X2 > 4) ** THEN **Class 2
zeros

Figure 1. Distribution of the objects on the plane of the analyzed attributes

### TEST 2 Ability of finding the most complete and accurate rules

This test and the like are formed on the basis of the following principle.

The object-attribute matrix of the N×p size (N
number of objects, p number of attributes) is filled with zeros and
ones (or any other binary symbols) which are distributed randomly and evenly.
In this matrix we choose some parts of the rows of different length (attribute
value combinations), each of them is doubled in the matrix strictly vertically
and for a certain number of times. Thus are created the object subgroups with a
known logical rule, which describes them with 100% accuracy. The subgroup sets
are united into classes, subject to recognition. To make the experiment true to
life the columns and the rows of the common matrix are reorganized randomly. We
set the task of finding the introduced* known *rules in the data matrix.

In test 2 in particular the data table has
the following characteristics: the number of objects 400 (100 out of them
belong to the first class (class L), 100 to the second (class K), 200 objects
(class M) have randomly distributed value, 100 binary attributes acquiring the
value of either** A **or** B **. It is required to find 4 known logical
rules, 2 rules for each class. These rules are the combinations of 7 to 15
elementary logical events. For the table fragment see Fig. 2.

The task of test 2 is by far not the most complicated of those that we come across in practice. For example, 100 binary attributes can pop up in the analysis, when we deal with only 10 initial quantitative attributes which in the process of logical laws search fall into 10 intervals each. The real tasks often contain hundreds and even thousands of quantitative and ordinal attributes, while the logical laws can be combinations of dozens of hundreds of elementary events. If some system cannot find the rules of unlimited length covering maximum possible objects of its own class then the analyst risks to drown in the sea of the splinters of the logical rules.

Figure 2.
A small fragment of binary test data

(the sought combinations of attribute values are highlighted)

### TEST 3 False laws

The increase in the number of analyzed attributes in the dataset of the restricted volume is accompanied by the considerably higher risk of revealing false laws. According to the results of the research, this problem is not well developed in science, although its great importance for the present moment is pointed out by a number of authors. We know only certain attempts to suggest mathematical formulas to calculate the statistical value of some mathematical models which come as a result of the analysis of the high dimensional empirical material. Nevertheless, these attempts, carried out within the framework of the problem of multiple comparisons, seem to be extremely domain specific and cannot be considered as general.

Test 3 and the like are necessary to
estimate the logical rules feasibility for high dimensional data. It is a table
that includes data with randomly distributed value, according to Bernully
scheme. In our case, we will present binary data (each attribute can take** A **or** B value **) to different analysis algorithms with the probability of
0.5.

In the passages below we are going to consider the most popular approaches in Data Mining, present test tasks to the relevant algorithms and find out the reasons of their unsatisfactory operation.

## 3. Test results

Three popular if-then rules search systems were chosen for the test: See5/C5.0 (RuleQuest,Australia), AnswerTree (SPSS) decision trees, and WizWhy (WizSoft) constrainedcombinatorialsorting.

### Testing the decision trees building systems

** Test 1 **turns out unpassable for the algorithms of decision trees building. As it was
already pointed out, attributes 1 and 2 or any of their intervals regarded
separately are not able to divide crosses from zeros. So already at the first
stage of defining the best attribute these algorithms refuse all together
from finding any logical rule.

The results of** test 2 **, passed by
system AnswerTree v. 2.1, are portrayed in Fig. 3. The recognized
classes are marked by letters** L **and** **, 100 analyzed attributes
are marked as** a1, , a100 **,** **each attribute can acquire 2
values** A **and** B **. The test is simplified: 200 randomly generated
objects are excluded from the data matrix.

As it is seen in the figure, the AnswerTree system has found 7 rules. They are situated at the ends of the branches of the tree that was built, and are rounded by ovals. However, 2 rules only can be considered more or less satisfactory in respect to their accuracy and completeness of covering the objects of their own class. So, rule 2

** IF ** (a62=) and (a72=) ** THEN ** class ** L **

covers with 100 % accuracy 39 objects
(lines) of** L **class out of 100. In its turn, rule 7

** IF ** (a62=) and (a89=) and (a84=) and (a91=) ** THEN ** class ** **

refers 57 objects to class** , **committing
one error. So, in general, test 2 remains unsolved. The system was not able to
find the 4 rules, that were known to the examiner and that covered all the
objects of the recognized classes with 100% accuracy. Similarly, this test is
not passed by the other systems that use different algorithms of decision trees
building.

To solve** test 3 **(false laws) the
system See5 v.1.16 was used. It was presented with a table containing random
data (200 binary attributes, 4000 objects, 2 classes** L **and** K, **2000
objects each). This system found 72 if‑then rules in the data, some of
them can be seen in Table 1.

In general, See5 system has classified the random data in the table on the basis of the revealed logical rules, with the error probability of 26.6 %.

**Table1.
Some rules revealed by See5 system in random data**

Rule 1: (23/1, lift 1.8) a56 = A a72 = A a74 = B a133 = A a140 = B a158 = B a180 = B ->class
L | Rule 28: (24/2, lift 1.8) a47 = B a74 = A a114 = B a119 = A a127 = A a177 = A a194 = A ->class K [0.885] | Rule 32: (32/4, lift 1.7) a3 = A a4 = A a39 = B a140 = A a157 = B a171 = B a182 = B ->class K [0.853] |

In table 1 we should pay attention
(considering the essence of the presented test) to the figures in square
brackets for each rule. These figures show how accurate the rule is, which is
estimated with the help of Laplaceratio (n-m+1)/(n+2), in which n is the number of objects,
correctly covered by the rule; m is the number of objects wrongly covered by
the rule. As these figures show, the rules that were found are satisfactory in
terms of their accuracy (no other rule evaluation is suggested by the system).
That is why the user can think that the rules that were found reflect the true laws
in data and are able to make new knowledge. But this is an illusion. It is
known a priori that these are the results of processing a table with a random
distribution of** A **and** B **symbols. Similar illusions of new
knowledge come out in the process of testing the See5 system of decision trees
building by the other variants of test 3, containing different number of
attributes and objects.

Fig. 3. Decision tree, built by the AnswerTree system according to the data of test 2.

It is worth pointing out that the systems under consideration really function with the high speed advertised by their developers; they have convenient interface, well-developed means of initial data manipulation, etc. However, the means mentioned above seem to become less attractive when we start realizing the principal limits of the used analytical approach.

### Testing the system that uses constrained combinatorial sorting

Constrainedcombinatorialsorting algorithms were suggested in the middle of the 60-s and since then they manifested their efficiency in solving many tasks in completely different areas. These algorithms calculate the frequencies of the combinations of simple logical events in data subgroups (classes). The examples of simple logical events are the following: X = C1; X < C2; X > C3; C4< X < C5, etc., where X is some parameter (field) and Ciare constants. The limit is the length of the combination of simple logical events. On the basis of comparison of the calculated frequencies in different data subgroups we make a conclusion about how useful this or that combination is in terms of setting associations in the data, classifying, forecasting, etc.

The WizWhy system is a modern representative of the approach that uses constrainedcombinatorialsorting. Although the system developers do not reveal the specific characteristics of the algorithm that forms the basis of the WizWhy system operation, the results of a thorough system test lead to the conclusion about the presence of constrainedcombinatorialsorting in it (this is confirmed by the study of the results, of the dependency of the time when the results were available on the number of analyzed parameters, etc.). WizWhy seems to use constrainedcombinatorialsorting in a modified version together with an additional algorithm like Apriori. The WizWhy authors stress the following general features of the system:

Discovering ALL if-then rules; Calculating the error probability for each rule; Defining the best segmentation of numerical variables; Calculating the forecasting power of each attribute; Revealing unusual phenomena in data; Using the discovered rules for forecasting; Expressing the forecast in the form of a relevant rules list; Calculating a forecast error; Forecasting with the error cost in mind.

Talking about the advantages of WizWhy, authors can also point out the following:

The system forecasts are not influenced by any subjective reasons; The system users do not need any special knowledge in applied statistics; More accurate and faster calculations than those done by the other Data Mining methods.

To make their point more convincing the WizWhy authors oppose their system to the neuronet approach and algorithms of decision trees building and affirm that as WizWhy has higher level characteristics it ousts the other program products on the Data Mining market.

The key audacious statements of the developers about the system characteristics can be easily crashed with the help of our 3 tests on the ability of solving obvious tasks, ability of finding the most complete and accurate logical rules and on leaving aside false laws in multidimensional data.

Thus, the WizWhy system categorically refuses to find any logical rule in test 1. This has to do with the fact that the statement about Defining the best segmentation of numerical variables is irrelevant. Here is the typical mistake committed always when there is no understanding that part is not the same as the whole. The developers try to find the best way to divide quantitative attributes into intervals, regarding each attribute in isolation from the whole system of attributes.

In test 2 the WizWhy system (the system
default settings are used) was able to find only one more or less complete
logical statement which correctly refers to class** K **52 out of 53 covered** **objects (records):

** IF **(a2 = B) and (a6 = B) and (a7 = A) and (a34 = B) and (a61 = A)** THEN **class = ** K. **

The reason of the obvious system failure is that it is able to find logical rules containing not more than 6 elementary events. Actually, this is the very limit of the constrainedcombinatorialsorting in action: the presented test requires to find the logical rules introduced by the examiner and including up to 15 elementary events (ai = A or B).

It is also necessary to point out that the process of logical laws search is very time-consuming. So, it took more than 3 hours to get the results of test 2 on the Intel Pentium III computer, 733 H. Moreover, besides one valuable rule mentioned above, the WizWhy system presented as well several thousand rules that were much less accurate and less complete.

The WizWhy system reaction to test 3 (false
laws) is of special interest as the system developers calculate the value level
of* *** α **for each rule that is found.
Here they use the following formula (it is found in the System Users Guide
[1]):

,

,

where

* m *is the number of objects covered by the rule throughout the whole
data set;

* n *is the number of objects correctly covered by the rule (covered
objects in the base object class);

* N *is the volume of the whole data set;

* M *is the volume of the class described by the relevant if-then rule.

The WizWhy system was presented with several variants of test 3 from (5 attributes for 100 objects) to (200 attributes for 1000 objects): the random binary data tables were divided equally into 2 classes. It hardly makes any sense to list all the results here (they are too many). We will do with those that seem to us the brightest examples.

So, for the case of (100 attributes for 200 objects) the system found 14552 statistically important if-then rules; with the help of which the object classes were recognized with 95% accuracy. In the other test examples the system revealed even more statistically feasible logical rules. For example, for the case of (100 attributes for 1000 objects) the system report had the following result:

** IF (a1 *** is *** A) **and
(** a6 *** is *** A **) and (** a29 *** is *** B **) and (** a64 *** is *** A) THEN Class *** is *** K **

* Rule accuracy: *** 0,865 **

* The rule covers *** 45 *** objects. *

* Value level < *** 0,0000001 **

The examples speak for themselves. There is an impression that the WizWhy system which uses the principle of constrainedcombinatorialsorting is directly aimed at finding false laws in random data.

## 4. Conclusions based on the test results

1. In a number of cases the popular Data Mining analytical instruments turn out to be unable to solve even the simplest obvious tasks.

2. The applied approaches to Knowledge discovery in Databases reveal only vague fragments of the true logical laws.

3. The instruments for the logical rules search in high dimensional data are unable to differentiate false laws from stability laws.

Moreover, it can be added that the known systems for if-then rules search do not support the function of generalizing the revealed rules and the function of searching the optimal composition of such rules. At the same time, the mentioned functions are quite essential for building knowledge bases that require an ability of introducing concepts, metaconcepts and semantic relationships on the foundation of many knowledge fragments of the subject matter.

## 5. Main problems

### The problem of the first stage attribute segmentation

No matter what approach is used, already on the first stage of their work all the tested systems make a principal mistake.

Algorithms of decision trees naively try to find the best attribute (the tree root) which will optimally divide the dataset into parts. No mathematical tricks are able to correct the main drawback of an approach like this, as this attribute is taken out of the whole description system of the multidimensional object (database records). From our point of view, algorithms of decision trees building do not withstand any criticism and are not worth the groundless attention that is paid to them in books and articles.

In the systems, that use the constrainedcombinatorialsorting approach, the principal mistake of the first stage is connected with the search for the optimal segmentation of quantitative attributes. This was already mentioned above when discussing the results of test 1 passed by the WizWhy system it is impossible to find the best way to divide quantitative attributes into intervals, if each attribute is taken separately from all the system of attributes.

The stated problem of the first stage is derivative of the major key problem of if-then rules search in data. This key problem is connected with the fact that actually here we deal with the combinatorial task of complete sorting of combinations of elementary logical conditions. With the high dimensional attribute space this task is considered to be impossible to solve within acceptable time period in immediate future, even with the help of supercomputers. That is why all the known approaches to the if-then rules search have to use some heuristic restrictions and their results are quite often just scraps of true laws in data, broken pieces of knowledge.

As it is impossible to predict what intervals of initial attributes (elementary events) will prove to be the best for the sought if-then rules, the first stage of the work of the algorithm that heads for great results, should consist in breaking the initial attributes into many minute intervals, as tiny as possible (with the available calculating capacities). This, certainly, concerns mostly the quantitative attributes. There seems to be no other way to solve the task of attribute segmentation.

### Criterion for estimating a certain rule

Suppose, there is some imaginary algorithm Ψ which can carry out complete overlook within some acceptable time. It turns out that even in this case the task of doing the overlook has not been set formally yet in the process of the if-then rules search. What is to be considered the optimal decision in the if-then rules search? Is the optimal decision the only one or are there many optimal decisions?

Each record in the database can be
covered by many different rules of** Rij **type:

Examples of the conditions are X = C1; X < C2; X > C3; C4< X < C5,etc., where X is some parameter (field) and Ciare constants.

We are going to use two characteristics of
if-then rules accuracy and completeness.** Accuracy **of the rule** Rij **is the share of cases** j **among cases** i **.** Completeness **of the rule is the share of cases** i **among cases** j **.

For some subject areas with a complex
system organization it is often impossible to find the only if‑then rule
which has 100% completeness, with the given accuracy for some** Bj **.
There has to be some set of** Rij **rules in order to cover all
or some necessary majority of database records which meet condition** Bj **.
So it is senseless to set the task of finding the best rule for the whole
dataset. As the rules of the given set can differ considerably in completeness,
we think that the efficiency of some system for the if-then rules search is
defined by an ability** to find within some acceptable time period the best
rules (the most complete ones, with the accuracy provided) for each record of
the database. **In our opinion, this ability is the major criterion for
estimating Data Mining systems.

## 6. Pair words about genetic algorithms

Many researchers associate the development of analytical methods with evolution algorithms. The most popular among the latter ones are genetic algorithms that try to work out the mechanisms of inheritance, variability and selection existing in the animate nature.

The first step in building genetic algorithms is creating an initial set of combinations of elementary logical statements which are called chromosomes. The whole set of such combinations is called the chromosome population. Then, in order to carry out the concept of selection they introduce a method of different chromosomes comparison. The population is processed with the help of such procedures as reproduction, variability (mutation), and genetic composition. The most important procedures include random mutations in individual chromosomes, crossing over, recombination of the genetic material, contained in individual parental chromosomes (similar to heterosexual reproduction) and gene mutations. In the process of the work of these procedures each evolution stage produces populations with more and more perfect individuals.

Genetic algorithms are attractive because they are convenient for parallel processing. For example, it is possible to break a generation into several groups and work with each group independently, providing intergroup exchange of several chromosomes from time to time. There are also other methods of parallel processing of genetic algorithms.

At the same time, nowadays, these algorithms are not without serious drawbacks. In particular, the process of creating the initial set of chromosomes, the criteria for chromosome selection and the procedures used are heuristic and do not by far guarantee that the best decision will be found. As in real life, the evolution can get stuck on some non-productive branch. And on the contrary, there are examples when two non-perspective parents, who will be excluded from evolution by a genetic algorithm, turn out to be able to produce a highly efficient descendant after several iterations. This becomes especially noticeable in the process of solving high dimensional tasks with complex internal relationships.

All the mentioned drawbacks do not yet allow saying that nowadays genetic algorithms seriously compete with decision trees and constrainedcombinatorialsorting algorithms in the process of solving the task of searching logical laws in data. They are capricious in setting and labour-consuming in solving the tasks of searching logical laws in data.

## 7. New principle SPMD system

We have developed the** SPMD **system
which claims to solve some of the existing problems. The system uses a new
principle and the technology of logical rules search in data. The principle is
based upon the concepts of special local geometry. In this geometry each
multidimensional object exists in its own local space of events with its
individual metrics. The characteristics of the local spaces give geometrical
interpretation to the combinatory problem of the logical rules search. The technology
of this search is based upon the modified tools of linear algebra. Very
impotent is the using of the Structural Resonance in Multidimensional Data
(SRMD) phenomenon.

The** SPMD **system has the following
major characteristics:

1. Finding the best if-then rules (the most complete ones with the accuracy provided) for each database record

2. Developing and testing data classifiers on the if-then rules basis

3. Developing fuzzy if‑then rules

4. Developing dendrograms and exploring the metastructure of the ruleset

Moreover, the additional system characteristics are also important:

5. Linear dependency of the search algorithm working time on the data volume

6. No limits for a data type

7. Working with any amount of blanks in data

8. Working with polluted data

9. Using the data + noise method which helps to find out the stability laws in data

10. Finding non-periodical complex patterns in numerical and symbol rows

11. Possibility of parallel processing of the if-then rules search process

To illustrate some of the possibilities of
the** SPMD **system we will use the tests described above.

** Test 1 **is
absolutely no problem for this system as it does not make the first stage
mistake. In the process of attribute segmentation all the numerical values were
taken as the interval boundaries.

For the results of** test 2 **on the
ability of finding the most complete and accurate rules see table 2.

**Table2.
The results of test 2 passed by the SPMD system**

Rules | Completeness | Accuracy | |

1 |
| 50/100 | 1 |

2 |
| 50/100 | 1 |

3 |
| 50/100 | 1 |

4 |
| 50/100 | 1 |

As we see, the system has passed the task. It found all the 4 artificially introduced laws in the data and these laws all together completely cover the given classes L and K with 100% accuracy. The search time for one rule on Pentium III, 733 H, was about 2 seconds.

In the process of random data analysis (test 3) the SPMD system compares the statistical characteristics of the found rules with the characteristics of other rules, found in the additional randomly generated dataset. The results of this comparison provide enough ground to reject or accept the hypothesis of false laws.

In next articles we shall tell about theoretical bases of the SPMD system and in more detail we shall stop on a phenomenon of an information structural resonance in a data.

### REFERENCES

[1] Users Guide, WizWhy Version 2. WizSoft Inc. http://www.wizsoft.com.