If-then Rules Search in Data: The Main Problems

V.A.Duke, Argos Group LLC

Abstract

 The popular Data Mining systems use the algorithms of decision trees and constrained combinatorial sorting. The efficiency of the stated algorithms is estimated mainly in accordance with final results of an independent validation dataset or with the help of cross validation procedures. At the same time, we tend to overlook some principal drawbacks of the used approaches, which considerably downplay their importance in solving practical tasks. The article analyses the main problems of if-then rules search in data: the problem of attribute segmentation, the combinatorial problem, the lack of criteria for estimating certain rules, the problem of false laws in multidimensional data, etc. 

General Terms

Algorithms, Performance, Theory

Keywords

Knowledge discovery, if-then rules

1. Introduction

The data analysis systems, referring to the area of Data Mining and KDD (Knowledge Discovery in Databases) include if‑then rules search systems as an important component. Systems like that help to solve problems of forecasting, classification, pattern recognition, database segmentation, extraction of the hidden knowledge out of data, data interpretation, finding associations in database, etc. If-then rules search methods have minimum requirements to a data type and can be applied to processing heterogeneous information. Their results are transparent for perception.

The popular approaches in the class of the considered analytical systems use the algorithms of decision trees and constrained combinatorial sorting. The efficiency of the stated algorithms is estimated mainly in accordance with final results of an independent validation test or with the help of cross validation procedures. At the same time, we tend to overlook some principal drawbacks of the used approaches, which considerably downplay their importance in solving practical tasks.

In the present article we will try to reveal some unsolved problems, explain their reasons. In the next articles possible prospects will be considered, taking the latest DataViewSion system as an example.

We will use three tests in order to illustrate the emerging problems.

2. Tests

TEST 1 Ability to solve obvious tasks

This test suggests a task of dividing into 2 classes a set of objects evenly distributed on the plane in an arbitrary quadrate (Fig. 1). The quadrate is divided into 4 fields by the lines passing through the side centers. Each class is located in 2 fields, which are symmetrical in respect to one of the quadrate diagonals. The specific feature of this data configuration is that attributes 1 and 2 taken separately or the intervals in these attributes do not have an independent discriminating power.

The solution of the present test task is obvious. Each class is described by 2 logical rules (4 rules total):

 IF (X1 > 4) & (X2 < 5)  THEN Class 1 crosses

 IF (X1 < 5) & (X2 > 4)  THEN Class 1 crosses

 IF (X1 < 5) & (X2 < 5)  THEN Class 2 zeros

 IF (X1 > 4) & (X2 > 4)  THEN Class 2 zeros

Figure 1. Distribution of the objects on the plane of the analyzed attributes


TEST 2 Ability of finding the most complete and accurate rules

This test and the like are formed on the basis of the following principle.

The object-attribute matrix of the N×p size (N number of objects, p  number of attributes) is filled with zeros and ones (or any other binary symbols) which are distributed randomly and evenly. In this matrix we choose some parts of the rows of different length (attribute value combinations), each of them is doubled in the matrix strictly vertically and for a certain number of times. Thus are created the object subgroups with a known logical rule, which describes them with 100% accuracy. The subgroup sets are united into classes, subject to recognition. To make the experiment true to life the columns and the rows of the common matrix are reorganized randomly. We set the task of finding the introduced known rules in the data matrix.

In test 2 in particular the data table has the following characteristics: the number of objects 400 (100 out of them belong to the first class (class L), 100 to the second (class K), 200 objects (class M) have randomly distributed value, 100 binary attributes acquiring the value of either A or B . It is required to find 4 known logical rules, 2 rules for each class. These rules are the combinations of 7 to 15 elementary logical events. For the table fragment see Fig. 2.

The task of test 2 is by far not the most complicated of those that we come across in practice. For example, 100 binary attributes can pop up in the analysis, when we deal with only 10 initial quantitative attributes which in the process of logical laws search fall into 10 intervals each. The real tasks often contain hundreds and even thousands of quantitative and ordinal attributes, while the logical laws can be combinations of dozens of hundreds of elementary events. If some system cannot find the rules of unlimited length covering maximum possible objects of its own class then the analyst risks to drown in the sea of the splinters of the logical rules.


Figure 2. A small fragment of binary test data
(the sought combinations of attribute values are highlighted)

TEST 3 False laws

The increase in the number of analyzed attributes in the dataset of the restricted volume is accompanied by the considerably higher risk of revealing false laws. According to the results of the research, this problem is not well developed in science, although its great importance for the present moment is pointed out by a number of authors. We know only certain attempts to suggest mathematical formulas to calculate the statistical value of some mathematical models which come as a result of the analysis of the high dimensional empirical material. Nevertheless, these attempts, carried out within the framework of the problem of multiple comparisons, seem to be extremely domain specific and cannot be considered as general.

Test 3 and the like are necessary to estimate the logical rules feasibility for high dimensional data. It is a table that includes data with randomly distributed value, according to Bernully scheme. In our case, we will present binary data (each attribute can take A or B value ) to different analysis algorithms with the probability of 0.5.

In the passages below we are going to consider the most popular approaches in Data Mining, present test tasks to the relevant algorithms and find out the reasons of their unsatisfactory operation.

3. Test results

Three popular if-then rules search systems were chosen for the test: See5/C5.0 (RuleQuest,Australia), AnswerTree (SPSS) decision trees, and WizWhy (WizSoft) constrainedcombinatorialsorting.

Testing the decision trees building systems

 Test 1 turns out unpassable for the algorithms of decision trees building. As it was already pointed out, attributes 1 and 2 or any of their intervals regarded separately are not able to divide crosses from zeros. So already at the first stage of defining the best attribute these algorithms refuse all together from finding any logical rule.

The results of test 2 , passed by system AnswerTree v. 2.1, are portrayed in Fig. 3. The recognized classes are marked by letters L and  , 100 analyzed attributes are marked as a1, , a100 ,  each attribute can acquire 2 values A and B . The test is simplified: 200 randomly generated objects are excluded from the data matrix.

As it is seen in the figure, the AnswerTree system has found 7 rules. They are situated at the ends of the branches of the tree that was built, and are rounded by ovals. However, 2 rules only can be considered more or less satisfactory in respect to their accuracy and completeness of covering the objects of their own class. So, rule 2

 IF  (a62=) and (a72=)  THEN  class  L 

covers with 100 % accuracy 39 objects (lines) of L class out of 100. In its turn, rule 7

 IF  (a62=) and (a89=) and (a84=) and (a91=)  THEN  class   

refers 57 objects to class , committing one error. So, in general, test 2 remains unsolved. The system was not able to find the 4 rules, that were known to the examiner and that covered all the objects of the recognized classes with 100% accuracy. Similarly, this test is not passed by the other systems that use different algorithms of decision trees building.

To solve test 3 (false laws) the system See5 v.1.16 was used. It was presented with a table containing random data (200 binary attributes, 4000 objects, 2 classes L and K, 2000 objects each). This system found 72 if‑then rules in the data, some of them can be seen in Table 1.

In general, See5 system has classified the random data in the table on the basis of the revealed logical rules, with the error probability of 26.6 %.

Table1. Some rules revealed by See5 system in random data

Rule 1: (23/1, lift 1.8)

a56 = A

a72 = A

a74 = B

a133 = A

a140 = B

a158 = B

a180 = B

->class L [0.920] 

Rule 28: (24/2, lift 1.8)

a47 = B

a74 = A

a114 = B

a119 = A

a127 = A

a177 = A

a194 = A

->class K [0.885]

Rule 32: (32/4, lift 1.7)

a3 = A

a4 = A

a39 = B

a140 = A

a157 = B

a171 = B

a182 = B

->class K [0.853]

 

In table 1 we should pay attention (considering the essence of the presented test) to the figures in square brackets for each rule. These figures show how accurate the rule is, which is estimated with the help of Laplaceratio (n-m+1)/(n+2), in which n is the number of objects, correctly covered by the rule; m is the number of objects wrongly covered by the rule. As these figures show, the rules that were found are satisfactory in terms of their accuracy (no other rule evaluation is suggested by the system). That is why the user can think that the rules that were found reflect the true laws in data and are able to make new knowledge. But this is an illusion. It is known a priori that these are the results of processing a table with a random distribution of A and B symbols. Similar illusions of new knowledge come out in the process of testing the See5 system of decision trees building by the other variants of test 3, containing different number of attributes and objects.

Fig. 3. Decision tree, built by the AnswerTree system according to the data of test 2.

It is worth pointing out that the systems under consideration really function with the high speed advertised by their developers; they have convenient interface, well-developed means of initial data manipulation, etc. However, the means mentioned above seem to become less attractive when we start realizing the principal limits of the used analytical approach.

Testing the system that uses constrained combinatorial sorting

Constrainedcombinatorialsorting algorithms were suggested in the middle of the 60-s and since then they manifested their efficiency in solving many tasks in completely different areas. These algorithms calculate the frequencies of the combinations of simple logical events in data subgroups (classes). The examples of simple logical events are the following: X = C1; X < C2; X > C3; C4< X < C5, etc., where X is some parameter (field) and Ciare constants. The limit is the length of the combination of simple logical events. On the basis of comparison of the calculated frequencies in different data subgroups we make a conclusion about how useful this or that combination is in terms of setting associations in the data, classifying, forecasting, etc.

The WizWhy system is a modern representative of the approach that uses constrainedcombinatorialsorting. Although the system developers do not reveal the specific characteristics of the algorithm that forms the basis of the WizWhy system operation, the results of a thorough system test lead to the conclusion about the presence of constrainedcombinatorialsorting in it (this is confirmed by the study of the results, of the dependency of the time when the results were available on the number of analyzed parameters, etc.). WizWhy seems to use constrainedcombinatorialsorting in a modified version together with an additional algorithm like Apriori. The WizWhy authors stress the following general features of the system:

Discovering ALL if-then rules; Calculating the error probability for each rule; Defining the best segmentation of numerical variables; Calculating the forecasting power of each attribute; Revealing unusual phenomena in data; Using the discovered rules for forecasting; Expressing the forecast in the form of a relevant rules list; Calculating a forecast error; Forecasting with the error cost in mind.

Talking about the advantages of WizWhy, authors can also point out the following:

The system forecasts are not influenced by any subjective reasons; The system users do not need any special knowledge in applied statistics; More accurate and faster calculations than those done by the other Data Mining methods.

To make their point more convincing the WizWhy authors oppose their system to the neuronet approach and algorithms of decision trees building and affirm that as WizWhy has higher level characteristics it ousts the other program products on the Data Mining market.

The key audacious statements of the developers about the system characteristics can be easily crashed with the help of our 3 tests on the ability of solving obvious tasks, ability of finding the most complete and accurate logical rules and on leaving aside false laws in multidimensional data.

Thus, the WizWhy system categorically refuses to find any logical rule in test 1. This has to do with the fact that the statement about Defining the best segmentation of numerical variables is irrelevant. Here is the typical mistake committed always when there is no understanding that part is not the same as the whole. The developers try to find the best way to divide quantitative attributes into intervals, regarding each attribute in isolation from the whole system of attributes.

In test 2 the WizWhy system (the system default settings are used) was able to find only one more or less complete logical statement which correctly refers to class K 52 out of 53 covered  objects (records):

 IF  (a2 = B) and (a6 = B) and (a7 = A) and (a34 = B) and (a61 = A)  THEN  class =  K. 

The reason of the obvious system failure is that it is able to find logical rules containing not more than 6 elementary events. Actually, this is the very limit of the constrainedcombinatorialsorting in action: the presented test requires to find the logical rules introduced by the examiner and including up to 15 elementary events (ai = A or B).

It is also necessary to point out that the process of logical laws search is very time-consuming. So, it took more than 3 hours to get the results of test 2 on the Intel Pentium III computer, 733 H. Moreover, besides one valuable rule mentioned above, the WizWhy system presented as well several thousand rules that were much less accurate and less complete.

The WizWhy system reaction to test 3 (false laws) is of special interest as the system developers calculate the value level of    α  for each rule that is found. Here they use the following formula (it is found in the System Users Guide [1]):

,

,

where

 m is the number of objects covered by the rule throughout the whole data set;

 n is the number of objects correctly covered by the rule (covered objects in the base object class);

 N is the volume of the whole data set;

 M is the volume of the class described by the relevant if-then rule.

The WizWhy system was presented with several variants of test 3 from (5 attributes for 100 objects) to (200 attributes for 1000 objects): the random binary data tables were divided equally into 2 classes. It hardly makes any sense to list all the results here (they are too many). We will do with those that seem to us the brightest examples.

So, for the case of (100 attributes for 200 objects) the system found 14552 statistically important if-then rules; with the help of which the object classes were recognized with 95% accuracy. In the other test examples the system revealed even more statistically feasible logical rules. For example, for the case of (100 attributes for 1000 objects) the system report had the following result:

 IF (a1  is  A) and ( a6  is  A ) and ( a29  is  B ) and ( a64  is  A) THEN Class  is  K 

 Rule accuracy:  0,865 

 The rule covers  45  objects. 

 Value level <  0,0000001 

The examples speak for themselves. There is an impression that the WizWhy system which uses the principle of constrainedcombinatorialsorting is directly aimed at finding false laws in random data.

4. Conclusions based on the test results

1. In a number of cases the popular Data Mining analytical instruments turn out to be unable to solve even the simplest obvious tasks.

2. The applied approaches to Knowledge discovery in Databases reveal only vague fragments of the true logical laws.

3. The instruments for the logical rules search in high dimensional data are unable to differentiate false laws from stability laws.

Moreover, it can be added that the known systems for if-then rules search do not support the function of generalizing the revealed rules and the function of searching the optimal composition of such rules. At the same time, the mentioned functions are quite essential for building knowledge bases that require an ability of introducing concepts, metaconcepts and semantic relationships on the foundation of many knowledge fragments of the subject matter.

5. Main problems

The problem of the first stage attribute segmentation

No matter what approach is used, already on the first stage of their work all the tested systems make a principal mistake.

Algorithms of decision trees naively try to find the best attribute (the tree root) which will optimally divide the dataset into parts. No mathematical tricks are able to correct the main drawback of an approach like this, as this attribute is taken out of the whole description system of the multidimensional object (database records). From our point of view, algorithms of decision trees building do not withstand any criticism and are not worth the groundless attention that is paid to them in books and articles.

In the systems, that use the constrainedcombinatorialsorting approach, the principal mistake of the first stage is connected with the search for the optimal segmentation of quantitative attributes. This was already mentioned above when discussing the results of test 1 passed by the WizWhy system it is impossible to find the best way to divide quantitative attributes into intervals, if each attribute is taken separately from all the system of attributes.

The stated problem of the first stage is derivative of the major key problem of if-then rules search in data. This key problem is connected with the fact that actually here we deal with the combinatorial task of complete sorting of combinations of elementary logical conditions. With the high dimensional attribute space this task is considered to be impossible to solve within acceptable time period in immediate future, even with the help of supercomputers. That is why all the known approaches to the if-then rules search have to use some heuristic restrictions and their results are quite often just scraps of true laws in data, broken pieces of knowledge.

As it is impossible to predict what intervals of initial attributes (elementary events) will prove to be the best for the sought if-then rules, the first stage of the work of the algorithm that heads for great results, should consist in breaking the initial attributes into many minute intervals, as tiny as possible (with the available calculating capacities). This, certainly, concerns mostly the quantitative attributes. There seems to be no other way to solve the task of attribute segmentation.

Criterion for estimating a certain rule

Suppose, there is some imaginary algorithm Ψ which can carry out complete overlook within some acceptable time. It turns out that even in this case the task of doing the overlook has not been set formally yet in the process of the if-then rules search. What is to be considered the optimal decision in the if-then rules search? Is the optimal decision the only one or are there many optimal decisions?

Each record in the database can be covered by many different rules of Rij type:

Examples of the conditions are X = C1; X < C2; X > C3; C4< X < C5,etc., where X is some parameter (field) and Ciare constants.

We are going to use two characteristics of if-then rules accuracy and completeness. Accuracy of the rule Rij is the share of cases j among cases i . Completeness of the rule is the share of cases i among cases j .

For some subject areas with a complex system organization it is often impossible to find the only if‑then rule which has 100% completeness, with the given accuracy for some Bj . There has to be some set of Rij rules in order to cover all or some necessary majority of database records which meet condition Bj . So it is senseless to set the task of finding the best rule for the whole dataset. As the rules of the given set can differ considerably in completeness, we think that the efficiency of some system for the if-then rules search is defined by an ability to find within some acceptable time period the best rules (the most complete ones, with the accuracy provided) for each record of the database. In our opinion, this ability is the major criterion for estimating Data Mining systems.

6. Pair words about genetic algorithms

Many researchers associate the development of analytical methods with evolution algorithms. The most popular among the latter ones are genetic algorithms that try to work out the mechanisms of inheritance, variability and selection existing in the animate nature.

The first step in building genetic algorithms is creating an initial set of combinations of elementary logical statements which are called chromosomes. The whole set of such combinations is called the chromosome population. Then, in order to carry out the concept of selection they introduce a method of different chromosomes comparison. The population is processed with the help of such procedures as reproduction, variability (mutation), and genetic composition. The most important procedures include random mutations in individual chromosomes, crossing over, recombination of the genetic material, contained in individual parental chromosomes (similar to heterosexual reproduction) and gene mutations. In the process of the work of these procedures each evolution stage produces populations with more and more perfect individuals.

Genetic algorithms are attractive because they are convenient for parallel processing. For example, it is possible to break a generation into several groups and work with each group independently, providing intergroup exchange of several chromosomes from time to time. There are also other methods of parallel processing of genetic algorithms.

At the same time, nowadays, these algorithms are not without serious drawbacks. In particular, the process of creating the initial set of chromosomes, the criteria for chromosome selection and the procedures used are heuristic and do not by far guarantee that the best decision will be found. As in real life, the evolution can get stuck on some non-productive branch. And on the contrary, there are examples when two non-perspective parents, who will be excluded from evolution by a genetic algorithm, turn out to be able to produce a highly efficient descendant after several iterations. This becomes especially noticeable in the process of solving high dimensional tasks with complex internal relationships.

All the mentioned drawbacks do not yet allow saying that nowadays genetic algorithms seriously compete with decision trees and constrainedcombinatorialsorting algorithms in the process of solving the task of searching logical laws in data. They are capricious in setting and labour-consuming in solving the tasks of searching logical laws in data.

7. New principle SPMD system

We have developed the SPMD system which claims to solve some of the existing problems. The system uses a new principle and the technology of logical rules search in data. The principle is based upon the concepts of special local geometry. In this geometry each multidimensional object exists in its own local space of events with its individual metrics. The characteristics of the local spaces give geometrical interpretation to the combinatory problem of the logical rules search. The technology of this search is based upon the modified tools of linear algebra. Very impotent is the using of the Structural Resonance in Multidimensional Data (SRMD) phenomenon.

The SPMD system has the following major characteristics:

1.     Finding the best if-then rules (the most complete ones with the accuracy provided) for each database record

2.     Developing and testing data classifiers on the if-then rules basis

3.     Developing fuzzy if‑then rules

4.     Developing dendrograms and exploring the metastructure of the ruleset

Moreover, the additional system characteristics are also important:

5.     Linear dependency of the search algorithm working time on the data volume

6.     No limits for a data type

7.     Working with any amount of blanks in data

8.     Working with polluted data

9.     Using the data + noise method which helps to find out the stability laws in data

10. Finding non-periodical complex patterns in numerical and symbol rows

11. Possibility of parallel processing of the if-then rules search process

To illustrate some of the possibilities of the SPMD system we will use the tests described above.

 Test 1 is absolutely no problem for this system as it does not make the first stage mistake. In the process of attribute segmentation all the numerical values were taken as the interval boundaries.

For the results of test 2 on the ability of finding the most complete and accurate rules see table 2.

Table2. The results of test 2 passed by the SPMD system

Rules

Completeness

Accuracy

1

 IF a12=B & a14=B & a19=B & a26=A & a41=A & a52=A & a60=B & a67=B & a88=B & a98=A THEN Class=L

50/100

1

2

 IF a4=A & a27=A & a40=A & a55=B & a63=A & a71=A & a81=B & a82=B & a95=A THEN Class=L

50/100

1

3

 IF a3=A & a8=B & a11=B & a13=B & a24=A & a48=B & a49=B & a51=A & a59=B & a75=B & a86=B & a100=A THEN Class=K

50/100

1

4

 IF a2=B & a6=B & a17=B & a28=A & a32=A & a35=A & a46=B & a50=B & a54=B & a62=A & a65=B & a66=B & a77=A & a83=A & a84=B & a89=A & a91=B THEN Class=K

50/100

1

As we see, the system has passed the task. It found all the 4 artificially introduced laws in the data and these laws all together completely cover the given classes L and K with 100% accuracy. The search time for one rule on Pentium III, 733 H, was about 2 seconds.

In the process of random data analysis (test 3) the SPMD system compares the statistical characteristics of the found rules with the characteristics of other rules, found in the additional randomly generated dataset. The results of this comparison provide enough ground to reject or accept the hypothesis of false laws.

In next articles we shall tell about theoretical bases of the SPMD system and in more detail we shall stop on a phenomenon of an information structural resonance in a data.

 

REFERENCES

[1] Users Guide, WizWhy Version 2. WizSoft Inc. http://www.wizsoft.com.