In Part I of Knowledge Discovery in Databases, we discussed about the database systems, fundamentals of statistics and Big Data and fundamentals of knowledge discovery in databases. In second part of Knowledge Discovery in Databases, we discussed the process of the Knowledge Discovery in Databases and Methods of the Knowledge Discovery in Databases. In This Third Part of Knowledge Discovery in Databases We Have Discussed the Details of Methods of Knowledge Discovery in Databases, Application Examples.
Again, knowledge discovery in databases is a good way to extract knowledge from data. Increasing the data collected from all areas is likely to extract more and more knowledge. The new knowledge can then be used for further evaluations. However, with the new knowledge, it must be checked whether the statistical basis of this knowledge is correct in order to avoid serious mistakes for the future.
Methods of Knowledge Discovery in Databases
The main goal, as described, is the analysis of data and the recognition of patterns. The generalization in this case serves to find a compact amount of data from all information. The generalization can be divided into two parts. On the one hand, manual generalization, in which the data in the database are increasingly restricted. On the other hand, there is the automatic generalization, in which an algorithm specific parameters are given. The algorithm then analyzes and sorts the data using the parameters and automatically performs the necessary generalization steps.
The FIGR algorithm operates according to the methodology that the number of values is counted and then sorted on the basis of simple numbers. This minimizes the total number of values and thus produces a correspondingly more compact amount of data.
Clustering describes a method in which relationships between different data are produced. These relationships may lie on the one hand in the match of properties between data. On the other hand, however, it is also sorted here according to the largest possible differences. The goal is therefore to combine the same or as different as possible elements into a group. Clustering can still be subdivided into subareas. This includes cluster analysis based on multivariate statistical methods and competitive strategies through clustering with artificial neural networks.
Cluster analysis with multivariate statistical methods
A further subdivision of the multivariate statistical methods is carried out in hierarchical and partitioning methods. Hierarchical elements continue to be incorporated into agglomerative and divisive. The agglomerative processes summarize the objects and data step by step. Theoretically, this procedure can be performed until only two classes are left. The divisive methods work in the opposite direction. Here, starting from a class, further subclasses are formed. A problem with the hierarchical procedures is that wrong assignments can be made and these are no longer correctable. As a result, these methods can only be used to a limited extent. Above all, they are suitable for finding hierarchies or outliers. The partitioning methods look for optimal partitions in the data. This means that data that are as coherent as possible are searched for and summarized. The data can then be moved between partitions by means of certain specified target criteria.
Clustering with Artificial Neural Networks
For clustering with artificial neural networks, a support vector clustering algorithm in combination with a support vector machine used. The general function is to search for a hyperplane using mathematical functions that separate two classes. Another property of this hyperplane should be that the distance from the classes to the nearest points is maximal. The points that are closest to the plane serve as a support vector, while the other data points have no influence on the algorithm.
Within the classification, the data is assigned to predefined classes. This is in contrast to the clustering methods, where matching classes are found and assigned. A simple example of a classification is the granting of loans. Based on existing records related to the lending is assigned whether a loan is awarded or not.
The classification can be subdivided into two tasks. First, the pure assignment of objects to classes takes place. This is done based on the attribute values of the individual objects. Only in the second task is the actual step to speak of KDD. Here the explicit knowledge about the classes is generated.
Association analysis is based on finding rules between occurring data. An example of rules is an “if A and B then C” join. As a basis, the Apriori algorithm can be used to find frequent data links. This is based on a monotony property for frequently occurring data links: Each subset of a frequently occurring item set must also be frequent itself.
The frequently occurring links are then determined by size. First, single-element, then two-element, and so on are determined. This algorithm is executed so often until no more links are found. The association analysis then derives the most common rules.
A classic example of this is a shopping basket analysis. It analyzes which items are purchased together with other items. These are then displayed to other customers, like “Customers who bought this item, also bought” products. This leads to targeted advertising as there is a likelihood that the similar products might also be relevant to the customers. This example will be discussed in more detail in the course of the application examples. Within the association analysis the connection to the classification can be established.
Regression analysis is used to determine related information about existing data. Various regression methods are used to make forecasts with regard to the existing data. This method is mainly used when certain information is available, but missing related values. The higher the dependency between the objects, the more accurate a prognosis can be.
There are already many successfully used everyday examples for the use of KDD. This covers both commercial and non-commercial topics. Some of the examples are explained in the following sections. Furthermore, the aspect of data protection is taken up here.
In the course of the ever-increasing amount of data, data protection is becoming more important. These collected, partly personal, data are analyzed and evaluated without the knowledge of the users. By connecting different databases, it is possible to obtain precise patterns of behavior and information related to individuals.
Currently, however, generally only group behavior is analyzed, which is why no conclusions are made about individuals. Nevertheless, a closer attention must be paid to privacy in this context, so that no personal analysis will take place in the future.
Today, more and more analyzes are being carried out on current climate and land use changes. More and more data is collected and stored when creating simulation models. These data can then be evaluated, for example, to predict climate change or predict future events.
The systems among other things, analyze geological data. Based on the evaluated data, typical routes of cyclones are predicted.
The best-known examples of KDD are marketing applications. In this case, for example, customer data are analyzed and, based on this, the advertising is adapted. This can be done based on user behavior as well as the location or other available data.
The knowledge gain is here, among other things, in web shops with the well-known statements: What other articles buy customers after they have viewed this article? or “Other recommendations for you:” used. This is a significant benefit for the seller as advertising can be tailored to different customers. Such an association is also possible by associating purchases within certain locations. For example, it could be assumed that many people are looking for a bicycle within a website. On the basis of this data, a bicycle or bicycle accessory provider could specifically favor this area for its advertising.
A similar adaptation takes place today in areas of television. Analyzes are used to analyze which audience is following the program at what point in time. These analyzes then enable the placement of specific advertising on the relevant audience. An example of the application in marketing is the Spotlight system. This system analyzes sales quantities and reveals correlations between changes in these quantities and, for example, simultaneous price changes.
Overall, knowledge discovery in databases is a good way to extract knowledge from data. Increasing the data collected from all areas is likely to extract more and more knowledge. The new knowledge can then be used for further evaluations. However, with the new knowledge, it must be checked whether the statistical basis of this knowledge is correct in order to avoid serious mistakes for the future.
Data can be stored almost limitlessly in today’s world. Through this possibility, modern technologies or similar means are generating more and more data containing potential knowledge. Even newer and more powerful computers make more complex evaluations possible, which can extract previously intangible knowledge from the data.
The process of KDD will probably be able to change even more areas of our lives in the future due to the higher data volumes and will thus become more and more important. Also, new knowledge is constantly leading to newer knowledge. At the same time there will be a change in other topics. In terms of data protection in particular, the ever-increasing volumes of data and the increasing relationships between them will cause a great deal of change.