Knowledge Discovery in Databases : Part II

Abhishek Ghosh

By Abhishek Ghosh August 27, 2018 3:54 pm Updated on August 27, 2018

Knowledge Discovery in Databases : Part II

In Part I of Knowledge Discovery in Databases, we discussed about the database systems, fundamentals of statistics and Big Data and fundamentals of knowledge discovery in databases. In this second part of Knowledge Discovery in Databases, we will discuss the process of the Knowledge Discovery in Databases and Methods of the Knowledge Discovery in Databases. Overall, knowledge discovery in databases is a good way to extract knowledge from data. Increasing the data collected from all areas is likely to extract more and more knowledge. The new knowledge can then be used for further evaluations. However, with the new knowledge, it must be checked whether the statistical basis of this knowledge is correct in order to avoid serious mistakes for the future.

Knowledge Discovery in Databases : Process of the KDD

Data selection

The first step is to gain an understanding of the application and the already familiar application knowledge. Based on this, the goal is defined in order to reach the previously unknown knowledge. In addition, the desired knowledge must have a useful added value for the application. In the first process step, the data is selected in which the knowledge is to be searched. In the simplest case, we access an existing database. Within the database different tables can be selected. If no database is available, the data must be entered manually during data selection. This can be done, for example, through surveys or other similar methods.

After the data collection, the proper management of the data is an advantage. This is often done by a file specially created for data mining. However, because most data is managed in commercial database systems, it creates redundancy. However, it then offers functionalities in all process steps that can be used profitably. For example, subsets can be selected easily and efficiently. It is therefore increasingly desirable to integrate the KDD with commercial database systems.

Preprocessing

The goal in this process step is to integrate the required data and create consistency. Missing attributes in the data are also filled in so that no gaps falsify the data mining process. The preprocessing process, or transformation process, usually generates the greatest effort within a KDD process. This effort can theoretically be reduced by the use of a data warehouse, which is a durable, integrated collection of data from different sources for the purpose of analysis and decision support.

When data is obtained in different ways from different sources, it must be integrated because different names could have been used for the same attributes. In addition, inconsistencies such as different values of the same attribute or spelling errors must be resolved.

A survey can also create a so-called noise, whereby a random pattern can superimpose the actual patterns. Such noise usually arises from the appearance of measurement errors or intentionally unanswered questions. Depending on the algorithm used, missing attribute values must also be specified more precisely. For example, a distinction can then be made between measurement not performed and measurement error.

Transformation

In this step, the preprocessed data is transformed into a representation suitable for the purpose of knowledge discovery in databases. This means that not all known attributes of the data are also relevant for the data mining process. One of the typical transformations is attribute selection.

Although many algorithms already make their own selection of attributes, too many attributes can affect the efficiency and the result of the data mining. The attribute selection is therefore advantageous if there is sufficient application knowledge about the meaning of the attributes and the given data mining task. Then a manual attribute selection can be performed. Alternatively, an automatic attribute selection must be performed.

A complete algorithm that considers all subsets of the set of attributes is too expensive. Instead, heuristic algorithms are used. It is more likely that the empty set or total set of attributes is used to add or remove the attribute that achieves the best score for the resulting attribute set in relation to the set data mining task.

Some data mining algorithms can not process numeric attributes but only categorical attributes. This then requires a discretization, that is a transformation of numerical attributes into categorical attributes. Simple methods simply divide the range of values into intervals of equal length or frequency of contained attributes. More complex methods take into account class membership and form intervals of gaining information about class affiliation. Attribute values of objects of the same class are assigned to the same interval as possible.

Data Mining

Data Mining is the application of efficient algorithms that find the valid patterns contained in a database.
However, the two terms data mining and knowledge discovery are now used synonymously in databases.

Data mining is a collective term for various computer-aided procedures that are used to analyze large databases. Scientists define the term data mining as – Data mining is a problem-solving methodology that finds a logical or mathematical description, eventually of a complex nature, of patterns and regularities in a set of data. Data Mining aims to find patterns in a database that can be represented using logical or mathematical descriptions. Data mining offers the possibility of automatically generating new hypotheses, as opposed to traditional statistical methods used to validate given hypotheses.

First, the relevant data mining task is identified. Clustering or discovering outliers follows the goal of dividing the database into groups of objects. A group of objects that are as similar as possible and other groups of objects that are as dissimilar as possible. There are outliers that can not be assigned to any group.
The classification specifies training objects with attribute values already assigned to a class. A function is then created to subdivide the future objects into classes based on their attribute values.
The goal of generalization is to describe a set of data as compactly as possible by generalizing the attribute values. Thus, at the same time the number of records is reduced.

Based on the application objective and the type of data, a suitable algorithm is subsequently selected. Data with categorical attributes requires a different algorithm than data with numeric attributes.

Interpretation

In the last step of knowledge discovery in databases, the patterns found are presented appropriately. Many found patterns or many used attributes complicate this project. Often a visualization is more helpful than a textual output. When the representation of the found patterns is optimized, the patterns are evaluated by an expert. It deals with this process with his existing application knowledge in relation to the initially defined goals.

If, according to the expert, the goals are not yet all reached, the knowledge discovery process has to be run through again. We can choose any entry point. For example, the data mining process can be performed using the same algorithm and other parameters. Once the expert declares the evaluation successful, the knowledge found is documented. The newly acquired knowledge can serve as a new basis for future knowledge discovery processes in order to develop further knowledge.

Knowledge Discovery in Databases Part II

Overview of Methods of Knowledge Discovery in Databases

The methods used refer to the process of data mining in the course of the KDD. The basics have already been touched on in the field of data mining and will be revisited and expanded, among other things, in this article. One of the main purposes of these methods is to identify patterns in data. In this case a number of algorithms are used which often have their origins in mathematics and statistics. In the following, common methods are listed and explained in details next part of this series.

Generalization
Clustering – Cluster analysis with multivariate statistical methods, Clustering with Artificial Neural Networks
Classification
Association analysis
Regression analysis

Conclusion of Part II

This part, mostly around data sciences. The topic “Knowledge Discovery in Databases” is too bigger to fit within one or two article. In the next article, we will go in to details of methods of Knowledge Discovery in Databases, application examples and draw conclusion.

Tagged With knowledge discovery in databases

About Abhishek Ghosh

Here’s what we’ve got for you which might like :

Take The Conversation Further ...

Get new posts by email: